StarTopic Future Nintendo Hardware & Technology Speculation & Discussion |ST| (New Staff Post, Please read)

NateDrake · May 25, 2024

Dark Cloud said:
Does Nate even leak anything Nintendo anymore? Feels like he talks about Sony and Microsoft lately lol

Patience.

Darknut85 · May 25, 2024

NateDrake said:
Patience.

A 8 letter word with 4 vowels and 4 consonants. This has to be a hint at his next podcast./s

windyoshi · May 25, 2024

I forgot about that leak, the Sonic game ended up being real. Maybe there really is/was an Odyssey 2 in development... though, if it was real I can't see them just scrapping the project if a lot of the work was already done at that time. A new game expanding on the open concept of Bowser's Fury is personally what I am hoping for but Odyssey 2 would still be very exciting and i'm sure a great experience.

Zedark · May 25, 2024

Dekuman said:
Series S only has 512 GB with 360 GB usable. Switch's 256 GB with the same ratio of usable memory would put it at around 180 GB; but likely more could be usable since around 80% of Switch's 64/32GB's are usable

I believe one major reason why storage space is so much smaller than expected for Series S and X is that they reserve space for multi-resume: you need to reserve dozens of GBs to dump the RAM of a game onto, and you need to have that space for multiple games (up to 5 I believe). So that will be 50 GB or more for the Series S.

antonz · May 25, 2024

COD is alot like the game Ark: Survival Evolved. Installing it will say it takes like 180GB then when you look it includes a million things you may not even want installed. Ark for instance if you own all the expansion packs installs every single unique file for things you may not even be playing. If you tell ark only to install what you are using suddenly like 120GB of space is freed up

oldpuck · May 25, 2024

Thraktor said:
Thraktor's Guide to Concurrency in Ampere

This is a really good write up. I don't really have anything to add, but A+, well done.

TheGreatMightyPoo · May 25, 2024

NateDrake said:
Patience.

is a virtue

VIRTUAL BOY GAMES CONFIRMED FOR NSO

TomNookYankees · May 25, 2024

ILikeFeet said:
assets will be cut down in size, so not really

EDIT: actually looking it up, the reason for CoD's large file size is because it reuses assets from previous games, which are deleteable. since Drake doesn't have these games, the size will be much smaller anyway

Yes I agree that it's deliberately bloated nonsense, so it's a question of whether or not MS is going to actually force Activision to stop doing that shit.

Steve said:
Some porting studio will somehow make it work.

Also there’s always the chance of requiring download code for the physical release.

With the PS5 version of MW3 being 140GB, there’s always a chance that the Switch 2 will be 60-100GB.

But I’m more curious with how much the Switch 2 cartridge will hold, since I’m guessing there must have been some sort of breakthrough for it to at least hold 64GB.

Considering that the latest COD disks no longer even contain the game, good chance activision pulls similar BS with Switch 2... unless MS makes them not do that.

Meelow said:
I'm not sure why it's a debate if Blops 6 will be on Switch 2.

-Switch 2 will be around Series S power

Nintendo and Microsoft have made a 10 year deal to bring COD to Switch platforms

Activision CEO himself said he deeply regrets not bringing COD to Switch

Switch 2 will be getting COD.

No more "but file size" talk.

I didn't say COD won't be coming to Switch 2 I said that the biggest hurdle is going to be file size. If activsion actually puts effort into keeping the file size more manageable then it should be fine but right now their trend has been to deliberately bloat filesize for anti-competitive reasons. Hopefully that trend changes.

Dekuman said:
I brought this up as a possible stumbling block a while back, but the more i think about it the less it makes sense.

Series S only has 512 GB with 360 GB usable. Switch's 256 GB with the same ratio of usable memory would put it at around 180 GB; but likely more could be usable since around 80% of Switch's 64/32GB's are usable

There's zero chance Microsoft makes a cod that can't be installed out of the box on a Series S, and I'm unaware of a game that's actually 360 GB in size.
Even a 200 GB cod install could be made to be downloadable into a 256GB Switch 2 day 1 with less bloated (no 4k) assets and targeting the device itself.

Microsoft would not have made a deal if it had no intention of porting those games to Nintendo platforms. In hindsight, my own thoughts initially was misplaced and i'd guess yours is too.

I'm going to make a lame prediction that Switch 2 COD will be around 50-100GB with optional installs

ditto. I agree.

It's a question of whether or not MS is willing to force Activision to put some actual effort into file size management.

Hartmann · May 25, 2024

Samsung seems desperate to approach NVIDIA with 3nm process.

'니모를 잡아라'..삼성전자, 파운드리도 엔비디아 수주 최우선 전략 가동 | 파이낸셜뉴스

www.fnnews.com

According to industry sources on the 20th, Samsung Electronics' Semiconductor (DS) Division, Foundry Business Unit, has set securing 3nm product orders from Nvidia as its top priority for this year.

An industry insider stated, "Each department has been notified to prioritize tasks related to securing orders from 'Nemo' over their existing duties, indicating a concerted effort." Within Samsung Electronics, "Nemo" is the code name for Nvidia.

However, it appears that the Foundry Business Unit has not established a dedicated task force or specialized team specifically for securing Nvidia-related orders.

Previously, in 2020, Nvidia entrusted the manufacturing of its consumer graphics processing unit (GPU) GeForce RTX 30 to Samsung Electronics' 8nm process, and they have continued to receive chips from this process. However, Nvidia has recently allocated most of the volume for its advanced process chips to TSMC, leading to a halt in orders for Samsung Foundry. Currently, Nvidia's AI semiconductors 'H100' and 'A100' are also manufactured using TSMC's 4nm and 7nm processes.

oldpuck · May 25, 2024

kingman said:
So I was actually watching one of Rich's videos where he dismantles the Xbox series s, and holy crap I had no idea that thing was actually tiny! I know it's less powerful than the series X, but still it's way smaller than I expected

It's a really well designed piece of kit, at least at an industrial design level.

kingman said:
. Which kind of makes it a little bit more disappointing, that had Nintendo still been making full-on consoles and not hybrids, something the size of even the Wii U released in 2025 might have been significantly better than the PS5/XbsX.

This is probably a bridge too far. You could clearly make a jump beyond the PS5 at the same size - not a leap but a jump. Or you could make the PS5 somewhat smaller. I don't think the tech exists to put PS5 levels of power in a Wii U shaped box, much less do it for less than 500 dollars.

It's not a perfect proxy, but the PS5 is roughly as powerful as a $400 card at the time it launched. Considering the console launched at $500, with an SSD, a CPU, a BluRay drive and 16 GB of RAM you can see how much Sony was losing on the thing. That's why Nintendo had to get out of the graphics wars - they don't have gigantic non-video game lines of business that can subsidize 2 years of selling their core product at a loss.

In a side reality where Nintendo is making dedicated TV consoles still, and is launching an Nvidia based device in 2025, they're probably still charging $400 bucks, and trying to make money on it. A good proxy then might be a sub $300 card from Nvidia. Let's get optimistic and say the 4060. All of Nvidia's coolest bells and whistles, including Frame Generation, but basically no more performance than the Series X. And probably as big as the Series S (which is admittedly quite small)

crazybenjamin · May 25, 2024

GuardianStalker said:
Regarding a physical CoD BO6 version for the Switch, I wonder if Microsoft can't just put the single-player mode on the cartridge and use an additional download for the multi-player mode. The multi-player mode will receive constant updates anyway.

Yeah, I could see that being the case

TomNookYankees said:
If activsion actually puts effort into keeping the file size more manageable then it should be fine but right now their trend has been to deliberately bloat filesize for anti-competitive reasons. Hopefully that trend changes.

I thought it was simply a matter of them not caring, what are these anti-competitive reasons?

Zedark · May 25, 2024

Thraktor said:
Thraktor's Guide to Concurrency in Ampere

One topic which comes up quite frequently in this thread (particularly in the context of DLSS) is how concurrency works in Ampere GPUs. In Nvidia's Ampere whitepaper, they point out new concurrency features in the architecture, without providing much detail on how they work, or what kind of limitations there are in using it, besides some graphs showing a reduction in frame time from running graphics, RT and tensor cores at the same time for Wolfenstein Youngblood.

I've been trying to understand how this works myself, and in the process get a better understanding of how Nvidia's GPU architectures work at a lower level, and I feel like I've got a good enough understanding of concurrency in Ampere, at least between regular shader code and tensor cores, to explain it in a way which might be useful to people. And, in the process of writing, will hopefully clarify some things for myself.

One thing I should mention is that what I'm describing below is a simplified explanation of how SMs work, mainly to make it easier to understand, but also because I don't know enough of the lower level details to speak with any confidence on it. The main simplification I'm making is ignoring pipelining, which is quite important, but also quite complex, and I feel the general points are the same even if we ignore pipelining, although the specific implementation differs a bit. I'm also going to ignore things like warps getting split at branches, complex instructions, etc.

A Quick Intro to GPUs

To start, I should cover some basic points on how GPUs work which will be relevant later. The most important of these is the concept of SIMT, or single-instruction-multiple-threads, which is the paradigm by which GPUs operate. This means what it says, which is that GPUs execute a single instruction across multiple threads of data at once. So, for example, if you have a pixel shader with a thread for each pixel, and there's an instruction which states "multiply X by Y and store the answer in Z", it will execute that instruction for every pixel in that thread group, even though they all may have different X and Y values.

In Nvidia's case, a group of threads which executes together is called a Warp, and a warp contains 32 threads. So each time an instruction is executed on a Nvidia GPU, it's run on a warp of 32 threads. At a higher level these are organised into what are called thread blocks, but that's not too important here.

Each warp is issued to an SM (which are the building blocks of Nvidia's GPUs) to execute on, continuing to execute instructions until the shader has completed. Ordinarily a GPU would have a very large number of warps issued to its SMs at any one time. In Ampere's case, it can handle up to 48 warps per SM, and with 12 SMs on T239's GPU, that would mean up to 576 total warps, or 18,432 threads issued at a time.

The Ampere SM

Here's Nvidia's diagram of an Ampere SM from the whitepaper:

The Ampere SM is divided into four partitions, each of which contains registers, shader cores, tensor cores, instruction dispatch, etc. Each of these executes instructions independently from each other, and we'll look at them in more detail below. In addition to what's in the partitions, there is also an L1 cache/shared memory pool, texture units, and the RT core. We'll come back to the RT core later, but for the moment, let's focus on those partitions. Here's a diagram of a partition:

An SM partition contains everything needed to execute shader instructions independently. There is a register file, which stores the data being executed on by the threads, load/store units to move data in and out of those registers, and a warp scheduler and dispatch capable of dispatching instructions across three different data paths. The first data path is capable of executing FP32 and INT32 instructions, the second one is capable of executing just FP32 instructions, and the third datapath is capable of executing "tensor core" instructions and FP16 instructions.

Dispatching and Executing Instructions

If you look at the diagram of the SM partition, you'll see it notes (32 threads/clk) next to the dispatch unit. This is pretty important, as what it's saying is that, within each SM partition, one warp of 32 threads can be dispatched to one of the three data paths each clock cycle. This means that you can't simultaneously dispatch instructions to, say, both FP32 data paths within the same clock cycle. You would have to dispatch an instruction for one warp to one data path on one clock cycle, and then dispatch an instruction for another warp to the other data path on the next clock cycle.

Just because you can't dispatch to multiple data paths on the same clock cycle doesn't mean you can't have multiple data paths executing concurrently, though. Otherwise having multiple FP32 capable data paths would be useless if you can't use them at the same time. The key to this is that instructions typically take multiple cycles to execute.

I'm going to ignore pipelining here to keep things simple, but if you look at the two FP32 capable data paths in the diagram, you'll see each one is divided into 16 blocks. Nvidia calls these "CUDA cores" in marketing, although they're not really cores. What they actually tell us is that each one of these data paths can execute 16 FP32 operations per clock cycle. Now, if an FP32 instruction for a warp containing 32 threads is dispatched to one of these data paths, and it can execute 16 ops per clock, then it's straightforward to see that a standard FP32 operation (like fused multiply add, for example) would take two clock cycles to execute on one of these data paths.

If it takes 2 cycles to execute an FP32 operation on a warp, and the dispatch unit can issue one warp per cycle, then we can see how having two FP32 data paths becomes useful, as you can dispatch to each data path on alternate clock cycles, and, in theory at least, get 100% utilisation of both simultaneously.

Tensor Code Is Just Shader Code

One important thing to note here is that tensor cores are pretty much just big shader cores designed for a very specific operation. While shader cores perform operations like add or multiply on individual numbers (across multiple threads at the same time), tensor cores perform multiplication operations on matrices. They run instructions which sit in shader code just like the other data paths do, and if you look at Ampere's instruction set, you can see those instructions, labelled HMMA and IMMA. So when the dispatch unit comes across an FP32 instruction, it will send it to either of the first two data paths, when it comes across an INT32 instruction it will send it to the first data path, and when it comes across a matrix instruction, it will send it to the tensor core data path.

For those curious, it seems that these matrix multiplication instructions are synchronised across the entire warp, where a single matrix multiplication is split over the 32 threads in the warp. This makes a lot more sense than trying to execute 32 separate matrix multiplication operations simultaneously, which would require a huge amount of register space.

To understand how well the tensor core can operate concurrently with the other data paths, we need to know a bit more about these instructions. I'm going to focus on FP16 matrix multiplications but the same logic applies to TF32, BF16, INT8, etc. From Nvidia's documentation we know that Ampere supports two matrix sizes for these operations, 16x8x8 and 16x8x16. We'll first look at the 16x8x8 case, which means multiplying a 16x8 matrix by an 8x8 matrix. This requires 1024 FMA operations to execute.

We can calculate from Nvidia's advertised performance figures that each SM is capable of executing 512 FP16 tensor ops per clock, ignoring sparsity (their numbers claim double this, by counting FMA as two operations). This means that the tensor core in each SM partition can execute 128 FP16 operations per clock. So, a 16x8x8 multiplication which requires 1024 operations to complete would execute in 8 clock cycles. By the same logic, a 16x8x16 multiplication would execute in 16 cycles. Tensor ops in FP16 with sparsity use 16x8x32 multiplications (or really 16x8x16 after accounting for the sparsity structure) and also execute in 16 cycles.

Running Tensor Code Concurrently With Shader Code

Knowing that tensor cores run shader instructions, just like regular shader cores, and knowing how long those instructions take to execute, we can start to get an idea of how running tensor code like DLSS alongside regular shader code works.

Firstly, tensor code is bundled up in threads and warps just the same as any other code. For those 48 warps issued to an SM, you would have to have some regular shader code warps issued, and some warps which use tensor cores. For the sake of an example, let's say you have 32 warps of regular shader code, and 16 warps which use tensor cores. Then, for each SM partition, we can assume there are 8 warps of regular shader code issued to it, and 4 warps which use tensor cores.

Each cycle, the dispatch unit in an SM partition can dispatch one instruction from one of those 12 warps to one of the three data paths. To help illustrate how this would work, I've created a diagram showing a theoretically optimal case involving just standard FP32 instructions (let's say FMA) and FP16 matrix instructions of size 16x8x16:

Each row is one clock cycle, going from top to bottom, and each column is one of the three data paths. On the left side I've shown what the dispatch unit is doing that cycle, and in each column I've shown colour when an operation is being executed (blue for FP32, green for tensor) with a dark line at the top where it's dispatched. If a cell is white, that data path is idle.

You can see here how all three data paths can be kept reasonably busy even though the dispatch unit can only issue an instruction to one of them each cycle. The first two data paths could theoretically achieve full utilisation if there was no tensor code, but even with tensor code being dispatched it has a relatively small effect, just forcing one idle cycle for every 16 execution cycles. The key to this is how long the tensor ops take to execute. If they completed very quickly, then there would be more of the idle cycles you see in the other two data paths whenever a tensor op needs to be issued.

It's important to note that this is a theoretically optimal case. The dispatch unit isn't always going to be able to issue an instruction on every cycle, because there may not be a warp with an instruction ready for dispatch. These warp stalls, as they're called, can be caused by a number of reasons, for example waiting on data to arrive from RAM, and are the reason for the large number of warps which are issued at a time. Even if some of the warps are stalled for whatever reason, having 12 of them to dispatch from means it's very likely that at least one of them will be read to dispatch on any given clock cycle. Still, that's not guaranteed, and there are inevitably going to be missed cycles here and there.

On the specific issue of running shader code concurrently with tensor code like DLSS, although they can execute concurrently with almost full efficiency in a theoretical setup, there are additional bottlenecks to running them together. For instance, issuing a few DLSS warps to an SM alongside your shader code means there are fewer shader code warps available, so you're more likely to see all of them stalled, and the dispatch unit unable to issue to the FP32/INT32 data paths, than if you had a full complement of shader code warps. The same is the case with warps using tensor core code, where it's more likely that (say) 4 warps are all stalled waiting for memory than if all 12 warps were lining up to use the tensor cores.

Speaking of memory, regular shader code and DLSS would be competing for both cache and memory bandwidth. Hopefully this isn't too much of an issue with LPDDR5X providing more bandwidth than we expected, but it's still a potential limiter on performance when running them concurrently. Finally, I should note that ML models like DLSS aren't 100% matrix multiplication, and there are things like activation layers in there which will require regular shader code (likely FP16), but that should be a relatively small portion of the execution time.

Running FP16 Code Concurrently With FP32 Code

Another closely related topic that has come up a few times is the possibility of running non-tensor FP16 code concurrently with FP32 code, as the Ampere whitepaper states "standard FP16 operations are handled by the Tensor Cores in GA10x GPUs". In fact, I've argued myself in the past that this could be useful for developers with a mix of FP32 and FP16 code, but it seems like it's less useful in practice than running tensor code concurrently with FP32 code.

The reason for this is execution time. While tensor operations are on nice big matrices which take up to 16 cycles to complete, which leaves a lot of time for the dispatch unit to also issue FP32 instructions, non-tensor FP16 instructions are at the other end of the spectrum, executing very fast. From Nvidia's performance figures, we know that non-tensor FP16 instructions are executed at a rate of 128 per SM per clock, which means that a tensor core data path in one of the SM partitions can execute 32 FP16 operations per clock when running non-tensor code.

With the dispatch unit issuing one warp of 32 threads per clock, though, this means that non-tensor FP16 instructions execute in a single clock cycle. So, the dispatch unit would have to dispatch to the tensor core data path every single clock cycle to fully utilise the non-tensor FP16 performance available, and it can't do that while also dispatching to the other two data paths.

Here's another diagram like the one above, but now showing a combination of FP32 instructions and non-tensor FP16 instructions (the latter in red):

You can see the issue here, as the dispatch unit is dispatching on every clock cycle, but there's still a lot of idle cycles across the three data paths. In fact, so long as it's dispatching every clock cycle, then the achievable performance is 32 operations per clock (or 128 per clock for the entire SM) regardless of whether those are FP32 instructions or FP16 instructions or a mixture of both, just by virtue of the dispatch limitation. Taking pipelining into account would change the behaviour a bit, but the dispatch limitation would remain the same either way.

That doesn't mean that using FP16 isn't worthwhile, as it takes up less space in memory, less register space, and less bandwidth, even if you're not getting faster execution. I'm also very curious if any developers can make use of the tensor cores' matrix multiplication operations for non-ML use-cases. If you've got a problem which maps well to matrix multiplication and where FP16 is sufficient, then you could achieve a very large speedup by rewriting your shaders to make use of the HMMA ops. (I'm also curious if Nvidia have provided tools to do so, as the way in which matrices are synchronised across the warp mean these use cases would have to be handled a bit differently from regular shader code by the compiler).

Running RT Cores Concurrently With Everything Else

I've focussed mostly on running tensor core code like DLSS concurrently with regular shader code, but another feature of Ampere is the ability to run RT concurrently with both of these. There's less to say here, as Nvidia is more explicit about the functionality, saying in the whitepaper that "The new GA10x SM allows RT Core and graphics, or RT Core and compute workloads to run concurrently, significantly accelerating many ray tracing operations."

The RT core is responsible for BVH traversal and triangle intersection testing, ie finding exactly what triangle a given ray intersects, and where, and it's fixed-function hardware which sits apart from the SM partitions we talked about before, so it makes sense that it can be made to operate independently. This doesn't cover the entirety of RT workloads, as you still need shaders to create rays and process them after a hit is found, and to perform any work required after that (eg shading reflections), but it means the shader cores don't have to sit idle while the RT cores are doing their thing, or vice-versa.

TL,DR:

From purely a point of view of executing instructions within an SM, Ampere GPUs should be able to concurrently execute both regular shader code and tensor core code like DLSS, with theoretically up to about 94% efficiency achievable given the limits of the dispatch units. In reality there's likely to be contention between the workloads over things like cache and bandwidth, so real-world benefits would be lower, but I'd still imagine you'd get a good performance boost over running them sequentially, particularly given the relatively high bandwidth from LPDDR5X.

For FP32 and non-tensor FP16 code, while they can operate somewhat concurrently, the limitation of the dispatch unit only being able to dispatch one warp per cycle, combined with the very quick execution time of the FP16 instructions, means the benefit to running them together is small. There still can be register/memory/bandwidth benefits to using FP16 code, though.

RT cores can operate concurrently with everything else, as per Nvidia's whitepaper. Not much more to say here.

This is all based on what I can understand from Nvidia's whitepapers and other sources online, but it's not something that Nvidia have ever fully documented, publicly at least. So if anyone has any corrections or anything else to add it would be much appreciated.

This is an absolute classic Thraktor post, thanks a ton for making it! It is in-depth posts like this that make this thread such a one-of-a-kind experience.

From this, it seems like concurrent tensor operations used in matrix multiplication should be worthwhile to do concurrently, of course under the assumption that the memory can be supplied in a sufficiently fast fashion. What the ultimate performance hit for the ALUs will turn out to be remains a question, but the hope would be that this type of overlapping can hide a (hopefully large) portion of the computational requirement of DLSS, which in practice would give us a frame time boost vs. the non-concurrent implementation of DLSS (where DLSS happens as a separate operation and the FP32 warps are stalled).

In terms of latency, doing DLSS of the current frame simultaneous with the rendering of the next frame induces a frame's worth of input lag (33 ms for 30 fps, 17 ms for 60fps). That's of course not ideal, but it seems like it could be a worthwhile trade-off assuming that we can claw back a good chunk of the frame time buffer for non-DLSS tasks.

Individualised · May 25, 2024

ertaboy356b said:
What if 'SP' is the abbreviation of the next console .

There was a (likely bullshit) rumour that the system was going to be called "Super Play" last year.

JoshuaJSlone · May 25, 2024

crazybenjamin said:
Yeah, I could see that being the case

I thought it was simply a matter of them not caring, what are these anti-competitive reasons?

I don't know if there's reason to believe it beyond a joke, but I've seen people say that the reason some games are so big is so you don't have room left to install the competition.

Semi Lazy Gamer · May 25, 2024

TheGreatMightyPoo said:
is a virtue

VIRTUAL BOY GAMES CONFIRMED FOR NSO

Too bad I have little.

windyoshi · May 25, 2024

ertaboy356b said:
What if 'SP' is the abbreviation of the next console .

Maybe this project started development around the time of and along with the development of the potential "Switch Pro", that ended up not materializing because of covid and chip shortages. So "Switch Pro Red" or "Switch Pro Mario", and the codename stuck.

gonanzir · May 25, 2024

Hartmann said:
Samsung seems desperate to approach NVIDIA with 3nm process.

'니모를 잡아라'..삼성전자, 파운드리도 엔비디아 수주 최우선 전략 가동 | 파이낸셜뉴스

www.fnnews.com

Hartmann said:
However, Nvidia has recently allocated most of the volume for its advanced process chips to TSMC, leading to a halt in orders for Samsung Foundry.

Sounds like good news for TeamTSMC4N.

crazybenjamin · May 25, 2024

JoshuaJSlone said:
I don't know if there's reason to believe it beyond a joke

For me, I really don't see it as anything other than a joke, the idea that publishers would be that adversarial about things just seems too outlandish. Like, surely these publishers have to realize that it could just as easily happen the other way around? As in, the competition locks you out with no chance for coexistence because your game won't fit in the remaining storage

SebastianRooks · May 25, 2024

Zedark said:
This is an absolute classic Thraktor post, thanks a ton for making it! It is in-depth posts like this that make this thread such a one-of-a-kind experience.

From this, it seems like concurrent tensor operations used in matrix multiplication should be worthwhile to do concurrently, of course under the assumption that the memory can be supplied in a sufficiently fast fashion. What the ultimate performance hit for the ALUs will turn out to be remains a question, but the hope would be that this type of overlapping can hide a (hopefully large) portion of the computational requirement of DLSS, which in practice would give us a frame time boost vs. the non-concurrent implementation of DLSS (where DLSS happens as a separate operation and the FP32 warps are stalled).

In terms of latency, doing DLSS of the current frame simultaneous with the rendering of the next frame induces a frame's worth of input lag (33 ms for 30 fps, 17 ms for 60fps). That's of course not ideal, but it seems like it could be a worthwhile trade-off assuming that we can claw back a good chunk of the frame time buffer for non-DLSS tasks.

4N, big L2 cache and the ofa from mrs. wong and everyone is happy

TomNookYankees · May 25, 2024

crazybenjamin said:
For me, I really don't see it as anything other than a joke, the idea that publishers would be that adversarial about things just seems too outlandish. Like, surely these publishers have to realize that it could just as easily happen the other way around? As in, the competition locks you out with no chance for coexistence because your game won't fit in the remaining storage

Part of why AAA publishers keep going for GAAS over single player games is that part of the goal of these products is to make money by monopolizing players' time.

So yes they would in fact be that adversarial. Especially a company like Activision which was one of the earliest companies to aggressively push loot boxes (started with CoD Advanced Warfare and then later Black Ops 3)

Pokemaniac · May 25, 2024

windyoshi said:
I forgot about that leak, the Sonic game ended up being real. Maybe there really is/was an Odyssey 2 in development... though, if it was real I can't see them just scrapping the project if a lot of the work was already done at that time. A new game expanding on the open concept of Bowser's Fury is personally what I am hoping for but Odyssey 2 would still be very exciting and i'm sure a great experience.

That was a survey made by Sega. It has no relation to any game that may or may not be/have been in development at Nintendo.

Raccoon · May 25, 2024

Pokemaniac said:
That was a survey made by Sega. It has no relation to any game that may or may not be/have been in development at Nintendo.

lmfao maybe this is midori's nintendo source

Simba1 · May 25, 2024

Gerald said:
Series S have same CPU as PS5 and Series X, is two Zen2 clusters, each cluster with 4 cores, im believe that a78c will end up clocked 2.0-2.5ghz and yes ik it must be same clocks in two modes, ofc on TSMC 4n, samsung 8nm shouldn’t be debate anymore

I would be very surprised if Switch 2 CPU clock isn't below 2GHz.

MisterSpo · May 25, 2024

SilverStar would very clearly be Super Luigi Odyssey, given silver is used for second best. It was cancelled when, during development, Luigi couldn't throw his hat very far no matter how many takes they filmed.

BreakAtmo · May 25, 2024

Thraktor said:
Thraktor's Guide to Concurrency in Ampere

reclines in front of fireplace

sips Earl Grey

begins

Danny · May 25, 2024

Nintendo will open a second US store in San Fransisco in 2025:

Nintendo Announces Plans for Official Store in San Francisco

Nintendo of America today announced plans to open an official store in San Francisco’s iconic Union Square.

www.nintendo.com

I see this as related to Switch 2 in some ways.

I see Nintendo pushing more movies, stores and theme parks as a way to increase Nintendo mindshare among consumers, to lessen the risk of having Wii U moments in the future. The greater mindshare Nintendo has among gamers and the general public the less risk of releasing products that don't sell well enough.

And i think they are pushing hardest in the US because the US is their largest single market, and the dollar exchange rate makes it even more lucrative due to the very weak yen.

But i also think they will start to push for stores in Europe as well in the future, maybe 1 theme park in Europe as well in the future.

BreakAtmo · May 25, 2024

oldpuck said:
This is probably a bridge too far. You could clearly make a jump beyond the PS5 at the same size - not a leap but a jump. Or you could make the PS5 somewhat smaller. I don't think the tech exists to put PS5 levels of power in a Wii U shaped box, much less do it for less than 500 dollars.

You don't think? I wonder - a Zen4/RDNA3 APU made on N4P with a vapour chamber seems like it could be made very small, and the cooling solution could be cut down significantly. Though that might push it beyond 500 bones.

Zedark said:
This is an absolute classic Thraktor post, thanks a ton for making it! It is in-depth posts like this that make this thread such a one-of-a-kind experience.

From this, it seems like concurrent tensor operations used in matrix multiplication should be worthwhile to do concurrently, of course under the assumption that the memory can be supplied in a sufficiently fast fashion. What the ultimate performance hit for the ALUs will turn out to be remains a question, but the hope would be that this type of overlapping can hide a (hopefully large) portion of the computational requirement of DLSS, which in practice would give us a frame time boost vs. the non-concurrent implementation of DLSS (where DLSS happens as a separate operation and the FP32 warps are stalled).

In terms of latency, doing DLSS of the current frame simultaneous with the rendering of the next frame induces a frame's worth of input lag (33 ms for 30 fps, 17 ms for 60fps). That's of course not ideal, but it seems like it could be a worthwhile trade-off assuming that we can claw back a good chunk of the frame time buffer for non-DLSS tasks.

Wait, would the latency actually be one frame? Or would it just be the time taken to do the DLSS, since that would not change from frame to frame?

Necroxid · May 25, 2024

My2cents on the next Super Mario 3D game, what if it's a sequel to BOTH Galaxy AND Odyssey?
Like "Super Mario Odyssey into the Galaxy" or smth

carbvan · May 25, 2024

Necroxid said:
My2cents on the next Super Mario 3D game, what if it's a sequel to BOTH Galaxy AND Odyssey?
Like "Super Mario Odyssey into the Galaxy" or smth

Ngl that sounds like a Pokemon Go to the polls type joke

Shareholder Chad · May 25, 2024

Super Mario Odyssey had one flaw that I hope can be fixed.

Too many damn "moons". They weren't special. They even had welfare moons. Lets get back to the perfectly fine, 120.

palemire · May 25, 2024

Shareholder Chad said:
Super Mario Odyssey had one flaw that I hope can be fixed.

Too many damn "moons". They weren't special. They even had welfare moons. Lets get back to the perfectly fine, 120.

Even turning the majority of those super-easy-to-get into moon fragments or silver moons or something instead would've helped a lot in differentiating them from the more involved ones.

darthdiablo · May 25, 2024

Shareholder Chad said:
Super Mario Odyssey had one flaw that I hope can be fixed.

Too many damn "moons". They weren't special. They even had welfare moons. Lets get back to the perfectly fine, 120.

I liked it fine - how moons were there? 830? I got them all (have the star on my save file)

For some, they just want to get to the ending of the main story (which can be done in far fewer moons).

For others, they want some longevity to the game itself after completing the main story with extra challenges increasing replayability (ie: me)

Personally I love grabbing moon after moon (sometimes seconds apart), it’s a constant dopamine hit

Looks like they struck a good balance it tries to appeal to those with different goals in mind.

VegiHam · May 25, 2024

Shareholder Chad said:
Super Mario Odyssey had one flaw that I hope can be fixed.

Too many damn "moons". They weren't special. They even had welfare moons. Lets get back to the perfectly fine, 120.

Not a flaw. Moons weren't supposed to be a big hype get, they were supposed to be a satisfying end to a bite sized gameplay segment you could enjoy in a quick handheld play moment.

Want the end of handheld game design? Say goodbye to the lucrative hybrid form factor.

JoshuaJSlone · May 25, 2024

BreakAtmo said:
Wait, would the latency actually be one frame? Or would it just be the time taken to do the DLSS, since that would not change from frame to frame?

If I understand what you're getting at properly, I think you're right. So if it was a 30fps game but the concurrent DLSS (and whatever further post-processing) action took less than 16.6ms, the output wouldn't need be 33.3ms behind just because that's what each frame has been given.

30fps frame output (on 60hz screen) could go from
BBCCDDEE
to
ABBCCDDE

and maybe the difference could be even less on a variable refresh rate screen? Though I don't feel I have a solid enough understanding of how all the timing stuff works there to say with certainty.

VegiHam said:
Not a flaw. Moons weren't supposed to be a big hype get, they were supposed to be a satisfying end to a bite sized gameplay segment you could enjoy in a quick handheld play moment.

In my day we called those "blue coins"!

Zedark · May 25, 2024

BreakAtmo said:
You don't think? I wonder - a Zen4/RDNA3 APU made on N4P with a vapour chamber seems like it could be made very small, and the cooling solution could be cut down significantly. Though that might push it beyond 500 bones.

Wait, would the latency actually be one frame? Or would it just be the time taken to do the DLSS, since that would not change from frame to frame?

The thing is that when you overlap DLSS, you get the following situation: On the CPU, the updates for the current update in frame 1 (including, importantly, the user input) have finished, and now the GPU will render the base image before DLSS using the ALUs (non-tensor compute hardware), let's call it stage A. Then, frame 1 needs to go through DLSS which we call stage B, and at the end frame 1 needs to have post-processing applied to the post-DLSS image, called stage C. By the time frame 1 enters stage B, frame 2 has performed its CPU processes (and gathered input based on what is on the screen) and can enter stage A. Once stage B in frame 1 finishes, stage C can start for frame 1. Only once stage C is finished do we output frame 1 to the screen.

What this means is that we can only output frame 1 once stage C is finished. The benefit we extract from overlapping DLSS is that instead of having stage A + B + C take the total time of one frame (17 ms for 60 fps, 33 ms for 30 fps), we can now have stage A + C take up almost all of the frame time without stage B reducing the available time for native rendering and for post-processing. In practice, this likely means that there is more time to produce a high quality base image before DLSS. However, the consequence is that the full rendering of a frame, which includes stages A, B, and C, does not finish within the boundary of a frame. This means that we "miss" outputting the current frame, and have to wait one frame before we can output the current frame. Because of overlapping of the stages, this does not compound, but it does mean that each frame's display happens one frame later than the moment the CPU gathered user input for it, and therefore we experience one frame's worth of input lag. Essentially, the CPU has gathered input for frame 2 while the player is viewing frame 1, and therefore their input is mismatched by one frame.

This is my understanding of it, at least. If my understanding was wrong or someone wants to add additional nuance to it, please reply (it helps the discussion and everyone's understanding of the process)!

gnetc · May 25, 2024

windyoshi said:
Maybe this project started development around the time of and along with the development of the potential "Switch Pro", that ended up not materializing because of covid and chip shortages. So "Switch Pro Red" or "Switch Pro Mario", and the codename stuck.

Maybe SPR=Super? Ala Super Switch?

Darknut85 · May 25, 2024

darthdiablo said:
I liked it fine - how moons were there? 830? I got them all (have the star on my save file)

For some, they just want to get to the ending of the main story (which can be done in far fewer moons).

For others, they want some longevity to the game itself after completing the main story with extra challenges increasing replayability (ie: me)

Personally I love grabbing moon after moon (sometimes seconds apart), it’s a constant dopamine hit

Looks like they struck a good balance it tries to appeal to those with different goals in mind.

The problem is that they feel not as rewarding as the old stars (they’re literally laying around every corner). They could‘ve added other collectibles instead of devaluing the moons.

ItWasMeantToBe19 · May 25, 2024

One pretty major issue with the idea of DLSS concurrency is that any tensor core operation outside of the DLSS step would be unusable without incredible timing with regards to programming.

I assume the cost of DLSS will be like 4 ms which is substantial but not killer.

If you could get Super Resolution and Ray reconstruction to work with this concurrency idea, this becomes interesting, but Ray reconstruction appears to be much more expensive than Super Resolution and I assume concurrency will have tons of frametime costs from cache misses so it’s hard to imagine Super Resolution and Ray reconstruction being completed within 16.66 ms with constant cache misses.

This also depends on if NVIDIA is willing to spend the resources to make a shittier, faster version of Ray reconstruction that is designed around much more limited ray tracing.

Wockio · May 25, 2024

Darknut85 said:
The problem is that they feel not as rewarding as the old stars (they’re literally laying around every corner). They could‘ve added other collectibles instead of devaluing the moons.

But then you can't beat the game while doing only the moons you want

Serif · May 25, 2024

Playing Ghosts of Tsushima on Steam Deck has only increased my appetite for Switch 2's potential. Maybe it's odd that I'm thinking about Switch 2 while playing a Sony exclusive title on a handheld PC... but I can't help myself. Comparison is inevitable.

Some thoughts:

This is one of the prettiest games I've ever played, and it is a 'mere' PS4 game running at PS4 equivalent settings (dynamic 1080p with FSR2, 30 FPS, medium-low settings). Asset quality is what you'd expect - but the art direction - the way light bounces, how grass is shaded, how the world presents itself - stunning. And this is despite me cranking the settings way down - it's very scalable.
HDR is a must for Switch 2. It just looks stunning and gives an immediate upgrade to visuals regardless of resolution. I expect it in both modes.
I'm playing with a 4K framebuffer, so the Deck is always outputting a 2160p 60 Hz HDR signal to the TV. This is what I expect for Switch 2. The game is running at a dynamic upscaled 1080p, and the image quality is outstanding. Increasing the resolution past this point results in noticeable but negligible differences in sharpness. The distinction between 1440p and 4K is especially not noticeable when I'm 8 feet away from my 65'' TV.
30 FPS should be serviceable for most people, yes even on an OLED TV, as long as the game is responsive, the FPS is locked, and/or there is a good motion blur solution. I've had no issues playing this game for hours this way. (I know that some people physically cannot handle 30 frames, this is where I hope developers offer performance modes or support for 40 FPS in some way)
I expect many demanding multiplatform ports to target 1080p 30 FPS after DLSS And that would be a fantastic outcome. Aiming for higher is good of course. I don't expect the same degree of diminished visuals with the Switch's 'impossible ports', which sacrificed resolution/framerate/detail all at once.
Fuck it, if some games need to get 720p/900p docked after DLSS just to make it onto the Switch 2, I'll take it, as long as the image itself is still well anti-aliased and packed with detail.
At living room distance (which is what Nintendo would care about most for docked mode) - the Switch 2 should provide consistently high quality visuals in docked mode, to the point where many complaints about image quality and performance should disappear. We'll see the inevitable complaints about '1080p in 2025' but there will be disinterest in the spec wars, considering how good games should look.

I've enjoyed using my Steam Deck as a mini console and have finished up recent games like Lies of P on it, I'm willing to accept all the cutbacks to have a portable device. It's not a very seamless hybrid device though, so the Switch 2 will end up being my go-to for recent releases if third parties step up their game.

ItWasMeantToBe19 · May 25, 2024

Based on testing, ray reconstruction costs 1 ms (on a 4070) of frametime more than denoisers needed for just ray traced reflections.

And a 4070 has 4x the core count of the Switch 2, slightly more than 4x the bandwidth, and a clock speed that will be nearly twice the Switch 2 in docked mode most likely and around 3.5x the clock speed of the Switch 2 in handheld mode most likely.

So we're talking about ray reconstruction costs of 1 ms more than RT reflections only denoisers on a card that is like 6x more powerful. Assuming RT reflections denoisers cost around 0.5 to 1 ms on a 4070, we're looking at estimated ray reconstruction costs of 1.5 to 2 ms on a 4070. And this quickly scales to unusable on the Switch 2.

Note that DLSS Super Resolution 1440p costs 0.66 ms on a 3070 and 0.37 ms on a 4080 (4070 frametimes are not given) so Super Resolution 1440p probably costs around 0.5 ms on a 4070.

So, will NVIDIA make a more limited version of ray reconstruction to help the Switch 2 as well as like 3060 owners? Uhhhhh, not sure, but would be nice.

darthdiablo · May 25, 2024

gnetc said:
Maybe SPR=Super? Ala Super Switch?

Confirmed. It’s Super Nintendo Switch

Steve · May 25, 2024

Makes me wonder if Nintendo still have some games that are ready for the Switch 2 pipeline, but the Delays was mostly because of 3D Mario.

Like I’m quite impressed that Nintendo now has the capability of withholding games, compare to the Wii U era in which we waited for so long for games to arrive.

Steve · May 25, 2024

NateDrake said:
Patience.

When can we expect the Switch 2 spec talk podcast? Since you’ve mentioned it last week.

Also will the podcast bring new information that you’ve heard of?

rick_C · May 25, 2024

Steve said:
Makes me wonder if Nintendo still have some games that are ready for the Switch 2 pipeline, but the Delays was mostly because of 3D Mario.

Like I’m quite impressed that Nintendo now has the capability of withholding games, compare to the Wii U era in which we waited for so long for games to arrive.

They already hold FE engage and XB3 before, which is little bit out of my expectation.

navierStokes · May 25, 2024

VagrantValmar said:
So regarding Hellblade 2 and how it runs on Xbox Series S and the Deck...

Assuming this game launched on Switch 2, what would be the expected performance we could get, considering the resolution and lack of lumen on Series S are the biggest downgrades and upscaling via DLSS plus RT would, in theory, be better on Switch 2?

Well I tried to play around with the RTX 3050 mobile 4GB and underclock it like I did for control, but it's not possible. 4GB, even with 720p target resolution and DLSS Ultra Performance eats the VRAM. I know that the game needs 6GB of VRAM at its minimum spec, but I thought why not try it out

.

The difference between the "T239" simulated speed or GPU at full speed or the is quite marginal.

Underclocked + Voltage limit (lowest possible) (sub 20fps) Default (~20fps)

JoshuaJSlone · May 25, 2024

Serif said:
I expect many demanding multiplatform ports to target 1080p 30 FPS after DLSS And that would be a fantastic outcome. Aiming for higher is good of course. I don't expect the same degree of diminished visuals with the Switch's 'impossible ports', which sacrificed resolution/framerate/detail all at once.

This seems to come up a lot, but I doubt it. Unless DLSS is much more costly than we've been guesstimating, it's decreasing the input resolution that will have a much bigger effect on system resources, especially if it only needs done 30 times each second. Unless they're really pushed as far as doing something like sub-360p input in docked mode, so the advantages of upscaling to 1440p rather than 1080p would be smaller.

Though I'm tempted to play with DLSSTweaks and see what some stuff like 360p->1440p comes out like.

Serif · May 25, 2024

JoshuaJSlone said:
This seems to come up a lot, but I doubt it. Unless DLSS is much more costly than we've been guesstimating, it's decreasing the input resolution that will have a much bigger effect on system resources, especially if it only needs done 30 times each second. Unless they're really pushed as far as doing something like sub-360p input in docked mode, so the advantages of upscaling to 1440p rather than 1080p would be smaller.

Though I'm tempted to play with DLSSTweaks and see what some stuff like 360p->1440p comes out like.

To me it's less a question of the technical capability of the system and more of priority for developers i.e. the minimum they might be willing to ship to deliver the port on time. If dynamic DLSS is in play they may allow the input to vary as it needs to, and keep 1080p output as a 'safe' resolution that can guarantee a nice image. Especially since it meets the minimum target of handheld mode being 1080p.

Preferably they aim as high as they can, I don't really know how much work it takes to optimize for one resolution vs. the other.

HyruleanKnight · May 25, 2024

navierStokes said:
Well I tried to play around with the RTX 3050 mobile 4GB and underclock it like I did for control, but it's not possible. 4GB, even with 720p target resolution and DLSS Ultra Performance eats the VRAM. I know that the game needs 6GB of VRAM at its minimum spec, but I thought why not try it out .

Yeah that is almost entirely VRAM bottleneck city right there.

I've watched some more RTX2050 gaming benchmarks and deducted ~25% performance, and honestly most of them look and run pretty good. Compared to the Switch, the improvement in visuals is astronomical.

fluttersusagi · May 25, 2024

TheGreatMightyPoo said:
is a virtue

VIRTUAL BOY GAMES CONFIRMED FOR NSO

ok but this would actually go hard tho


Underclocked + Voltage limit (lowest possible) (sub 20fps)	Default (~20fps)

StarTopic Future Nintendo Hardware & Technology Speculation & Discussion |ST| (New Staff Post, Please read)

Chain Chomp

#TeamRemake

Rattata

The guy with the ToV avatar

Cappy

Koopa

Koopa

Koopa

Piranha Plant

Koopa

Moblin

The guy with the ToV avatar

Like Like

Starman

Like Like

Rattata

Nintendo doing Nintendo.

Moblin

Rattata

Koopa

Caught: 1025

Fox Brigade

Bob-omb

Villager

Bob-omb

Koopa

Bob-omb

Rattata

Chain Chomp

FORMER Nintendo Shareholder

bing bong singalong

+5 Death Stare

#TeamMarch

Starman

The guy with the ToV avatar

Warren Buffett = Tom Nook

#TeamRemake

Pikmin

NSMB2 Millionaire

𝕽𝖊𝖓𝖊𝖌𝖆𝖉𝖊 𝕬𝖓𝖌𝖊𝖑

Pikmin

+5 Death Stare

Chain Chomp

Chain Chomp

Octorok

Boo

Starman

𝕽𝖊𝖓𝖊𝖌𝖆𝖉𝖊 𝕬𝖓𝖌𝖊𝖑

I drew my pfp on MS Paint

moonlight is the message of love <3