• Hey everyone, staff have documented a list of banned content and subject matter that we feel are not consistent with site values, and don't make sense to host discussion of on Famiboards. This list (and the relevant reasoning per item) is viewable here.
  • Furukawa Speaks! We discuss the announcement of the Nintendo Switch Successor and our June Direct Predictions on the new episode of the Famiboards Discussion Club! Check it out here!

StarTopic Future Nintendo Hardware & Technology Speculation & Discussion |ST| (New Staff Post, Please read)

the-ttyd-remake-super-nintendo-switch-schizo-theory-v0-t8nd5qs4iy6c1.jpg


original.png

Untitled.png




About the Super Famicon color scheme theory, I'm in agreement. Partially because I want the "Super Nintendo Switch" name theory to become true haha, but also because it's true -- Nintendo has quietly made a pattern of having their color prompts be in same color scheme as the Super Famicon controller. The new Paper Mario TTYD remake, Super Mario RPG remake, and even the game Fashion Designer of all things. It's the exact kind of thing Nintendo would do because they have done in past gens when button prompts on the screen would have the same color and shape as the buttons on the controller to make it easy to understand.

Once is a coincidence, twice is a pattern, three times is...

(Using this image by PyrpleForever on Reddit for the Paper Mario TTYD remake analysis)
I wish they'd at least adopt colored button letters that they used in the new3DS XL if they want to maintain that professional look, but having muted/slightly desaturated color buttons like the one in Fashion Dreamer would also work well.

The Super Famicom/UK Super Nintendo glyphs have aged particularly well.
 
I had a dream that was announced and the gimmick was attaching two consoles together with magnets lol. It was still just a codename for some reason though.
 
Been meaning to ask this question, originally had it as a reply to some random post a couple of days ago but here it goes: how much does anyone on here think Nintendo's marketing this go-around is going to actually focus on either technical bullet points or else simply highlight better specs or else we are overselling this new device? This isn't to criticize getting excited about tech specs in the least because honestly the swirl of tech leaks have me just as excited as anyone on here, but Nintendo's history of sacrificing powerful but costly for affordability and accessibility does have me interested in how all this is going to look in practice...

Once again this isn't me asking "do you honestly think is happening!" more asking through a number of specifics: has tech gotten to the point that the idea of affordable and competitive are no longer mutually exclusive? does Nintendo announce beefier hardware or do the games speak for themselves? how does a potentially 400-500 dollar teched-up device look given Switch's base of families/kids? is there a hook or is "the thing you own but better" the new device's entire selling point? does the potential hole left by Microsoft give Nintendo more of an incentive to focus on the triple-a gaming market?

So in short: how do you all think this is going to look in practice??
 
I wish they'd at least adopt colored button letters that they used in the new3DS XL if they want to maintain that professional look, but having muted/slightly desaturated color buttons like the one in Fashion Dreamer would also work well.

The Super Famicom/UK Super Nintendo glyphs have aged particularly well.
What I hope for is that the grey circle around the buttons seen on the SFC makes a return with the coloured buttons. So they're always against a plain background, no matter the colour or special edition design of the controller itself.
 
MLID gives me huge SMD vibes...
MLID is known to delete older posts/videos when proven wrong to erase any evidence. Unfortunately for him, we remember. His insistence on it being on 8nm was no more than him being a clown with zero understanding about the subject matter.

8nm may have made sense years ago, with an SM count of 6 or fewer. This was before the Nvidia leak, so before we had hard evidence for 12 SMs. Beyond that point SEC8N not only did not make sense due to power and efficiency concerns, but anyone with a shred of knowledge about chip manufacturing should have figured out the improbability of Nintendo buying en masse and putting a die potentially the size of GA107 (or bigger) inside the Switch 2. Even much larger devices like the Steam Deck and Lenovo Legion Go barely have enough space for a die of that size. The 16nm die-shrunk version of the TX1 was exactly half the size of GA107.

SEC8N has given Nvidia some of the worst perf/watt uplifts since Fermi; I watched Iceberg Tech's video on the 3080Ti today and wasn't surprised to see power consumption values as high as 380-400W to deliver 10-15% more performance than my 6800, which gets away at 200-230W. Now, there are a lot more variables at play so this is by no means a scientific comparison, but Ampere on SEC8N really wasn't very efficient compared to the entire RDNA2 portfolio on TSMC 7nm. With 12 SMs on SEC8N we would've been looking at worse perf/watt than the og Steam Deck.
 
Last edited:
MLID is known to delete older posts/videos when proven wrong to erase any evidence. Unfortunately for him, we remember. His insistence on it being on 8nm was no more than him being a clown with zero understanding about the subject matter.

8nm may have made sense years ago, with an SM count of 6 or fewer. This was before the Nvidia leak, so before we had hard evidence for 12 SMs. Beyond that point SEC8N not only did not make sense due to power and efficiency concerns, but anyone with a shred of knowledge about chip manufacturing should have figured out the improbability of Nintendo buying en masse and putting a die potentially the size of GA107 (or bigger) inside the Switch 2. Even much larger devices like the Steam Deck and Lenovo Legion Go barely have enough space for a die of that size. The 16nm die-shrunk version of the TX1 was exactly half the size of GA107.

SEC8N has given Nvidia some of the worst perf/watt uplifts since Fermi; I watched Iceberg Tech's video on the 3080Ti today and wasn't surprised to see power consumption values as high as 380-400W to deliver 10-15% more performance than my 6800, which gets away at 200-230W. Now, there are a lot more variables at play so this is by no means a scientific comparison, but Ampere on SEC8N really wasn't very efficient compared to the entire RDNA2 portfolio on TSMC 7nm. With 12 SMs on SEC8N we would've been looking at worse perf/watt than the og Steam Deck.
So why the hell are we still listening to this guy as if he's a reputable source?
 
The most recent posts before he left were about discussing the internal delay with his sources. If the message to developers was abrupt then I don't see the incongruency.

There's no chance Nintendo set a deadline of Q2 2024 for a system that they were very uncertain could release in 2024 imo.

Just would be wildly incompetent and setting a pointlessly aggressive deadline for partners for no reason.
 
* Hidden text: cannot be quoted. *
That's the one I was thinking of, thank you.

There's no chance Nintendo set a deadline of Q2 2024 for a system that they were very uncertain could release in 2024 imo.

Just would be wildly incompetent and setting a pointlessly aggressive deadline for partners for no reason.
No chance? Huh?

There was a precedent too, Nintendo also internally delayed Switch 1. The Q2 2024 might very well have been the original plan, but when there's a rumored scenario occurring where the star lineup (think: 3D Mario, or new MKart, or whatever) isn't ready, Nintendo isn't going to have any hesitation about delaying it for another year, just so the lineup can be ready for launch day. It's not exactly unprecedented.

How do you think it would look for Nintendo if Nintendo did this instead:

"Whoops! We realized we were ready to go Q2 2024 when we thought there might be a chance we'd delay to Q2 2025. Sorry partners about not telling you we were planning to release in Q2 2024.. any of you partners have something ready to go by any chance?" crossing fingers
 
Last edited:


So regarding Hellblade 2 and how it runs on Xbox Series S and the Deck...

Assuming this game launched on Switch 2, what would be the expected performance we could get, considering the resolution and lack of lumen on Series S are the biggest downgrades and upscaling via DLSS plus RT would, in theory, be better on Switch 2?
 
That's the one I was thinking of, thank you.


No chance? Huh?

There was a precedent too, Nintendo also internally delayed Switch 1. The Q2 2024 might very well have been the original plan, but when there's a rumored scenario occurring where the star lineup (think: 3D Mario, or new MKart, or whatever) isn't ready, Nintendo isn't going to have any hesitation about delaying it for another year, just so the lineup can be ready for launch day. It's not exactly unprecedented.

Why would you set a 5 months ahead of time deadline for third-party games.

Like "sorry, no thanks on Assassin's Creed Red, it may be releasing near our Switch 2 release date, but it wasn't finished five months ahead of the release so we're going to not allow you to be on launch day."

And why would you set a 5 months ahead of time deadline when... You are very uncertain about if you'll even hit your deadline.
 


So regarding Hellblade 2 and how it runs on Xbox Series S and the Deck...

Assuming this game launched on Switch 2, what would be the expected performance we could get, considering the resolution and lack of lumen on Series S are the biggest downgrades and upscaling via DLSS plus RT would, in theory, be better on Switch 2?


This game bombed horribly and I doubt Microsoft puts in the effort to port it to Switch 2.
 
This game bombed horribly and I doubt Microsoft puts in the effort to port it to Switch 2.
...which is why they'll probably leave it to someone like Saber Interactive, Iron Galaxy, or heck, maybe even Shiver Interactive.

I mean, people were in doubt if something like Hellblade was ever going to get ported to Switch, and it happened anyway.
 
Why would you set a 5 months ahead of time deadline for third-party games.

Like "sorry, no thanks on Assassin's Creed Red, it may be releasing near our Switch 2 release date, but it wasn't finished five months ahead of the release so we're going to not allow you to be on launch day."

And why would you set a 5 months ahead of time deadline when... You are very uncertain about if you'll even hit your deadline.

I'm sorry - what exactly do you find so surprising or unbelievable about Nintendo doing an internal delay? It happens all the time in the industry, it's not unique to Nintendo.
 
I'm sorry - what exactly do you find so surprising or unbelievable about Nintendo doing an internal delay? It happens all the time in the industry, it's not unique to Nintendo.

I feel like you're intentionally not understanding these series of posts.

The poster I'm criticizing reported over and over again that third-party developers had a deadline of Q2 2024 to complete their games for a launch that was probably Q4 2024 to have their games be launch titles.

This deadline makes no sense at all and is pointlessly strict to third-parties and would cause problems with getting Fall games to release on the system on launch day.

But this deadline becomes even more absurd when you consider that the Switch 2 was likely delayed to 2025 and delays are always preceded by uncertain about whether or not you'll be able to hit the original target.

So according to this poster, Nintendo was telling third-parties that they had to have their games done by June 30th at the latest to launch with the Switch 2, a system that would release early November at the earliest and perhaps not until March or April based on their own internal projections at the time most likely.

Can you imagine crunching for this pointless June 30th deadline only to hear that the system wasn't going to release until March 2025 at the earliest.
 
...which is why they'll probably leave it to someone like Saber Interactive, Iron Galaxy, or heck, maybe even Shiver Interactive.

I mean, people were in doubt if something like Hellblade was ever going to get ported to Switch, and it happened anyway.

Will Microsoft even bother to spend the money hiring a port studio.

The Steam numbers for Hellblade 2 are catastrophic.
 
Quoted by: SiG
1
I'm guessing Hellblade 1's sales are going to end up much higher than Hellblade 2's as shown by how badly Hellblade 2 is selling based on public Steam data.
As I've said, I don't think sales numbers on Steam are particularly indicative of whether a certain platform gets a a port or not.
 
I think LM2HD looks nice! I never played the original, so I'm excited to dive in.

I just wonder about the Samus Returns effect. 3DS had a good install base, and Samus Returns was a big juicy game that looked good on the hardware. Came out six months into the Switch's lifetime, with at least 70 million people who hadn't upgraded. But still Switch games kinda paved it over. Now I'm imagining the same thing, but for a game that, as an uprezzed 3DS game, doesn't even look at the top end of Switch 1 games. Though maybe being cross gen helps - no one feels like they're throwing away their dollars on a game that they won't be able to play on the system they're already planning on buying.

On the other hand, I also think about how it opens up options. Games that I would require a substantial investment to rework their control scheme seem off the table for a late stage Switch "remaster." The ROI isn't great, you probably want more Senior Nintendo involvement to get the tuning correct, and that kind of game might be more in remake than remaster/uprez territory. But if it's a Switch 2 launch year game, suddenly that kind of investment makes sense, even if the primary audience is the Switch 1.

I also think about something like Tomodachi Life. That's a class of game that feels like it benefits from a big install base rather than helps drive the install base. At least for a console positioned how Switch was positioned. With DS/3DS, Tomodachi Life feels a little more like the core pitch of the console. Maybe I'm misreading the situation, or maybe Nintendo decides that, the Switch having been solidly established as a "core gamer" product, they don't need to aggressively appeal to that demo on launch in the same way this time

None of this is "will or wont" or "should or shouldn't" it just shuffles up my perception of what is likely.
I think it's worth remembering that Samus Returns had a fair bit more working against it than just being a 3DS game. It was a somewhat controversial remake of a game that was already considered the black sheep of the series, and (aside from the weird Prime spin-offs) is the only 30fps Metroid to this day. Being cross gen with Switch (or even just playable on it) certainly would have helped, but it wouldn't have solved all that game's issues.

Also, when Nintendo feels comfortable releasing emulated GCN/Wii games with minimal graphical touchups on Switch, I don't the a 3DS game with actually substantial overhauls to its assets (like LM2HD) would be especially out of place on Switch 2, especially if they're able to boost the resolution significantly above 1080p.
 
Last edited:
There's no chance Nintendo set a deadline of Q2 2024 for a system that they were very uncertain could release in 2024 imo.

Just would be wildly incompetent and setting a pointlessly aggressive deadline for partners for no reason.
I kinda agree, 2025 was ALWAYS in the cards. Nate said he believed q1 2025 was an option and that was in 2023 - if he knew that then that means partners were likely given a timeframe, less inclined to believe a deadline. If his info is to be believed of course.
 
lol 3 fucking days to be labelled a catastrophe - some industry/fanbase we got here
For some reason he is always complaining about Xbox or its studios lol.

Microsoft didnt spent a dollar marketing Hellblade thats for sure, but its a niche game (focusing on a cinematic/film experience) and its on gamepass day 1
 
This game bombed horribly and I doubt Microsoft puts in the effort to port it to Switch 2.
Respectfully, that's a bit besides the point since I'm only talking in hypotheticals.

Regarding what you said, I don't think we're really at a point to accurately know if it bombed or not. MS wants Gamepass subs and if they got them, it didn't bomb.
Plus, even if it bombs, porting it to other systems can give it a second chance.
 
For some reason he is always complaining about Xbox or its studios lol.

Microsoft didnt spent a dollar marketing Hellblade thats for sure, but its a niche game (focusing on a cinematic/film experience) and its on gamepass day 1
Microsoft DID spend money advertising it, I saw adverts for it on almost every reddit I read, and I saw ads for it on several gaming sites.
 
0
I'm guessing Hellblade 1's sales are going to end up much higher than Hellblade 2's as shown by how badly Hellblade 2 is selling based on public Steam data.
Isn't that all the more reason for a Switch 2 release, though? Some Switch games sold by piggybacking on their technical achievement, and Hellblade was one of them. The first game was also not really that much better, yet it sold so many copies (not just on Nintendo) because it was impressive on a technical level, which fueled the hype around it.

Ignoring all that, I think MS will do anything they can to sell a few more copies if they can help it. Making a Switch 2 version shouldn't be too hard or costly, maybe even cheaper than Hellblade on the Switch. In fact I think MS is considering putting most of their AA and some AAA games on other platforms to help inflate sales figures, it's all about numbers these days.

I really feel bad for MS. They've been taking Ls one after another, and not all of them were their fault. Internally I think MS has been suffering from mismanagement for a while, and I think they bet too much money in the wrong areas, resulting in excessive lay-offs and even more bad decision making. Buying out ABK was a last ditch effort, and I hope it pays off; I don't want the Xbox brand to die off by next gen.
 
Last edited:
So why the hell are we still listening to this guy as if he's a reputable source?
So there are reports giving him credit for something that he says. Which is odd because right before he said anything. A couple of users here were talking about the tool AND they posted the link here. So here we have this known website that just so happened to post a link to test this theory of 8 nm vs 4 nm, and suddenly, somehow MLID had a change of heart? 🙄

But what i notices is that it really illustrated that these tech/gaming journalists don't really do a good job researching.
 
Assuming this game launched on Switch 2, what would be the expected performance we could get, considering the resolution and lack of lumen on Series S are the biggest downgrades and upscaling via DLSS plus RT would, in theory, be better on Switch 2?
I think the performance would be the same, because 30fps is the target, and I don't think the CPU load is going to be prohibitive. The question is how good would it look at 30fps. I wrote a little post on it already, but my guess would be "noticeably lower quality, but not a nightmare downgrade."

The game really is beautifully and impressively optimized for Series S, and that means a rare game that feels "truly next gen" running well on hardware in Switch 2's ballpark. But it also means that the game is kinda optimized against the Switch 2's strengths.

The game is impressively optimized on VRAM... which undermines some of the benefit of Switch 2's larger RAM pool.

The game looks great with software lumen... which means that Switch 2's hardware lumen won't be transformative.

The game is built for upscaling... which means DLSS isn't a "free" performance win.

If I had to guess, a theoretical port would need to swap the 2x TSR upscale here for a 4x DLSS upscale. The output resolution being the same, the input resolution being halved. I think that is going to play merry hell with the volumetric lighting and fog, and the visual downgrade will be noticable. Decent chance to enable hardware lumen, but again, not transformative.
 
So there are reports giving him credit for something that he says. Which is odd because right before he said anything. A couple of users here were talking about the tool AND they posted the link here. So here we have this known website that just so happened to post a link to test this theory of 8 nm vs 4 nm, and suddenly, somehow MLID had a change of heart? 🙄

But what i notices is that it really illustrated that these tech/gaming journalists don't really do a good job researching.
He does have a few genuinely good sources with a good track record, and when he regurgitates their information he is seen as credible. He also (allegedly) happens to have more connections to insiders in the industry than most other self-proclaimed leakers (like RGT, for example) who are in actuality just information collectors, often from the same sources, which they then parrot to viewers in video form.

The problem is he also has bad sources mixed in with the good ones, and he himself isn't really all that smart so he makes uninformed guesses with no real logical input by himself by over-extrapolating from good information, which he then either passes off as facts or information he is "80% confident" or "90% confident" and so on.

There's another YouTuber named AdoredTV, and I personally think he's much more credible. At the very least he uploads only when he thinks he has enough credible information, which can often take months to accumulate. He also does journalism pieces with actual critical input, such as his work on Intel back in 2020 and their dubious history with AMD. He's also known to have made some bad takes in the past, which he at least acknowledges. Unfortunately I don't think he does these sorts of content anymore, or maybe he's waiting again. The last time he made a video was 11 months ago, not counting a PC building video hosted by someone else on his channel.
 
Last edited:
Thraktor's Guide to Concurrency in Ampere

One topic which comes up quite frequently in this thread (particularly in the context of DLSS) is how concurrency works in Ampere GPUs. In Nvidia's Ampere whitepaper, they point out new concurrency features in the architecture, without providing much detail on how they work, or what kind of limitations there are in using it, besides some graphs showing a reduction in frame time from running graphics, RT and tensor cores at the same time for Wolfenstein Youngblood.

I've been trying to understand how this works myself, and in the process get a better understanding of how Nvidia's GPU architectures work at a lower level, and I feel like I've got a good enough understanding of concurrency in Ampere, at least between regular shader code and tensor cores, to explain it in a way which might be useful to people. And, in the process of writing, will hopefully clarify some things for myself.

One thing I should mention is that what I'm describing below is a simplified explanation of how SMs work, mainly to make it easier to understand, but also because I don't know enough of the lower level details to speak with any confidence on it. The main simplification I'm making is ignoring pipelining, which is quite important, but also quite complex, and I feel the general points are the same even if we ignore pipelining, although the specific implementation differs a bit. I'm also going to ignore things like warps getting split at branches, complex instructions, etc.

A Quick Intro to GPUs

To start, I should cover some basic points on how GPUs work which will be relevant later. The most important of these is the concept of SIMT, or single-instruction-multiple-threads, which is the paradigm by which GPUs operate. This means what it says, which is that GPUs execute a single instruction across multiple threads of data at once. So, for example, if you have a pixel shader with a thread for each pixel, and there's an instruction which states "multiply X by Y and store the answer in Z", it will execute that instruction for every pixel in that thread group, even though they all may have different X and Y values.

In Nvidia's case, a group of threads which executes together is called a Warp, and a warp contains 32 threads. So each time an instruction is executed on a Nvidia GPU, it's run on a warp of 32 threads. At a higher level these are organised into what are called thread blocks, but that's not too important here.

Each warp is issued to an SM (which are the building blocks of Nvidia's GPUs) to execute on, continuing to execute instructions until the shader has completed. Ordinarily a GPU would have a very large number of warps issued to its SMs at any one time. In Ampere's case, it can handle up to 48 warps per SM, and with 12 SMs on T239's GPU, that would mean up to 576 total warps, or 18,432 threads issued at a time.

The Ampere SM

Here's Nvidia's diagram of an Ampere SM from the whitepaper:

temp-Image-EWm7-C8.avif


The Ampere SM is divided into four partitions, each of which contains registers, shader cores, tensor cores, instruction dispatch, etc. Each of these executes instructions independently from each other, and we'll look at them in more detail below. In addition to what's in the partitions, there is also an L1 cache/shared memory pool, texture units, and the RT core. We'll come back to the RT core later, but for the moment, let's focus on those partitions. Here's a diagram of a partition:

temp-Image-D65a-Td.avif


An SM partition contains everything needed to execute shader instructions independently. There is a register file, which stores the data being executed on by the threads, load/store units to move data in and out of those registers, and a warp scheduler and dispatch capable of dispatching instructions across three different data paths. The first data path is capable of executing FP32 and INT32 instructions, the second one is capable of executing just FP32 instructions, and the third datapath is capable of executing "tensor core" instructions and FP16 instructions.

Dispatching and Executing Instructions

If you look at the diagram of the SM partition, you'll see it notes (32 threads/clk) next to the dispatch unit. This is pretty important, as what it's saying is that, within each SM partition, one warp of 32 threads can be dispatched to one of the three data paths each clock cycle. This means that you can't simultaneously dispatch instructions to, say, both FP32 data paths within the same clock cycle. You would have to dispatch an instruction for one warp to one data path on one clock cycle, and then dispatch an instruction for another warp to the other data path on the next clock cycle.

Just because you can't dispatch to multiple data paths on the same clock cycle doesn't mean you can't have multiple data paths executing concurrently, though. Otherwise having multiple FP32 capable data paths would be useless if you can't use them at the same time. The key to this is that instructions typically take multiple cycles to execute.

I'm going to ignore pipelining here to keep things simple, but if you look at the two FP32 capable data paths in the diagram, you'll see each one is divided into 16 blocks. Nvidia calls these "CUDA cores" in marketing, although they're not really cores. What they actually tell us is that each one of these data paths can execute 16 FP32 operations per clock cycle. Now, if an FP32 instruction for a warp containing 32 threads is dispatched to one of these data paths, and it can execute 16 ops per clock, then it's straightforward to see that a standard FP32 operation (like fused multiply add, for example) would take two clock cycles to execute on one of these data paths.

If it takes 2 cycles to execute an FP32 operation on a warp, and the dispatch unit can issue one warp per cycle, then we can see how having two FP32 data paths becomes useful, as you can dispatch to each data path on alternate clock cycles, and, in theory at least, get 100% utilisation of both simultaneously.

Tensor Code Is Just Shader Code

One important thing to note here is that tensor cores are pretty much just big shader cores designed for a very specific operation. While shader cores perform operations like add or multiply on individual numbers (across multiple threads at the same time), tensor cores perform multiplication operations on matrices. They run instructions which sit in shader code just like the other data paths do, and if you look at Ampere's instruction set, you can see those instructions, labelled HMMA and IMMA. So when the dispatch unit comes across an FP32 instruction, it will send it to either of the first two data paths, when it comes across an INT32 instruction it will send it to the first data path, and when it comes across a matrix instruction, it will send it to the tensor core data path.

For those curious, it seems that these matrix multiplication instructions are synchronised across the entire warp, where a single matrix multiplication is split over the 32 threads in the warp. This makes a lot more sense than trying to execute 32 separate matrix multiplication operations simultaneously, which would require a huge amount of register space.

To understand how well the tensor core can operate concurrently with the other data paths, we need to know a bit more about these instructions. I'm going to focus on FP16 matrix multiplications but the same logic applies to TF32, BF16, INT8, etc. From Nvidia's documentation we know that Ampere supports two matrix sizes for these operations, 16x8x8 and 16x8x16. We'll first look at the 16x8x8 case, which means multiplying a 16x8 matrix by an 8x8 matrix. This requires 1024 FMA operations to execute.

We can calculate from Nvidia's advertised performance figures that each SM is capable of executing 512 FP16 tensor ops per clock, ignoring sparsity (their numbers claim double this, by counting FMA as two operations). This means that the tensor core in each SM partition can execute 128 FP16 operations per clock. So, a 16x8x8 multiplication which requires 1024 operations to complete would execute in 8 clock cycles. By the same logic, a 16x8x16 multiplication would execute in 16 cycles. Tensor ops in FP16 with sparsity use 16x8x32 multiplications (or really 16x8x16 after accounting for the sparsity structure) and also execute in 16 cycles.

Running Tensor Code Concurrently With Shader Code

Knowing that tensor cores run shader instructions, just like regular shader cores, and knowing how long those instructions take to execute, we can start to get an idea of how running tensor code like DLSS alongside regular shader code works.

Firstly, tensor code is bundled up in threads and warps just the same as any other code. For those 48 warps issued to an SM, you would have to have some regular shader code warps issued, and some warps which use tensor cores. For the sake of an example, let's say you have 32 warps of regular shader code, and 16 warps which use tensor cores. Then, for each SM partition, we can assume there are 8 warps of regular shader code issued to it, and 4 warps which use tensor cores.

Each cycle, the dispatch unit in an SM partition can dispatch one instruction from one of those 12 warps to one of the three data paths. To help illustrate how this would work, I've created a diagram showing a theoretically optimal case involving just standard FP32 instructions (let's say FMA) and FP16 matrix instructions of size 16x8x16:

temp-Imagec-DHh-Ao.avif


Each row is one clock cycle, going from top to bottom, and each column is one of the three data paths. On the left side I've shown what the dispatch unit is doing that cycle, and in each column I've shown colour when an operation is being executed (blue for FP32, green for tensor) with a dark line at the top where it's dispatched. If a cell is white, that data path is idle.

You can see here how all three data paths can be kept reasonably busy even though the dispatch unit can only issue an instruction to one of them each cycle. The first two data paths could theoretically achieve full utilisation if there was no tensor code, but even with tensor code being dispatched it has a relatively small effect, just forcing one idle cycle for every 16 execution cycles. The key to this is how long the tensor ops take to execute. If they completed very quickly, then there would be more of the idle cycles you see in the other two data paths whenever a tensor op needs to be issued.

It's important to note that this is a theoretically optimal case. The dispatch unit isn't always going to be able to issue an instruction on every cycle, because there may not be a warp with an instruction ready for dispatch. These warp stalls, as they're called, can be caused by a number of reasons, for example waiting on data to arrive from RAM, and are the reason for the large number of warps which are issued at a time. Even if some of the warps are stalled for whatever reason, having 12 of them to dispatch from means it's very likely that at least one of them will be read to dispatch on any given clock cycle. Still, that's not guaranteed, and there are inevitably going to be missed cycles here and there.

On the specific issue of running shader code concurrently with tensor code like DLSS, although they can execute concurrently with almost full efficiency in a theoretical setup, there are additional bottlenecks to running them together. For instance, issuing a few DLSS warps to an SM alongside your shader code means there are fewer shader code warps available, so you're more likely to see all of them stalled, and the dispatch unit unable to issue to the FP32/INT32 data paths, than if you had a full complement of shader code warps. The same is the case with warps using tensor core code, where it's more likely that (say) 4 warps are all stalled waiting for memory than if all 12 warps were lining up to use the tensor cores.

Speaking of memory, regular shader code and DLSS would be competing for both cache and memory bandwidth. Hopefully this isn't too much of an issue with LPDDR5X providing more bandwidth than we expected, but it's still a potential limiter on performance when running them concurrently. Finally, I should note that ML models like DLSS aren't 100% matrix multiplication, and there are things like activation layers in there which will require regular shader code (likely FP16), but that should be a relatively small portion of the execution time.

Running FP16 Code Concurrently With FP32 Code

Another closely related topic that has come up a few times is the possibility of running non-tensor FP16 code concurrently with FP32 code, as the Ampere whitepaper states "standard FP16 operations are handled by the Tensor Cores in GA10x GPUs". In fact, I've argued myself in the past that this could be useful for developers with a mix of FP32 and FP16 code, but it seems like it's less useful in practice than running tensor code concurrently with FP32 code.

The reason for this is execution time. While tensor operations are on nice big matrices which take up to 16 cycles to complete, which leaves a lot of time for the dispatch unit to also issue FP32 instructions, non-tensor FP16 instructions are at the other end of the spectrum, executing very fast. From Nvidia's performance figures, we know that non-tensor FP16 instructions are executed at a rate of 128 per SM per clock, which means that a tensor core data path in one of the SM partitions can execute 32 FP16 operations per clock when running non-tensor code.

With the dispatch unit issuing one warp of 32 threads per clock, though, this means that non-tensor FP16 instructions execute in a single clock cycle. So, the dispatch unit would have to dispatch to the tensor core data path every single clock cycle to fully utilise the non-tensor FP16 performance available, and it can't do that while also dispatching to the other two data paths.

Here's another diagram like the one above, but now showing a combination of FP32 instructions and non-tensor FP16 instructions (the latter in red):

temp-Image4b-D9-Hp.avif


You can see the issue here, as the dispatch unit is dispatching on every clock cycle, but there's still a lot of idle cycles across the three data paths. In fact, so long as it's dispatching every clock cycle, then the achievable performance is 32 operations per clock (or 128 per clock for the entire SM) regardless of whether those are FP32 instructions or FP16 instructions or a mixture of both, just by virtue of the dispatch limitation. Taking pipelining into account would change the behaviour a bit, but the dispatch limitation would remain the same either way.

That doesn't mean that using FP16 isn't worthwhile, as it takes up less space in memory, less register space, and less bandwidth, even if you're not getting faster execution. I'm also very curious if any developers can make use of the tensor cores' matrix multiplication operations for non-ML use-cases. If you've got a problem which maps well to matrix multiplication and where FP16 is sufficient, then you could achieve a very large speedup by rewriting your shaders to make use of the HMMA ops. (I'm also curious if Nvidia have provided tools to do so, as the way in which matrices are synchronised across the warp mean these use cases would have to be handled a bit differently from regular shader code by the compiler).

Running RT Cores Concurrently With Everything Else

I've focussed mostly on running tensor core code like DLSS concurrently with regular shader code, but another feature of Ampere is the ability to run RT concurrently with both of these. There's less to say here, as Nvidia is more explicit about the functionality, saying in the whitepaper that "The new GA10x SM allows RT Core and graphics, or RT Core and compute workloads to run concurrently, significantly accelerating many ray tracing operations."

The RT core is responsible for BVH traversal and triangle intersection testing, ie finding exactly what triangle a given ray intersects, and where, and it's fixed-function hardware which sits apart from the SM partitions we talked about before, so it makes sense that it can be made to operate independently. This doesn't cover the entirety of RT workloads, as you still need shaders to create rays and process them after a hit is found, and to perform any work required after that (eg shading reflections), but it means the shader cores don't have to sit idle while the RT cores are doing their thing, or vice-versa.

TL,DR:

From purely a point of view of executing instructions within an SM, Ampere GPUs should be able to concurrently execute both regular shader code and tensor core code like DLSS, with theoretically up to about 94% efficiency achievable given the limits of the dispatch units. In reality there's likely to be contention between the workloads over things like cache and bandwidth, so real-world benefits would be lower, but I'd still imagine you'd get a good performance boost over running them sequentially, particularly given the relatively high bandwidth from LPDDR5X.

For FP32 and non-tensor FP16 code, while they can operate somewhat concurrently, the limitation of the dispatch unit only being able to dispatch one warp per cycle, combined with the very quick execution time of the FP16 instructions, means the benefit to running them together is small. There still can be register/memory/bandwidth benefits to using FP16 code, though.

RT cores can operate concurrently with everything else, as per Nvidia's whitepaper. Not much more to say here.

This is all based on what I can understand from Nvidia's whitepapers and other sources online, but it's not something that Nvidia have ever fully documented, publicly at least. So if anyone has any corrections or anything else to add it would be much appreciated.
 


So regarding Hellblade 2 and how it runs on Xbox Series S and the Deck...

Assuming this game launched on Switch 2, what would be the expected performance we could get, considering the resolution and lack of lumen on Series S are the biggest downgrades and upscaling via DLSS plus RT would, in theory, be better on Switch 2?

Lumen GI is mandatory in this game. you can't turn it off. Series S just has no lumen reflections. which means the game can scale lower if you could turn off lumen completely.

which leads me to ask what the game looks like without it 🤔
 
Switch 2 isn't a last-gen console though.
I think the idea is that if the PS4 and Xbox One are getting it, there shouldn't be any technical reason why it shouldn't come to the Switch 2. Not that I had any doubt, outside of MS and/or Nintendo feeling like CoD isn't going to sell on the Switch 2 or some other business reasons that we're not privy to.
 
They mean that the game is releasing releasing on PS4/ONE, so there shouldn't be a problem in Switch 2 receiving a port.
I assumed as much anyway, CoD doesn't seem like the most demanding game and Switch 2 is more capable than those consoles.
 
Thraktor's Guide to Concurrency in Ampere

One topic which comes up quite frequently in this thread (particularly in the context of DLSS) is how concurrency works in Ampere GPUs. In Nvidia's Ampere whitepaper, they point out new concurrency features in the architecture, without providing much detail on how they work, or what kind of limitations there are in using it, besides some graphs showing a reduction in frame time from running graphics, RT and tensor cores at the same time for Wolfenstein Youngblood.

I've been trying to understand how this works myself, and in the process get a better understanding of how Nvidia's GPU architectures work at a lower level, and I feel like I've got a good enough understanding of concurrency in Ampere, at least between regular shader code and tensor cores, to explain it in a way which might be useful to people. And, in the process of writing, will hopefully clarify some things for myself.

One thing I should mention is that what I'm describing below is a simplified explanation of how SMs work, mainly to make it easier to understand, but also because I don't know enough of the lower level details to speak with any confidence on it. The main simplification I'm making is ignoring pipelining, which is quite important, but also quite complex, and I feel the general points are the same even if we ignore pipelining, although the specific implementation differs a bit. I'm also going to ignore things like warps getting split at branches, complex instructions, etc.

A Quick Intro to GPUs

To start, I should cover some basic points on how GPUs work which will be relevant later. The most important of these is the concept of SIMT, or single-instruction-multiple-threads, which is the paradigm by which GPUs operate. This means what it says, which is that GPUs execute a single instruction across multiple threads of data at once. So, for example, if you have a pixel shader with a thread for each pixel, and there's an instruction which states "multiply X by Y and store the answer in Z", it will execute that instruction for every pixel in that thread group, even though they all may have different X and Y values.

In Nvidia's case, a group of threads which executes together is called a Warp, and a warp contains 32 threads. So each time an instruction is executed on a Nvidia GPU, it's run on a warp of 32 threads. At a higher level these are organised into what are called thread blocks, but that's not too important here.

Each warp is issued to an SM (which are the building blocks of Nvidia's GPUs) to execute on, continuing to execute instructions until the shader has completed. Ordinarily a GPU would have a very large number of warps issued to its SMs at any one time. In Ampere's case, it can handle up to 48 warps per SM, and with 12 SMs on T239's GPU, that would mean up to 576 total warps, or 18,432 threads issued at a time.

The Ampere SM

Here's Nvidia's diagram of an Ampere SM from the whitepaper:

temp-Image-EWm7-C8.avif


The Ampere SM is divided into four partitions, each of which contains registers, shader cores, tensor cores, instruction dispatch, etc. Each of these executes instructions independently from each other, and we'll look at them in more detail below. In addition to what's in the partitions, there is also an L1 cache/shared memory pool, texture units, and the RT core. We'll come back to the RT core later, but for the moment, let's focus on those partitions. Here's a diagram of a partition:

temp-Image-D65a-Td.avif


An SM partition contains everything needed to execute shader instructions independently. There is a register file, which stores the data being executed on by the threads, load/store units to move data in and out of those registers, and a warp scheduler and dispatch capable of dispatching instructions across three different data paths. The first data path is capable of executing FP32 and INT32 instructions, the second one is capable of executing just FP32 instructions, and the third datapath is capable of executing "tensor core" instructions and FP16 instructions.

Dispatching and Executing Instructions

If you look at the diagram of the SM partition, you'll see it notes (32 threads/clk) next to the dispatch unit. This is pretty important, as what it's saying is that, within each SM partition, one warp of 32 threads can be dispatched to one of the three data paths each clock cycle. This means that you can't simultaneously dispatch instructions to, say, both FP32 data paths within the same clock cycle. You would have to dispatch an instruction for one warp to one data path on one clock cycle, and then dispatch an instruction for another warp to the other data path on the next clock cycle.

Just because you can't dispatch to multiple data paths on the same clock cycle doesn't mean you can't have multiple data paths executing concurrently, though. Otherwise having multiple FP32 capable data paths would be useless if you can't use them at the same time. The key to this is that instructions typically take multiple cycles to execute.

I'm going to ignore pipelining here to keep things simple, but if you look at the two FP32 capable data paths in the diagram, you'll see each one is divided into 16 blocks. Nvidia calls these "CUDA cores" in marketing, although they're not really cores. What they actually tell us is that each one of these data paths can execute 16 FP32 operations per clock cycle. Now, if an FP32 instruction for a warp containing 32 threads is dispatched to one of these data paths, and it can execute 16 ops per clock, then it's straightforward to see that a standard FP32 operation (like fused multiply add, for example) would take two clock cycles to execute on one of these data paths.

If it takes 2 cycles to execute an FP32 operation on a warp, and the dispatch unit can issue one warp per cycle, then we can see how having two FP32 data paths becomes useful, as you can dispatch to each data path on alternate clock cycles, and, in theory at least, get 100% utilisation of both simultaneously.

Tensor Code Is Just Shader Code

One important thing to note here is that tensor cores are pretty much just big shader cores designed for a very specific operation. While shader cores perform operations like add or multiply on individual numbers (across multiple threads at the same time), tensor cores perform multiplication operations on matrices. They run instructions which sit in shader code just like the other data paths do, and if you look at Ampere's instruction set, you can see those instructions, labelled HMMA and IMMA. So when the dispatch unit comes across an FP32 instruction, it will send it to either of the first two data paths, when it comes across an INT32 instruction it will send it to the first data path, and when it comes across a matrix instruction, it will send it to the tensor core data path.

For those curious, it seems that these matrix multiplication instructions are synchronised across the entire warp, where a single matrix multiplication is split over the 32 threads in the warp. This makes a lot more sense than trying to execute 32 separate matrix multiplication operations simultaneously, which would require a huge amount of register space.

To understand how well the tensor core can operate concurrently with the other data paths, we need to know a bit more about these instructions. I'm going to focus on FP16 matrix multiplications but the same logic applies to TF32, BF16, INT8, etc. From Nvidia's documentation we know that Ampere supports two matrix sizes for these operations, 16x8x8 and 16x8x16. We'll first look at the 16x8x8 case, which means multiplying a 16x8 matrix by an 8x8 matrix. This requires 1024 FMA operations to execute.

We can calculate from Nvidia's advertised performance figures that each SM is capable of executing 512 FP16 tensor ops per clock, ignoring sparsity (their numbers claim double this, by counting FMA as two operations). This means that the tensor core in each SM partition can execute 128 FP16 operations per clock. So, a 16x8x8 multiplication which requires 1024 operations to complete would execute in 8 clock cycles. By the same logic, a 16x8x16 multiplication would execute in 16 cycles. Tensor ops in FP16 with sparsity use 16x8x32 multiplications (or really 16x8x16 after accounting for the sparsity structure) and also execute in 16 cycles.

Running Tensor Code Concurrently With Shader Code

Knowing that tensor cores run shader instructions, just like regular shader cores, and knowing how long those instructions take to execute, we can start to get an idea of how running tensor code like DLSS alongside regular shader code works.

Firstly, tensor code is bundled up in threads and warps just the same as any other code. For those 48 warps issued to an SM, you would have to have some regular shader code warps issued, and some warps which use tensor cores. For the sake of an example, let's say you have 32 warps of regular shader code, and 16 warps which use tensor cores. Then, for each SM partition, we can assume there are 8 warps of regular shader code issued to it, and 4 warps which use tensor cores.

Each cycle, the dispatch unit in an SM partition can dispatch one instruction from one of those 12 warps to one of the three data paths. To help illustrate how this would work, I've created a diagram showing a theoretically optimal case involving just standard FP32 instructions (let's say FMA) and FP16 matrix instructions of size 16x8x16:

temp-Imagec-DHh-Ao.avif


Each row is one clock cycle, going from top to bottom, and each column is one of the three data paths. On the left side I've shown what the dispatch unit is doing that cycle, and in each column I've shown colour when an operation is being executed (blue for FP32, green for tensor) with a dark line at the top where it's dispatched. If a cell is white, that data path is idle.

You can see here how all three data paths can be kept reasonably busy even though the dispatch unit can only issue an instruction to one of them each cycle. The first two data paths could theoretically achieve full utilisation if there was no tensor code, but even with tensor code being dispatched it has a relatively small effect, just forcing one idle cycle for every 16 execution cycles. The key to this is how long the tensor ops take to execute. If they completed very quickly, then there would be more of the idle cycles you see in the other two data paths whenever a tensor op needs to be issued.

It's important to note that this is a theoretically optimal case. The dispatch unit isn't always going to be able to issue an instruction on every cycle, because there may not be a warp with an instruction ready for dispatch. These warp stalls, as they're called, can be caused by a number of reasons, for example waiting on data to arrive from RAM, and are the reason for the large number of warps which are issued at a time. Even if some of the warps are stalled for whatever reason, having 12 of them to dispatch from means it's very likely that at least one of them will be read to dispatch on any given clock cycle. Still, that's not guaranteed, and there are inevitably going to be missed cycles here and there.

On the specific issue of running shader code concurrently with tensor code like DLSS, although they can execute concurrently with almost full efficiency in a theoretical setup, there are additional bottlenecks to running them together. For instance, issuing a few DLSS warps to an SM alongside your shader code means there are fewer shader code warps available, so you're more likely to see all of them stalled, and the dispatch unit unable to issue to the FP32/INT32 data paths, than if you had a full complement of shader code warps. The same is the case with warps using tensor core code, where it's more likely that (say) 4 warps are all stalled waiting for memory than if all 12 warps were lining up to use the tensor cores.

Speaking of memory, regular shader code and DLSS would be competing for both cache and memory bandwidth. Hopefully this isn't too much of an issue with LPDDR5X providing more bandwidth than we expected, but it's still a potential limiter on performance when running them concurrently. Finally, I should note that ML models like DLSS aren't 100% matrix multiplication, and there are things like activation layers in there which will require regular shader code (likely FP16), but that should be a relatively small portion of the execution time.

Running FP16 Code Concurrently With FP32 Code

Another closely related topic that has come up a few times is the possibility of running non-tensor FP16 code concurrently with FP32 code, as the Ampere whitepaper states "standard FP16 operations are handled by the Tensor Cores in GA10x GPUs". In fact, I've argued myself in the past that this could be useful for developers with a mix of FP32 and FP16 code, but it seems like it's less useful in practice than running tensor code concurrently with FP32 code.

The reason for this is execution time. While tensor operations are on nice big matrices which take up to 16 cycles to complete, which leaves a lot of time for the dispatch unit to also issue FP32 instructions, non-tensor FP16 instructions are at the other end of the spectrum, executing very fast. From Nvidia's performance figures, we know that non-tensor FP16 instructions are executed at a rate of 128 per SM per clock, which means that a tensor core data path in one of the SM partitions can execute 32 FP16 operations per clock when running non-tensor code.

With the dispatch unit issuing one warp of 32 threads per clock, though, this means that non-tensor FP16 instructions execute in a single clock cycle. So, the dispatch unit would have to dispatch to the tensor core data path every single clock cycle to fully utilise the non-tensor FP16 performance available, and it can't do that while also dispatching to the other two data paths.

Here's another diagram like the one above, but now showing a combination of FP32 instructions and non-tensor FP16 instructions (the latter in red):

temp-Image4b-D9-Hp.avif


You can see the issue here, as the dispatch unit is dispatching on every clock cycle, but there's still a lot of idle cycles across the three data paths. In fact, so long as it's dispatching every clock cycle, then the achievable performance is 32 operations per clock (or 128 per clock for the entire SM) regardless of whether those are FP32 instructions or FP16 instructions or a mixture of both, just by virtue of the dispatch limitation. Taking pipelining into account would change the behaviour a bit, but the dispatch limitation would remain the same either way.

That doesn't mean that using FP16 isn't worthwhile, as it takes up less space in memory, less register space, and less bandwidth, even if you're not getting faster execution. I'm also very curious if any developers can make use of the tensor cores' matrix multiplication operations for non-ML use-cases. If you've got a problem which maps well to matrix multiplication and where FP16 is sufficient, then you could achieve a very large speedup by rewriting your shaders to make use of the HMMA ops. (I'm also curious if Nvidia have provided tools to do so, as the way in which matrices are synchronised across the warp mean these use cases would have to be handled a bit differently from regular shader code by the compiler).

Running RT Cores Concurrently With Everything Else

I've focussed mostly on running tensor core code like DLSS concurrently with regular shader code, but another feature of Ampere is the ability to run RT concurrently with both of these. There's less to say here, as Nvidia is more explicit about the functionality, saying in the whitepaper that "The new GA10x SM allows RT Core and graphics, or RT Core and compute workloads to run concurrently, significantly accelerating many ray tracing operations."

The RT core is responsible for BVH traversal and triangle intersection testing, ie finding exactly what triangle a given ray intersects, and where, and it's fixed-function hardware which sits apart from the SM partitions we talked about before, so it makes sense that it can be made to operate independently. This doesn't cover the entirety of RT workloads, as you still need shaders to create rays and process them after a hit is found, and to perform any work required after that (eg shading reflections), but it means the shader cores don't have to sit idle while the RT cores are doing their thing, or vice-versa.

TL,DR:

From purely a point of view of executing instructions within an SM, Ampere GPUs should be able to concurrently execute both regular shader code and tensor core code like DLSS, with theoretically up to about 94% efficiency achievable given the limits of the dispatch units. In reality there's likely to be contention between the workloads over things like cache and bandwidth, so real-world benefits would be lower, but I'd still imagine you'd get a good performance boost over running them sequentially, particularly given the relatively high bandwidth from LPDDR5X.

For FP32 and non-tensor FP16 code, while they can operate somewhat concurrently, the limitation of the dispatch unit only being able to dispatch one warp per cycle, combined with the very quick execution time of the FP16 instructions, means the benefit to running them together is small. There still can be register/memory/bandwidth benefits to using FP16 code, though.

RT cores can operate concurrently with everything else, as per Nvidia's whitepaper. Not much more to say here.

This is all based on what I can understand from Nvidia's whitepapers and other sources online, but it's not something that Nvidia have ever fully documented, publicly at least. So if anyone has any corrections or anything else to add it would be much appreciated.
Thanks for this writeup. I don't have anything to add to it, but I do want to mention to people for context that concurrency and overlapped workloads aren't a theoretical subject for Nintendo's new hardware. On the Switch today, Nintendo recommends different approaches to the render loop that trade off between latency and frame time for choosing to do CPU and GPU processing sequentially or concurrently. When DLSS is introduced, it can work the same way with additional concurrency inside the GPU, and hopefully Nintendo will provide concurrency examples as part of their documentation for it too.

Switch 2 isn't a last-gen console though.
And also Microsoft is contractually required to release CoD on Nintendo platforms, so it wouldn't matter even if the new one was skipping PS4/XBO.
 
Switch 2 isn't a last-gen console though.

I’m saying that most big publishers are still doing cross gen. (Heck Yakuza and persona 3 were cross gen and even RE4 remake)

Like COD still getting PS4 support is pretty crazy to think about.

Like from everything we’re hearing from the Switch 2, it’s pretty much a suped up PS4 pro with modern architecture that will almost reach Series s level with DLSS.
 
Please read this new, consolidated staff post before posting.

Furthermore, according to this follow-up post, all off-topic chat will be moderated.
Last edited by a moderator:


Back
Top Bottom