• Hey everyone, staff have documented a list of banned content and subject matter that we feel are not consistent with site values, and don't make sense to host discussion of on Famiboards. This list (and the relevant reasoning per item) is viewable here.
  • Do you have audio editing experience and want to help out with the Famiboards Discussion Club Podcast? If so, we're looking for help and would love to have you on the team! Just let us know in the Podcast Thread if you are interested!

StarTopic Future Nintendo Hardware & Technology Speculation & Discussion |ST| (Read the staff posts before commenting!)

Oldpuck's leaving #teamleapday? Guess I'll have to continue flying the flag solo.

On the time of 18 to 24 months from sampling to release, I assume you're talking about the TX1 and the original Switch? If so, I don't think it's really comparable, as TX1 was already entering production independently of Nintendo; Nvidia had other customers for it (not many, but still...). Correct me if I'm wrong, but I believe Nintendo only officially signed with Nvidia around when tape out happened, so the question wasn't really about hardware timelines, it was about how quickly Nintendo could produce a lineup of games for hardware they had no experience with, using APIs and tools they had no experience with. Even two years is cutting it a bit tight in those circumstances.

Now, however, we're looking at a chip designed for Nintendo, with hardware, tools and APIs that are evolutions of what they're already familiar with. It appears that Nintendo have already had at least three years to develop software for this new console, and rather than tape out being the start of a mad dash to get a games lineup ready, tape out would have been the point where Nintendo had all their ducks in a row and gave Nvidia the thumbs up to get the hardware side in motion.

From leakers, it appears that Nvidia's typical tape out to product launch is under a year these days, with around 8 months for the first Ada GPUs. I'd expect a Nintendo chip to have a longer gap than that, between wanting a higher volume launch than is typical for GPUs, and a bit of a buffer for safety (if they need to do a second stepping). Let's say 14 or 15 months. As you say, sampling began between April and August last year, which is pretty much right where I'd expect it to be for a holiday 2023 launch window. That is to say, I'm reasonably confident that Nintendo's plan was to launch [redacted] holiday 2023, and that was their plan as recently as the middle of last year.

Of course, plans can change, and it's entirely possible that Nintendo have reconsidered since then, with software delays being the most likely culprit, but I don't agree that pushing back a year to holiday 2024 is the most likely outcome. Holiday launches are a good bet for initial plans, as if you're looking at things three or four years in advance, before you've started work on any games for the system, you'll probably go for a holiday launch. Delaying a system with a pipeline of games already in development is a very different thing, though. Nintendo absolutely made the right choice launching Switch in March rather than pushing back to the following holiday, as they had the pipeline of games to support it, and ended up with far more consoles in players hands by the end of 2017 than if they had gone for a holiday launch.

In [redacted]'s case, this is further complicated by the almost guaranteed presence of cross-gen titles. I don't think it's a stretch to assume that any games Nintendo releases on Switch after [redacted]'s launch will be cross-gen, with some kind of enhancement on the new hardware. That being the case, if Nintendo were planning for [redacted] to launch in Q4 2023, they would have two types of games in their pipeline for release from that point onwards; [redacted] exclusives and cross-gen games. On the one hand, this gives them a buffer, as they can release those cross-gen games as regular Switch titles if [redacted] is delayed. On the other hand, they almost certainly don't want to do this, as it means they can't announce these titles as [redacted] games, and hence weaken the next-gen lineup.

Let's say Nintendo's planned lineup for the first year of [redacted] consisted of Major Flagship Game™️, a [redacted] exclusive, at launch, with maybe a couple of other exclusives over the year, and the rest of the lineup consisting of cross-gen games. Let's also say that Major Flagship Game™️ doesn't look quite flagship-worthy, and they decide to give it a few more months in the oven, and hence also delay [redacted] itself. If they delay a full year, then in order to keep money coming in they have to release almost all of those cross-gen games as regular old Switch titles, so rather than the delay resulting in a better game lineup, the delay would actually be eating into [redacted]'s software offering. Put another way, if Nintendo has enough games to keep Switch alive another year, they have enough games to keep [redacted] going through its first. Sacrificing the latter for the former doesn't make much sense to me.
I'm with Thraktor, Switch 2 will be in holiday 2023, not holiday 2024
 
The likelihood of Nintendo going with anything but T239 for a console that will be released before 2026, is near zero.

It's a chip that's tailor made for a gaming console. Many of the design decisions don't make sense any other way (8 core a78, FDE). And it's not likely Nvidia can offer anything that's worth upgrading for anytime soon.
Indeed, T239 Drake is the chip and it will not changed
 
1 TFLOPS, with a 720p VRR screen and Nintendo's software stack would be a PS4 in your hand, with power left over. And on the TV, a simple 2.25 jump would give you plenty of power to take those 1080p PS4 images up to 4k. I would be very happy with that.
Supplemental: There is the possibility that Nintendo goes with a more elaborate setup. Nintendo already lets game devs select multiple performance profiles (including one specifically optimized for loading areas), and Nvidia has some unusual dynamic scaling tech.

There is a chance we get something more like console DFVFS, or where Nintendo offers a "high GPU/low CPU" profile, in addition to the "balanced" profile and "high CPU/low GPU" profile that Switch has now. In which case, all options may be true - just not all at once.
 
0
My guess is currently 1.7 Tflops in portable mode (12 SMs @ 550MHz) and 3.4 Tflops docked (12 SMs @ 1.1GHz).

To explain my reasoning, let's play a game of Why does Thraktor think a TSMC 4N manufacturing process is likely for T239?

The short answer is that a 12 SM GPU is far too large for Samsung 8nm, and likely too large for any intermediate process like TSMC's 6nm or Samsung's 5nm/4nm processes. There's a popular conception that Nintendo will go with a "cheap" process like 8nm and clock down to oblivion in portable mode, but that ignores both the economic and physical realities of microprocessor design.

To start, let's quickly talk about power curves. A power curve for a chip, whether a CPU or GPU or something else, is a plot of the amount of power the chip consumes against the clock speed of the chip. A while ago I extracted the power curve for Orin's 8nm Ampere GPU from a Nvidia power estimator tool. There are more in-depth details here, here and here, but for now let's focus on the actual power curve data:
Code:
Clock      W per TPC
0.42075    0.96
0.52275    1.14
0.62475    1.45
0.72675    1.82
0.82875    2.21
0.93075    2.73
1.03275    3.32
1.23675    4.89
1.30050    5.58

The first column is the clock speed in GHz, and the second is the Watts consumed per TPC (which is a pair of SMs). Let's create a chart for this power curve:

orin-power-curve.png


We can see that the power consumption curves upwards as clock speeds increase. The reason for this is that to increase clock speed you need to increase voltage, and power consumption is proportional to voltage squared. As a result, higher clock speeds are typically less efficient than lower ones.

So, if higher clock speeds are typically less efficient, doesn't that mean you can always reduce clocks to gain efficiency? Not quite. While the chart above might look like a smooth curve, it's actually hiding something; at that lowest clock speed of 420MHz the curve breaks down completely. To illustrate, let's look at the same data, but chart power efficiency (measured in Gflops per Watt) rather than outright power consumption:

orin-efficiency-curve.png


There are two things going on in this chart. For all the data points from 522 MHz onwards, we see what you would usually expect, which is that efficiency drops as clock speeds increase. The relationship is exceptionally clear here, as it's a pretty much perfect straight line. But then there's that point on the left. The GPU at 420MHz is less efficient than it is at 522MHz, why is that?

The answer is relatively straight-forward if we consider one important point: there is a minimum voltage that the chip can operate at. Voltage going up with clock speed means efficiency gets worse, and voltage going down as clock speeds increase means efficiency gets better. But what happens when you want to reduce clocks but can't reduce voltage any more? Not only do you stop improving power efficiency, but it actually starts to go pretty sharply in the opposite direction.

Because power consumption is mostly related to voltage, not clock speed, when you reduce clocks but keep the voltage the same, you don't really save much power. A large part of the power consumption called "static power" stays exactly the same, while the other part, "dynamic power", does fall off a bit. What you end up with is much less performance, but only slightly less power consumption. That is, power efficiency gets worse.

So that kink in the efficiency graph, between 420MHz and 522MHz, is the point at which you can't reduce the voltage any more. Any clocks below that point will all operate at the same voltage, and without being able to reduce the voltage, power efficiency gets worse instead of better below that point. The clock speed at that point can be called the "peak efficiency clock", as it offers higher power efficiency than any other clock speed.

How does this impact how chips are designed?

There are two things to take from the above. First, as a general point, every chip on a given manufacturing process has a peak efficiency clock, below which you lose power efficiency by reducing clocks. Secondly, we have the data from Orin to know pretty well where this point is for a GPU very similar to T239's on a Samsung 8nm process, which is around 470MHz.

Now let's talk designing chips. Nvidia and Nintendo are in a room deciding what GPU to put in their new SoC for Nintendo's new console. Nintendo has a financial budget of how much they want to spend on the chip, but they also have a power budget, which is how much power the chip can use up to keep battery life and cooling in check. Nvidia and Nintendo's job in that room is to figure out the best GPU they can fit within those two budgets.

GPUs are convenient in that you can make them basically as wide as you want (that is use as many SMs as you want) and developers will be able to make use of all the performance available. The design space is basically a line between a high number of SMs at a low clock, and a low number of SMs at a high clock. Because there's a fixed power budget, the theoretically ideal place on that line is the one where the clock is the peak efficiency clock, so you can get the most performance from that power.

That is, if the power budget is 3W for the GPU, and the peak efficiency clock is 470MHz, and the power consumption per SM at 470MHz is 0.5W, then the best possible GPU they could include would be a 6 SM GPU running at 470MHz. Using a smaller GPU would mean higher clocks, and efficiency would drop, but using a larger GPU with lower clocks would also mean efficiency would drop, because we're already at the peak efficiency clock.

In reality, it's rare to see a chip designed to run at exactly that peak efficiency clock, because there's always a financial budget as well as the power budget. Running a smaller GPU at higher clocks means you save money, so the design is going to be a tradeoff between a desire to get as close as possible to the peak efficiency clock, which maximises performance within a fixed power budget, and as small a GPU as possible, which minimises cost. Taking the same example, another option would be to use 4 SMs and clock them at around 640MHz. This would also consume 3W, but would provide around 10% less performance. It would, however, result in a cheaper chip, and many people would view 10% performance as a worthwhile trade-off when reducing the number of SMs by 33%.

However, while it's reasonable to design a chip with intent to clock it at the peak efficiency clock, or to clock it above the peak efficiency clock, what you're not going to see is a chip that's intentionally designed to run at a fixed clock speed that's below the peak efficiency clock. The reason for this is pretty straight-forward; if you have a design with a large number of SMs that's intended to run at a clock below the peak efficiency clock, you could just remove some SMs and increase the clock speed and you would get both better performance within your power budget and it would cost less.

How does this relate to Nintendo and T239's manufacturing process?

The above section wasn't theoretical. Nvidia and Nintendo did sit in a room (or have a series of calls) to design a chip for a new Nintendo console, and what they came out with is T239. We know that the result of those discussions was to use a 12 SM Ampere GPU. We also know the power curve, and peak efficiency clock for a very similar Ampere GPU on 8nm.

The GPU in the TX1 used in the original Switch units consumed around 3W in portable mode, as far as I can tell. In later models with the die-shrunk Mariko chip, it would have been lower still. Therefore, I would expect 3W to be a reasonable upper limit to the power budget Nintendo would allocate for the GPU in portable mode when designing the T239.

With a 3W power budget and a peak efficiency clock of 470MHz, then the (again, not theoretical) numbers above tell us the best possible performance would be achieved by a 6 SM GPU operating at 470MHz, and that you'd be able to get 90% of that performance with a 4 SM GPU operating at 640MHz. Note that neither of these say 12 SMs. A 12 SM GPU on Samsung 8nm would be an awful design for a 3W power budget. It would be twice the size and cost of a 6 SM GPU while offering much less performance, if it's even possible to run within 3W at any clock.

There's no world where Nintendo and Nvidia went into that room with an 8nm SoC in mind and a 3W power budget for the GPU in handheld mode, and came out with a 12 SM GPU. That means either the manufacturing process, or the power consumption must be wrong (or both). I'm basing my power consumption estimates on the assumption that this is a device around the same size as the Switch and with battery life that falls somewhere between TX1 and Mariko units. This seems to be the same assumption almost everyone here is making, and while it could be wrong, I think them sticking with the Switch form-factor and battery life is a pretty safe bet, which leaves the manufacturing process.

So, if it's not Samsung 8nm, what is it?

Well, from the Orin data we know that a 12 SM Ampere GPU on Samsung 8nm at the peak efficiency clocks of 470MHz would consume a bit over 6W, which means we need something twice as power efficient as Samsung 8nm. There are a couple of small differences between T239 and Orin's GPUs, like smaller tensor cores and improved clock-gating, but they are likely to have only marginal impact on power consumption, nowhere near the 2x we need, which will have to come from a better manufacturing process.

One note to add here is that we actually need a bit more than a 2x efficiency improvement over 8nm, because as the manufacturing process changes, so does the peak efficiency clock. The peak efficiency clock will typically increase as an architecture is moved to a more efficient manufacturing process, as the improved process allows higher clocks at given voltages. From DVFS tables in Linux, we know that Mariko's peak efficiency clock on 16nm/12nm is likely 384MHz. That's increased to around 470MHz for Ampere on 8nm, and will increase further as it's migrated to more advanced processes.

I'd expect peak efficiency clocks of around 500-600MHz on improved processes, which means that instead of running at 470MHz the chip would need to run at 500-600MHz within 3W to make sense. A clock of 550MHz would consume around 7.5W on 8nm, so we would need a 2.5x improvement in efficiency instead.

So, what manufacturing process can give a 2.5x improvement in efficiency over Samsung 8nm? The only reasonable answer I can think of is TSMC's 5nm/4nm processes, including 4N, which just happens to be the process Nvidia is using for every other product (outside of acquired Mellanox products) from this point onwards. In Nvidia's Ada white paper (an architecture very similar to Ampere), they claim a 2x improvement in performance per Watt, which appears to come almost exclusively from the move to TSMC's 4N process, plus some memory changes.

They don't provide any hard numbers for similarly sized GPUs at the same clock speed, with only a vague unlabelled marketing graph here, but they recently announced the Ada based RTX 4000 SFF workstation GPU, which has 48 SMs clocked at 1,565MHz and a 70W TDP. The older Ampere RTX A4000 also had 48 SMs clocked at 1,560MHz and had a TDP of 140W. There are differences in the memory setup, and TDPs don't necessarily reflect real world power consumption, but the indication is that the move from Ampere on Samsung 8nm to an Ampere-derived architecture on TSMC 4N reduces power consumption by about a factor of 2.

What about the other options? TSMC 6nm or Samsung 5nm/4nm?

Honestly the more I think about it the less I think these other possibilities are viable. Even aside from the issue that these aren't processes Nvidia is using for anything else, I just don't think a 12 SM GPU would make sense on either of them. Even on TSMC 4N it's a stretch. Evidence suggests that it would achieve a 2x efficiency improvement, but we would be looking for 2.5x in reality. There's enough wiggle room there, in terms of Ada having some additional features not in T239 and not having hard data on Ada's power consumption, so the actual improvement in T239's case may be 2.5x, but even that would mean that Nintendo have gone for the largest GPU possible within the power limit.

With 4N just about stretching to the 2.5x improvement in efficiency required for a 12 SM GPU to make sense, I don't think the chances for any other process are good. We don't have direct examples for other processes like we have for Ada, but from everything we know, TSMC's 5nm class processes are significantly more efficient than either their 6nm or Samsung's 5nm/4nm processes. If it's a squeeze for 12 SMs to work on 4N, then I can't see how it would make sense on anything less efficient than 4N.

But what about cost, isn't 4nm really expensive?

Actually, no. TSMC's 4N wafers are expensive, but they're also much higher density, which means you fit many more chips on a wafer. This SemiAnalysis article from September claimed that Nvidia pays 2.2x as much for a TSMC 4N wafer as they do for a Samsung 8nm wafer. However, Nvidia is achieving 2.7x higher transistor density on 4N, which means that a chip with the same transistor count would actually be cheaper if manufactured on 4N than 8nm (even more so when you factor yields into account).

Are there any caveats?

Yes, the major one being the power consumption of the chip. I'm assuming that Nintendo's next device is going to be roughly the same size and form-factor as the Switch, and they will want a similar battery life. If it's a much larger device (like Steam Deck sized) or they're ok with half an hour of battery life, then that changes the equations, but I don't think either of those are realistic. Ditto if it turned out to be a stationary home console for some reason (again, I'm not expecting that).

The other one is that I'm assuming that Nintendo will use all 12 SMs in portable mode. It's theoretically possible that they would disable half of them in portable mode, and only run the full 12 in docked mode. This would allow them to stick within 3W even on 8nm. However, it's a pain from the software point of view, and it assumes that Nintendo is much more focussed on docked performance than handheld, including likely running much higher power draw docked. I feel it's more likely that Nintendo would build around handheld first, as that's the baseline of performance all games will have to operate on, and then use the same setup at higher clocks for docked.

That's a lot of words. Is this just all confirmation bias or copium or hopium or whatever the kids call it?

I don't think so. Obviously everyone should be careful of their biases, but I actually made the exact same argument over a year ago back before the Nvidia hack, when we thought T239 would be manufactured on Samsung 8nm but didn't know how big the GPU was. At the time a lot of people thought I was too pessimistic because I thought 8 SMs was unrealistic on 8nm and a 4 SM GPU was more likely. I was wrong about T239 using a 4 SM GPU, but the Orin power figures we got later backed up my argument, and 8 SMs is indeed unrealistic on 8nm. The 12 SM GPU we got is even more unrealistic on 8nm, so by the same logic we must be looking at a much more efficient manufacturing process. What looked pessimistic back then is optimistic now only because the data has changed.
 
Last edited:
My guess is currently 1.7 Tflops in portable mode (12 SMs @ 550MHz) and 3.4 Tflops docked (12 SMs @ 1.1GHz).

To explain my reasoning, let's play a game of Why does Thraktor think a TSMC 4N manufacturing process is likely for T239?

The short answer is that a 12 SM GPU is far too large for Samsung 8nm, and likely too large for any intermediate process like TSMC's 6nm or Samsung's 5nm/4nm processes. There's a popular conception that Nintendo will go with a "cheap" process like 8nm and clock down to oblivion in portable mode, but that ignores both the economic and physical realities of microprocessor design.

To start, let's quickly talk about power curves. A power curve for a chip, whether a CPU or GPU or something else, is a plot of the amount of power the chip consumes against the clock speed of the chip. A while ago I extracted the power curve for Orin's 8nm Ampere GPU from a Nvidia power estimator tool. There are more in-depth details here, here and here, but for now let's focus on the actual power curve data:
Code:
Clock      W per TPC
0.42075    0.96
0.52275    1.14
0.62475    1.45
0.72675    1.82
0.82875    2.21
0.93075    2.73
1.03275    3.32
1.23675    4.89
1.30050    5.58

The first column is the clock speed in GHz, and the second is the Watts consumed per TPC (which is a pair of SMs). Let's create a chart for this power curve:

orin-power-curve.png


We can see that the power consumption curves upwards as clock speeds increase. The reason for this is that to increase clock speed you need to increase voltage, and power consumption is proportional to voltage squared. As a result, higher clock speeds are typically less efficient than lower ones.

So, if higher clock speeds are typically less efficient, doesn't that mean you can always reduce clocks to gain efficiency? Not quite. While the chart above might look like a smooth curve, it's actually hiding something; at that lowest clock speed of 420MHz the curve breaks down completely. To illustrate, let's look at the same data, but chart power efficiency (measured in Gflops per Watt) rather than outright power consumption:

orin-efficiency-curve.png


There are two things going on in this chart. For all the data points from 522 MHz onwards, we see what you would usually expect, which is that efficiency drops as clock speeds increase. The relationship is exceptionally clear here, as it's a pretty much perfect straight line. But then there's that point on the left. The GPU at 420MHz is less efficient than it is at 522MHz, why is that?

The answer is relatively straight-forward if we consider one important point: there is a minimum voltage that the chip can operate at. Voltage going up with clock speed means efficiency gets worse, and voltage going down as clock speeds increase means efficiency gets better. But what happens when you want to reduce clocks but can't reduce voltage any more? Not only do you stop improving power efficiency, but it actually starts to go pretty sharply in the opposite direction.

Because power consumption is mostly related to voltage, not clock speed, when you reduce clocks but keep the voltage the same, you don't really save much power. A large part of the power consumption called "static power" stays exactly the same, while the other part, "dynamic power", does fall off a bit. What you end up with is much less performance, but only slightly less power consumption. That is, power efficiency gets worse.

So that kink in the efficiency graph, between 420MHz and 522MHz, is the point at which you can't reduce the voltage any more. Any clocks below that point will all operate at the same voltage, and without being able to reduce the voltage, power efficiency gets worse instead of better below that point. The clock speed at that point can be called the "peak efficiency clock", as it offers higher power efficiency than any other clock speed.

How does this impact how chips are designed?

There are two things to take from the above. First, as a general point, every chip on a given manufacturing process has a peak efficiency clock, below which you lose power efficiency by reducing clocks. Secondly, we have the data from Orin to know pretty well where this point is for a GPU very similar to T239's on a Samsung 8nm process, which is around 470MHz.

Now let's talk designing chips. Nvidia and Nintendo are in a room deciding what GPU to put in their new SoC for Nintendo's new console. Nintendo has a financial budget of how much they want to spend on the chip, but they also have a power budget, which is how much power the chip can use up to keep battery life and cooling in check. Nvidia and Nintendo's job in that room is to figure out the best GPU they can fit within those two budgets.

GPUs are convenient in that you can make them basically as wide as you want (that is use as many SMs as you want) and developers will be able to make use of all the performance available. The design space is basically a line between a high number of SMs at a low clock, and a low number of SMs at a high clock. Because there's a fixed power budget, the theoretically ideal place on that line is the one where the clock is the peak efficiency clock, so you can get the most performance from that power.

That is, if the power budget is 3W for the GPU, and the peak efficiency clock is 470MHz, and the power consumption per SM at 470MHz is 0.5W, then the best possible GPU they could include would be a 6 SM GPU running at 470MHz. Using a smaller GPU would mean higher clocks, and efficiency would drop, but using a larger GPU with lower clocks would also mean efficiency would drop, because we're already at the peak efficiency clock.

In reality, it's rare to see a chip designed to run at exactly that peak efficiency clock, because there's always a financial budget as well as the power budget. Running a smaller GPU at higher clocks means you save money, so the design is going to be a tradeoff between a desire to get as close as possible to the peak efficiency clock, which maximises performance within a fixed power budget, and as small a GPU as possible, which minimises cost. Taking the same example, another option would be to use 4 SMs and clock them at around 640MHz. This would also consume 3W, but would provide around 10% less performance. It would, however, result in a cheaper chip, and many people would view 10% performance as a worthwhile trade-off when reducing the number of SMs by 33%.

However, while it's reasonable to design a chip with intent to clock it at the peak efficiency clock, or to clock it above the peak efficiency clock, what you're not going to see is a chip that's intentionally designed to run at a fixed clock speed that's below the peak efficiency clock. The reason for this is pretty straight-forward; if you have a design with a large number of SMs that's intended to run at a clock below the peak efficiency clock, you could just remove some SMs and increase the clock speed and you would get both better performance within your power budget and it would cost less.

How does this relate to Nintendo and T239's manufacturing process?

The above section wasn't theoretical. Nvidia and Nintendo did sit in a room (or have a series of calls) to design a chip for a new Nintendo console, and what they came out with is T239. We know that the result of those discussions was to use a 12 SM Ampere GPU. We also know the power curve, and peak efficiency clock for a very similar Ampere GPU on 8nm.

The GPU in the TX1 used in the original Switch units consumed around 3W in portable mode, as far as I can tell. In later models with the die-shrunk Mariko chip, it would have been lower still. Therefore, I would expect 3W to be a reasonable upper limit to the power budget Nintendo would allocate for the GPU in portable mode when designing the T239.

With a 3W power budget and a peak efficiency clock of 470MHz, then the (again, not theoretical) numbers above tell us the best possible performance would be achieved by a 6 SM GPU operating at 470MHz, and that you'd be able to get 90% of that performance with a 4 SM GPU operating at 640MHz. Note that neither of these say 12 SMs. A 12 SM GPU on Samsung 8nm would be an awful design for a 3W power budget. It would be twice the size and cost of a 6 SM GPU while offering much less performance, if it's even possible to run within 3W at any clock.

There's no world where Nintendo and Nvidia went into that room with an 8nm SoC in mind and a 3W power budget for the GPU in handheld mode, and came out with a 12 SM GPU. That means either the manufacturing process, or the power consumption must be wrong (or both). I'm basing my power consumption estimates on the assumption that this is a device around the same size as the Switch and with battery life that falls somewhere between TX1 and Mariko units. This seems to be the same assumption almost everyone here is making, and while it could be wrong, I think them sticking with the Switch form-factor and battery life is a pretty safe bet, which leaves the manufacturing process.

So, if it's not Samsung 8nm, what is it?

Well, from the Orin data we know that a 12 SM Ampere GPU on Samsung 8nm at the peak efficiency clocks of 470MHz would consume a bit over 6W, which means we need something twice as power efficient as Samsung 8nm. There are a couple of small differences between T239 and Orin's GPUs, like smaller tensor cores and improved clock-gating, but they are likely to have only marginal impact on power consumption, nowhere near the 2x we need, which will have to come from a better manufacturing process.

One note to add here is that we actually need a bit more than a 2x efficiency improvement over 8nm, because as the manufacturing process changes, so does the peak efficiency clock. The peak efficiency clock will typically increase as an architecture is moved to a more efficient manufacturing process, as the improved process allows higher clocks at given voltages. From DVFS tables in Linux, we know that Mariko's peak efficiency clock on 16nm/12nm is likely 384MHz. That's increased to around 470MHz for Ampere on 8nm, and will increase further as it's migrated to more advanced processes.

I'd expect peak efficiency clocks of around 500-600MHz on improved processes, which means that instead of running at 470MHz the chip would need to run at 500-600MHz within 3W to make sense. A clock of 550MHz would consume around 7.5W on 8nm, so we would need a 2.5x improvement in efficiency instead.

So, what manufacturing process can give a 2.5x improvement in efficiency over Samsung 8nm? The only reasonable answer I can think of is TSMC's 5nm/4nm processes, including 4N, which just happens to be the process Nvidia is using for every other product (outside of acquired Mellanox products) from this point onwards. In Nvidia's Ada white paper (an architecture very similar to Ampere), they claim a 2x improvement in performance per Watt, which appears to come almost exclusively from the move to TSMC's 4N process, plus some memory changes.

They don't provide any hard numbers for similarly sized GPUs at the same clock speed, with only a vague unlabelled marketing graph here, but they recently announced the Ada based RTX 4000 SFF workstation GPU, which has 48 SMs clocked at 1,565MHz and a 70W TDP. The older Ampere RTX A4000 also had 48 SMs clocked at 1,560MHz and had a TDP of 140W. There are differences in the memory setup, and TDPs don't necessarily reflect real world power consumption, but the indication is that the move from Ampere on Samsung 8nm to an Ampere-derived architecture on TSMC 4N reduces power consumption by about a factor of 2.

What about the other options? TSMC 6nm or Samsung 5nm/4nm?

Honestly the more I think about it the less I think these other possibilities are viable. Even aside from the issue that these aren't processes Nvidia is using for anything else, I just don't think a 12 SM GPU would make sense on either of them. Even on TSMC 4N it's a stretch. Evidence suggests that it would achieve a 2x efficiency improvement, but we would be looking for 2.5x in reality. There's enough wiggle room there, in terms of Ada having some additional features not in T239 and not having hard data on Ada's power consumption, so the actual improvement in T239's case may be 2.5x, but even that would mean that Nintendo have gone for the largest GPU possible within the power limit.

With 4N just about stretching to the 2.5x improvement in efficiency required for a 12 SM GPU to make sense, I don't think the chances for any other process are good. We don't have direct examples for other processes like we have for Ada, but from everything we know, TSMC's 5nm class processes are significantly more efficient than either their 6nm or Samsung's 5nm/4nm processes. If it's a squeeze for 12 SMs to work on 4N, then I can't see how it would make sense on anything less efficient than 4N.

But what about cost, isn't 4nm really expensive?

Actually, no. TSMC's 4N wafers are expensive, but they're also much higher density, which means you fit many more chips on a wafer. This SemiAnalysis article from September claimed that Nvidia pays 2.2x as much for a TSMC 4N wafer as they do for a Samsung 8nm wafer. However, Nvidia is achieving 2.7x higher transistor density on 4N, which means that a chip with the same transistor count would actually be cheaper if manufactured on 4N than 8nm (even more so when you factor yields into account).

Are there any caveats?

Yes, the major one being the power consumption of the chip. I'm assuming that Nintendo's next device is going to be roughly the same size and form-factor as the Switch, and they will want a similar battery life. If it's a much larger device (like Steam Deck sized) or they're ok with half an hour of battery life, then that changes the equations, but I don't think either of those are realistic. Ditto if it turned out to be a stationary home console for some reason (again, I'm not expecting that).

The other one is that I'm assuming that Nintendo will use all 12 SMs in portable mode. It's theoretically possible that they would disable half of them in portable mode, and only run the full 12 in docked mode. This would allow them to stick within 3W even on 8nm. However, it's a pain from the software point of view, and it assumes that Nintendo is much more focussed on docked performance than handheld, including likely running much higher power draw docked. I feel it's more likely that Nintendo would build around handheld first, as that's the baseline of performance all games will have to operate on, and then use the same setup at higher clocks for docked.

That's a lot of words. Is this just all confirmation bias or copium or hopium or whatever the kids call it?

I don't think so. Obviously everyone should be careful of their biases, but I actually made the exact same argument over a year ago back before the Nvidia hack, when we thought T239 would be manufactured on Samsung 8nm but didn't know how big the GPU was. At the time everyone thought I was too pessimistic because I thought 8 SMs was unrealistic on 8nm and a 4 SM GPU was more likely. I was wrong about T239 using a 4 SM GPU, but the Orin power figures we got later backed up my argument, and 8 SMs is indeed unrealistic on 8nm. The 12 SM GPU we got is even more unrealistic on 8nm, so by the same logic we must be looking at a much more efficient manufacturing process. What looked pessimistic back then is optimistic now only because the data has changed.
I haven’t read it yet, and will read it in full, but I’m curious on your CPU expectations if it isn’t mentioned.

And this is also a magnum opus 😹


I’ve read it all, and now I’m curious on the thoughts of the ceiling for the frequency for the A78 on the 8N vs if it were on the 4N node
 
Last edited:
My guess is currently 1.7 Tflops in portable mode (12 SMs @ 550MHz) and 3.4 Tflops docked (12 SMs @ 1.1GHz).

To explain my reasoning, let's play a game of Why does Thraktor think a TSMC 4N manufacturing process is likely for T239?

The short answer is that a 12 SM GPU is far too large for Samsung 8nm, and likely too large for any intermediate process like TSMC's 6nm or Samsung's 5nm/4nm processes. There's a popular conception that Nintendo will go with a "cheap" process like 8nm and clock down to oblivion in portable mode, but that ignores both the economic and physical realities of microprocessor design.

To start, let's quickly talk about power curves. A power curve for a chip, whether a CPU or GPU or something else, is a plot of the amount of power the chip consumes against the clock speed of the chip. A while ago I extracted the power curve for Orin's 8nm Ampere GPU from a Nvidia power estimator tool. There are more in-depth details here, here and here, but for now let's focus on the actual power curve data:
Code:
Clock      W per TPC
0.42075    0.96
0.52275    1.14
0.62475    1.45
0.72675    1.82
0.82875    2.21
0.93075    2.73
1.03275    3.32
1.23675    4.89
1.30050    5.58

The first column is the clock speed in GHz, and the second is the Watts consumed per TPC (which is a pair of SMs). Let's create a chart for this power curve:

orin-power-curve.png


We can see that the power consumption curves upwards as clock speeds increase. The reason for this is that to increase clock speed you need to increase voltage, and power consumption is proportional to voltage squared. As a result, higher clock speeds are typically less efficient than lower ones.

So, if higher clock speeds are typically less efficient, doesn't that mean you can always reduce clocks to gain efficiency? Not quite. While the chart above might look like a smooth curve, it's actually hiding something; at that lowest clock speed of 420MHz the curve breaks down completely. To illustrate, let's look at the same data, but chart power efficiency (measured in Gflops per Watt) rather than outright power consumption:

orin-efficiency-curve.png


There are two things going on in this chart. For all the data points from 522 MHz onwards, we see what you would usually expect, which is that efficiency drops as clock speeds increase. The relationship is exceptionally clear here, as it's a pretty much perfect straight line. But then there's that point on the left. The GPU at 420MHz is less efficient than it is at 522MHz, why is that?

The answer is relatively straight-forward if we consider one important point: there is a minimum voltage that the chip can operate at. Voltage going up with clock speed means efficiency gets worse, and voltage going down as clock speeds increase means efficiency gets better. But what happens when you want to reduce clocks but can't reduce voltage any more? Not only do you stop improving power efficiency, but it actually starts to go pretty sharply in the opposite direction.

Because power consumption is mostly related to voltage, not clock speed, when you reduce clocks but keep the voltage the same, you don't really save much power. A large part of the power consumption called "static power" stays exactly the same, while the other part, "dynamic power", does fall off a bit. What you end up with is much less performance, but only slightly less power consumption. That is, power efficiency gets worse.

So that kink in the efficiency graph, between 420MHz and 522MHz, is the point at which you can't reduce the voltage any more. Any clocks below that point will all operate at the same voltage, and without being able to reduce the voltage, power efficiency gets worse instead of better below that point. The clock speed at that point can be called the "peak efficiency clock", as it offers higher power efficiency than any other clock speed.

How does this impact how chips are designed?

There are two things to take from the above. First, as a general point, every chip on a given manufacturing process has a peak efficiency clock, below which you lose power efficiency by reducing clocks. Secondly, we have the data from Orin to know pretty well where this point is for a GPU very similar to T239's on a Samsung 8nm process, which is around 470MHz.

Now let's talk designing chips. Nvidia and Nintendo are in a room deciding what GPU to put in their new SoC for Nintendo's new console. Nintendo has a financial budget of how much they want to spend on the chip, but they also have a power budget, which is how much power the chip can use up to keep battery life and cooling in check. Nvidia and Nintendo's job in that room is to figure out the best GPU they can fit within those two budgets.

GPUs are convenient in that you can make them basically as wide as you want (that is use as many SMs as you want) and developers will be able to make use of all the performance available. The design space is basically a line between a high number of SMs at a low clock, and a low number of SMs at a high clock. Because there's a fixed power budget, the theoretically ideal place on that line is the one where the clock is the peak efficiency clock, so you can get the most performance from that power.

That is, if the power budget is 3W for the GPU, and the peak efficiency clock is 470MHz, and the power consumption per SM at 470MHz is 0.5W, then the best possible GPU they could include would be a 6 SM GPU running at 470MHz. Using a smaller GPU would mean higher clocks, and efficiency would drop, but using a larger GPU with lower clocks would also mean efficiency would drop, because we're already at the peak efficiency clock.

In reality, it's rare to see a chip designed to run at exactly that peak efficiency clock, because there's always a financial budget as well as the power budget. Running a smaller GPU at higher clocks means you save money, so the design is going to be a tradeoff between a desire to get as close as possible to the peak efficiency clock, which maximises performance within a fixed power budget, and as small a GPU as possible, which minimises cost. Taking the same example, another option would be to use 4 SMs and clock them at around 640MHz. This would also consume 3W, but would provide around 10% less performance. It would, however, result in a cheaper chip, and many people would view 10% performance as a worthwhile trade-off when reducing the number of SMs by 33%.

However, while it's reasonable to design a chip with intent to clock it at the peak efficiency clock, or to clock it above the peak efficiency clock, what you're not going to see is a chip that's intentionally designed to run at a fixed clock speed that's below the peak efficiency clock. The reason for this is pretty straight-forward; if you have a design with a large number of SMs that's intended to run at a clock below the peak efficiency clock, you could just remove some SMs and increase the clock speed and you would get both better performance within your power budget and it would cost less.

How does this relate to Nintendo and T239's manufacturing process?

The above section wasn't theoretical. Nvidia and Nintendo did sit in a room (or have a series of calls) to design a chip for a new Nintendo console, and what they came out with is T239. We know that the result of those discussions was to use a 12 SM Ampere GPU. We also know the power curve, and peak efficiency clock for a very similar Ampere GPU on 8nm.

The GPU in the TX1 used in the original Switch units consumed around 3W in portable mode, as far as I can tell. In later models with the die-shrunk Mariko chip, it would have been lower still. Therefore, I would expect 3W to be a reasonable upper limit to the power budget Nintendo would allocate for the GPU in portable mode when designing the T239.

With a 3W power budget and a peak efficiency clock of 470MHz, then the (again, not theoretical) numbers above tell us the best possible performance would be achieved by a 6 SM GPU operating at 470MHz, and that you'd be able to get 90% of that performance with a 4 SM GPU operating at 640MHz. Note that neither of these say 12 SMs. A 12 SM GPU on Samsung 8nm would be an awful design for a 3W power budget. It would be twice the size and cost of a 6 SM GPU while offering much less performance, if it's even possible to run within 3W at any clock.

There's no world where Nintendo and Nvidia went into that room with an 8nm SoC in mind and a 3W power budget for the GPU in handheld mode, and came out with a 12 SM GPU. That means either the manufacturing process, or the power consumption must be wrong (or both). I'm basing my power consumption estimates on the assumption that this is a device around the same size as the Switch and with battery life that falls somewhere between TX1 and Mariko units. This seems to be the same assumption almost everyone here is making, and while it could be wrong, I think them sticking with the Switch form-factor and battery life is a pretty safe bet, which leaves the manufacturing process.

So, if it's not Samsung 8nm, what is it?

Well, from the Orin data we know that a 12 SM Ampere GPU on Samsung 8nm at the peak efficiency clocks of 470MHz would consume a bit over 6W, which means we need something twice as power efficient as Samsung 8nm. There are a couple of small differences between T239 and Orin's GPUs, like smaller tensor cores and improved clock-gating, but they are likely to have only marginal impact on power consumption, nowhere near the 2x we need, which will have to come from a better manufacturing process.

One note to add here is that we actually need a bit more than a 2x efficiency improvement over 8nm, because as the manufacturing process changes, so does the peak efficiency clock. The peak efficiency clock will typically increase as an architecture is moved to a more efficient manufacturing process, as the improved process allows higher clocks at given voltages. From DVFS tables in Linux, we know that Mariko's peak efficiency clock on 16nm/12nm is likely 384MHz. That's increased to around 470MHz for Ampere on 8nm, and will increase further as it's migrated to more advanced processes.

I'd expect peak efficiency clocks of around 500-600MHz on improved processes, which means that instead of running at 470MHz the chip would need to run at 500-600MHz within 3W to make sense. A clock of 550MHz would consume around 7.5W on 8nm, so we would need a 2.5x improvement in efficiency instead.

So, what manufacturing process can give a 2.5x improvement in efficiency over Samsung 8nm? The only reasonable answer I can think of is TSMC's 5nm/4nm processes, including 4N, which just happens to be the process Nvidia is using for every other product (outside of acquired Mellanox products) from this point onwards. In Nvidia's Ada white paper (an architecture very similar to Ampere), they claim a 2x improvement in performance per Watt, which appears to come almost exclusively from the move to TSMC's 4N process, plus some memory changes.

They don't provide any hard numbers for similarly sized GPUs at the same clock speed, with only a vague unlabelled marketing graph here, but they recently announced the Ada based RTX 4000 SFF workstation GPU, which has 48 SMs clocked at 1,565MHz and a 70W TDP. The older Ampere RTX A4000 also had 48 SMs clocked at 1,560MHz and had a TDP of 140W. There are differences in the memory setup, and TDPs don't necessarily reflect real world power consumption, but the indication is that the move from Ampere on Samsung 8nm to an Ampere-derived architecture on TSMC 4N reduces power consumption by about a factor of 2.

What about the other options? TSMC 6nm or Samsung 5nm/4nm?

Honestly the more I think about it the less I think these other possibilities are viable. Even aside from the issue that these aren't processes Nvidia is using for anything else, I just don't think a 12 SM GPU would make sense on either of them. Even on TSMC 4N it's a stretch. Evidence suggests that it would achieve a 2x efficiency improvement, but we would be looking for 2.5x in reality. There's enough wiggle room there, in terms of Ada having some additional features not in T239 and not having hard data on Ada's power consumption, so the actual improvement in T239's case may be 2.5x, but even that would mean that Nintendo have gone for the largest GPU possible within the power limit.

With 4N just about stretching to the 2.5x improvement in efficiency required for a 12 SM GPU to make sense, I don't think the chances for any other process are good. We don't have direct examples for other processes like we have for Ada, but from everything we know, TSMC's 5nm class processes are significantly more efficient than either their 6nm or Samsung's 5nm/4nm processes. If it's a squeeze for 12 SMs to work on 4N, then I can't see how it would make sense on anything less efficient than 4N.

But what about cost, isn't 4nm really expensive?

Actually, no. TSMC's 4N wafers are expensive, but they're also much higher density, which means you fit many more chips on a wafer. This SemiAnalysis article from September claimed that Nvidia pays 2.2x as much for a TSMC 4N wafer as they do for a Samsung 8nm wafer. However, Nvidia is achieving 2.7x higher transistor density on 4N, which means that a chip with the same transistor count would actually be cheaper if manufactured on 4N than 8nm (even more so when you factor yields into account).

Are there any caveats?

Yes, the major one being the power consumption of the chip. I'm assuming that Nintendo's next device is going to be roughly the same size and form-factor as the Switch, and they will want a similar battery life. If it's a much larger device (like Steam Deck sized) or they're ok with half an hour of battery life, then that changes the equations, but I don't think either of those are realistic. Ditto if it turned out to be a stationary home console for some reason (again, I'm not expecting that).

The other one is that I'm assuming that Nintendo will use all 12 SMs in portable mode. It's theoretically possible that they would disable half of them in portable mode, and only run the full 12 in docked mode. This would allow them to stick within 3W even on 8nm. However, it's a pain from the software point of view, and it assumes that Nintendo is much more focussed on docked performance than handheld, including likely running much higher power draw docked. I feel it's more likely that Nintendo would build around handheld first, as that's the baseline of performance all games will have to operate on, and then use the same setup at higher clocks for docked.

That's a lot of words. Is this just all confirmation bias or copium or hopium or whatever the kids call it?

I don't think so. Obviously everyone should be careful of their biases, but I actually made the exact same argument over a year ago back before the Nvidia hack, when we thought T239 would be manufactured on Samsung 8nm but didn't know how big the GPU was. At the time a lot of people thought I was too pessimistic because I thought 8 SMs was unrealistic on 8nm and a 4 SM GPU was more likely. I was wrong about T239 using a 4 SM GPU, but the Orin power figures we got later backed up my argument, and 8 SMs is indeed unrealistic on 8nm. The 12 SM GPU we got is even more unrealistic on 8nm, so by the same logic we must be looking at a much more efficient manufacturing process. What looked pessimistic back then is optimistic now only because the data has changed.
what about CPU clocks? im personaly hope for 2GHz
 
I'm with Thraktor, Switch 2 will be in holiday 2023, not holiday 2024
everything point that the Switch sucessor will launch on holiday 2024/early 2025, Nintendo is not gonna launch they next gen hardware this hardware and risk killing the sale momentum of Switch.
 
My guess is currently 1.7 Tflops in portable mode (12 SMs @ 550MHz) and 3.4 Tflops docked (12 SMs @ 1.1GHz).
Snipping a truly excellent analysis, and I agree with basically every word of it, with a few caveats. The biggest being that we've seen that power curves are somewhat in the control of the manufacturer. AMD tweaked power curves for the Ally, and Aerith is substantially improved at the bottom. In Orin's case I expect the power draw numbers to be slightly muddled by the double rate tensor cores.
 
0
My guess is currently 1.7 Tflops in portable mode (12 SMs @ 550MHz) and 3.4 Tflops docked (12 SMs @ 1.1GHz).

To explain my reasoning, let's play a game of Why does Thraktor think a TSMC 4N manufacturing process is likely for T239?

The short answer is that a 12 SM GPU is far too large for Samsung 8nm, and likely too large for any intermediate process like TSMC's 6nm or Samsung's 5nm/4nm processes. There's a popular conception that Nintendo will go with a "cheap" process like 8nm and clock down to oblivion in portable mode, but that ignores both the economic and physical realities of microprocessor design.

To start, let's quickly talk about power curves. A power curve for a chip, whether a CPU or GPU or something else, is a plot of the amount of power the chip consumes against the clock speed of the chip. A while ago I extracted the power curve for Orin's 8nm Ampere GPU from a Nvidia power estimator tool. There are more in-depth details here, here and here, but for now let's focus on the actual power curve data:
Code:
Clock      W per TPC
0.42075    0.96
0.52275    1.14
0.62475    1.45
0.72675    1.82
0.82875    2.21
0.93075    2.73
1.03275    3.32
1.23675    4.89
1.30050    5.58

The first column is the clock speed in GHz, and the second is the Watts consumed per TPC (which is a pair of SMs). Let's create a chart for this power curve:

orin-power-curve.png


We can see that the power consumption curves upwards as clock speeds increase. The reason for this is that to increase clock speed you need to increase voltage, and power consumption is proportional to voltage squared. As a result, higher clock speeds are typically less efficient than lower ones.

So, if higher clock speeds are typically less efficient, doesn't that mean you can always reduce clocks to gain efficiency? Not quite. While the chart above might look like a smooth curve, it's actually hiding something; at that lowest clock speed of 420MHz the curve breaks down completely. To illustrate, let's look at the same data, but chart power efficiency (measured in Gflops per Watt) rather than outright power consumption:

orin-efficiency-curve.png


There are two things going on in this chart. For all the data points from 522 MHz onwards, we see what you would usually expect, which is that efficiency drops as clock speeds increase. The relationship is exceptionally clear here, as it's a pretty much perfect straight line. But then there's that point on the left. The GPU at 420MHz is less efficient than it is at 522MHz, why is that?

The answer is relatively straight-forward if we consider one important point: there is a minimum voltage that the chip can operate at. Voltage going up with clock speed means efficiency gets worse, and voltage going down as clock speeds increase means efficiency gets better. But what happens when you want to reduce clocks but can't reduce voltage any more? Not only do you stop improving power efficiency, but it actually starts to go pretty sharply in the opposite direction.

Because power consumption is mostly related to voltage, not clock speed, when you reduce clocks but keep the voltage the same, you don't really save much power. A large part of the power consumption called "static power" stays exactly the same, while the other part, "dynamic power", does fall off a bit. What you end up with is much less performance, but only slightly less power consumption. That is, power efficiency gets worse.

So that kink in the efficiency graph, between 420MHz and 522MHz, is the point at which you can't reduce the voltage any more. Any clocks below that point will all operate at the same voltage, and without being able to reduce the voltage, power efficiency gets worse instead of better below that point. The clock speed at that point can be called the "peak efficiency clock", as it offers higher power efficiency than any other clock speed.

How does this impact how chips are designed?

There are two things to take from the above. First, as a general point, every chip on a given manufacturing process has a peak efficiency clock, below which you lose power efficiency by reducing clocks. Secondly, we have the data from Orin to know pretty well where this point is for a GPU very similar to T239's on a Samsung 8nm process, which is around 470MHz.

Now let's talk designing chips. Nvidia and Nintendo are in a room deciding what GPU to put in their new SoC for Nintendo's new console. Nintendo has a financial budget of how much they want to spend on the chip, but they also have a power budget, which is how much power the chip can use up to keep battery life and cooling in check. Nvidia and Nintendo's job in that room is to figure out the best GPU they can fit within those two budgets.

GPUs are convenient in that you can make them basically as wide as you want (that is use as many SMs as you want) and developers will be able to make use of all the performance available. The design space is basically a line between a high number of SMs at a low clock, and a low number of SMs at a high clock. Because there's a fixed power budget, the theoretically ideal place on that line is the one where the clock is the peak efficiency clock, so you can get the most performance from that power.

That is, if the power budget is 3W for the GPU, and the peak efficiency clock is 470MHz, and the power consumption per SM at 470MHz is 0.5W, then the best possible GPU they could include would be a 6 SM GPU running at 470MHz. Using a smaller GPU would mean higher clocks, and efficiency would drop, but using a larger GPU with lower clocks would also mean efficiency would drop, because we're already at the peak efficiency clock.

In reality, it's rare to see a chip designed to run at exactly that peak efficiency clock, because there's always a financial budget as well as the power budget. Running a smaller GPU at higher clocks means you save money, so the design is going to be a tradeoff between a desire to get as close as possible to the peak efficiency clock, which maximises performance within a fixed power budget, and as small a GPU as possible, which minimises cost. Taking the same example, another option would be to use 4 SMs and clock them at around 640MHz. This would also consume 3W, but would provide around 10% less performance. It would, however, result in a cheaper chip, and many people would view 10% performance as a worthwhile trade-off when reducing the number of SMs by 33%.

However, while it's reasonable to design a chip with intent to clock it at the peak efficiency clock, or to clock it above the peak efficiency clock, what you're not going to see is a chip that's intentionally designed to run at a fixed clock speed that's below the peak efficiency clock. The reason for this is pretty straight-forward; if you have a design with a large number of SMs that's intended to run at a clock below the peak efficiency clock, you could just remove some SMs and increase the clock speed and you would get both better performance within your power budget and it would cost less.

How does this relate to Nintendo and T239's manufacturing process?

The above section wasn't theoretical. Nvidia and Nintendo did sit in a room (or have a series of calls) to design a chip for a new Nintendo console, and what they came out with is T239. We know that the result of those discussions was to use a 12 SM Ampere GPU. We also know the power curve, and peak efficiency clock for a very similar Ampere GPU on 8nm.

The GPU in the TX1 used in the original Switch units consumed around 3W in portable mode, as far as I can tell. In later models with the die-shrunk Mariko chip, it would have been lower still. Therefore, I would expect 3W to be a reasonable upper limit to the power budget Nintendo would allocate for the GPU in portable mode when designing the T239.

With a 3W power budget and a peak efficiency clock of 470MHz, then the (again, not theoretical) numbers above tell us the best possible performance would be achieved by a 6 SM GPU operating at 470MHz, and that you'd be able to get 90% of that performance with a 4 SM GPU operating at 640MHz. Note that neither of these say 12 SMs. A 12 SM GPU on Samsung 8nm would be an awful design for a 3W power budget. It would be twice the size and cost of a 6 SM GPU while offering much less performance, if it's even possible to run within 3W at any clock.

There's no world where Nintendo and Nvidia went into that room with an 8nm SoC in mind and a 3W power budget for the GPU in handheld mode, and came out with a 12 SM GPU. That means either the manufacturing process, or the power consumption must be wrong (or both). I'm basing my power consumption estimates on the assumption that this is a device around the same size as the Switch and with battery life that falls somewhere between TX1 and Mariko units. This seems to be the same assumption almost everyone here is making, and while it could be wrong, I think them sticking with the Switch form-factor and battery life is a pretty safe bet, which leaves the manufacturing process.

So, if it's not Samsung 8nm, what is it?

Well, from the Orin data we know that a 12 SM Ampere GPU on Samsung 8nm at the peak efficiency clocks of 470MHz would consume a bit over 6W, which means we need something twice as power efficient as Samsung 8nm. There are a couple of small differences between T239 and Orin's GPUs, like smaller tensor cores and improved clock-gating, but they are likely to have only marginal impact on power consumption, nowhere near the 2x we need, which will have to come from a better manufacturing process.

One note to add here is that we actually need a bit more than a 2x efficiency improvement over 8nm, because as the manufacturing process changes, so does the peak efficiency clock. The peak efficiency clock will typically increase as an architecture is moved to a more efficient manufacturing process, as the improved process allows higher clocks at given voltages. From DVFS tables in Linux, we know that Mariko's peak efficiency clock on 16nm/12nm is likely 384MHz. That's increased to around 470MHz for Ampere on 8nm, and will increase further as it's migrated to more advanced processes.

I'd expect peak efficiency clocks of around 500-600MHz on improved processes, which means that instead of running at 470MHz the chip would need to run at 500-600MHz within 3W to make sense. A clock of 550MHz would consume around 7.5W on 8nm, so we would need a 2.5x improvement in efficiency instead.

So, what manufacturing process can give a 2.5x improvement in efficiency over Samsung 8nm? The only reasonable answer I can think of is TSMC's 5nm/4nm processes, including 4N, which just happens to be the process Nvidia is using for every other product (outside of acquired Mellanox products) from this point onwards. In Nvidia's Ada white paper (an architecture very similar to Ampere), they claim a 2x improvement in performance per Watt, which appears to come almost exclusively from the move to TSMC's 4N process, plus some memory changes.

They don't provide any hard numbers for similarly sized GPUs at the same clock speed, with only a vague unlabelled marketing graph here, but they recently announced the Ada based RTX 4000 SFF workstation GPU, which has 48 SMs clocked at 1,565MHz and a 70W TDP. The older Ampere RTX A4000 also had 48 SMs clocked at 1,560MHz and had a TDP of 140W. There are differences in the memory setup, and TDPs don't necessarily reflect real world power consumption, but the indication is that the move from Ampere on Samsung 8nm to an Ampere-derived architecture on TSMC 4N reduces power consumption by about a factor of 2.

What about the other options? TSMC 6nm or Samsung 5nm/4nm?

Honestly the more I think about it the less I think these other possibilities are viable. Even aside from the issue that these aren't processes Nvidia is using for anything else, I just don't think a 12 SM GPU would make sense on either of them. Even on TSMC 4N it's a stretch. Evidence suggests that it would achieve a 2x efficiency improvement, but we would be looking for 2.5x in reality. There's enough wiggle room there, in terms of Ada having some additional features not in T239 and not having hard data on Ada's power consumption, so the actual improvement in T239's case may be 2.5x, but even that would mean that Nintendo have gone for the largest GPU possible within the power limit.

With 4N just about stretching to the 2.5x improvement in efficiency required for a 12 SM GPU to make sense, I don't think the chances for any other process are good. We don't have direct examples for other processes like we have for Ada, but from everything we know, TSMC's 5nm class processes are significantly more efficient than either their 6nm or Samsung's 5nm/4nm processes. If it's a squeeze for 12 SMs to work on 4N, then I can't see how it would make sense on anything less efficient than 4N.

But what about cost, isn't 4nm really expensive?

Actually, no. TSMC's 4N wafers are expensive, but they're also much higher density, which means you fit many more chips on a wafer. This SemiAnalysis article from September claimed that Nvidia pays 2.2x as much for a TSMC 4N wafer as they do for a Samsung 8nm wafer. However, Nvidia is achieving 2.7x higher transistor density on 4N, which means that a chip with the same transistor count would actually be cheaper if manufactured on 4N than 8nm (even more so when you factor yields into account).

Are there any caveats?

Yes, the major one being the power consumption of the chip. I'm assuming that Nintendo's next device is going to be roughly the same size and form-factor as the Switch, and they will want a similar battery life. If it's a much larger device (like Steam Deck sized) or they're ok with half an hour of battery life, then that changes the equations, but I don't think either of those are realistic. Ditto if it turned out to be a stationary home console for some reason (again, I'm not expecting that).

The other one is that I'm assuming that Nintendo will use all 12 SMs in portable mode. It's theoretically possible that they would disable half of them in portable mode, and only run the full 12 in docked mode. This would allow them to stick within 3W even on 8nm. However, it's a pain from the software point of view, and it assumes that Nintendo is much more focussed on docked performance than handheld, including likely running much higher power draw docked. I feel it's more likely that Nintendo would build around handheld first, as that's the baseline of performance all games will have to operate on, and then use the same setup at higher clocks for docked.

That's a lot of words. Is this just all confirmation bias or copium or hopium or whatever the kids call it?

I don't think so. Obviously everyone should be careful of their biases, but I actually made the exact same argument over a year ago back before the Nvidia hack, when we thought T239 would be manufactured on Samsung 8nm but didn't know how big the GPU was. At the time a lot of people thought I was too pessimistic because I thought 8 SMs was unrealistic on 8nm and a 4 SM GPU was more likely. I was wrong about T239 using a 4 SM GPU, but the Orin power figures we got later backed up my argument, and 8 SMs is indeed unrealistic on 8nm. The 12 SM GPU we got is even more unrealistic on 8nm, so by the same logic we must be looking at a much more efficient manufacturing process. What looked pessimistic back then is optimistic now only because the data has changed.
Wow, what a post! Thank you for writing all of this down and explaining your thoughts!

I haven’t been able to follow all of the leaks and rumors, far from it even, but what exactly is the assumption that the next system is going to have 12 SM based on? Is that based on that same Nvidia Hack that had several titles on it that were actually announced later on? Is there any possible explanation why this chip could have 12 SM and NOT be the next handheld device for Nintendo? Others seem to expect 8 SM (think I have seen it on the DF Discord, but also here), do you see a reason why they would come to that assumption?
(Edit: ) Could this chip have 12 SM but 4 deactivated for better yields?
Also, how does the rumor that devkits have been recalled play into this?

Just going by gut feeling, I struggle to imagine a system from Nintendo that has more performance than the Steam deck, has battery life comparable to Logan Switch, has the form factor of the original Switch and also is sub $429, as that would be my upper bound. (And Nintendo makes a profit on it)
Again, just gut feeling, based on nothing else. You clearly understand A LOT from this. I‘d appreciate if you could possibly address some of my questions from above. :)
 
Last edited:
Wow, what a post! Thank you for writing all of this down and explaining your thoughts!

I haven’t been able to follow all of the leaks and rumors, far from it even, but what exactly is the assumption that the next system is going to have 12 SM based on? Is that based on that same Nvidia Hack that had several titles on it that were actually announced later on? Is there any possible explanation why this chip could have 12 SM and NOT be the next handheld device for Nintendo? Others seem to expect 8 SM (think I have seen it on the DF Discord, but also here), do you see a reason why they would come to that assumption?
Also, how does the rumor that devkits have been recalled play into this?

Just going by gut feeling, I struggle to imagine a system from Nintendo that has more performance than the Steam deck, has battery life comparable to Logan Switch, has the form factor of the original Switch and also is sub $429, as that would be my upper bound. (And Nintendo makes a profit on it)
Again, just gut feeling, based on nothing else. You clearly understand A LOT from this. I‘d appreciate if you could possibly address some of my questions from above. :)
I'm not Thraktor, but I think I have the answers for the bolded portion.

Yes, the information about Drake having 12 SMs did come from the illegal Nvidia leaks.

And considering that the information about Drake in the illegal Nvidia leaks was found in the NVN2 folder, and there are references to NVN, which is the name of the Nintendo Switch's API, I don't see why Drake won't be used for Nintendo's new hardware.
 
Wow, what a post! Thank you for writing all of this down and explaining your thoughts!

I haven’t been able to follow all of the leaks and rumors, far from it even, but what exactly is the assumption that the next system is going to have 12 SM based on? Is that based on that same Nvidia Hack that had several titles on it that were actually announced later on? Is there any possible explanation why this chip could have 12 SM and NOT be the next handheld device for Nintendo? Others seem to expect 8 SM (think I have seen it on the DF Discord, but also here), do you see a reason why they would come to that assumption?
(Edit: ) Could this chip have 12 SM but 4 deactivated for better yields?
Also, how does the rumor that devkits have been recalled play into this?

Just going by gut feeling, I struggle to imagine a system from Nintendo that has more performance than the Steam deck, has battery life comparable to Logan Switch, has the form factor of the original Switch and also is sub $429, as that would be my upper bound. (And Nintendo makes a profit on it)
Again, just gut feeling, based on nothing else. You clearly understand A LOT from this. I‘d appreciate if you could possibly address some of my questions from above. :)
it's based on the nvidia hack. for Drake to not be used for Nintendo, you'd have to believe NVN2 (NVN is the API for switch) and Hovi (Nvidia's internal codename for Nintendo) aren't Nintendo related given both of these are tied to the T239 (Drake).

there's a chance that Drake isn't used, but it wouldn't be because we were wrong in associating the chip with Nintendo, but because it was turned down for whatever reason

could the chip be cut down for yields, yes, but why make a larger and more expensive chip to cut it down? just make a smaller chip. it's not like they don't have predictive tools to figure out yields.
 
Hey, y'all, what did I miss? I've been busy with work and the Street Fighter 6 beta.
 
0
Oldpuck's leaving #teamleapday? Guess I'll have to continue flying the flag solo.

On the time of 18 to 24 months from sampling to release, I assume you're talking about the TX1 and the original Switch? If so, I don't think it's really comparable, as TX1 was already entering production independently of Nintendo; Nvidia had other customers for it (not many, but still...). Correct me if I'm wrong, but I believe Nintendo only officially signed with Nvidia around when tape out happened, so the question wasn't really about hardware timelines, it was about how quickly Nintendo could produce a lineup of games for hardware they had no experience with, using APIs and tools they had no experience with. Even two years is cutting it a bit tight in those circumstances.

Now, however, we're looking at a chip designed for Nintendo, with hardware, tools and APIs that are evolutions of what they're already familiar with. It appears that Nintendo have already had at least three years to develop software for this new console, and rather than tape out being the start of a mad dash to get a games lineup ready, tape out would have been the point where Nintendo had all their ducks in a row and gave Nvidia the thumbs up to get the hardware side in motion.

From leakers, it appears that Nvidia's typical tape out to product launch is under a year these days, with around 8 months for the first Ada GPUs. I'd expect a Nintendo chip to have a longer gap than that, between wanting a higher volume launch than is typical for GPUs, and a bit of a buffer for safety (if they need to do a second stepping). Let's say 14 or 15 months. As you say, sampling began between April and August last year, which is pretty much right where I'd expect it to be for a holiday 2023 launch window. That is to say, I'm reasonably confident that Nintendo's plan was to launch [redacted] holiday 2023, and that was their plan as recently as the middle of last year.

Of course, plans can change, and it's entirely possible that Nintendo have reconsidered since then, with software delays being the most likely culprit, but I don't agree that pushing back a year to holiday 2024 is the most likely outcome. Holiday launches are a good bet for initial plans, as if you're looking at things three or four years in advance, before you've started work on any games for the system, you'll probably go for a holiday launch. Delaying a system with a pipeline of games already in development is a very different thing, though. Nintendo absolutely made the right choice launching Switch in March rather than pushing back to the following holiday, as they had the pipeline of games to support it, and ended up with far more consoles in players hands by the end of 2017 than if they had gone for a holiday launch.

In [redacted]'s case, this is further complicated by the almost guaranteed presence of cross-gen titles. I don't think it's a stretch to assume that any games Nintendo releases on Switch after [redacted]'s launch will be cross-gen, with some kind of enhancement on the new hardware. That being the case, if Nintendo were planning for [redacted] to launch in Q4 2023, they would have two types of games in their pipeline for release from that point onwards; [redacted] exclusives and cross-gen games. On the one hand, this gives them a buffer, as they can release those cross-gen games as regular Switch titles if [redacted] is delayed. On the other hand, they almost certainly don't want to do this, as it means they can't announce these titles as [redacted] games, and hence weaken the next-gen lineup.

Let's say Nintendo's planned lineup for the first year of [redacted] consisted of Major Flagship Game™️, a [redacted] exclusive, at launch, with maybe a couple of other exclusives over the year, and the rest of the lineup consisting of cross-gen games. Let's also say that Major Flagship Game™️ doesn't look quite flagship-worthy, and they decide to give it a few more months in the oven, and hence also delay [redacted] itself. If they delay a full year, then in order to keep money coming in they have to release almost all of those cross-gen games as regular old Switch titles, so rather than the delay resulting in a better game lineup, the delay would actually be eating into [redacted]'s software offering. Put another way, if Nintendo has enough games to keep Switch alive another year, they have enough games to keep [redacted] going through its first. Sacrificing the latter for the former doesn't make much sense to me.

Cool except no company has ever done anything ever like this (not announcing anything even six months before launch and managing zero leaks from third-parties) before because it doesn't really make any sense to do.
 
Cool except no company has ever done anything ever like this (not announcing anything even six months before launch and managing zero leaks from third-parties) before because it doesn't really make any sense to do.
Well, yes they have, and from a marketing perspective, makes plenty of sense. Times change. A fast turnaround would be beneficial to their market position.
 
Well, yes they have, and from a marketing perspective, makes plenty of sense. Times change. A fast turnaround would be beneficial to their market position.

No it wouldn't, announcing the Switch 2 so late will either have no impact on the Switch 2's sales due to shortages or will have a negative impact short-term impact as some people will have already spent their money allotted for video games for the year on the ZOLED. A business plan of none of the 100k US and European third-party programmers and artists in video games leaking anything to anyone is less a business plan and more just magic.
 
No it wouldn't, announcing the Switch 2 so late will either have no impact on the Switch 2's sales due to shortages or will have a negative impact short-term impact as some people will have already spent their money allotted for video games for the year on the ZOLED. A business plan of none of the 100k US and European third-party programmers and artists in video games leaking anything to anyone is less a business plan and more just magic.

🙄🙄🙄 Anyways, #Team2023
 
Cool except no company has ever done anything ever like this (not announcing anything even six months before launch and managing zero leaks from third-parties) before because it doesn't really make any sense to do.
A short announcement cycle is doable imo, but leaks are trickier. While I don't think they're as strong of an indicator as some here would suggest, I do wonder if the lack of leaks imply that Nintendo is cracking down hard on them or things just aren't ready yet.
 
You have a comment from Furukawa saying that the case with the Nintendo switch, NX, was a unique case, and the implication that that will not be something that is done again.

What were they referring to? To the long cycle between announcement and official reveal and then official release.
 
We’ll know soon enough whether it is this year or not. While they can have a short cycle they’ll need to start producing soon for a sufficient holiday release. While Nintendo can crack down hard these types of spin ups usually leak in one form or another, as we see from the uncles or the Lite backplate. This goes doubly so with 3rd party companies that are known water faucets of information.

I do think it a little reckless with how they are playing it with investors. We’ll see what they have to say at the quarterly report-especially if they lower the forecast.
You have a comment from Furukawa saying that the case with the Nintendo switch, NX, was a unique case, and the implication that that will not be something that is done again.

What were they referring to? To the long cycle between announcement and official reveal and then official release.
It could be that. Could also be how they had to announce the code name due to their primary business faltering & their move into mobile. This was well in advance of any information about the device.
 
Wait, game titles come from Geforce Now hack/leak, right? Was that part of the Lapsus hack, or was that a whole separate thing?

Thraktor's GPU clock guess further emboldens my belief in LPDDR5X ✊
(although my actual guess tends to line up more with oldpuck's, actually; I currently got 1.4 tflops handheld/3.15 tflops docked. Assuming LPDDR5X-7500 MT/s, I can set aside ~30 GB/s:tflop to land at ~95 GB/s for the GPU, then have ~25 GB/s leftover for the CPU, which should be about 2.5x the amount that the PS4's CPU uses. And heeeey, 8 A78's in the 1.4-1.6 ghz range shouldn't be too far off from 2.5-2.7x the PS4's CPU, what a coincidence...
...but if it's 8533 MT/s instead, sign me up for Thraktor's numbers :p)



Huh, not much of a difference between 25W and 30W modes for the SoC. Bandwidth starvation at that point? Or thermal throttled. Or both.

The non-SoC related power needs are pretty nutty. 15W SoC being potentially doubled after everything's factored in? So the mainboard + ram + storage + display + audio + fan + wireless + etc. potentially add up to an amount in which you can easily shove a docked V2 inside?
25W SoC able to get inflated to near double? Again, all the non-SoC stuff potentially add up to an amount in which you can stick inside a docked OG Switch?

Also, that max of 48 W in portable turbo mode is insane.
According to this article*, docked OG Switch can be pushed to 16W at max.
I ask, "What is the difference between 48 and 16?"
Raise your hand if your answer is "32".
.
Alright, now, raise your hand if your answer is "The former is three times the latter, aka in the ballpark of the proportional difference between the PS5 and the Series S"
(admittedly, I am fudging this by going with the lower half of the S's power draw here...)

Portable OG Switch should be 9W or under, right? Let's say ~9W under heavy load.
The Ally's portable turbo mode going to as low as 37W is still pretty nutty.
I ask, "What is the difference between 37-48 and 9?"
Raise your hand if your answer is "28-39"
.
Alright, now, raise your hand if your answer is "The former range is about four to five times the latter, aka the proportional difference between the Series S (~60's-80's) and the docked OG Switch at its max (16 as stated earlier)."

*also, notice that the hottest it got in that article was ~52C? Room temperature's unspecified, but the delta between that 52C and room temperature shouldn't exceed ~30.
And remember, despite having multiples tens of degrees Celsius to spare before thermal throttling at usual room temperatures, I think that Nintendo advises to only play in areas of temperature ranging between 5 and 35C.
Considering that the Ally can potentially hit 95C (in plugged in turbo mode, that is) in what is presumably 'room temperature', wonder what conditions can you play in and still be allowed to use enough power for the Ally to have leg room? :unsure:

What I'm hinting at here is, if you're paying a pretty penny for the Ally expecting a major performance uplift over say, the Steam Deck, I hope that you have air conditioning during the summer.
 
My guess is currently 1.7 Tflops in portable mode (12 SMs @ 550MHz) and 3.4 Tflops docked (12 SMs @ 1.1GHz).

To explain my reasoning, let's play a game of Why does Thraktor think a TSMC 4N manufacturing process is likely for T239?

The short answer is that a 12 SM GPU is far too large for Samsung 8nm, and likely too large for any intermediate process like TSMC's 6nm or Samsung's 5nm/4nm processes. There's a popular conception that Nintendo will go with a "cheap" process like 8nm and clock down to oblivion in portable mode, but that ignores both the economic and physical realities of microprocessor design.

To start, let's quickly talk about power curves. A power curve for a chip, whether a CPU or GPU or something else, is a plot of the amount of power the chip consumes against the clock speed of the chip. A while ago I extracted the power curve for Orin's 8nm Ampere GPU from a Nvidia power estimator tool. There are more in-depth details here, here and here, but for now let's focus on the actual power curve data:
Code:
Clock      W per TPC
0.42075    0.96
0.52275    1.14
0.62475    1.45
0.72675    1.82
0.82875    2.21
0.93075    2.73
1.03275    3.32
1.23675    4.89
1.30050    5.58

The first column is the clock speed in GHz, and the second is the Watts consumed per TPC (which is a pair of SMs). Let's create a chart for this power curve:

orin-power-curve.png


We can see that the power consumption curves upwards as clock speeds increase. The reason for this is that to increase clock speed you need to increase voltage, and power consumption is proportional to voltage squared. As a result, higher clock speeds are typically less efficient than lower ones.

So, if higher clock speeds are typically less efficient, doesn't that mean you can always reduce clocks to gain efficiency? Not quite. While the chart above might look like a smooth curve, it's actually hiding something; at that lowest clock speed of 420MHz the curve breaks down completely. To illustrate, let's look at the same data, but chart power efficiency (measured in Gflops per Watt) rather than outright power consumption:

orin-efficiency-curve.png


There are two things going on in this chart. For all the data points from 522 MHz onwards, we see what you would usually expect, which is that efficiency drops as clock speeds increase. The relationship is exceptionally clear here, as it's a pretty much perfect straight line. But then there's that point on the left. The GPU at 420MHz is less efficient than it is at 522MHz, why is that?

The answer is relatively straight-forward if we consider one important point: there is a minimum voltage that the chip can operate at. Voltage going up with clock speed means efficiency gets worse, and voltage going down as clock speeds increase means efficiency gets better. But what happens when you want to reduce clocks but can't reduce voltage any more? Not only do you stop improving power efficiency, but it actually starts to go pretty sharply in the opposite direction.

Because power consumption is mostly related to voltage, not clock speed, when you reduce clocks but keep the voltage the same, you don't really save much power. A large part of the power consumption called "static power" stays exactly the same, while the other part, "dynamic power", does fall off a bit. What you end up with is much less performance, but only slightly less power consumption. That is, power efficiency gets worse.

So that kink in the efficiency graph, between 420MHz and 522MHz, is the point at which you can't reduce the voltage any more. Any clocks below that point will all operate at the same voltage, and without being able to reduce the voltage, power efficiency gets worse instead of better below that point. The clock speed at that point can be called the "peak efficiency clock", as it offers higher power efficiency than any other clock speed.

How does this impact how chips are designed?

There are two things to take from the above. First, as a general point, every chip on a given manufacturing process has a peak efficiency clock, below which you lose power efficiency by reducing clocks. Secondly, we have the data from Orin to know pretty well where this point is for a GPU very similar to T239's on a Samsung 8nm process, which is around 470MHz.

Now let's talk designing chips. Nvidia and Nintendo are in a room deciding what GPU to put in their new SoC for Nintendo's new console. Nintendo has a financial budget of how much they want to spend on the chip, but they also have a power budget, which is how much power the chip can use up to keep battery life and cooling in check. Nvidia and Nintendo's job in that room is to figure out the best GPU they can fit within those two budgets.

GPUs are convenient in that you can make them basically as wide as you want (that is use as many SMs as you want) and developers will be able to make use of all the performance available. The design space is basically a line between a high number of SMs at a low clock, and a low number of SMs at a high clock. Because there's a fixed power budget, the theoretically ideal place on that line is the one where the clock is the peak efficiency clock, so you can get the most performance from that power.

That is, if the power budget is 3W for the GPU, and the peak efficiency clock is 470MHz, and the power consumption per SM at 470MHz is 0.5W, then the best possible GPU they could include would be a 6 SM GPU running at 470MHz. Using a smaller GPU would mean higher clocks, and efficiency would drop, but using a larger GPU with lower clocks would also mean efficiency would drop, because we're already at the peak efficiency clock.

In reality, it's rare to see a chip designed to run at exactly that peak efficiency clock, because there's always a financial budget as well as the power budget. Running a smaller GPU at higher clocks means you save money, so the design is going to be a tradeoff between a desire to get as close as possible to the peak efficiency clock, which maximises performance within a fixed power budget, and as small a GPU as possible, which minimises cost. Taking the same example, another option would be to use 4 SMs and clock them at around 640MHz. This would also consume 3W, but would provide around 10% less performance. It would, however, result in a cheaper chip, and many people would view 10% performance as a worthwhile trade-off when reducing the number of SMs by 33%.

However, while it's reasonable to design a chip with intent to clock it at the peak efficiency clock, or to clock it above the peak efficiency clock, what you're not going to see is a chip that's intentionally designed to run at a fixed clock speed that's below the peak efficiency clock. The reason for this is pretty straight-forward; if you have a design with a large number of SMs that's intended to run at a clock below the peak efficiency clock, you could just remove some SMs and increase the clock speed and you would get both better performance within your power budget and it would cost less.

How does this relate to Nintendo and T239's manufacturing process?

The above section wasn't theoretical. Nvidia and Nintendo did sit in a room (or have a series of calls) to design a chip for a new Nintendo console, and what they came out with is T239. We know that the result of those discussions was to use a 12 SM Ampere GPU. We also know the power curve, and peak efficiency clock for a very similar Ampere GPU on 8nm.

The GPU in the TX1 used in the original Switch units consumed around 3W in portable mode, as far as I can tell. In later models with the die-shrunk Mariko chip, it would have been lower still. Therefore, I would expect 3W to be a reasonable upper limit to the power budget Nintendo would allocate for the GPU in portable mode when designing the T239.

With a 3W power budget and a peak efficiency clock of 470MHz, then the (again, not theoretical) numbers above tell us the best possible performance would be achieved by a 6 SM GPU operating at 470MHz, and that you'd be able to get 90% of that performance with a 4 SM GPU operating at 640MHz. Note that neither of these say 12 SMs. A 12 SM GPU on Samsung 8nm would be an awful design for a 3W power budget. It would be twice the size and cost of a 6 SM GPU while offering much less performance, if it's even possible to run within 3W at any clock.

There's no world where Nintendo and Nvidia went into that room with an 8nm SoC in mind and a 3W power budget for the GPU in handheld mode, and came out with a 12 SM GPU. That means either the manufacturing process, or the power consumption must be wrong (or both). I'm basing my power consumption estimates on the assumption that this is a device around the same size as the Switch and with battery life that falls somewhere between TX1 and Mariko units. This seems to be the same assumption almost everyone here is making, and while it could be wrong, I think them sticking with the Switch form-factor and battery life is a pretty safe bet, which leaves the manufacturing process.

So, if it's not Samsung 8nm, what is it?

Well, from the Orin data we know that a 12 SM Ampere GPU on Samsung 8nm at the peak efficiency clocks of 470MHz would consume a bit over 6W, which means we need something twice as power efficient as Samsung 8nm. There are a couple of small differences between T239 and Orin's GPUs, like smaller tensor cores and improved clock-gating, but they are likely to have only marginal impact on power consumption, nowhere near the 2x we need, which will have to come from a better manufacturing process.

One note to add here is that we actually need a bit more than a 2x efficiency improvement over 8nm, because as the manufacturing process changes, so does the peak efficiency clock. The peak efficiency clock will typically increase as an architecture is moved to a more efficient manufacturing process, as the improved process allows higher clocks at given voltages. From DVFS tables in Linux, we know that Mariko's peak efficiency clock on 16nm/12nm is likely 384MHz. That's increased to around 470MHz for Ampere on 8nm, and will increase further as it's migrated to more advanced processes.

I'd expect peak efficiency clocks of around 500-600MHz on improved processes, which means that instead of running at 470MHz the chip would need to run at 500-600MHz within 3W to make sense. A clock of 550MHz would consume around 7.5W on 8nm, so we would need a 2.5x improvement in efficiency instead.

So, what manufacturing process can give a 2.5x improvement in efficiency over Samsung 8nm? The only reasonable answer I can think of is TSMC's 5nm/4nm processes, including 4N, which just happens to be the process Nvidia is using for every other product (outside of acquired Mellanox products) from this point onwards. In Nvidia's Ada white paper (an architecture very similar to Ampere), they claim a 2x improvement in performance per Watt, which appears to come almost exclusively from the move to TSMC's 4N process, plus some memory changes.

They don't provide any hard numbers for similarly sized GPUs at the same clock speed, with only a vague unlabelled marketing graph here, but they recently announced the Ada based RTX 4000 SFF workstation GPU, which has 48 SMs clocked at 1,565MHz and a 70W TDP. The older Ampere RTX A4000 also had 48 SMs clocked at 1,560MHz and had a TDP of 140W. There are differences in the memory setup, and TDPs don't necessarily reflect real world power consumption, but the indication is that the move from Ampere on Samsung 8nm to an Ampere-derived architecture on TSMC 4N reduces power consumption by about a factor of 2.

What about the other options? TSMC 6nm or Samsung 5nm/4nm?

Honestly the more I think about it the less I think these other possibilities are viable. Even aside from the issue that these aren't processes Nvidia is using for anything else, I just don't think a 12 SM GPU would make sense on either of them. Even on TSMC 4N it's a stretch. Evidence suggests that it would achieve a 2x efficiency improvement, but we would be looking for 2.5x in reality. There's enough wiggle room there, in terms of Ada having some additional features not in T239 and not having hard data on Ada's power consumption, so the actual improvement in T239's case may be 2.5x, but even that would mean that Nintendo have gone for the largest GPU possible within the power limit.

With 4N just about stretching to the 2.5x improvement in efficiency required for a 12 SM GPU to make sense, I don't think the chances for any other process are good. We don't have direct examples for other processes like we have for Ada, but from everything we know, TSMC's 5nm class processes are significantly more efficient than either their 6nm or Samsung's 5nm/4nm processes. If it's a squeeze for 12 SMs to work on 4N, then I can't see how it would make sense on anything less efficient than 4N.

But what about cost, isn't 4nm really expensive?

Actually, no. TSMC's 4N wafers are expensive, but they're also much higher density, which means you fit many more chips on a wafer. This SemiAnalysis article from September claimed that Nvidia pays 2.2x as much for a TSMC 4N wafer as they do for a Samsung 8nm wafer. However, Nvidia is achieving 2.7x higher transistor density on 4N, which means that a chip with the same transistor count would actually be cheaper if manufactured on 4N than 8nm (even more so when you factor yields into account).

Are there any caveats?

Yes, the major one being the power consumption of the chip. I'm assuming that Nintendo's next device is going to be roughly the same size and form-factor as the Switch, and they will want a similar battery life. If it's a much larger device (like Steam Deck sized) or they're ok with half an hour of battery life, then that changes the equations, but I don't think either of those are realistic. Ditto if it turned out to be a stationary home console for some reason (again, I'm not expecting that).

The other one is that I'm assuming that Nintendo will use all 12 SMs in portable mode. It's theoretically possible that they would disable half of them in portable mode, and only run the full 12 in docked mode. This would allow them to stick within 3W even on 8nm. However, it's a pain from the software point of view, and it assumes that Nintendo is much more focussed on docked performance than handheld, including likely running much higher power draw docked. I feel it's more likely that Nintendo would build around handheld first, as that's the baseline of performance all games will have to operate on, and then use the same setup at higher clocks for docked.

That's a lot of words. Is this just all confirmation bias or copium or hopium or whatever the kids call it?

I don't think so. Obviously everyone should be careful of their biases, but I actually made the exact same argument over a year ago back before the Nvidia hack, when we thought T239 would be manufactured on Samsung 8nm but didn't know how big the GPU was. At the time a lot of people thought I was too pessimistic because I thought 8 SMs was unrealistic on 8nm and a 4 SM GPU was more likely. I was wrong about T239 using a 4 SM GPU, but the Orin power figures we got later backed up my argument, and 8 SMs is indeed unrealistic on 8nm. The 12 SM GPU we got is even more unrealistic on 8nm, so by the same logic we must be looking at a much more efficient manufacturing process. What looked pessimistic back then is optimistic now only because the data has changed.
Excellent Post, and pretty much confirms/gives power curve info behind 8N being too inefficient (both size and power-wise)

Beyond the CPU question, what are the odds/math for them targeting 1.3GHz when docked in your mind? Primarily because that would peg them at Virtually 4TFLOPs (3.999 but still).
The reason for this is that the number/guess on your end was just doubling the Portable Mode clock, meanwhile OG switch is 157.28GFLOPs portable at 307.2MHz, and 393.2GFLOPs docked at 768MHz.

So scaling the clock by the same factor would be 2.5X, so 550MHz * 2.5 = 1.37GHz (4.2TFLOPs)

Just wondering why you think they'd undercut the Portable-Docked Difference by that degree versus OG Switch. Especially for a system intending to push 4K Outputs. Also not to mention it would help keep software behavior in relation to clock scaling even closer to OG switch for B/C purposes if that happens to be a problem.
 
Wow, what a post! Thank you for writing all of this down and explaining your thoughts!

I haven’t been able to follow all of the leaks and rumors, far from it even, but what exactly is the assumption that the next system is going to have 12 SM based on? Is that based on that same Nvidia Hack that had several titles on it that were actually announced later on? Is there any possible explanation why this chip could have 12 SM and NOT be the next handheld device for Nintendo? Others seem to expect 8 SM (think I have seen it on the DF Discord, but also here), do you see a reason why they would come to that assumption?
(Edit: ) Could this chip have 12 SM but 4 deactivated for better yields?
Also, how does the rumor that devkits have been recalled play into this?

Just going by gut feeling, I struggle to imagine a system from Nintendo that has more performance than the Steam deck, has battery life comparable to Logan Switch, has the form factor of the original Switch and also is sub $429, as that would be my upper bound. (And Nintendo makes a profit on it)
Again, just gut feeling, based on nothing else. You clearly understand A LOT from this. I‘d appreciate if you could possibly address some of my questions from above. :)
This information is from a separate Nvidia hack unrelated to Geforce Now. Some Nvidia source code leaked and incidentally contained stuff related to an in development version of NVN, the native Switch graphics API.
 
0
Thraktor's post
First off: do we know if Drake won't ship with software-binned GPU cores? (maybe even CPU cores too). The info from leaks we got so far could be pointing out to 1536 CUDA cores and 8 CPU cores but those could be disabled software wise even if they do exist on silicon.
Secondly: Why are you saying it would be a pain to disable SMs in portable mode? would it be hard to implement without a system restart?
And finally: If the hardware has been taped out back in 2022, how could it be using anything but Samsung 8nm?
 
Last edited:
As of today, I haven't heard any whispers suggest 2023 release is in play. Most chatter suggests late 2024 as the intended window.
I believe you, but if Nintendo really is planning on a late 2024 release for the damn thing I can’t help but be convinced they’re making a massive mistake. Even the DS didn’t last that long. If they wanted to blow the momentum they’d built up from the Switch I couldn’t possibly imagine a better way to do it.

Legit starting to have Wii U flashbacks. Trying to convince myself that no system that has mainline Pokémon on it can possibly flop, let alone as badly as the Wii U, but the way they’ve dragged out the Switch’s lifespan makes me worry they could be setting up Redacted to do just that.
 
First off: do we know if Drake won't ship with software-binned GPU cores? (maybe even CPU cores too). The info from leaks we got so far could be pointing out to 1536 CUDA cores and 8 CPU cores but those could be disabled software wise even if they do exist on silicon.
When the idea comes up, you’d have to ask yourself, “why would Nintendo actually do this? What’s the benefit? They are paying for more silicon, yet they aren’t even using it? Why did they pay for more on the R&D front? And why are they willing to pay for more per unit and not just go with a cheaper option? What’s the rationale behind this decision?” there’s always a logical reasoning behind every decision, especially a decision that will last them several years well into the next decade even.

Though if you mean when going from TV mode go HH mode, why not disable GPU and potentially CPU cores? Well, you wouldn’t want that to happen. The game would crash regardless. You’d have to reboot it to go into a 6 core mode vs an 8 core mode. 7 for games and 5 for games respectively. And engine can be taking advantage of as many cores, but then suddenly there aren’t any cores. Tragic.

For the second, the GPU,
Secondly: Why are you saying it would be a pain to disable SMs in portable mode? would it be hard to implement without a system restart?
It would be more of a pain for developers.

And finally: If the hardware has been taped out back in 2022, how could it be using anything but Samsung 8nm?
5nm, which is what the 4N node is based on, is several years old but it just is the leading edge node right now.

When you have no other better alternative it is still the leading edge.

Second, Nvidia GPUs of the ADA line were taped out well before their retail release in Q4 2022, so it’s not necessarily indicative of anything.
 
I believe you, but if Nintendo really is planning on a late 2024 release for the damn thing I can’t help but be convinced they’re making a massive mistake. Even the DS didn’t last that long. If they wanted to blow the momentum they’d built up from the Switch I couldn’t possibly imagine a better way to do it.

Legit starting to have Wii U flashbacks. Trying to convince myself that no system that has mainline Pokémon on it can possibly flop, let alone as badly as the Wii U, but the way they’ve dragged out the Switch’s lifespan makes me worry they could be setting up Redacted to do just that.
I’m not sure how your getting WiiU flashbacks when the stuff leading into that device are not being replicated here. Nor do I really see how dragging out Switch’s lifespan is setting up Drake to mirror WiiU. The device in question would need:
  • To replicate the Wii & DS’ poor EOL software slate between 1st & 3rd party. Among these issues were droughts, confused/poor software, & audience mismatch
  • Be a completely different device that eschews that paradigm Switch started. Possibly restarting BC, NSO, & Eshop again.
  • Be a conceptually unsound device that was made in a complete bubble divorced from everyone else & made with no one in mind.
As I said in another post but I’m not really sure WiiU/3DS comps are applicable here.
 
First off: do we know if Drake won't ship with software-binned GPU cores? (maybe even CPU cores too). The info from leaks we got so far could be pointing out to 1536 CUDA cores and 8 CPU cores but those could be disabled software wise even if they do exist on silicon.
Secondly: Why are you saying it would be a pain to disable SMs in portable mode? would it be hard to implement without a system restart?
And finally: If the hardware has been taped out back in 2022, how could it be using anything but Samsung 8nm?
12 SM is what the graphic API has access to. There might be more on the chip, for yields.
 
First off: do we know if Drake won't ship with software-binned GPU cores? (maybe even CPU cores too). The info from leaks we got so far could be pointing out to 1536 CUDA cores and 8 CPU cores but those could be disabled software wise even if they do exist on silicon.
Secondly: Why are you saying it would be a pain to disable SMs in portable mode? would it be hard to implement without a system restart?
And finally: If the hardware has been taped out back in 2022, how could it be using anything but Samsung 8nm?
You would be spending extra money making 12SM chip to only use 6SM. When company used binned chip, it's because they can sell it as multiple product line like a rtx 500 super would be the chip with all SM enabled but the regular rtx 500 would be the chip with a defective SM so it is binned. There probably isn't two separate hardware coming soon.

I am no expert on this but I assume that it would be similar to switching gpu on the fly. You have graphic data split among all 12SM so now you are missing half of your data switching to 6SM. Or vice versa and all your data on the 6SM and need to spread them to all 12SM before they can do their job. A lot more difficult to manage versus changing resolution setting between dock and undock.
 
With DLSS, 2TF is enough to offer PS4 Pro level experiences, anything else is just gravy. If you really want Series S performance in TV mode - then you can buy a Series S. They're less than 200 bucks used. Mario is going to look great no matter what.
No idea why you write that tbh, everyone on this forum will buy a REDACTED to play Nintendo games so Series S is irrelevant.
 


everyone marveling at the little details in Zelda. kinda reminds me of Half Life Alyx's bottles in a way. I kinda doubt this is taking up so much processing power, but it does make me excited to see what the team can do with better hardware

Yeah you‘re right this is really comparable to the bottles in HL Alyx which where "just" a shader.

There is no way that this is real time fog. I feel like that they made there some workaround that the Fog is 2D animated and then they mask it as soon as it interacts with Weapons or Links movement. The result looks crazy good and it is probably difficult to implement that it looks convincing. Similarly to the clouds which almost look like volumetric when not looking to closely.
 
Yeah you‘re right this is really comparable to the bottles in HL Alyx which where "just" a shader.

There is no way that this is real time fog. I feel like that they made there some workaround that the Fog is 2D animated and then they mask it as soon as it interacts with Weapons or Links movement. The result looks crazy good and it is probably difficult to implement that it looks convincing. Similarly to the clouds which almost look like volumetric when not looking to closely.
That's what Nintendo does best though. Computationally cheap workarounds that looks almost as good as the real thing, and works great due to sublime art.
 
People really believe this device is coming Holiday 2024 because there hasn't been any SDK leak from third parties? You're telling me you think the SDK will somehow be released to devs between now and Holiday 2024 and they will have time enough to develop something in that time frame? Huh.

Even if this thing is out Holiday 2024, third parties already have dev kits so the "no third party leak" argument is null and void. If dev kits are not out right now, this thing is not even coming out in 2024 regardless.
 
I'm with Thraktor, Switch 2 will be in holiday 2023, not holiday 2024

To be clear, I think the original plan (up until mid-2022 at least) was a launch in holiday 2023, but I wouldn't rule out a delay into 2024. If it does get pushed into 2024, though, I would expect it early in the year, not late.

I haven’t read it yet, and will read it in full, but I’m curious on your CPU expectations if it isn’t mentioned.

And this is also a magnum opus 😹


I’ve read it all, and now I’m curious on the thoughts of the ceiling for the frequency for the A78 on the 8N vs if it were on the 4N node

On the CPU front, eight A78C cores seems almost certain. It's a bit more difficult to extract CPU power curves from Nvidia's Jetson tool (and the A78AE isn't exactly the same as the A78C), but from what I can see I'd guess around a 1.7 or 1.8Ghz clock speed. For 8N that would be around 1.2GHz. Unlike the GPU, though, an 8 core A78C is very much viable on 8nm.

Wait, game titles come from Geforce Now hack/leak, right? Was that part of the Lapsus hack, or was that a whole separate thing?

Thraktor's GPU clock guess further emboldens my belief in LPDDR5X ✊
(although my actual guess tends to line up more with oldpuck's, actually; I currently got 1.4 tflops handheld/3.15 tflops docked. Assuming LPDDR5X-7500 MT/s, I can set aside ~30 GB/s:tflop to land at ~95 GB/s for the GPU, then have ~25 GB/s leftover for the CPU, which should be about 2.5x the amount that the PS4's CPU uses. And heeeey, 8 A78's in the 1.4-1.6 ghz range shouldn't be too far off from 2.5-2.7x the PS4's CPU, what a coincidence...
...but if it's 8533 MT/s instead, sign me up for Thraktor's numbers :p)

Just curious, but when you're looking at Tflops to bandwidth ratios of Ampere cards, are you using the official Tflops figure for the cards, or calculating it from the actual clock speed the cards achieve in game? The reason I ask is that Nvidia's official figures actually underestimate the Tflops by a bit, as the GPUs typically run at a higher clock than the advertised boost clock. For example, the RTX 3070 advertises a boost clock of 1,725MHz, for 20.31 Tflops, but in game clock speeds average around 1,890MHz, which would give 22.26 Tflops. With 448GB/s of bandwidth, the former would come out at 22.06GB/s per Tflop, but the latter would give us 20.13GB/s per Tflop.

With standard LPDDR5 at 102GB/s and 25GB/s for the CPU, that would give a cap of 3.5Tflops if we're using official figures, or 3.8Tflops if we're basing it off actual in-game clocks. In either case I think they could manage 3.4Tflops without being any more bandwidth constrained than other Ampere GPUs. Of course any additional bandwidth on top of that would surely help, so I definitely wouldn't complain if it were LPDDR5X.

Excellent Post, and pretty much confirms/gives power curve info behind 8N being too inefficient (both size and power-wise)

Beyond the CPU question, what are the odds/math for them targeting 1.3GHz when docked in your mind? Primarily because that would peg them at Virtually 4TFLOPs (3.999 but still).
The reason for this is that the number/guess on your end was just doubling the Portable Mode clock, meanwhile OG switch is 157.28GFLOPs portable at 307.2MHz, and 393.2GFLOPs docked at 768MHz.

So scaling the clock by the same factor would be 2.5X, so 550MHz * 2.5 = 1.37GHz (4.2TFLOPs)

Just wondering why you think they'd undercut the Portable-Docked Difference by that degree versus OG Switch. Especially for a system intending to push 4K Outputs. Also not to mention it would help keep software behavior in relation to clock scaling even closer to OG switch for B/C purposes if that happens to be a problem.

Do any games actually use a 307MHz clock on Switch? My understanding is that it was replaced prior to launch with the 384MHz clock, which has a 1:2 ratio with the 768MHz docked clock (and then a 460MHz portable clock was added too, reducing the ratio further).

In any case, my reasoning for the 1.1GHz docked clock is partly the clean 1:2 ratio like the 384MHz/768MHz clocks Switch launched with, but also power consumption and bandwidth limits. A 1.1GHz clock would put power consumption at around 10W for the GPU, so maybe around 15W for the full system. They're not concerned about battery life in docked mode, but they still have to cool the system, and with a similar thickness and small fan setup it would be tricky to cool much more than 15W without the fan becoming distractingly loud.

On the memory bandwidth side, as discussed by @Look over there above, there's only so much performance they can get before they become bandwidth-starved. Comparing to desktop Ampere GPUs, a 1.1GHz clock would put them in a similar bandwidth per Tflops ratio. Maybe they could get away with 1.2GHz if we're looking at in-game Ampere clocks for our comparison.

First off: do we know if Drake won't ship with software-binned GPU cores? (maybe even CPU cores too). The info from leaks we got so far could be pointing out to 1536 CUDA cores and 8 CPU cores but those could be disabled software wise even if they do exist on silicon.
Secondly: Why are you saying it would be a pain to disable SMs in portable mode? would it be hard to implement without a system restart?
And finally: If the hardware has been taped out back in 2022, how could it be using anything but Samsung 8nm?

We don't know if they'll ship with binned GPU cores, but personally I don't think it's very likely. Binning GPU cores is common on console chips as a way to improve yields, but those chips are typically pretty large. The PS5 SoC is around 300mm² and the XBSX SoC is 360mm², and yields get worse the bigger the chip, so you basically need to disable something to get decent yields on a chip that big. By comparison, if T239 is on TSMC 4N, then it's going to be a tiny chip, well under 100mm². Yields should be good enough with a die that small that there's no need for binning out any cores.

On disabling the SMs, I don't think it would need a restart, but it would require much more careful management than the current change from docked to handheld. The SMs being disabled would have code running on them at the time, so that in-flight code and data would need to be migrated to other SMs. Developers would also have to account for the two different GPU configurations when working on handheld and docked modes, which would be more work than the current "identical GPU at different clocks" paradigm.

Regarding taping out in 2022, here are a list of products Nvidia taped out in 2022:

Hopper: TSMC 4N
Lovelace: TSMC 4N
T239: ?
Grace: TSMC 4N

It's not exactly a difficult game of fill-in-the-blank.
 
Tying leaks to release timing is a bit of a folly because in a perfect world, there would be no leaks to begin with. Since they are not guaranteed, the absence of leaks is just that, not in implication of when a product might release. I think we take leaks for granted because it happens so often nowadays
 
On the CPU front, eight A78C cores seems almost certain. It's a bit more difficult to extract CPU power curves from Nvidia's Jetson tool (and the A78AE isn't exactly the same as the A78C), but from what I can see I'd guess around a 1.7 or 1.8Ghz clock speed. For 8N that would be around 1.2GHz. Unlike the GPU, though, an 8 core A78C is very much viable on 8nm.
Awww, would have hoped for 1.9 to 2.1 GHz for the CPU as a range if it is on 4N
 
My guess is currently 1.7 Tflops in portable mode (12 SMs @ 550MHz) and 3.4 Tflops docked (12 SMs @ 1.1GHz).

To explain my reasoning, let's play a game of Why does Thraktor think a TSMC 4N manufacturing process is likely for T239?

The short answer is that a 12 SM GPU is far too large for Samsung 8nm, and likely too large for any intermediate process like TSMC's 6nm or Samsung's 5nm/4nm processes. There's a popular conception that Nintendo will go with a "cheap" process like 8nm and clock down to oblivion in portable mode, but that ignores both the economic and physical realities of microprocessor design.

To start, let's quickly talk about power curves. A power curve for a chip, whether a CPU or GPU or something else, is a plot of the amount of power the chip consumes against the clock speed of the chip. A while ago I extracted the power curve for Orin's 8nm Ampere GPU from a Nvidia power estimator tool. There are more in-depth details here, here and here, but for now let's focus on the actual power curve data:
Code:
Clock      W per TPC
0.42075    0.96
0.52275    1.14
0.62475    1.45
0.72675    1.82
0.82875    2.21
0.93075    2.73
1.03275    3.32
1.23675    4.89
1.30050    5.58

The first column is the clock speed in GHz, and the second is the Watts consumed per TPC (which is a pair of SMs). Let's create a chart for this power curve:

orin-power-curve.png


We can see that the power consumption curves upwards as clock speeds increase. The reason for this is that to increase clock speed you need to increase voltage, and power consumption is proportional to voltage squared. As a result, higher clock speeds are typically less efficient than lower ones.

So, if higher clock speeds are typically less efficient, doesn't that mean you can always reduce clocks to gain efficiency? Not quite. While the chart above might look like a smooth curve, it's actually hiding something; at that lowest clock speed of 420MHz the curve breaks down completely. To illustrate, let's look at the same data, but chart power efficiency (measured in Gflops per Watt) rather than outright power consumption:

orin-efficiency-curve.png


There are two things going on in this chart. For all the data points from 522 MHz onwards, we see what you would usually expect, which is that efficiency drops as clock speeds increase. The relationship is exceptionally clear here, as it's a pretty much perfect straight line. But then there's that point on the left. The GPU at 420MHz is less efficient than it is at 522MHz, why is that?

The answer is relatively straight-forward if we consider one important point: there is a minimum voltage that the chip can operate at. Voltage going up with clock speed means efficiency gets worse, and voltage going down as clock speeds increase means efficiency gets better. But what happens when you want to reduce clocks but can't reduce voltage any more? Not only do you stop improving power efficiency, but it actually starts to go pretty sharply in the opposite direction.

Because power consumption is mostly related to voltage, not clock speed, when you reduce clocks but keep the voltage the same, you don't really save much power. A large part of the power consumption called "static power" stays exactly the same, while the other part, "dynamic power", does fall off a bit. What you end up with is much less performance, but only slightly less power consumption. That is, power efficiency gets worse.

So that kink in the efficiency graph, between 420MHz and 522MHz, is the point at which you can't reduce the voltage any more. Any clocks below that point will all operate at the same voltage, and without being able to reduce the voltage, power efficiency gets worse instead of better below that point. The clock speed at that point can be called the "peak efficiency clock", as it offers higher power efficiency than any other clock speed.

How does this impact how chips are designed?

There are two things to take from the above. First, as a general point, every chip on a given manufacturing process has a peak efficiency clock, below which you lose power efficiency by reducing clocks. Secondly, we have the data from Orin to know pretty well where this point is for a GPU very similar to T239's on a Samsung 8nm process, which is around 470MHz.

Now let's talk designing chips. Nvidia and Nintendo are in a room deciding what GPU to put in their new SoC for Nintendo's new console. Nintendo has a financial budget of how much they want to spend on the chip, but they also have a power budget, which is how much power the chip can use up to keep battery life and cooling in check. Nvidia and Nintendo's job in that room is to figure out the best GPU they can fit within those two budgets.

GPUs are convenient in that you can make them basically as wide as you want (that is use as many SMs as you want) and developers will be able to make use of all the performance available. The design space is basically a line between a high number of SMs at a low clock, and a low number of SMs at a high clock. Because there's a fixed power budget, the theoretically ideal place on that line is the one where the clock is the peak efficiency clock, so you can get the most performance from that power.

That is, if the power budget is 3W for the GPU, and the peak efficiency clock is 470MHz, and the power consumption per SM at 470MHz is 0.5W, then the best possible GPU they could include would be a 6 SM GPU running at 470MHz. Using a smaller GPU would mean higher clocks, and efficiency would drop, but using a larger GPU with lower clocks would also mean efficiency would drop, because we're already at the peak efficiency clock.

In reality, it's rare to see a chip designed to run at exactly that peak efficiency clock, because there's always a financial budget as well as the power budget. Running a smaller GPU at higher clocks means you save money, so the design is going to be a tradeoff between a desire to get as close as possible to the peak efficiency clock, which maximises performance within a fixed power budget, and as small a GPU as possible, which minimises cost. Taking the same example, another option would be to use 4 SMs and clock them at around 640MHz. This would also consume 3W, but would provide around 10% less performance. It would, however, result in a cheaper chip, and many people would view 10% performance as a worthwhile trade-off when reducing the number of SMs by 33%.

However, while it's reasonable to design a chip with intent to clock it at the peak efficiency clock, or to clock it above the peak efficiency clock, what you're not going to see is a chip that's intentionally designed to run at a fixed clock speed that's below the peak efficiency clock. The reason for this is pretty straight-forward; if you have a design with a large number of SMs that's intended to run at a clock below the peak efficiency clock, you could just remove some SMs and increase the clock speed and you would get both better performance within your power budget and it would cost less.

How does this relate to Nintendo and T239's manufacturing process?

The above section wasn't theoretical. Nvidia and Nintendo did sit in a room (or have a series of calls) to design a chip for a new Nintendo console, and what they came out with is T239. We know that the result of those discussions was to use a 12 SM Ampere GPU. We also know the power curve, and peak efficiency clock for a very similar Ampere GPU on 8nm.

The GPU in the TX1 used in the original Switch units consumed around 3W in portable mode, as far as I can tell. In later models with the die-shrunk Mariko chip, it would have been lower still. Therefore, I would expect 3W to be a reasonable upper limit to the power budget Nintendo would allocate for the GPU in portable mode when designing the T239.

With a 3W power budget and a peak efficiency clock of 470MHz, then the (again, not theoretical) numbers above tell us the best possible performance would be achieved by a 6 SM GPU operating at 470MHz, and that you'd be able to get 90% of that performance with a 4 SM GPU operating at 640MHz. Note that neither of these say 12 SMs. A 12 SM GPU on Samsung 8nm would be an awful design for a 3W power budget. It would be twice the size and cost of a 6 SM GPU while offering much less performance, if it's even possible to run within 3W at any clock.

There's no world where Nintendo and Nvidia went into that room with an 8nm SoC in mind and a 3W power budget for the GPU in handheld mode, and came out with a 12 SM GPU. That means either the manufacturing process, or the power consumption must be wrong (or both). I'm basing my power consumption estimates on the assumption that this is a device around the same size as the Switch and with battery life that falls somewhere between TX1 and Mariko units. This seems to be the same assumption almost everyone here is making, and while it could be wrong, I think them sticking with the Switch form-factor and battery life is a pretty safe bet, which leaves the manufacturing process.

So, if it's not Samsung 8nm, what is it?

Well, from the Orin data we know that a 12 SM Ampere GPU on Samsung 8nm at the peak efficiency clocks of 470MHz would consume a bit over 6W, which means we need something twice as power efficient as Samsung 8nm. There are a couple of small differences between T239 and Orin's GPUs, like smaller tensor cores and improved clock-gating, but they are likely to have only marginal impact on power consumption, nowhere near the 2x we need, which will have to come from a better manufacturing process.

One note to add here is that we actually need a bit more than a 2x efficiency improvement over 8nm, because as the manufacturing process changes, so does the peak efficiency clock. The peak efficiency clock will typically increase as an architecture is moved to a more efficient manufacturing process, as the improved process allows higher clocks at given voltages. From DVFS tables in Linux, we know that Mariko's peak efficiency clock on 16nm/12nm is likely 384MHz. That's increased to around 470MHz for Ampere on 8nm, and will increase further as it's migrated to more advanced processes.

I'd expect peak efficiency clocks of around 500-600MHz on improved processes, which means that instead of running at 470MHz the chip would need to run at 500-600MHz within 3W to make sense. A clock of 550MHz would consume around 7.5W on 8nm, so we would need a 2.5x improvement in efficiency instead.

So, what manufacturing process can give a 2.5x improvement in efficiency over Samsung 8nm? The only reasonable answer I can think of is TSMC's 5nm/4nm processes, including 4N, which just happens to be the process Nvidia is using for every other product (outside of acquired Mellanox products) from this point onwards. In Nvidia's Ada white paper (an architecture very similar to Ampere), they claim a 2x improvement in performance per Watt, which appears to come almost exclusively from the move to TSMC's 4N process, plus some memory changes.

They don't provide any hard numbers for similarly sized GPUs at the same clock speed, with only a vague unlabelled marketing graph here, but they recently announced the Ada based RTX 4000 SFF workstation GPU, which has 48 SMs clocked at 1,565MHz and a 70W TDP. The older Ampere RTX A4000 also had 48 SMs clocked at 1,560MHz and had a TDP of 140W. There are differences in the memory setup, and TDPs don't necessarily reflect real world power consumption, but the indication is that the move from Ampere on Samsung 8nm to an Ampere-derived architecture on TSMC 4N reduces power consumption by about a factor of 2.

What about the other options? TSMC 6nm or Samsung 5nm/4nm?

Honestly the more I think about it the less I think these other possibilities are viable. Even aside from the issue that these aren't processes Nvidia is using for anything else, I just don't think a 12 SM GPU would make sense on either of them. Even on TSMC 4N it's a stretch. Evidence suggests that it would achieve a 2x efficiency improvement, but we would be looking for 2.5x in reality. There's enough wiggle room there, in terms of Ada having some additional features not in T239 and not having hard data on Ada's power consumption, so the actual improvement in T239's case may be 2.5x, but even that would mean that Nintendo have gone for the largest GPU possible within the power limit.

With 4N just about stretching to the 2.5x improvement in efficiency required for a 12 SM GPU to make sense, I don't think the chances for any other process are good. We don't have direct examples for other processes like we have for Ada, but from everything we know, TSMC's 5nm class processes are significantly more efficient than either their 6nm or Samsung's 5nm/4nm processes. If it's a squeeze for 12 SMs to work on 4N, then I can't see how it would make sense on anything less efficient than 4N.

But what about cost, isn't 4nm really expensive?

Actually, no. TSMC's 4N wafers are expensive, but they're also much higher density, which means you fit many more chips on a wafer. This SemiAnalysis article from September claimed that Nvidia pays 2.2x as much for a TSMC 4N wafer as they do for a Samsung 8nm wafer. However, Nvidia is achieving 2.7x higher transistor density on 4N, which means that a chip with the same transistor count would actually be cheaper if manufactured on 4N than 8nm (even more so when you factor yields into account).

Are there any caveats?

Yes, the major one being the power consumption of the chip. I'm assuming that Nintendo's next device is going to be roughly the same size and form-factor as the Switch, and they will want a similar battery life. If it's a much larger device (like Steam Deck sized) or they're ok with half an hour of battery life, then that changes the equations, but I don't think either of those are realistic. Ditto if it turned out to be a stationary home console for some reason (again, I'm not expecting that).

The other one is that I'm assuming that Nintendo will use all 12 SMs in portable mode. It's theoretically possible that they would disable half of them in portable mode, and only run the full 12 in docked mode. This would allow them to stick within 3W even on 8nm. However, it's a pain from the software point of view, and it assumes that Nintendo is much more focussed on docked performance than handheld, including likely running much higher power draw docked. I feel it's more likely that Nintendo would build around handheld first, as that's the baseline of performance all games will have to operate on, and then use the same setup at higher clocks for docked.

That's a lot of words. Is this just all confirmation bias or copium or hopium or whatever the kids call it?

I don't think so. Obviously everyone should be careful of their biases, but I actually made the exact same argument over a year ago back before the Nvidia hack, when we thought T239 would be manufactured on Samsung 8nm but didn't know how big the GPU was. At the time a lot of people thought I was too pessimistic because I thought 8 SMs was unrealistic on 8nm and a 4 SM GPU was more likely. I was wrong about T239 using a 4 SM GPU, but the Orin power figures we got later backed up my argument, and 8 SMs is indeed unrealistic on 8nm. The 12 SM GPU we got is even more unrealistic on 8nm, so by the same logic we must be looking at a much more efficient manufacturing process. What looked pessimistic back then is optimistic now only because the data has changed.
What if Nintendo uses batteries with more density? Today cheap Chinese smarthones are already managing to get close to the 6000mAh level
 
That's what Nintendo does best though. Computationally cheap workarounds that looks almost as good as the real thing, and works great due to sublime art.
I'd like for them to keep this mentality for developing games on the next console, even if it has PS5-level hardware. Optimization doesn't hurt.
 
I'd like for them to keep this mentality for developing games on the next console, even if it has PS5-level hardware. Optimization doesn't hurt.
I wouldn't call console versions of current unoptimized outside of some edge cases (pc version tho). and I wouldn't put it past Nintendo to adopt some heavier features over the cheaper work arounds. for example, Zelda's take on global illumination could really benefit from a costlier version since it has a tendency to produce errors

What if Nintendo uses batteries with more density? Today cheap Chinese smarthones are already managing to get close to the 6000mAh level
definitely not out of the question and something I expect. don't think we'll see 6000mAh, but 5000mAh like the steam deck seems likely
 
What if Nintendo uses batteries with more density? Today cheap Chinese smarthones are already managing to get close to the 6000mAh level
Although possible, somehow that seems excessive I think? That would be heavier.

Currently switch uses a 4310mAh battery, we’ve suggested before about a 5000mAh battery.


they have done upgrades to the battery gen on gen with the DS to 3DS, so it’s possible they’ll do it again here.

Who knows? They may even deliver OLED/V2 levels of battery life with this 5k mAh battery and targeting the same power budget….

While also being a huge gen over gen leap.
 
1) Nintendo sells new hardware models that play the same games hoping to sell to people who already own a previous version of the system all the time. Selling one that would play games better would just be even more attractive.

2) PS5 was in higher demand than supply, but it's still been one of the fastest selling systems in history. I don't think lack of new systems is what's different, but that software development ambitions/budgets/development times didn't grow as fast as would be necessary to make games passing on the earlier machines make sense.
S2ldNji.png
1) You're right, and there have been a ton of Wii U ports on Switch for example. However, the Wii U only sold 13 million copies. So for the vast majority of Nintendo Switch owners, these ports were completely new gaming experiences. Even ports like Skyward Sword HD, Miitopia or Xenoblade Chronicles Definitive Edition were somehow aimed at a new audience, at least in part. There are certainly people who have Totk and will want an improved version on Switch 2, but I don't know if the appeal is comparable.


What I don't know either is to what extent having to wait 5 years for a new 3D Zelda or for a new 3D Mario (if the next one is cross-gen) that really uses what the Switch 2 can do will not be a problem the same way it's not for the PS5, knowing that development times are getting longer and longer, as you say.


2) Indeed, the demand for the PS5 is very strong and it sold very well despite the shortages. However, the fact that Sony had to literally deprive itself of people who wanted a PS5 but couldn't buy it remains a good reason to have a longer than expected cross-gen period.

I totally agree with what you say about development times getting longer, and I think that's precisely the difference between Nintendo and Playstation: longer development times mean more difficulties to maintain a sustained release schedule. Sony can count on The Last Of Us as much as on Call Of Duty, on Spiderman as well as on GTA. If both Zelda and Mario are cross gen, and even with other big sellers like Mario Kart or Pokémon, there is a risk of a gap in the schedule at some point.

1) I think we all agree with this. The diverging point is when people expect the next Switch be readily available to anyone who wants one.

When we look at OLED accounting for 51% of Switch sales in the last FY (9.2 mi out of 17.9 mi), we can see a lot of demand for upgrades and also demand from new users choosing the best experience over the cheaper ones. So, as long as they have a big lineup of system sellers, they should have enough demand to sell out for around 2 years, even if these system sellers are also available on the Switch.

And of course, if both we and Nintendo miscalculate and there's not much demand for the new one (say, another 3DS), then they will cancel Switch versions to boost demand and to have the next wave of games a little bit earlier.


2) I agree, but it's not just PS 1st party though. COD, FIFA, GTA, GI, Fortnite, ER, HL, etc and many other PS best-sellers and system-sellers are also on PS4 (and on the $300 Series S, some even on their phones and/or the Switch) and didn't prevent the huge demand for the PS5.

And considering that 70% of Switch owners in the US by 2018 had a PS or Xbox, the audiences may not differ as much as you think, specially the early adopters. But even if they do, replacing functional electronics for ones which do the same thing but better became quite widespread nowadays. Not saying that a huge part of their user base will do it without exclusive games, but a small chunk of 130+mi is still a lot of demand.
1) I agree with you, and I think that given the success of the Switch, a smooth transition like Nintendo wants will necessarily involve a cross-gen period. I don't believe in an abrupt break at all. However, the time games will take to develop means that if you offer major titles in cross-gen, people who bought the Switch 2 will wait a long time before getting future iterations designed specifically for them.

That's why I see a balance between cross-gen titles and some exclusives. Because in the short term, it's fine to sell tons of copies of Totk, but at some point in the life cycle of the next console, people will wonder where their next-gen Zelda or Mario are. And I don't know if remakes and remasters will be attractive enough in the meantime.But perhaps the market has changed and I may be wrong.

2) I think that Playstation consoles are attractive enough to sell games and Nintendo games are attractive enough to sell consoles. That's what explains for me the very high demand for the PS5 despite the extended cross-gen period. This is why I insist on different market positioning.

You can have an Xbox or a Playstation AND a Nintendo console, and this was already the case with Nintendo's traditional handhelds before the Switch, which were complementary to other consoles. However, I don't deny that the hybrid approach, which allows the Swtich to be used both as a companion mobile console and as a home console, may have (re)brought in a new audience, that's true.
 
Wow, what a post! Thank you for writing all of this down and explaining your thoughts!

I haven’t been able to follow all of the leaks and rumors, far from it even, but what exactly is the assumption that the next system is going to have 12 SM based on? Is that based on that same Nvidia Hack that had several titles on it that were actually announced later on? Is there any possible explanation why this chip could have 12 SM and NOT be the next handheld device for Nintendo? Others seem to expect 8 SM (think I have seen it on the DF Discord, but also here), do you see a reason why they would come to that assumption?
(Edit: ) Could this chip have 12 SM but 4 deactivated for better yields?
Also, how does the rumor that devkits have been recalled play into this?

Just going by gut feeling, I struggle to imagine a system from Nintendo that has more performance than the Steam deck, has battery life comparable to Logan Switch, has the form factor of the original Switch and also is sub $429, as that would be my upper bound. (And Nintendo makes a profit on it)
Again, just gut feeling, based on nothing else. You clearly understand A LOT from this. I‘d appreciate if you could possibly address some of my questions from above. :)

Yes, the 12 SMs is from the Nvidia hack. I assume most people who aren't aware of the hack are expecting 8 SMs because it's exactly half of Orin's GPU, or it just lines up with whatever their performance expectations are.

I don't think the rumoured recall of dev kits has much of an impact. We have pretty solid details on T239, and strong evidence that it was taped out in mid-2022, so it's the chip we're getting regardless of anything Nintendo might be doing with dev kits. One thing I'll note is that the timeframe for when the dev kits were rumoured to be recalled was pretty much the time when we would expect dev kits using actual T239 silicon to begin appearing, a few months after tape out. If any dev kits were recalled at that time they would have been early dev kits using non-final hardware (eg just PCs with Ampere GPUs, or jury-rigged Orin dev kits). I wonder if they recalled the old dev kits, but only sent the new T239-based kits to a much more limited number of third parties.

I admit outperforming Steam Deck at under $400 and with a much lower power draw seems unlikely, but there are quite a few architectural and economic factors in Nintendo's favour. Firstly, Nintendo is just using more power-efficient architectures. The CPU is an obvious case here, where A78 cores are much more power efficient than Zen 2, but even on 8nm, Ampere is around as efficient as RDNA2 on TSMC 7nm. If we take those architectures and migrate them to TSMC 4N, we get a much more efficient chip than Steam Deck uses.

Another factor is that Nintendo seems to be going more for a slow-and-wide approach, which yields better power efficiency. They're using 8 CPU cores at (almost certainly) lower clocks, compared to 4 cores at higher clocks. On the GPU side, the Steam Deck uses a relatively narrow-and-fast configuration for a device like this, running 512 "cores" at 1.6GHz to achieve 1.6Tflops. Comparatively, Nintendo would be running 1536 "cores" at 550MHz to hit 1.7Tflops. The latter configuration should be much more power efficient.

On the economic side, Nintendo have the benefit of releasing at least a year and a half later than the Steam Deck, during which time prices of electronic components have dropped significantly. Nintendo is also operating at a much larger scale than Valve is. I don't know if Valve have given any sales figures for the Steam Deck, but I would imagine they would be very happy with sales of 1 or 2 million units. Nintendo is coming off a device that's sold 125 million and counting, and that counts for a lot in terms of economies of scale.

In particular, designing Steam Deck's SoC was a nice side-gig for AMD, but not something that had a major impact on their business, so there wouldn't have been much of an incentive to give Valve a particularly good deal, and the R&D costs would have had to be recouped in relatively few sales. Conversely, Nintendo are now Nvidia's only customer for consumer SoCs, and thanks to the Switch, the TX1 is now almost certainly the best-selling chip Nvidia have ever produced. Nvidia have a strong incentive to keep Nintendo on-board, and can amortise R&D costs over a much larger number of expected sales, which means Nintendo are in a position to negotiate a much better deal than Valve are.

What if Nintendo uses batteries with more density? Today cheap Chinese smarthones are already managing to get close to the 6000mAh level

Yeah, I do expect an increase in battery capacity, but more likely to around 5000mAh, which is in line with battery density improvements since the Switch. I'm taking that into account when I say I'm expecting a battery life in between the original Switch and the Mariko models, though, because running at around the same power draw as the original Switch with a 5000mAh battery would give a battery life somewhat better than OG units, but not quite as good as Mariko models.
 
Last edited:
Sadly, #teamleapday was already dead before it was formed becuase Leap Day 2024 is a Thursday, unless Nintendo switches over to Thursday releases for next-gen.

#teamleapday knew it was a Thursday from the beginning:

Well, he said exactly. I don't think anyone has predicted February 29th yet, so maybe that's it.

Edit: Actually, I'd unironically say that the 29th of February would be a pretty good launch date if they are planning on Q1 2024, now that I think of it. It's only a few days off the original Switch launch date, it's definitely memorable, and although it's not the usual Friday launch, it is a Thursday, so close enough.

The glory of Leap Day transcends your petty concerns about days of the week.
 
Please read this staff post before posting.

Furthermore, according to this follow-up post, all off-topic chat will be moderated.
Last edited:


Back
Top Bottom