StarTopic Future Nintendo Hardware & Technology Speculation & Discussion |ST| (Read the staff posts before commenting!)

ItWasMeantToBe19 · Mar 22, 2023

How much power would each 4 GB stick of LPDDR5 RAM take up?

To try to get an estimate of how much they would have to sacrifice otherwise if they had 4 sticks of LPDDR5 instead of just 2.

Deleted member 887 · Mar 22, 2023

Paxie said:
So basically, 5LPP = 8nm (actual) and the Samsung 8nm people have been talking about as a worst case scenario = 10nm (actual)?

Err, sorry, I really meant 7nm not 8nm in that previous comment. But otherwise, yes, roughly.

You could see Samsung's 5LPP as loosely the equivalent of TSMC's 7nm, which is the node that the Steam Deck/PS5/Xbox Series consoles are all built on.

ILikeFeet · Mar 22, 2023

ItWasMeantToBe19 said:
How much power would each 4 GB stick of LPDDR5 RAM take up?

To try to get an estimate of how much they would have to sacrifice otherwise if they had 4 sticks of LPDDR5 instead of just 2.

I'm gonna assume you just mean 4 chips, since sticks aren't used on a board like this. but they also wouldn't use 4 chips because of space reasons. that already cuts down power consumption. also, we need to consider the speeds the ram runs at. without knowing that, there's not too much we can say beyond what the max could be

ReddDreadtheLead · Mar 22, 2023

ItWasMeantToBe19 said:
How much power would each 4 GB stick of LPDDR5 RAM take up?

To try to get an estimate of how much they would have to sacrifice otherwise if they had 4 sticks of LPDDR5 instead of just 2.

Less than 1W in portable, like 1.5W when docked.

Thraktor · Mar 22, 2023

oldpuck said:
The 27% TDP improvement from SEC 8nm to SEC 5nm in this specific example happens to be right on the money for the gap between SEC 8nm and TSMC 7nm for Ampere. It also matches ARMs own rough numbers of A78 on various nodes.

Which matches with the general consideration of SEC 5LPP as their "catch up" to TSMC 7nm.

Take the Jetson Power Tool, plug in Switch clocks and T239 specs, set the load to medium, and apply a 30% power savings and you get roughly the Switch's power draw. One of the reasons I've been thinking TSMC 7nm for so long. If Sammy can offer that level of gain over 8nm, for better than TSMC cost, on a long lived node, then that is a very comfortable place to be in.

Yeah, it puts Samsung's 5nm processes at around the same place as TSMC's 7nm family, which is generally in line with expectations.

Regarding the use of original Switch clocks on the new model, although I don't think that specific rumour about using Samsung 5LPP has any weight to it, it's helpful in highlighting one of the reasons why I think the size of T239's GPU is indicative of a more advanced process, and actually higher clocks than the original Switch.

By poking around at the Jetson Power Tool, we can find that the power curve used for the Orin GPU fits very closely to the following equation:

P = N x 0.4132e^2.01C

Where N is the number of TPCs (or 2x the number of SMs), C is the clock speed measured in GHz, and P is the power consumption measured in Watts. If we take the 27% reduction for 5LPP as a flat value, we can just multiply the equation by 0.73, giving us a hypothetical 5LPP power curve of:

P = N x 0.3016e^2.01C

So, for a 6 TPC design like T239, we would get 3.92W for the GPU at 384MHz and 8.47W for it at 768MHz. Both of these are within the ballpark of what we'd expect for a new Switch, but still don't really explain why they would use such a large GPU. If we invert the equation, we can calculate the clock speed that can be achieved at a given power consumption for a given number of TPCs:

8nm: C = ln( P / (N x 0.4132) ) / 2.01
5nm: C = ln( P / (N x 0.3016) ) / 2.01

Let's assume that Nintendo were choosing between an 8 SM design and a 12 SM design, both on 5LPP with this hypothetical power curve. If their goal was 8.75W for the GPU in docked mode, then they could either have an 8 SM design clocked at 970MHz, providing 1,986 Gflops, or a 12 SM design clocked at 768MHz, providing 2,359 Gflops. Effectively, they're increasing their GPU size by 50%, but only achieving a 19% performance increase out of it. It's not zero return on investment, but it's not great.

Portable mode makes less sense, though. An 8 SM GPU within a 3.92W limit could clock to 586MHz, which would give 1,201 Gflops. A 12 SM design clocked at 384MHz consumes the same amount of power, and hits 1,178 Gflops. That is, they're actually getting slightly less performance with 12 SMs than they would have with 8.

Of course this analysis is inherently limited by assuming that 5LPP provides a simple, scalar reduction in power over 8mm. However, I'd still expect roughly similar behaviour. Effectively what we're looking at here is the marginal return on an increased number of TPCs for a given power draw, or equivalently for given clocks. This is related to the power efficiency curve, and we should see that it's 0 at the clock speed which provides peak power efficiency, tending up towards 1 at peak clocks, and it's negative below the peak power efficiency. Hence why the 12 SM GPU at 3.92W performs worse than an 8 SM one does, because it's dealing with clock speeds below the point of peak efficiency.

As a point of reference, if we take the 8nm Orin power curve from above, we can calculate the clock speed which achieves maximum efficiency: 477MHz. This explains why they don't clock below 420MHz on Orin Jetson products and instead disable TPCs at lower power settings, because it actually provides more performance given they're below peak efficiency. If we do the same for the hypothetical 5LPP curve, we get 644MHz as the peak efficiency. This probably doesn't bear much relationship to the actual point of peak efficiency on 5LPP, given the crude nature of applying a scalar shift to the curve, but we should definitely see this peak efficiency point increase as we move onto more efficient manufacturing processes.

I would be very surprised to see Nintendo using clock speeds lower than the peak efficiency point for the process they're using. If they were, then they'd effectively be paying extra for a less powerful GPU. If money weren't an issue, then the hypothetical ideal design for a power-limited chip should be to identify the peak efficiency clock speed and then choose however many TPCs fit in your power budget at that clock speed. Removing TPCs would reduce performance slightly while lowering your costs, while adding TPCs would also reduce performance but raise your costs.

If you're more constrained by cost than power consumption, then the optimal design is simply as many TPCs as you can afford, clocked as high as you can. In a more realistic scenario where you're balancing cost and power draw of various components, the design will sit somewhere between these two extremes, sitting at a sweet spot in the power/clock/cost space where the marginal benefit of adding more TPCs isn't worth the additional cost.

Of course Nintendo actually have two power profiles to be concerned about, portable and docked, but power efficiency is far more important in portable mode, whereas docked mode is going to be more balanced against cost. Running at below peak efficiency clocks in portable mode would effectively mean they've chosen a design which trades away power efficiency in portable mode in favour of improved power efficiency in docked mode, and increased their costs in doing so, which doesn't make a whole lot of sense to me.

I would expect Nintendo to have chosen a GPU such that they're clocking somewhere above peak efficiency clocks in portable mode, and around 2x that in docked mode. This gives them a good balance of performance, power draw and cost, and it's exactly what they did with TX1. At 8nm we can easily see that T239 doesn't have such a GPU, as clocking at peak efficiency clocks of 477MHz would draw 6.47W for the GPU alone in portable mode. Assuming about 3W for the GPU, the optimal number of TPCs on 8nm would be 2.78, or 5.57 SMs.

On our hypothetical 5LPP power curve, with a peak efficiency point at 644MHz, we would get 6.6W for T239's GPU. As I said, this is a very crude, and to be honest is probably a good illustration of why we shouldn't just treat the differences in power consumption between manufacturing processes as a flat percentage. The peak efficiency point is likely to be lower than this, and the 27% power improvement is unlikely to be representative of the lower end of the power curve. Still, I would be surprised if 5LPP were so much more efficient than 8N that 12 SMs would be a sensible design choice. You'd need around a 50% reduction in power draw compared to 8N at the low end of the curve for 12 SMs to make sense. Judging by Ampere/Ada comparisons, 4N does seem to offer around a 50% reduction in power draw over 8N.

Anyway, my point is that using original Switch clocks on any process of 8nm or better would mean that they've chosen a GPU that's too big for their requirements, and are paying more for something that's giving them less performance in portable mode and only marginal improvements docked. As I see it, increases in the minimum viable clock speed (ie peak efficiency clock) with improved manufacturing processes make a clock speed of 500MHz+ in portable mode more likely, and a similar increase to the docked clock. That being the case, it's impossible to justify 12 SMs on 8nm, and honestly hard to justify them on either Samsung 5nm or TSMC 7nm. Only on TSMC's 5nm/4nm processes does 12 SMs seem sensible to me.

MuddySeal · Mar 22, 2023

Look over there · Mar 22, 2023

LinkURL said:
How can we know if the T239 that received Linux support is the same T239 from NVN2, I mean in the sense that in that gap of time maybe there were some changes in the project, like a change of node maybe, or something like that would certainly change the numbering of the project?

Ah, there just so happens to be a particular example near and dear to us.
OG Switch, Erista, is T210.
Red box/V2/Lite/OLED, Mariko, is... T214. Node change + update memory controller from LPDDR4 to 4X, still a change in the number.

---

@Thraktor
So... should the clocks end up such that they nail a balance of both energy and silicon cost efficiency, how are you feeling about the odds of 7500 MT/s LPDDR5X to properly feed the thing?

Ashee · Mar 22, 2023

After lurking and reading a lot of theories here i have a big question.... do we have some kind of info about the console type(portable, table, hybrid)? i want to believe that after seing how many sales the switch made till now, next gen will still be a hybrid type(I searched a while but couldnt find anything about this, like everyone just asumes that its going to be)

Look over there · Mar 22, 2023

Strictly speaking, the design itself points towards a battery powered device of some sort. Something that can fit within a thin & light laptop, or even smaller. There is also an absence of another device, so I think that it's safe to assume that this battery powered device has to fill the role of both portable and TV play. In other words, I interpret this as a hybrid.

ILikeFeet · Mar 22, 2023

Ashee said:
After lurking and reading a lot of theories here i have a big question.... do we have some kind of info about the console type(portable, table, hybrid)? i want to believe that after seing how many sales the switch made till now, next gen will still be a hybrid type(I searched a while but couldnt find anything about this, like everyone just asumes that its going to be)

no, that wouldn't be in any of the leaks. but it's expected to be a hybrid because that's why the switch sells

theguy · Mar 22, 2023

so theoretically do we think this thing could manage 4k resolution in docked mode, but everything is essentially on the lowest settings possible depending on the game?

ReddDreadtheLead · Mar 22, 2023

@oldpuck really interesting article that is worth looking into especially when considering the consoles RT capabilities, and there’s theorizing when it comes to the Ampere line up but it is aiming to fix a Turing issue.

Also, why FLOPs aren’t everything people

:

AMD implements raytracing acceleration by adding intersection test instructions to the texture units. Instead of dealing with textures though, these instructions take a box or triangle node in a predefined format. Box nodes can represent four boxes, and triangle nodes can represent four triangles. The instruction computes intersection test results for everything in that node, and hands the results back to the shader. Then, the shader is responsible for traversing the BVH and handing the next node to the texture units. RDNA 3 additionally has specialized LDS instructions to make managing the traversal stack faster.

Cyberpunk 2077
Cyberpunk 2077 can make extensive use of raytracing, and is one of the best showcases of what the technology can enable. Turning on raytracing often produces an immediately noticeable difference, as reflections that the game doesn’t handle with rasterization become visible. To do raytracing, Cyberpunk 2077 uses the DirectX Raytracing (DXR) API. It defines BVH-es in two structures – a top level acceleration structures (TLAS), and a bottom level acceleration structure (BLAS). Traversing the TLAS gets you to a BLAS, which eventually gets you to the relevant geometry.

With a capture from inside “The Mox”, we get a TLAS that covers a massive portion of the gameplay setting. Most of Night City seems to be included, as well as the surrounding areas. The TLAS has 70,720 nodes, of which 22,404 are box nodes and 48,316 are “instance nodes” that link to BLAS instances. Traversing the TLAS leads you to 8,315 BLAS instances, which collectively represent 11,975,029 triangles. The TLAS occupies 11 MB of storage, while everything together (including the BLAS-es) occupies 795 MB.

This further exemplifies the issue with the Seires S. Regardless of people’s thoughts on there being or there not being RT on Drake, it is there hardware wise and it is supported API wise, it is meant to be used and will occupy space that needs to be accounted for. Of course, NVidia is more efficient with this regard than AMD, but it’s not zero.

Series S only has 7.5-8GB of RAM available, but 6.7-7.2GB for games, by not using the RT they avoid this and get more out of the low RAM pool.

AMD thus uses a rather deep BVH with a lot of subdividing. That means less demand on intersection test throughput. But cache and memory latency will have a large impact, because each jump between nodes is dependent on the intersection test results from the previous node. GPUs have high cache and memory latency compared to CPUs, so RDNA 2 will need to keep a lot of rays in flight to hide that latency.

We previously looked at how RDNA 2 handled raytracing at the hardware level, and noted that the L2 cache played a very significant role. Looking at the raytracing structure sizes, the hitrates we saw make a lot of sense. The TLAS alone is 11 MB, so unless a lot of rays happen to go in the same direction, caches with capacities in the kilobyte range will probably have difficulty coping.

This is really important to note as the Consoles do not have an Infinity Cache that helps with the latency, they use GDDR6, which has a notably high latency as a result of high memory bandwidth.

So they are more sensitive and it needs to be taken into account.

More in the article and really worth the read…

Here’s a final tidbit:

However, Nvidia takes a very different approach to BVH construction. The way Nsight Graphics presents the BVH suggests it’s an extremely wide tree. Expanding the top node immediately reveals thousands of bounding boxes, each of which points to a BLAS. Each BLAS then contains anywhere from a few dozen to thousands of primitives. If Nsight’s representation corresponds to the actual raytracing structure, then Nvidia can get to the bottom of their acceleration structure in just three hops. That makes Nvidia’s implementation far less sensitive to cache and memory latency.

To enable this approach, Nvidia likely has more flexible hardware, or is handling a lot more work with the general purpose vector units. Unlike AMD, where nodes can only point to four children, Nvidia does not have fixed size nodes. One node can point to two triangle nodes, while another points to six. A single triangle node can contain hundreds of triangles.

(They also have a 11MB TLAS)

Dakhil · Mar 22, 2023

ReddDreadtheLead said:
@oldpuck really interesting article that is worth looking into especially when considering the consoles RT capabilities, and there’s theorizing when it comes to the Ampere line up but it is aiming to fix a Turing issue.

Also, why FLOPs aren’t everything people :

This further exemplifies the issue with the Seires S. Regardless of people’s thoughts on there being or there not being RT on Drake, it is there hardware wise and it is supported API wise, it is meant to be used and will occupy space that needs to be accounted for. Of course, NVidia is more efficient with this regard than AMD, but it’s not zero.

Series S only has 7.5-8GB of RAM available, but 6.7-7.2GB for games, by not using the RT they avoid this and get more out of the low RAM pool.

This is really important to note as the Consoles do not have an Infinity Cache that helps with the latency, they use GDDR6, which has a notably high latency as a result of high memory bandwidth.

So they are more sensitive and it needs to be taken into account.

More in the article and really worth the read…

Here’s a final tidbit:

(They also have a 11MB TLAS)

Raytracing on AMD’s RDNA 2/3, and Nvidia’s Turing and Pascal

Note: Jake has commented that Nvidia’s tools may not show the true BVH structure. That’s a distinct possibility, as the structure implied by Nsight is indeed ridiculously wide. The rest…

chipsandcheese.com

ReddDreadtheLead · Mar 22, 2023

Dakhil said:
Raytracing on AMD’s RDNA 2/3, and Nvidia’s Turing and Pascal

Note: Jake has commented that Nvidia’s tools may not show the true BVH structure. That’s a distinct possibility, as the structure implied by Nsight is indeed ridiculously wide. The rest…

chipsandcheese.com

How could I forget to link the article

A million thanks Dakhil.

Cuzizkool · Mar 22, 2023

Look over there said:
@Thraktor
So... should the clocks end up such that they nail a balance of both energy and silicon cost efficiency, how are you feeling about the odds of 7500 MT/s LPDDR5X to properly feed the thing?

I‘m also curious where most people stand on the LPDDR5 vs LPDDR5x issue. Bandwidth is going to be a bottleneck no matter what. Would 5x provide enough of a reason for Nintendo to choose it? Or is the improvement nominal? Is it too bleeding edge/expensive for Nintendo?

ILikeFeet · Mar 22, 2023

theguy said:
so theoretically do we think this thing could manage 4k resolution in docked mode, but everything is essentially on the lowest settings possible depending on the game?

"depending on the game" is doing a lot of heavy lifting.

but I'm gonna say no. maybe 1080p > 2160p.

maybe. but it's just as easily that it can't

BlueManifest · Mar 22, 2023

Multiply current switch by 6X and that will be switch 2, it’s not going to be 3.5 TF

ItWasMeantToBe19 · Mar 22, 2023

Thraktor said:
Yeah, it puts Samsung's 5nm processes at around the same place as TSMC's 7nm family, which is generally in line with expectations.

Regarding the use of original Switch clocks on the new model, although I don't think that specific rumour about using Samsung 5LPP has any weight to it, it's helpful in highlighting one of the reasons why I think the size of T239's GPU is indicative of a more advanced process, and actually higher clocks than the original Switch.

By poking around at the Jetson Power Tool, we can find that the power curve used for the Orin GPU fits very closely to the following equation:

P = N x 0.4132e^2.01C

Where N is the number of TPCs (or 2x the number of SMs), C is the clock speed measured in GHz, and P is the power consumption measured in Watts. If we take the 27% reduction for 5LPP as a flat value, we can just multiply the equation by 0.73, giving us a hypothetical 5LPP power curve of:

P = N x 0.3016e^2.01C

So, for a 6 TPC design like T239, we would get 3.92W for the GPU at 384MHz and 8.47W for it at 768MHz. Both of these are within the ballpark of what we'd expect for a new Switch, but still don't really explain why they would use such a large GPU. If we invert the equation, we can calculate the clock speed that can be achieved at a given power consumption for a given number of TPCs:

8nm: C = ln( P / (N x 0.4132) ) / 2.01
5nm: C = ln( P / (N x 0.3016) ) / 2.01

Let's assume that Nintendo were choosing between an 8 SM design and a 12 SM design, both on 5LPP with this hypothetical power curve. If their goal was 8.75W for the GPU in docked mode, then they could either have an 8 SM design clocked at 970MHz, providing 1,986 Gflops, or a 12 SM design clocked at 768MHz, providing 2,359 Gflops. Effectively, they're increasing their GPU size by 50%, but only achieving a 19% performance increase out of it. It's not zero return on investment, but it's not great.

Portable mode makes less sense, though. An 8 SM GPU within a 3.92W limit could clock to 586MHz, which would give 1,201 Gflops. A 12 SM design clocked at 384MHz consumes the same amount of power, and hits 1,178 Gflops. That is, they're actually getting slightly less performance with 12 SMs than they would have with 8.

Of course this analysis is inherently limited by assuming that 5LPP provides a simple, scalar reduction in power over 8mm. However, I'd still expect roughly similar behaviour. Effectively what we're looking at here is the marginal return on an increased number of TPCs for a given power draw, or equivalently for given clocks. This is related to the power efficiency curve, and we should see that it's 0 at the clock speed which provides peak power efficiency, tending up towards 1 at peak clocks, and it's negative below the peak power efficiency. Hence why the 12 SM GPU at 3.92W performs worse than an 8 SM one does, because it's dealing with clock speeds below the point of peak efficiency.

As a point of reference, if we take the 8nm Orin power curve from above, we can calculate the clock speed which achieves maximum efficiency: 477MHz. This explains why they don't clock below 420MHz on Orin Jetson products and instead disable TPCs at lower power settings, because it actually provides more performance given they're below peak efficiency. If we do the same for the hypothetical 5LPP curve, we get 644MHz as the peak efficiency. This probably doesn't bear much relationship to the actual point of peak efficiency on 5LPP, given the crude nature of applying a scalar shift to the curve, but we should definitely see this peak efficiency point increase as we move onto more efficient manufacturing processes.

I would be very surprised to see Nintendo using clock speeds lower than the peak efficiency point for the process they're using. If they were, then they'd effectively be paying extra for a less powerful GPU. If money weren't an issue, then the hypothetical ideal design for a power-limited chip should be to identify the peak efficiency clock speed and then choose however many TPCs fit in your power budget at that clock speed. Removing TPCs would reduce performance slightly while lowering your costs, while adding TPCs would also reduce performance but raise your costs.

If you're more constrained by cost than power consumption, then the optimal design is simply as many TPCs as you can afford, clocked as high as you can. In a more realistic scenario where you're balancing cost and power draw of various components, the design will sit somewhere between these two extremes, sitting at a sweet spot in the power/clock/cost space where the marginal benefit of adding more TPCs isn't worth the additional cost.

Of course Nintendo actually have two power profiles to be concerned about, portable and docked, but power efficiency is far more important in portable mode, whereas docked mode is going to be more balanced against cost. Running at below peak efficiency clocks in portable mode would effectively mean they've chosen a design which trades away power efficiency in portable mode in favour of improved power efficiency in docked mode, and increased their costs in doing so, which doesn't make a whole lot of sense to me.

I would expect Nintendo to have chosen a GPU such that they're clocking somewhere above peak efficiency clocks in portable mode, and around 2x that in docked mode. This gives them a good balance of performance, power draw and cost, and it's exactly what they did with TX1. At 8nm we can easily see that T239 doesn't have such a GPU, as clocking at peak efficiency clocks of 477MHz would draw 6.47W for the GPU alone in portable mode. Assuming about 3W for the GPU, the optimal number of TPCs on 8nm would be 2.78, or 5.57 SMs.

On our hypothetical 5LPP power curve, with a peak efficiency point at 644MHz, we would get 6.6W for T239's GPU. As I said, this is a very crude, and to be honest is probably a good illustration of why we shouldn't just treat the differences in power consumption between manufacturing processes as a flat percentage. The peak efficiency point is likely to be lower than this, and the 27% power improvement is unlikely to be representative of the lower end of the power curve. Still, I would be surprised if 5LPP were so much more efficient than 8N that 12 SMs would be a sensible design choice. You'd need around a 50% reduction in power draw compared to 8N at the low end of the curve for 12 SMs to make sense. Judging by Ampere/Ada comparisons, 4N does seem to offer around a 50% reduction in power draw over 8N.

Anyway, my point is that using original Switch clocks on any process of 8nm or better would mean that they've chosen a GPU that's too big for their requirements, and are paying more for something that's giving them less performance in portable mode and only marginal improvements docked. As I see it, increases in the minimum viable clock speed (ie peak efficiency clock) with improved manufacturing processes make a clock speed of 500MHz+ in portable mode more likely, and a similar increase to the docked clock. That being the case, it's impossible to justify 12 SMs on 8nm, and honestly hard to justify them on either Samsung 5nm or TSMC 7nm. Only on TSMC's 5nm/4nm processes does 12 SMs seem sensible to me.

Assuming this math is accurate (and I don't care enough to see if these equations for modern mobile engineering are a good estimate, I haven't done electrical engineering and it would be a huge pain in the ass), this reads as

1. The Switch 2 will not come out for many years as Nintendo waits for more power efficient chips to be available that can deliver the clock speeds they want while staying <=11W

2. The Drake was cancelled because NVIDIA/Nintendo realized that they could not get chips power efficient enough to run at the clocks needed for this chip to work well and they miscalculated how easy it would be to get these chips when they designed the Drake.

Skittzo · Mar 22, 2023

Thraktor said:
Yeah, it puts Samsung's 5nm processes at around the same place as TSMC's 7nm family, which is generally in line with expectations.

Regarding the use of original Switch clocks on the new model, although I don't think that specific rumour about using Samsung 5LPP has any weight to it, it's helpful in highlighting one of the reasons why I think the size of T239's GPU is indicative of a more advanced process, and actually higher clocks than the original Switch.

By poking around at the Jetson Power Tool, we can find that the power curve used for the Orin GPU fits very closely to the following equation:

P = N x 0.4132e^2.01C

Where N is the number of TPCs (or 2x the number of SMs), C is the clock speed measured in GHz, and P is the power consumption measured in Watts. If we take the 27% reduction for 5LPP as a flat value, we can just multiply the equation by 0.73, giving us a hypothetical 5LPP power curve of:

P = N x 0.3016e^2.01C

So, for a 6 TPC design like T239, we would get 3.92W for the GPU at 384MHz and 8.47W for it at 768MHz. Both of these are within the ballpark of what we'd expect for a new Switch, but still don't really explain why they would use such a large GPU. If we invert the equation, we can calculate the clock speed that can be achieved at a given power consumption for a given number of TPCs:

8nm: C = ln( P / (N x 0.4132) ) / 2.01
5nm: C = ln( P / (N x 0.3016) ) / 2.01

Let's assume that Nintendo were choosing between an 8 SM design and a 12 SM design, both on 5LPP with this hypothetical power curve. If their goal was 8.75W for the GPU in docked mode, then they could either have an 8 SM design clocked at 970MHz, providing 1,986 Gflops, or a 12 SM design clocked at 768MHz, providing 2,359 Gflops. Effectively, they're increasing their GPU size by 50%, but only achieving a 19% performance increase out of it. It's not zero return on investment, but it's not great.

Portable mode makes less sense, though. An 8 SM GPU within a 3.92W limit could clock to 586MHz, which would give 1,201 Gflops. A 12 SM design clocked at 384MHz consumes the same amount of power, and hits 1,178 Gflops. That is, they're actually getting slightly less performance with 12 SMs than they would have with 8.

Of course this analysis is inherently limited by assuming that 5LPP provides a simple, scalar reduction in power over 8mm. However, I'd still expect roughly similar behaviour. Effectively what we're looking at here is the marginal return on an increased number of TPCs for a given power draw, or equivalently for given clocks. This is related to the power efficiency curve, and we should see that it's 0 at the clock speed which provides peak power efficiency, tending up towards 1 at peak clocks, and it's negative below the peak power efficiency. Hence why the 12 SM GPU at 3.92W performs worse than an 8 SM one does, because it's dealing with clock speeds below the point of peak efficiency.

As a point of reference, if we take the 8nm Orin power curve from above, we can calculate the clock speed which achieves maximum efficiency: 477MHz. This explains why they don't clock below 420MHz on Orin Jetson products and instead disable TPCs at lower power settings, because it actually provides more performance given they're below peak efficiency. If we do the same for the hypothetical 5LPP curve, we get 644MHz as the peak efficiency. This probably doesn't bear much relationship to the actual point of peak efficiency on 5LPP, given the crude nature of applying a scalar shift to the curve, but we should definitely see this peak efficiency point increase as we move onto more efficient manufacturing processes.

I would be very surprised to see Nintendo using clock speeds lower than the peak efficiency point for the process they're using. If they were, then they'd effectively be paying extra for a less powerful GPU. If money weren't an issue, then the hypothetical ideal design for a power-limited chip should be to identify the peak efficiency clock speed and then choose however many TPCs fit in your power budget at that clock speed. Removing TPCs would reduce performance slightly while lowering your costs, while adding TPCs would also reduce performance but raise your costs.

If you're more constrained by cost than power consumption, then the optimal design is simply as many TPCs as you can afford, clocked as high as you can. In a more realistic scenario where you're balancing cost and power draw of various components, the design will sit somewhere between these two extremes, sitting at a sweet spot in the power/clock/cost space where the marginal benefit of adding more TPCs isn't worth the additional cost.

Of course Nintendo actually have two power profiles to be concerned about, portable and docked, but power efficiency is far more important in portable mode, whereas docked mode is going to be more balanced against cost. Running at below peak efficiency clocks in portable mode would effectively mean they've chosen a design which trades away power efficiency in portable mode in favour of improved power efficiency in docked mode, and increased their costs in doing so, which doesn't make a whole lot of sense to me.

I would expect Nintendo to have chosen a GPU such that they're clocking somewhere above peak efficiency clocks in portable mode, and around 2x that in docked mode. This gives them a good balance of performance, power draw and cost, and it's exactly what they did with TX1. At 8nm we can easily see that T239 doesn't have such a GPU, as clocking at peak efficiency clocks of 477MHz would draw 6.47W for the GPU alone in portable mode. Assuming about 3W for the GPU, the optimal number of TPCs on 8nm would be 2.78, or 5.57 SMs.

On our hypothetical 5LPP power curve, with a peak efficiency point at 644MHz, we would get 6.6W for T239's GPU. As I said, this is a very crude, and to be honest is probably a good illustration of why we shouldn't just treat the differences in power consumption between manufacturing processes as a flat percentage. The peak efficiency point is likely to be lower than this, and the 27% power improvement is unlikely to be representative of the lower end of the power curve. Still, I would be surprised if 5LPP were so much more efficient than 8N that 12 SMs would be a sensible design choice. You'd need around a 50% reduction in power draw compared to 8N at the low end of the curve for 12 SMs to make sense. Judging by Ampere/Ada comparisons, 4N does seem to offer around a 50% reduction in power draw over 8N.

Anyway, my point is that using original Switch clocks on any process of 8nm or better would mean that they've chosen a GPU that's too big for their requirements, and are paying more for something that's giving them less performance in portable mode and only marginal improvements docked. As I see it, increases in the minimum viable clock speed (ie peak efficiency clock) with improved manufacturing processes make a clock speed of 500MHz+ in portable mode more likely, and a similar increase to the docked clock. That being the case, it's impossible to justify 12 SMs on 8nm, and honestly hard to justify them on either Samsung 5nm or TSMC 7nm. Only on TSMC's 5nm/4nm processes does 12 SMs seem sensible to me.

Could it be that they've chosen a bigger GPU simply because that gets them more tensor cores and RT cores?

ItWasMeantToBe19 · Mar 22, 2023

Skittzo said:
Could it be that they've chosen a bigger GPU simply because that gets them more tensor cores and RT cores?

I cannot imagine how badly the Switch 2 would perform generating BVHs so I can't imagine Nintendo cares at all about the raytracing capabilities of the Switch 2.

Sub RTX 2060 performance combined with an extremely weak CPU is just going to make RT performance ultra shitty.

I don't think NVIDIA has made custom hardware for generating BVHs in general so I doubt the Switch 2 would be the first thing to have that hardware.

Deleted member 887 · Mar 22, 2023

Ashee said:
After lurking and reading a lot of theories here i have a big question.... do we have some kind of info about the console type(portable, table, hybrid)? i want to believe that after seing how many sales the switch made till now, next gen will still be a hybrid type(I searched a while but couldnt find anything about this, like everyone just asumes that its going to be)

Look over there said:
Strictly speaking, the design itself points towards a battery powered device of some sort. Something that can fit within a thin & light laptop, or even smaller. There is also an absence of another device, so I think that it's safe to assume that this battery powered device has to fill the role of both portable and TV play. In other words, I interpret this as a hybrid.

Display Port. Despite aggressively pruning unneeded IO controllers, they've preserved the one needed to dock.

Deleted member 887 · Mar 22, 2023

Skittzo said:
Could it be that they've chosen a bigger GPU simply because that gets them more tensor cores and RT cores?

At least in terms of tensor performance, they could just pull an Orin and run double speed, right?

Skittzo · Mar 22, 2023

oldpuck said:
At least in terms of tensor performance, they could just pull an Orin and run double speed, right?

I have no idea how tensor performance scales so I guess maybe? It was just a thought.

ILikeFeet · Mar 22, 2023

ItWasMeantToBe19 said:
Assuming this math is accurate (and I don't care enough to see if these equations for modern mobile engineering are a good estimate, I haven't done electrical engineering and it would be a huge pain in the ass), this reads as

1. The Switch 2 will not come out for many years as Nintendo waits for more power efficient chips to be available that can deliver the clock speeds they want while staying <=11W

2. The Drake was cancelled because NVIDIA/Nintendo realized that they could not get chips power efficient enough to run at the clocks needed for this chip to work well and they miscalculated how easy it would be to get these chips when they designed the Drake.

if this isn't a joke then I'm really questioning your reading comprehension

ReddDreadtheLead · Mar 22, 2023

ItWasMeantToBe19 said:
Assuming this math is accurate (and I don't care enough to see if these equations for modern mobile engineering are a good estimate, I haven't done electrical engineering and it would be a huge pain in the ass), this reads as

1. The Switch 2 will not come out for many years as Nintendo waits for more power efficient chips to be available that can deliver the clock speeds they want while staying <=11W

2. The Drake was cancelled because NVIDIA/Nintendo realized that they could not get chips power efficient enough to run at the clocks needed for this chip to work well and they miscalculated how easy it would be to get these chips when they designed the Drake.

How on earth did you even come to these conclusions from that post? It’s like you read something else entirely or inserted something to make a post.

Look over there said:
Ah, there just so happens to be a particular example near and dear to us.
OG Switch, Erista, is T210.
Red box/V2/Lite/OLED, Mariko, is... T214. Node change + update memory controller from LPDDR4 to 4X, still a change in the number.

---

@Thraktor
So... should the clocks end up such that they nail a balance of both energy and silicon cost efficiency, how are you feeling about the odds of 7500 MT/s LPDDR5X to properly feed the thing?

And what about LPDDR5T

Thraktor · Mar 22, 2023

Look over there said:
@Thraktor
So... should the clocks end up such that they nail a balance of both energy and silicon cost efficiency, how are you feeling about the odds of 7500 MT/s LPDDR5X to properly feed the thing?

That's one I'm not sure about. On the memory controller side Nvidia just announced at GTC that their Grace CPU is sampling, which means they've already got a product manufactured on TSMC 4N with an LDDR5X controller in the hands of partners. I believe Grace was designed by the Tegra group, like T239, so conceivably if T239 were manufactured on 4N there shouldn't have been any reason not to use it across both products.

What I'm more concerned about is LPDDR5X part cost and availability. We're only just seeing it in phones, and the smallest capacity I've seen is 8GB (with a 64-bit interface), which means for a 128-bit bus you've got a minimum of 16GB, which likely pushes up the price quite a bit compared to what seems to be a 12GB baseline using standard LPDDR5 parts. Then again, I don't know if this is just because LPDDR5X hasn't made it to mid-range and low-end phones yet. In 2019, when LPDDR4X was around a year and a half old, Nintendo switched from LPDDR4 to 4X, without even using the higher clock speeds. Since then LPDDR4X seems to have completely displaced LPDDR4, so it seems likely that Nintendo were aware that 4X would be a better choice for availability long-term, and that's what informed the change. If the memory industry expects 5X to completely displace LPDDR5 over the next few years, then we could see the same thing here.

ItWasMeantToBe19 · Mar 22, 2023

ILikeFeet said:
if this isn't a joke then I'm really questioning your reading comprehension

How would Nintendo actually get 4N chips.

ILikeFeet · Mar 22, 2023

ItWasMeantToBe19 said:
How would Nintendo actually get 4N chips.

Nvidia would make the t239 on it. Nintendo buys the chips from Nvidia. it's not like Nintendo wouldn't have known it was on 4N because it would be made for it from the start. remember that Nvidia bought almost $10B of 4N capacity in 2021/2022

ItWasMeantToBe19 · Mar 22, 2023

ILikeFeet said:
Nvidia would make the t239 on it. Nintendo buys the chips from Nvidia. it's not like Nintendo wouldn't have known it was on 4N because it would be made for it from the start. remember that Nvidia bought almost $10B of 4N capacity in 2021/2022

Is there math showing if NVIDIA would be able to purchase enough 4N chips to produce 30m Switch 2s within its first two years without spending a metric fuckton due to phone competition.

ReddDreadtheLead · Mar 22, 2023

ItWasMeantToBe19 said:
How would Nintendo actually get 4N chips.

They don’t??? That’s not their direct position.

Nor was it Sony’s and Microsoft’s.

It was their vendor: AMD and NVidia (if it is on 4N).

ItWasMeantToBe19 said:
Is there math showing if NVIDIA would be able to purchase enough 4N chips to produce 30m Switch 2s within its first two years without spending a metric fuckton due to phone competition.

4N is only for nvidia.

ItWasMeantToBe19 · Mar 22, 2023

ReddDreadtheLead said:
They don’t??? That’s not their direct position.

Nor was it Sony’s and Microsoft’s.

It was their vendor: AMD and NVidia (if it is on 4N).

4N is only for nvidia.

Okay, but like... 4nm is sold to many different vendors.

Concernt · Mar 22, 2023

BlueManifest said:
Multiply current switch by 6X and that will be switch 2, it’s not going to be 3.5 TF

That's... Not factual. That assumes identical clocks. Possible, yes, but far from a certainty.

Skittzo · Mar 22, 2023

ItWasMeantToBe19 said:
Okay, but like... 4nm is sold to many different vendors.

4N is not the same thing as the broader 4nm. 4N is a customized version of their 5nm node made specifically for Nvidia.

Source: literally 15 seconds of googling.

Deleted member 887 · Mar 22, 2023

Skittzo said:
I have no idea how tensor performance scales so I guess maybe? It was just a thought.

Yeah, I dunno either! I was speculating randomly myself.

ReddDreadtheLead · Mar 22, 2023

ItWasMeantToBe19 said:
Okay, but like... 4nm is sold to many different vendors.

I don’t think anyone has 4nm.

And 4N is nvidia specific, hence why it’s 4N and not N4 which is the general TSMC 5nm based node, which is evolved from the N5 process. Not to be mixed up with N5+ and N5P which are apple specific if I remember right.

ItWasMeantToBe19 · Mar 22, 2023

Skittzo said:
4N is not the same thing as the broader 4nm. 4N is a customized version of their 5nm node made specifically for Nvidia.

Source: literally 15 seconds of googling.

NVIDIA says 4N is 4nm!

ReddDreadtheLead · Mar 22, 2023

ItWasMeantToBe19 said:
NVIDIA says 4N is 4nm!

I don’t think you really read his post

Skittzo · Mar 22, 2023

ItWasMeantToBe19 said:
NVIDIA says 4N is 4nm!

Okay.

That is entirely irrelevant and meaningless nomenclature. The point is the specific 4N node has zero competition from other vendors because it's exclusively made for Nvidia.

ILikeFeet · Mar 22, 2023

ItWasMeantToBe19 said:
NVIDIA says 4N is 4nm!

what 4N is isn't relevant. Nvidia has a fuckload of capacity to make products on, including Drake if it was made for it

ItWasMeantToBe19 · Mar 22, 2023

ILikeFeet said:
what 4N is isn't relevant. Nvidia has a fuckload of capacity to make products on, including Drake if it was made for it

Doesn't this capacity depend heavily on how much TSMC charged them for these chips (which could have been very expensive due to competition!)

syrozz · Mar 22, 2023

ItWasMeantToBe19 said:
Doesn't this capacity depend heavily on how much TSMC charged them for these chips (which could have been very expensive due to competition!)

Yeah but they are stuck with too much prepaid allocations for 4N. They need customers to off load the excess.

Nvidia wants to scale back RTX 4000 GPU chips, but can’t

Alongside Nvidia, Apple and AMD also plan to reduce the amount of chips they order from TSMC in an effort to adjust to the ongoing gaming PC market decline

www.pcgamesn.com

Millions of Switch 2 would come off handy.

Thraktor · Mar 22, 2023

ItWasMeantToBe19 said:
Is there math showing if NVIDIA would be able to purchase enough 4N chips to produce 30m Switch 2s within its first two years without spending a metric fuckton due to phone competition.

As per the person you're quoting, they already spent a "metric fuckton", in the form of a reported $10 billion to pre-allocate capacity from TSMC. For a point of reference, TSMC's revenue for their entire 5nm family (which includes 4N) in 2022 was on the order of $15 billion dollars (source: TSMC 4Q22 earnings conference). They paid TSMC enough to buy around two thirds of their overall capacity for a year, and although the pre-payments likely covered multiple years, it still represents a huge chunk of TSMC's 5nm business.

The chip would also be very small on 4N, meaning it wouldn't really account for that many wafers. I estimated a 66.1mm2 die size and 854.4 yielded dies per wafer (here). So, with your estimate of 30m units over two years, that would mean 35,112 wafers over two years, or 1,463 wafers per month. In Q4 2022 TSMC shipped 3.7 million wafers, or 1.23 million wafers per month. While TSMC don't publish wafer numbers per process, even if 5nm accounts for only 5% of that, Nvidia could make 30m of them in two years with barely over 1.5% of the total capacity, after paying enough to secure 66% of that capacity.

Deleted member 887 · Mar 22, 2023

ItWasMeantToBe19 said:
How would Nintendo actually get 4N chips.

Make them, like everyone else who has them? Nvidia bought billions of dollars of capacity years ago. Presumably they had some idea what they were going to manufacture at the time. They're not fighting tooth and nail with mobile companies or AMD for the chips in real time. Will Nvidia and Nintendo put it on 4N? Fuck if I know, but it's not exactly insane for Nvidia to put an SOC on the same process node as their graphics cards.

Ampere was on 8nm, Orin was on 8nm. Kepler was on 28nm, K1 was on 28nm. Pascal was on 16nm, TX2 was on 16nm. Volta was on 12nm, Xavier was on 12nm.

Nvidia intentionally puts multiple products on the same process node, because it allows them to move capacity around, rather than being stuck with a bunch of capacity on a node they can't use

ItWasMeantToBe19 said:
2. The Drake was cancelled because NVIDIA/Nintendo realized that they could not get chips power efficient enough to run at the clocks needed for this chip to work well and they miscalculated how easy it would be to get these chips when they designed the Drake.

I don;t know if you're a troll, but I'm going to assume you're not. You're misreading Thraktor's post.

They bought the capacity in 2021, and the Nvidia hack was 2022. So at the time that the specs of the chips leaked, Nvidia and Nintendo knew, down to the penny, how much 5nm (and 8nm, and TSMC 7nm) would cost. And, because Ampere was already running on 8nm, and ARM was already running on multiple nodes, they knew, down to the Watt, what the power draw would be on SEC 8nm, and TSMC 7nm, where Ampere already is, and A78 has already been manufactured.

So when Nvidia/Nintendo made the decision to make Drake a certain size, they did so knowing full well if they were going to hit their power requirements on 8nm and 7nm, and exactly how much it would cost to move to 5nm. Thraktor isn't suggesting that Nvidia couldn't deliver a certain level of performance on 8nm, he's suggesting (as we have been speculating for a year now) that you wouldn't make the decisions Nvidia made if you weren't planning to use some 5nm capacity to build it.

Concernt · Mar 22, 2023

Even at only having Nintendo published titles dated up until July, only Pikmin 4 and TOTK are actually from EPD, and TOTK is only where it is due to multiple delays. That sounds extremely light for a year or EPD releases.

I don't see many reasons why they'd be quiet if there are indeed more, outside of a new console that they'll be coming out on.

Look over there · Mar 22, 2023

Cuzizkool said:
I‘m also curious where most people stand on the LPDDR5 vs LPDDR5x issue. Bandwidth is going to be a bottleneck no matter what. Would 5x provide enough of a reason for Nintendo to choose it? Or is the improvement nominal? Is it too bleeding edge/expensive for Nintendo?

Mmm, the potential improvement would be a help, but nothing so crazy as to be like, a 'Pro' level of difference. Noticeable and welcome, but at the same time, we could live without. I do however, think that a change to 5X is inevitable within the lifetime of the system for availability reasons (regular 4 stopped being made while regular Switch is still going, for example). So I personally think that there's value in skipping the step where one updates the memory controller and just go straight to 5X.
But as Thraktor points out, there is cost/availability questions. There should be 6 GB modules (according to Samsung's LPDDR5X page, at least), so I think that they just haven't made their way to mid-range phones yet.

Thraktor said:
As per the person you're quoting, they already spent a "metric fuckton", in the form of a reported $10 billion to pre-allocate capacity from TSMC. For a point of reference, TSMC's revenue for their entire 5nm family (which includes 4N) in 2022 was on the order of $15 billion dollars (source: TSMC 4Q22 earnings conference). They paid TSMC enough to buy around two thirds of their overall capacity for a year, and although the pre-payments likely covered multiple years, it still represents a huge chunk of TSMC's 5nm business.

The chip would also be very small on 4N, meaning it wouldn't really account for that many wafers. I estimated a 66.1mm2 die size and 854.4 yielded dies per wafer (here). So, with your estimate of 30m units over two years, that would mean 35,112 wafers over two years, or 1,463 wafers per month. In Q4 2022 TSMC shipped 3.7 million wafers, or 1.23 million wafers per month. While TSMC don't publish wafer numbers per process, even if 5nm accounts for only 5% of that, Nvidia could make 30m of them in two years with barely over 1.5% of the total capacity, after paying enough to secure 66% of that capacity.

Nothing important to add, but just wanted to add a fun remark:
The N5 fab in Arizona is expected to be capable of what, at least 20k wafers a month? Other clients aside, in the scenario where the REDACTED gets made on 4N, Eagleland can potentially handle a rather decent chunk of the supply

Concernt · Mar 22, 2023

16.0.1 firmware is out.

Nothing to see, just BCAT and bad words.

Pokemaniac · Mar 22, 2023

Concernt said:
16.1 firmware is out.

Nothing to see, just BCAT and bad words.

16.0.1, not 16.1.0. The changes seem completely in line with it just being a patch release (i.e. probably just stability).

Concernt · Mar 22, 2023

Pokemaniac said:
16.0.1, not 16.1.0. The changes seem completely in line with it just being a patch release (i.e. probably just stability).

As I said. Bad words and BCAT.

The stability in this case being BCAT bug fixes.

ILikeFeet · Mar 22, 2023

I'm more curious as to how this scales. it's running on a 13900K and 4090, so it's a question of how well could even a Series X and PS5 could run it, let alone the Series S and Drake

SiG · Mar 22, 2023

ItWasMeantToBe19 said:
How would Nintendo actually get 4N chips.

Nvidia would supply them with an SoC fabricated on 4N. It's that simple.

StarTopic Future Nintendo Hardware & Technology Speculation & Discussion |ST| (Read the staff posts before commenting!)

Manakete

Deleted member 887

Guest

Warpstar Knight

#TeamLate2025WithAPotentialForEarly2026

"[✄]. [✄]. [✄]. [✄]." -Microsoft

BREAKING: Chocolate Seal!

Bob-omb

Rattata

Bob-omb

Warpstar Knight

Paratroopa

#TeamLate2025WithAPotentialForEarly2026

Cyberpunk 2077​

2010 experience points!

#TeamLate2025WithAPotentialForEarly2026

Shriekbat

Warpstar Knight

Bob-omb

Manakete

Baba Yaga Hut

Manakete

Deleted member 887

Guest

Deleted member 887

Guest

Baba Yaga Hut

Warpstar Knight

#TeamLate2025WithAPotentialForEarly2026

"[✄]. [✄]. [✄]. [✄]." -Microsoft

Manakete

Warpstar Knight

Manakete

#TeamLate2025WithAPotentialForEarly2026

Manakete

Optimism is non-negotiable

Baba Yaga Hut

Deleted member 887

Guest

#TeamLate2025WithAPotentialForEarly2026

Manakete

#TeamLate2025WithAPotentialForEarly2026

Baba Yaga Hut

Warpstar Knight

Manakete

Cappy

"[✄]. [✄]. [✄]. [✄]." -Microsoft

Deleted member 887

Guest

Optimism is non-negotiable

Bob-omb

Optimism is non-negotiable

Caught: 1025

Optimism is non-negotiable

Warpstar Knight

Chain Chomp

Cyberpunk 2077