Z0m3le
Bob-omb
A few things that people might miss on here... 5LPP is actually Samsung 4LPX*, which has a further reduction according to the earlier chart you posted of ~10%, putting the power consumption difference between "8nm" and "4nm" at ~40% power reduction in that chart. It's also been noted that Drake has Ada power reduction enhancements via the Nvidia Hack. This should further reduce this power consumption. Comparing nodes multiple generations apart is not a science, too many variables to deal with, but what I've come up with is somewhere around 45% for Drake power reduction vs Orin.Yeah, it puts Samsung's 5nm processes at around the same place as TSMC's 7nm family, which is generally in line with expectations.
Regarding the use of original Switch clocks on the new model, although I don't think that specific rumour about using Samsung 5LPP has any weight to it, it's helpful in highlighting one of the reasons why I think the size of T239's GPU is indicative of a more advanced process, and actually higher clocks than the original Switch.
By poking around at the Jetson Power Tool, we can find that the power curve used for the Orin GPU fits very closely to the following equation:
P = N x 0.4132e^2.01C
Where N is the number of TPCs (or 2x the number of SMs), C is the clock speed measured in GHz, and P is the power consumption measured in Watts. If we take the 27% reduction for 5LPP as a flat value, we can just multiply the equation by 0.73, giving us a hypothetical 5LPP power curve of:
P = N x 0.3016e^2.01C
So, for a 6 TPC design like T239, we would get 3.92W for the GPU at 384MHz and 8.47W for it at 768MHz. Both of these are within the ballpark of what we'd expect for a new Switch, but still don't really explain why they would use such a large GPU. If we invert the equation, we can calculate the clock speed that can be achieved at a given power consumption for a given number of TPCs:
8nm: C = ln( P / (N x 0.4132) ) / 2.01
5nm: C = ln( P / (N x 0.3016) ) / 2.01
Let's assume that Nintendo were choosing between an 8 SM design and a 12 SM design, both on 5LPP with this hypothetical power curve. If their goal was 8.75W for the GPU in docked mode, then they could either have an 8 SM design clocked at 970MHz, providing 1,986 Gflops, or a 12 SM design clocked at 768MHz, providing 2,359 Gflops. Effectively, they're increasing their GPU size by 50%, but only achieving a 19% performance increase out of it. It's not zero return on investment, but it's not great.
Portable mode makes less sense, though. An 8 SM GPU within a 3.92W limit could clock to 586MHz, which would give 1,201 Gflops. A 12 SM design clocked at 384MHz consumes the same amount of power, and hits 1,178 Gflops. That is, they're actually getting slightly less performance with 12 SMs than they would have with 8.
Of course this analysis is inherently limited by assuming that 5LPP provides a simple, scalar reduction in power over 8mm. However, I'd still expect roughly similar behaviour. Effectively what we're looking at here is the marginal return on an increased number of TPCs for a given power draw, or equivalently for given clocks. This is related to the power efficiency curve, and we should see that it's 0 at the clock speed which provides peak power efficiency, tending up towards 1 at peak clocks, and it's negative below the peak power efficiency. Hence why the 12 SM GPU at 3.92W performs worse than an 8 SM one does, because it's dealing with clock speeds below the point of peak efficiency.
As a point of reference, if we take the 8nm Orin power curve from above, we can calculate the clock speed which achieves maximum efficiency: 477MHz. This explains why they don't clock below 420MHz on Orin Jetson products and instead disable TPCs at lower power settings, because it actually provides more performance given they're below peak efficiency. If we do the same for the hypothetical 5LPP curve, we get 644MHz as the peak efficiency. This probably doesn't bear much relationship to the actual point of peak efficiency on 5LPP, given the crude nature of applying a scalar shift to the curve, but we should definitely see this peak efficiency point increase as we move onto more efficient manufacturing processes.
I would be very surprised to see Nintendo using clock speeds lower than the peak efficiency point for the process they're using. If they were, then they'd effectively be paying extra for a less powerful GPU. If money weren't an issue, then the hypothetical ideal design for a power-limited chip should be to identify the peak efficiency clock speed and then choose however many TPCs fit in your power budget at that clock speed. Removing TPCs would reduce performance slightly while lowering your costs, while adding TPCs would also reduce performance but raise your costs.
If you're more constrained by cost than power consumption, then the optimal design is simply as many TPCs as you can afford, clocked as high as you can. In a more realistic scenario where you're balancing cost and power draw of various components, the design will sit somewhere between these two extremes, sitting at a sweet spot in the power/clock/cost space where the marginal benefit of adding more TPCs isn't worth the additional cost.
Of course Nintendo actually have two power profiles to be concerned about, portable and docked, but power efficiency is far more important in portable mode, whereas docked mode is going to be more balanced against cost. Running at below peak efficiency clocks in portable mode would effectively mean they've chosen a design which trades away power efficiency in portable mode in favour of improved power efficiency in docked mode, and increased their costs in doing so, which doesn't make a whole lot of sense to me.
I would expect Nintendo to have chosen a GPU such that they're clocking somewhere above peak efficiency clocks in portable mode, and around 2x that in docked mode. This gives them a good balance of performance, power draw and cost, and it's exactly what they did with TX1. At 8nm we can easily see that T239 doesn't have such a GPU, as clocking at peak efficiency clocks of 477MHz would draw 6.47W for the GPU alone in portable mode. Assuming about 3W for the GPU, the optimal number of TPCs on 8nm would be 2.78, or 5.57 SMs.
On our hypothetical 5LPP power curve, with a peak efficiency point at 644MHz, we would get 6.6W for T239's GPU. As I said, this is a very crude, and to be honest is probably a good illustration of why we shouldn't just treat the differences in power consumption between manufacturing processes as a flat percentage. The peak efficiency point is likely to be lower than this, and the 27% power improvement is unlikely to be representative of the lower end of the power curve. Still, I would be surprised if 5LPP were so much more efficient than 8N that 12 SMs would be a sensible design choice. You'd need around a 50% reduction in power draw compared to 8N at the low end of the curve for 12 SMs to make sense. Judging by Ampere/Ada comparisons, 4N does seem to offer around a 50% reduction in power draw over 8N.
Anyway, my point is that using original Switch clocks on any process of 8nm or better would mean that they've chosen a GPU that's too big for their requirements, and are paying more for something that's giving them less performance in portable mode and only marginal improvements docked. As I see it, increases in the minimum viable clock speed (ie peak efficiency clock) with improved manufacturing processes make a clock speed of 500MHz+ in portable mode more likely, and a similar increase to the docked clock. That being the case, it's impossible to justify 12 SMs on 8nm, and honestly hard to justify them on either Samsung 5nm or TSMC 7nm. Only on TSMC's 5nm/4nm processes does 12 SMs seem sensible to me.
*See dakhil's post below for correction
Last edited: