StarTopic Future Nintendo Hardware & Technology Speculation & Discussion |ST| (Read the staff posts before commenting!)

JoshuaJSlone · Mar 18, 2022

Linkie3 said:
Is it theoretically possible that the Switch can get more "impossible" PS4/XOne ports thanks to FSR 2.0? This could explain the HWL port to the Switch

Ehh. Even if FSR2.0 is still cheap (and it sounds more complicated than the original), the most it can do is make games that would've had bad resolution cuts to Switch have slightly less bad resolution cuts.

Linkie3 · Mar 18, 2022

Thanks for the clarification everybody. Maybe I was a bit to exited for nothing.

Skittzo · Mar 18, 2022

Linkie3 said:
Is it theoretically possible that the Switch can get more "impossible" PS4/XOne ports thanks to FSR 2.0? This could explain the HWL port to the Switch

It's very likely this was in development for Switch for years, I don't think it's only "now" possible due to relatively recent developments.

Kenka · Mar 18, 2022

ReddDreadtheLead said:
I’m not saying it necessarily is new to Ampere, just that it’s executed a bit differently than it was in the past, ie im implying it was done back then as well.

Really it’s the FP32 that is technically done differently, the FP16 is done the same.

CUDA is a programming language and the NVidia cards for a long time have cores that run CUDA. However, the microarchitecture that encompasses the cores is different.

Ampere for example has a 128 CUDA cores per SM, while Turing has 64 CUDA cores per SM.

There are other changes as well to the cache, ROPs, TMUs, and other features that were added or changed. Turing for example has 8 Tensor Cores per SM, but Ampere only has 4 tensor cores per SM and they are the 3rd generation of Tensor Cores while Turing has 2nd generation.

This is what makes up the architecture and what differentiates one from the other. They all run CUDA yes, but they are not all equal cards. The layout of how the card is differs from one generation to another, but the core ability to run CUDA is still there.

This is an interesting way to put it

You are doing God's work, Thraktor.

Skittzo · Mar 18, 2022

Thraktor said:
I've been doing a bit of research into this, and I'm actually not sure that the bolded is correct. Samsung 8N is surely the cheapest plausible node per wafer, but once you take into account density and yields, it's entirely possible that a more advanced node like TSMC N5 is actually cheaper per chip. In fact, my back-of-a-paper-envelope maths suggests that a Samsung 8N Drake could cost 70% more than a TSMC N5 Drake.

I should emphasise that I have no expertise in this field, my analysis contains a lot of assumptions and estimations which may deviate significantly from reality, and you shouldn't take what I'm about to write any more seriously than any other random person on the internet. That said, I can run through the maths of it.

A few pages back, I posted an estimate of Drake's die size on various manufacturing processes. I've revised my estimates on these figures in two ways since then. The first is that I'm now estimating Drake's transistor count to be around 8 billion transistors. This is based on Nvidia's Orin die photo actually being for an older 17 billion transistor configuration of the chip, but also from the fact that Xbox Series S's "Lockhart" SoC reportedly comes in at 8 billion transistors itself. This is the same number of CPU cores (8) and GPU shader "cores" (1536) on the silicon as Drake, but we know that the Zen2 CPU is larger and uses more transistors than A78, and RDNA2 similarly is larger and uses more transistors per "core" than Ampere. There are some differences between Drake and base Ampere, though, the 4MB of L2 cache will add considerably to the total (based on the GA102 die, it looks like it could be around 1.3 billion transistors for that alone), and there might be some additional components on there care of Nvidia that Nintendo don't really need, but might be useful for Nvidia's other customers (eg an 8K codec block). I'm just going with 8 billion as a round figure, but again there's a large margin of error.

The second change is that I'm changing my estimate for TSMC N7->N6 density improvement from 18% (TSMC's claim) to 8.1% (actual measured improvement from Navi 23 to Navi 24). That being the case, my new estimates are as follows:

Process Density (mT/mm2) Drake size (mm2)
Samsung 8nm 45.6 175.4
Samsung 7nm 59.2 135.1
Samsung 5nm 83.4 95.9
Samsung 4nm 109.7 72.9
TSMC N7 65.6 122.0
TSMC N6 70.9 112.8
TSMC N5 106.1 75.4

In terms of cost per wafer, my starting point was the figures shown in Ian Cutress's video on wafer prices (which incidentally is very informative if you're curious about how this kind of stuff works). This contains wafer cost figures for many of TSMC's nodes. It's important to note here that these numbers are a few years old at this point, and that the exact prices per wafer have surely changed (in fact they've probably gone down and come back up again since then), however I'm not really that interested in the absolute numbers, but rather the relative costs across different processes. The cost Nintendo pay for a Drake chip has a lot of other factors involved (packaging, testing, and obviously Nvidia's margins), which are difficult to estimate, so it's simpler to think about costs in relative terms.

The costs per wafer (in USD) quoted in that video for more recent nodes are:

28nm 20nm 16nm 10nm 7nm
2,361.84 2,981.75 4,081.22 5,126.35 5,859.28

These are just TSMC nodes, and this predated their 5nm processes. To estimate the 5nm wafer costs, I'm relying on this chart which TSMC released in mid-2021, showing the relative wafer manufacturing capacity of 16nm, 7nm and 5nm process families. This shows that the capacity of 7nm relative to 5nm in 2020 was 3.87:1, and the estimated capacity ratio in 2021 is shown as 1.76:1. We also know from TSMC's 2021 Q4 financials that 5nm accounted for 19% of revenue in 2021, compared to 31% for 7nm. The capacity figure from the chart doesn't reflect actual output, and it seems to reflect installed capacity at year-end, which obviously wouldn't be in operation over the entire year they're reporting revenue for. Therefore, if we assume that capacity was added uniformly over the year, the actual ratio of 7nm to 5nm wafers produced should be half way between the 2020 and 2021 year-end capacity numbers. That is, we would expect that over the course of 2021, TSMC produced about 2.4x as many 7nm wafers as 5nm wafers. With a 1.63x ratio of revenue between the two nodes, we can estimate that the revenue per wafer was approximately 47% higher for 5nm than 7nm. This would put a 5nm wafer at $8,622.76. Again, this may not be the correct absolute figure, but I'm mostly interested in whether the relative prices are accurate.

So, onto the cost per die. To do this we have to estimate the number of dies per wafer, for which I use this yield calculator. I take the die sizes above and assume all dies are square. For the defect density, I'm using a figure of 0.1 defect/cm2, which is based on this Anandtech article. It's likely yields are actually a bit better than this by now, but it won't make a huge difference to the analysis.

Die area Dies per wafer Cost per wafer ($) Cost per die ($) Cost per die ratio
TSMC N7 122.0 427 5,859.28 13.72 1.15
TSMC N6 112.8 462 5,859.28 12.68 1.06
TSMC N5 75.4 723 8,622.76 11.93 1.00

For N6 TSMC are probably charging a bit more per wafer than N7, but as I have no way of estimating this, I'm just leaving the price per wafer the same. The actual cost per die here won't be even close to what Nintendo will have to pay, both with the old numbers being used for wafer prices, and with packaging, testing and Nvidia's margins being added on top. However, the cost per die ratio in the last column is independent of those things. I've chosen TSMC N5 here as the baseline, and you can see that N7 and N6 are actually calculated as being more expensive per die than N5. The dies per wafer gives you a clue as to why, with the substantial increase in density of N5 (plus the smaller die resulting in a better yield ratio) meaning that even a significantly more expensive wafer cost doesn't necessarily mean more expensive chips themselves.

For the Samsung manufacturing processes, I haven't been able to find any information (even rough estimates) on wafer costs, or wafer output and revenue splits that might be used to estimate revenue per wafer. However, we can look at the cost per wafer required to hit a cost per die ratio of 1.0 (ie the same cost per die as TSMC N5) and evaluate whether that's feasible. For defect density on 5nm I'm going to use 0.5, as it was rumoured to be resulting in 50% yields for mobile SoCs that should be roughly 100mm2 in size. For 8nm defect density it's a bit trickier, but I'm estimating 0.3 defects per square cm, based on product distribution of Nvidia's desktop GPUs (if it were lower, then they wouldn't have to bin quite so heavily, if higher they wouldn't be able to sell full-die chips like the 3090Ti at all). These are only very rough estimates, so I'll also look at a range of estimates for both of these.

Process Defect density (per cm2) Dies per wafer Cost per wafer ($) - 1.00 ratio
Samsung 5nm 0.5 383 4,569.19
Samsung 5nm 0.3 459 5,475.87
Samsung 8nm 0.5 148 1,765.64
Samsung 8nm 0.3 201 2,397.93
Samsung 8nm 0.1 280 3,340.40

Samsung's 5nm processes are a bit more realistically priced here. They're most comparable to TSMC's 7nm family in terms of performance and efficiency, and if they've got the defect density down to 0.3 then they could charge a similar amount per wafer to TSMC N7 and be competitive on a per-chip cost. If the defect density is actually 0.5, then they'd have to be much more aggressive on price per wafer, coming in below TSMC 10nm, and not that far off TSMC's 16nm family. Note that the manufacturing costs on Samsung's side are likely quite a bit higher for their 5nm processes than even TSMC's N7, as Samsung are using EUV extensively in their 5nm process, so there's only a limited extent to which they can be aggressive on price.

On the 8nm side, wafer costs get a lot more unrealistic if we're trying to assume that they can be competitive on a cost per die basis with N5. If we use the 0.3 defect density estimate, then they'd have to charge about $2,400 per wafer for N8, which is basically the same as TSMC's 28nm process. Keep in mind that Samsung have their own 28nm and 14nm processes that are pretty competitive with TSMC's 28nm and 16nm families, which means Samsung would either have to be charging a similar amount for an 8nm wafer as they charge for a 28nm wafer, or they are massively undercharging for their 28nm and 14nm processes if they're proportionally cheaper than 8nm. Both of these seem very unlikely. Even with only a 0.1 defect density (similar to TSMC's processes), they would have to charge $3,340 per wafer, which is quite a bit less than TSMC 16nm.

If we assume the cheapest Samsung could charge for an 8nm wafer is the same as a TSMC 16nm wafer (which would make it very aggressively priced), and the defect density is 0.3, the cost per die would be $20.30, which gives a cost per die ratio of 1.70, or 70% more expensive than the same die on TSMC N5. This is even ignoring the significant performance and efficiency benefits of going with TSMC's N5 process over Samsung's 8nm process.

We can also plug Mariko into these to figure out a relative cost. For the Mariko die size, I measured some photos I found online in comparison to the original TX1, and it looks to be approximately 10.1mm by 10.2mm. With an assumed 0.1 defect ratio on 16nm, this would put it at 507 dies per wafer, and therefore $8.05 per die. Again this doesn't represent the actual price Nintendo pay, but this means a TSMC N5 Drake (with about 4x the transistor count) would cost about 50% more than Mariko does.

This might explain why Nvidia is moving so aggressively onto TSMC's 5nm process. I had assumed that they would keep lower-end Ada chips on Samsung 8nm, or maybe Samsung 5nm, but this would suggest that it's actually cheaper per chip to use TSMC 5nm, even before the clock speed/efficiency benefits of the better node. It also, from my perspective, makes Drake's 12 SM GPU a lot more reasonable. For an 8nm chip in a Switch form-factor, 12 SMs is much more than any of us expected, but if you were to design a TSMC N5 chip for a Switch like device, 12 SMs is actually not excessive at all. It's a small ~75mm2 die, and there shouldn't be any issue running all 12 SMs at reasonable clocks in both handheld and docked modes. Yields would be extremely high, and as TSMC N5 will be a very long-lived node, there would be no pressure to do a node shrink any time soon.

Now, to caveat all of this again, I'm just a random person on the internet with no relevant expertise or insight, so it's entirely possible (probable?) that there are inaccurate assumptions and estimates above, or just straightforward misunderstanding of how these things work. So take it all with a huge grain of salt. Personally I still think 8nm is very likely, possibly even moreso than TSMC N5, but I think it's nonetheless interesting to run through the numbers to try to actually verify my assumptions.

Just out of curiosity, if you know, do these costs per wafer literally just include the raw materials associated with the wafer and that particular process designated for that wafer, or does it also roll in all of the other costs associated with that production line, i.e. salaries, maintenance, etc.?

I would also imagine that the cost associated with each production line would vary significantly depending on the actual compositional makeup of the chips involved, or how many discrete layers are formed and with how much raw material. What you've cited of course are estimates but it does seem like the true cost could be quite different depending on a ton of factors.

As it related to Nintendo here, they obviously want the best deal they can for these chips but on the other hand they didn't have to put 12SMs into this thing. It's possible that they're willing to spend a little extra on design and possibly even production to better meet whatever needs they have for this product.

Deleted member 887 · Mar 18, 2022

Kenka said:
I guess this must be true since the CPU has more boxes to check to find the data it's looking for but then I assume this is also true for any memory type? The bigger the RAM quantity the slower it will be because the CPU has more indexes to look through? So, 16 GB of RAM will always be slower than its 8 GB counterpart?

When I store Link's hearts in memory, I know where they are. I don't have to search all of memory to find them, because they're always in the same place. You set your coffee on the table, picking it back up is the same speed no matter how big your house is. RAM is the same. You know the address of the thing you're looking for, so it doesn't matter how big your RAM is.

Cache - and here I don't mean CPU cache, I mean all kinds of caches - is weird because you don't actually know what's in there. Your "memory" of what you've got stored in the cache is... the cache! And because there is lots of different kinds of possible things in there you don't have addresses for them, you have to go looking. There are lots of clever ways to store things so you don't have to scan cache from top to bottom every time you go looking (which would probably be so slow that the cache wouldn't help you).

Think of a library. Lots of books, and you don't know if the book you're looking for is in there. Bigger libraries are harder to search, absolutely, but you don't have to search the whole library. You can go straight to fiction, and then in fiction, fantasy fiction, alphabetized by author, hit the 'T's. Only now do you have to scan to see if they have Lord of the Rings, starting at Judith Tarr and stopping at Harry Turtledove.

L1/L2/L3 caches specifically get very slightly slower as they get bigger under certain conditions. Specifically if the smaller cache was plenty for you, then you might pay a tiny cost for a bigger cache you don't actually use. The term "working set" is used to mean "the things the CPU cares about right now" When you're doing heavy optimizing you're trying to keep your working set smaller than your cache, because it's so much slower to go get new data from RAM, and swap it. So if you're able to keep your working set smaller than cache, then a bigger cache which is even microscopically slower could be a performance drop.

A thing to keep in mind is that this is a problem that a game is encountering thousands of times a second, if not more. We're not talking about enough L2 cache to keep a whole frame of game play data in the working set. We're talking about something like computing each Bokoblin's current momentum, the sort of task you might perform dozens of in order to render a single frame of gameplay. More L2 cache might make a single pathway a tiny bit slower, and every other pathway a lot faster.

Or it might not! Sometimes the way a game does "really critical and incredible thing" is super-duper optimized, and these kind of tradeoffs really do go the wrong way. Often this is a medium sized game problem. Smaller games using off-the-shelf engines rarely need to get this deep in the weeds. Huge AAA games can stress hardware a lot, but they're also doing so many different kinds of things that they're rarely getting bogged down in little edge cases like this. Clever medium sized games, however, might build their entire loop on one specific, highly optimized path and totally fall down in the edge cases.

Dakhil · Mar 18, 2022

Thraktor said:
Yeah, Samsung 5nm and TSMC 7nm/6nm are definitely possible, but going forward it looks like Nvidia will have one Samsung 8nm chip (Orin) and a whole load of TSMC 5nm chips (Ada, Hopper, Grace, etc.), so I'd say Drake is more likely to be on one of those two processes.

Assuming TSMC's been chosen by Nintendo and Nvidia as the foundry company of choice for the fabrication of Drake, I think whether Nintendo and Nvidia choose TSMC's N6 process node or TSMC's N5 process node for the fabrication of Drake comes to which process node has more, additional capacity available for the fabrication of Drake moreso than the cost of the process node.

Although Nvidia did pay a premium to TSMC for securing additional capacity for TSMC's N5 process node, I imagine the capacity for TSMC's N5 process node for Nvidia is going to be extremely tight, considering how ridiculously high demand for TSMC's N5 process node is (e.g. Dimensity 8000 and Dimensity 8100, Zen 4, Navi 31 and 32, etc.), and most of Nvidia's products in 2022 and beyond seem to be fabricated using TSMC's N5 process node (e.g. Hopper, Ada, etc.).

I feel confident that Grace's going to fabricated using TSMC's N5 process node since Grace's a datacentre CPU, and TSMC so far has been Nvidia's default choice for the fabrication of datacentre chips. And I think there's a good chance Atlan's going to be fabricated using TSMC's N5 process node, considering Atlan's sampling in 2023; and Nvidia's Arm based SoCs are generally fabricated on the same process node as the consumer and/or professional and/or datacentre GPUs with the same GPU architecture. And Orin seems to be no exception. (Of course, customised variants of Nvidia's Arm based SoCs, such as Drake for example, are a different story.)

Mr.Gamerson · Mar 18, 2022

Do you guys think a 1080p screen is much more likely now that the specs may be much better than expected, especially with these upscaling techniques being more and more prevalent and improving? I personally would prefer them stick to a 720p screen and use the power and cost savings for other parts of the system, but I don't know if the savings are worth not having better picture quality and 1080p will be a good marketing bullet point to help show advancement over the current oled screen, especially if this system is marketed as a next-gen switch.

Skittzo · Mar 18, 2022

Mr.Gamerson said:
Do you guys think a 1080p screen is much more likely now that the specs may be much better than expected, especially with these upscaling techniques being more and more prevalent and improving? I personally would prefer them stick to a 720p screen and use the power and cost savings for other parts of the system, but I don't know if the savings are worth not having better picture quality and 1080p will be a good marketing bullet point to help show advancement over the current oled screen, especially if this system is marketed as a next-gen switch.

My current thinking is that they'll want to share as many components as possible with the OLED model so that they don't have 3+ different process lines and supply chains set up for 3+ models. So I'm expecting a 720p OLED screen.

Kenka · Mar 18, 2022

oldpuck said:
When I store Link's hearts in memory, I know where they are. I don't have to search all of memory to find them, because they're always in the same place. You set your coffee on the table, picking it back up is the same speed no matter how big your house is. RAM is the same. You know the address of the thing you're looking for, so it doesn't matter how big your RAM is.

Cache - and here I don't mean CPU cache, I mean all kinds of caches - is weird because you don't actually know what's in there. Your "memory" of what you've got stored in the cache is... the cache! And because there is lots of different kinds of possible things in there you don't have addresses for them, you have to go looking. There are lots of clever ways to store things so you don't have to scan cache from top to bottom every time you go looking (which would probably be so slow that the cache wouldn't help you).

Think of a library. Lots of books, and you don't know if the book you're looking for is in there. Bigger libraries are harder to search, absolutely, but you don't have to search the whole library. You can go straight to fiction, and then in fiction, fantasy fiction, alphabetized by author, hit the 'T's. Only now do you have to scan to see if they have Lord of the Rings, starting at Judith Tarr and stopping at Harry Turtledove.

L1/L2/L3 caches specifically get very slightly slower as they get bigger under certain conditions. Specifically if the smaller cache was plenty for you, then you might pay a tiny cost for a bigger cache you don't actually use. The term "working set" is used to mean "the things the CPU cares about right now" When you're doing heavy optimizing you're trying to keep your working set smaller than your cache, because it's so much slower to go get new data from RAM, and swap it. So if you're able to keep your working set smaller than cache, then a bigger cache which is even microscopically slower could be a performance drop.

A thing to keep in mind is that this is a problem that a game is encountering thousands of times a second, if not more. We're not talking about enough L2 cache to keep a whole frame of game play data in the working set. We're talking about something like computing each Bokoblin's current momentum, the sort of task you might perform dozens of in order to render a single frame of gameplay. More L2 cache might make a single pathway a tiny bit slower, and every other pathway a lot faster.

Or it might not! Sometimes the way a game does "really critical and incredible thing" is super-duper optimized, and these kind of tradeoffs really do go the wrong way. Often this is a medium sized game problem. Smaller games using off-the-shelf engines rarely need to get this deep in the weeds. Huge AAA games can stress hardware a lot, but they're also doing so many different kinds of things that they're rarely getting bogged down in little edge cases like this. Clever medium sized games, however, might build their entire loop on one specific, highly optimized path and totally fall down in the edge cases.

For fuck's sake, where were you until now in my life, oldpuck. I have needed you all along! I still don't understand why the cache orders the information in it the way it does but you just clarified why some games are more difficult to port (or an aspect of that I assume) than others. That is more than what I asked for.

And now, I have even more questions. That's the telltale sign that this convo is bearing fruits!

Dakhil · Mar 18, 2022

Mr.Gamerson said:
Do you guys think a 1080p screen is much more likely now that the specs may be much better than expected, especially with these upscaling techniques being more and more prevalent and improving?

I think that depends on if Nintendo wants to support a refresh rate higher than 60 Hz (e.g. 120 Hz) and VRR in handheld mode since a 1080p display seems to be the minimum requirement to support a refresh rate higher than 60 Hz and VRR for OLED displays (e.g. iPhone 13 Pro and iPhone 13 Pro Max). Of course, Nintendo could theoretically customise the 720p OLED displays to add support for refresh rates higher than 60 Hz and VRR, but I imagine that won't come cheap.

Considering Nintendo once considered using a 480p 120 Hz display for Project Indy, I do think the likelihood of Nintendo using a 1080p display, especially one with support for a refresh rate higher than 60 Hz and VRR, is not as unlikely as originally thought, albeit still not a very high likelihood.

Hermii · Mar 18, 2022

Mr.Gamerson said:
Do you guys think a 1080p screen is much more likely now that the specs may be much better than expected, especially with these upscaling techniques being more and more prevalent and improving? I personally would prefer them stick to a 720p screen and use the power and cost savings for other parts of the system, but I don't know if the savings are worth not having better picture quality and 1080p will be a good marketing bullet point to help show advancement over the current oled screen, especially if this system is marketed as a next-gen switch.

Even with the old spec expectation (4=8 sm and Orins twice as performant tensor cores) 1080 would have been fine with a reliance on dlss in portable mode.

The way I see it, if it ain’t broke don’t fix it. There are very few complaints about the image quality of the oled. And that screen should also be hdr capable in theory, further boosting image quality if utilized.

Unless Nintendo is planning to use VR, wich 720p sucks at.

Kenka · Mar 18, 2022

Do we have a list of prices for OLED panels in that form factor? In my opinion, a 4K screen makes more sense because you can run games at either 720p and 1080p and they would still be native rez. Plus, you could run VR games at 2 * 2160 * 1920 pixels.

Dakhil · Mar 18, 2022

Kenka said:
Do we have a list of prices for OLED panels in that form factor? In my opinion, a 4K screen makes more sense because you can run games at either 720p and 1080p and they would still be native rez. Plus, you could run VR games at 2 * 2160 * 1920 pixels.

I don't believe so. But considering Sony's the only company using a 4K display for smartphones, I imagine the price is far from inexpensive.

The problem with attaching the console to a VR viewer headset (e.g. Google Cardboard, Nintendo Labo VR) is the VR viewer headset becomes top heavy when the console's attached, which doesn't translate to a pleasant VR experience, going by my personal experience with Nintendo Labo VR.

Sol · Mar 18, 2022

Mr.Gamerson said:
Do you guys think a 1080p screen is much more likely now that the specs may be much better than expected, especially with these upscaling techniques being more and more prevalent and improving? I personally would prefer them stick to a 720p screen and use the power and cost savings for other parts of the system, but I don't know if the savings are worth not having better picture quality and 1080p will be a good marketing bullet point to help show advancement over the current oled screen, especially if this system is marketed as a next-gen switch.

I'm literally expecting this device to be the Switch OLED with the new chip inside.

ILikeFeet · Mar 18, 2022

Mr.Gamerson said:
Do you guys think a 1080p screen is much more likely now that the specs may be much better than expected, especially with these upscaling techniques being more and more prevalent and improving? I personally would prefer them stick to a 720p screen and use the power and cost savings for other parts of the system, but I don't know if the savings are worth not having better picture quality and 1080p will be a good marketing bullet point to help show advancement over the current oled screen, especially if this system is marketed as a next-gen switch.

nah. people at large don't notice the difference when the pixel density is sufficently high. and as someone shown, 7in 720p has a higher density than 27in 2160p. no point in wasting battery lighting all that. I'd rather they use the extra headroom in some other way

Foltzie · Mar 18, 2022

Are there any details as to how big the contact was for the Samsung panels in the OLED model. I seem to recall speculation it was for quite a bit of capacity.

Aka, is it likely the next Switch just uses the same panel because Nintendo already has stock?

Teal'c · Mar 18, 2022

ITSMILNER said:
I could see Hogwarts having a Switch 4k upgrade if the console is indeed coming within the next year, maybe that’s why they have been quiet about a Switch version? It’s interesting though as it’s not something I expected to see running on Switch.

I had a similar thought. Usually games that are too complex (visually) skip Switch but MAYBE they worked on the version for the next model, then tried to adapt it to the current model as well?

Or is it simply a matter of installed base, 100mln+ is hard to ignore.

Hypothesis...

Mr.Gamerson · Mar 18, 2022

@brainchild what is your opinion on the the e3 2019 BOTW2 trailer compared to the 2021 trailer technically. Also, if Nintendo did do a Switch 2/4K version of BOTW 2 do you think there would be significant visual improvements aside from resolution and framerate? I would think ray tracing would be a neat thing to add if the rumored specs are true.

Look over there · Mar 18, 2022

Kenka said:
So in order to increase a fixed quantity of RAM during the design phase (and more RAM is always better as one poster mentioned), the only prerequisite is to have settled for a certain die size, respectively a memory bus width. And that makes the bus width a very critical quantity. So, in the case of the succ, since we know that the hardware is pretty much constrained to some form factor, then the first thing we would check on would be that I assume, since it helps us calculate output of the GPU. Then we would be interested in the memory clock, and then in the possible VRAM configurations (if not straight up mentioned). Finally, we would turn our eyes to the CPUs but those are bound to not be a surprise given ARM-dominance in the low power space.

All of that in that order, correct? Assuming the CPUs will not bottleneck the GPU (which is a whole other topic).

More or less, yep! Although one more thing: since clock rates at runtime should be controlled software-side, not only can it be decided late in the design process, but even later! There are profiles/presets for the OG Switch that were added post-launch. Or so I recall reading in this thread before.

Kenka said:
I have a feeling that cache might be the succ's 'secret sauce' along with RTX and machine learning cores. The more we learn about it the better. That said, although I kinda understand why a cache should not be big and why the physically closest cache must preferably be the fastest too, I still fail to see why you would increase a cache size that is lower ranked? Unless you have to because it is shared with other components that necessarily further away from the CPU? But then, what is the point in a SoC like the succ's in which CPU and GPU have access to the same memory pool?

I think I am close to reaching a satisfactory level of knowledge about the design of SoCs. I thank everyone for being so great at explaining stuff.

The unbolded's already answered, so to respond to the bolded: as you've alluded to it, different caches can be shared with other components. It'll vary depending on the product in question.
Heck, for an example, let's look at the Jetson AGX Orin!
Technical brief pdf
If the direct link to the pdf doesn't work, then it should be linked to at this page, where it says 'Download our latest technical brief to learn more about the Jetson AGX Orin'.
On page 3, there's the block diagram for the entire SoC. Note the 4 MB 'System Cache' for later.
On page 5, there's the GPU block diagram. There, you can see that each SM gets its own 192 KB L1 cache. Then there's a GPU-wide L2 cache. Presumably, the 'System Cache' serves as the de facto L3 cache for the GPU.
Then on page 8, there's the CPU block diagram. Each A78 core gets its own 64 KB L1 cache for Instruction, 64 KB L1 cache for Data, and 256 KB L2 cache. Oh, yea, separate L1 caches for Instruction and Data are normal for CPUs nowadays. Then for a given cluster of 4 cores, there's 2 MB L3 cache. Then presumably, that 'System Cache' sits above as the de facto L4 cache for the CPU. Remember that the number's descriptive; it'd just be the 4th level of cache to look through from the CPU's perspective, while it'd presumably be the 3rd level for the GPU to search.

Kenka said:
For fuck's sake, where were you until now in my life, oldpuck. I have needed you all along! I still don't understand why the cache orders the information in it the way it does but you just clarified why some games are more difficult to port (or an aspect of that I assume) than others. That is more than what I asked for.

And now, I have even more questions. That's the telltale sign that this convo is bearing fruits!

So, given how small caches are, relatively speaking, sooner or later it's basically a case of 'I need to kick something out to create space for this new thing'.
So, 'why information within the cache is where it is' kinda becomes 'the order in which I evicted old stuff'.
Naturally, the next question is 'so how do you decide what to clear out?' There a lot of different policies. And honestly, we're getting beyond my depth here

. That said, if you're curious, I can at least point you towards this wikipedia page.

JoshuaJSlone · Mar 18, 2022

Mr.Gamerson said:
Do you guys think a 1080p screen is much more likely now that the specs may be much better than expected, especially with these upscaling techniques being more and more prevalent and improving?

Man, I hope so. Sticking this thing with a 720p screen would be like giving base Switch just 3DS's top screen.

Future Boy Nemo · Mar 18, 2022

Kenka said:
For fuck's sake, where were you until now in my life, oldpuck. I have needed you all along! I still don't understand why the cache orders the information in it the way it does but you just clarified why some games are more difficult to port (or an aspect of that I assume) than others. That is more than what I asked for.

And now, I have even more questions. That's the telltale sign that this convo is bearing fruits!

Disclaimer: I have zero formal education in this field, so if what I'm about to say is incorrect I apologize in advance.

My understanding is that the cache is more or less 'what you've used recently'. Once L1 gets filled up (which happens faster than you can blink -- microprocessors are about as fast as the sun is big) whatever was accessed longest ago gets bumped into L2, and when that's full things get bumped into L3. I usually picture each level of the cache as being a deck of index cards, and each time you find a card you want it gets moved to the top of the pile. If you take a card from L2 or L3 or the VRAM or a server in Iceland then L1 overflows and the card at the bottom is put at the top of L2, causing L2 to overflow its bottom card to the top of L3.

There are complicated algorithms these days that are used to modify the relative importance of data in the cache, so even if card A hasn't been used in awhile the system knows it will be soon and keeps it from going below, say, L2 if it can help it, but it's all automated. The programmer generally doesn't directly control the cache, just designing around it if optimization is critical, but it's probably done now and then in embedded systems and might be possible for standardized hardware like consoles? But even if it is possible, in 99.9% of cases the cache... arranger? scheduler? The cache organizer will be automated.

Fake edit: @Look over there explained it better than me with just two sentences and more confidence. I'll definitely be reading that wiki page to up my knowledge!

Zedark · Mar 18, 2022

Kenka said:
For fuck's sake, where were you until now in my life, oldpuck. I have needed you all along! I still don't understand why the cache orders the information in it the way it does but you just clarified why some games are more difficult to port (or an aspect of that I assume) than others. That is more than what I asked for.

And now, I have even more questions. That's the telltale sign that this convo is bearing fruits!

The aim of cache is to have the information that you need to use in a closer location to the actual compute units, so that latency is minimised. The ideal situation happens where all the data you need next is present in the cache, and the stuff you don't use is not present in the cache. This would theoretically maximise performance, and is the ideal. The cache eviction algorithm needs to approach this optimal situation under two constraints: it can't hold all the things you need in the near future in cache due to size constraints, and even if it could, it generally doesn't know which pieces of data those are. Under those constraints, the best we can do is to keep those pieces of data in cache that are most likely to be used soon.

Which algorithm would do that? An often-used one is LRU (least recently used). We basically assign an age (in number of ticks since its last use as registered on an internal clock) and evict the oldest item whenever a new piece of data that is not in the cache needs to be loaded. You then place that piece of dat in cache and remove the least recently used one. This uses the heuristic that you are most likely to use a piece of data that you used recently (since it's associated with part of the code you are executing). Whenever that piece of data gets reused, you can set the age to 0 again.

You would probably design this in a hierarchical fashion: whenever an L1 cache element is evicted, it goes to L2, and then L3, and then to main memory. There are probably more sophisticated algorithms than vanilla LRU, but each should be a variant of LRU at its core. Edit: I would omagine that more complex algorithms can mix information about the frequency of use into an LRU variant as well in order to combine as much information as possible into the eviction algorithm.

Serif · Mar 18, 2022

JoshuaJSlone said:
Man, I hope so. Sticking this thing with a 720p screen would be like giving base Switch just 3DS's top screen.

Well, the 3DS's top screen is a whopping 400x240 which is vastly insufficient when the Switch can render plenty of 720p games.
Maybe a closer comparison would be if the Switch had the Vita's screen res (and Xenoblade would still be sub-native ! haha gottem)

Anyways, Steam Deck right now has a 1280x800 screen and I don't hear many complaints about the screen resolution.
On the Deck and current Switch, native 800p/720p games look crisp.
I wouldn't complain if Drake had a 720p screen since games will render at that res or even be supersampled.

Makes sense to me that they would reuse OLED parts. A 1080p screen would be nice-to-have, but I'd rather have HDR before a res bump.

Mr Swine · Mar 18, 2022

Will Nintendo pay extra to put in a bigger L3 cache onto Drake to speed up Ray Tracing? I simply feel that 88/102GB/s is simply to little bandwidth to make Ray Tracing work as it should. Maybe it depends also how efficient the RT cores are on Drake, are they still Gen 2 or will they be based on the next Gen?

Dakhil · Mar 18, 2022

Mr Swine said:
Maybe it depends also how efficient the RT cores are on Drake, are they still Gen 2 or will they be based on the next Gen?

So far, nobody knows. Nvidia made no mention of which generation the RT cores on Orin are part of on the Jetson AGX Orin Data Sheet.

Alovon11 · Mar 18, 2022

Dakhil said:
So far, nobody knows. Nvidia made no mention of which generation the RT cores on Orin are part of on the Jetson AGX Orin Data Sheet.

Hopefully NVIDIA mentions it at GTC with their Orin AGX showcase thingy

Dakhil · Mar 18, 2022

Alovon11 said:
Hopefully NVIDIA mentions it at GDC with their Orin AGX showcase thingy

If the RT cores on Orin are part of the same generation of the RT cores on Ada GPUs, then I could see Nvidia still not mention what generation the RT cores on Orin are part of until the next GTC 2022 conference during late 2022.

Also, you're confusing GTC with GDC, which I don't blame you for, since GDC and GTC are occurring in the same week this year.

Hermii · Mar 18, 2022

Alovon11 said:
Hopefully NVIDIA mentions it at GDC with their Orin AGX showcase thingy

Unless Nvidia has a use case for RT in self driving cars/ robotics/ AI, there is no reason to highlight the rt cores

FernandoRocker · Mar 18, 2022

I would definitely prefer a 720p OLED screen over a 1080p one (be it LCD or OLED).

At 7" there is really not much difference between 720p and 1080p (I meant there is but is not very noticeable). If the screen has less pixels, it wouldn't only have better battery, but games could have more graphical effects by rendering everything at 720p instead of 1080.

The power to render a game natively at 1080p compared to 720p is kinda big. You need way power power to do it.

I know things do not scale like this, but just to oversimplify, 1080p has more than double the pixels... You basically need twice the processing power to render at 1080p instead of 720.

I would rather prefer they use that power to get better graphics fidelity and increased framerates over increased resolution.

slsk · Mar 18, 2022

The original Switch screen is close to 220 pixels-per-inch (PPI). A 1080p screen makes no sense unless the screen is going to be larger (approaching 10 inches).

brainchild · Mar 18, 2022

Mr.Gamerson said:
@brainchild what is your opinion on the the e3 2019 BOTW2 trailer compared to the 2021 trailer technically.

Hard to say as one took place primarily underground and the other outside. As far as I can tell, the rendering features are on par with the original, which was already future proofed with nearly every standard rendering feature seen in AAA consoles today (global illumination, ambient occlusion, volumetric lighting, screen space reflections, physically-based materials, etc.) just with poorer fidelity. BOTW 2 seems to roughly more of the same with same (rendering-wise) with some slight parameter refinements.

Mr.Gamerson said:
Also, if Nintendo did do a Switch 2/4K version of BOTW 2 do you think there would be significant visual improvements aside from resolution and framerate? I would think ray tracing would be a neat thing to add if the rumored specs are true.

Generally speaking, hardware that can allow the resolution of all of the visual effects/rendering features/textures to be increased in addition to a higher rendered output resolution will allow for significant visual improvements as those were really the only elements holding back the art style (aside from more complex geometry).

So basically, resolution is intrinsically tied to the quality of the rendering features on offer. Ray-tracing would be nice, but I don't know how performant it would be.

Personally, I would like to see all light sources have shadow-casting enabled but that might be asking for a bit much.

Alovon11 · Mar 18, 2022

brainchild said:
Hard to say as one took place primarily underground and the other outside. As far as I can tell, the rendering features are on par with the original, which was already future proofed with nearly every standard rendering feature seen in AAA consoles today (global illumination, ambient occlusion, volumetric lighting, screen space reflections, physically-based materials, etc.) just with poorer fidelity. BOTW 2 seems to roughly more of the same with same (rendering-wise) with some slight parameter refinements.

Generally speaking, hardware that can allow the resolution of all of the visual effects/rendering features/textures to be increased in addition to a higher rendered output resolution will allow for significant visual improvements as those were really the only elements holding back the art style (aside from more complex geometry).

So basically, resolution is intrinsically tied to the quality of the rendering features on offer. Ray-tracing would be nice, but I don't know how performant it would be.

Personally, I would like to see all light sources have shadow-casting enabled but that might be asking for a bit much.

Honestly I could see Nintendo go hard on RTGI/RTXGI for their first party stuff as lighting has always been a major focus from them in their games.

And with how scalable RTGI is nowadays (SVOGI running on OG Switch, and NVIDIA's RTXGI being scalable enough to run on the OG Xbox One in software), I could see them adopting RTXGI or a custom version of it for Switch 2 versions of the games.

Not to mention NVN2's Drivers have DLSS and RT acceleration all over them apparently and with them also listing a PS4-Pro level GPU raster-wise when docked (Assuming 1ghz at least) and 12RT cores (Which should accelerate RT faster than the PS5), I could see RTXGI get a fair bit of use on Nintendo first-party titles.

ReddDreadtheLead · Mar 18, 2022

Kenka said:
You are doing God's work, Thraktor.

I’m not Thraktor

On a different I wonder if Nintendo will opt to have a 4-8MB SLC that is accessible for both the GPU and the CPU. It will be slower than the other higher cache levels but it would certainly help reduce the need to hit up on the RAM and increase the effective bandwidth while also increasing the level of efficiency present in this chip.

Making the 88-102GB/s a lot more manageable than just, well, 88-102GB/s that we see on paper.

Also worth mentioning that the changing of GPU frequency can affect the bandwidth of the cache.

Though I don’t know how this affects the SLC in this case.

SiG · Mar 18, 2022

I am curious as to whether or not this uncovered patent regarding a newer Dock with an internal swivel might help address concerns about Drake's proposed docked clocks.

Perhaps they're having a harder time trying to dissipate heat from a regular or OLED Dock that they needed to redesign it again to accomodate for greater exhaust pressure?

Perhaps they were anticipating the poor yeilds of Samsung 8nm that they were ready to shell out (no pun intended) another dock to accomodate for the greater fan blow, but in the end perhaps didn't need it because they switched over to TSMC 7nm?

brainchild said:
Hard to say as one took place primarily underground and the other outside. As far as I can tell, the rendering features are on par with the original, which was already future proofed with nearly every standard rendering feature seen in AAA consoles today (global illumination, ambient occlusion, volumetric lighting, screen space reflections, physically-based materials, etc.) just with poorer fidelity. BOTW 2 seems to roughly more of the same with same (rendering-wise) with some slight parameter refinements.

I do wonder if AMD FSR 2.0 can help in those matters, though I'm not sure if it's too late in the development pipeline to implement that for this game, or any other subsequent game being released for Holliday 2022. (Then again, we do not know yet for how long Nintendo Switch Sports had been in development before it implemented FSR 1.0.) AMD had just been announced it in press but the results do look rather promising (in the short term solution of things).

I'm still concerned about how Nintendo/Nvidia is going to handle storage speed issues.

FernandoRocker · Mar 18, 2022

SiG said:
I'm still concerned about how Nintendo/Nvidia is going to handle storage speed issues.

Well, I have two ideas.

1. Make the internal storage fast (NVMe)... 128GB minimum. And use microSD cards only for cold storage (physical cartridges would need to be installed). Nintendo would need to find a way to make the process user-friendly. Everything should happen automatically and in the background with little user interference.

2. Make the internal storage fast (NVMe)... 128GB minimum. Physical cartridges would need to be installed. Ditch the microSD slot and replace it with a user-friendly (no tools required) NVMe slot.

Alovon11 · Mar 18, 2022

FernandoRocker said:
Well, I have two ideas.

1. Make the internal storage fast (NVMe)... 128GB minimum. And use microSD cards only for cold storage (physical cartridges would need to be installed). Nintendo would need to find a way to make the process user-friendly. Everything should happen automatically and in the background with little user interference.

2. Make the internal storage fast (NVMe)... 128GB minimum. Physical cartridges would need to be installed). Ditch the microSD slot and replace it with a user-friendly (no tools required) NVMe slot.

I do say that #1 would be the cheapest option and heck, they could even make it 256GB minimum to allow for a 128GB Partition for cold-storage movement.

But #2 would be the most straightforward.

FernandoRocker · Mar 18, 2022

Alovon11 said:
I do say that #1 would be the cheapest option and heck, they could even make it 256GB minimum to allow for a 128GB Partition for cold-storage movement.

But #2 would be the most straightforward.

I would prefer #2.

Heck, a 1TB NVMe drive is cheaper than a 1TB microSD card.

Alovon11 · Mar 18, 2022

FernandoRocker said:
I would prefer #2.

Heck, a 1TB NVMe drive is cheaper than a 1TB microSD card.

Yeah, by cheapest I mean Cheapest for Nintnedo.

ReddDreadtheLead · Mar 18, 2022

FernandoRocker said:
Well, I have two ideas.

1. Make the internal storage fast (NVMe)... 128GB minimum. And use microSD cards only for cold storage (physical cartridges would need to be installed). Nintendo would need to find a way to make the process user-friendly. Everything should happen automatically and in the background with little user interference.

2. Make the internal storage fast (NVMe)... 128GB minimum. Physical cartridges would need to be installed. Ditch the microSD slot and replace it with a user-friendly (no tools required) NVMe slot.

The only option is to not do either of these.

Unless you want an even more underclocked switch 2 or a 1 hour battery life

FernandoRocker · Mar 18, 2022

ReddDreadtheLead said:
The only option is to not do either of these.

Unless you want an even more underclocked switch 2 or a 1 hour battery life

Wouldn't the power usage be compensated by the shorter loading times?

ReddDreadtheLead · Mar 18, 2022

FernandoRocker said:
Wouldn't the power usage be compensated by the shorter loading times?

It actively streams in data.

It’s like 4W for the smallest external NVMe SSD.

And I don’t see Nintendo applying the NVMe protocol and making their own super fast SSD like Sony or Apple did.

Kenka · Mar 19, 2022

Dakhil said:
I don't believe so. But considering Sony's the only company using a 4K display for smartphones, I imagine the price is far from inexpensive.

The problem with attaching the console to a VR viewer headset (e.g. Google Cardboard, Nintendo Labo VR) is the VR viewer headset becomes top heavy when the console's attached, which doesn't translate to a pleasant VR experience, going by my personal experience with Nintendo Labo VR.

That is something I didn't test for myself. I compared the Oculus Quest's weight with the Switch's a in a former post and assumed that it weighting 200g less would make it comfortable for VR. I guess not. So I guess it is not a use case for the 12 SM.

Look over there said:
More or less, yep! Although one more thing: since clock rates at runtime should be controlled software-side, not only can it be decided late in the design process, but even later! There are profiles/presets for the OG Switch that were added post-launch. Or so I recall reading in this thread before.

The unbolded's already answered, so to respond to the bolded: as you've alluded to it, different caches can be shared with other components. It'll vary depending on the product in question.
Heck, for an example, let's look at the Jetson AGX Orin!
Technical brief pdf
If the direct link to the pdf doesn't work, then it should be linked to at this page, where it says 'Download our latest technical brief to learn more about the Jetson AGX Orin'.
On page 3, there's the block diagram for the entire SoC. Note the 4 MB 'System Cache' for later.
On page 5, there's the GPU block diagram. There, you can see that each SM gets its own 192 KB L1 cache. Then there's a GPU-wide L2 cache. Presumably, the 'System Cache' serves as the de facto L3 cache for the GPU.
Then on page 8, there's the CPU block diagram. Each A78 core gets its own 64 KB L1 cache for Instruction, 64 KB L1 cache for Data, and 256 KB L2 cache. Oh, yea, separate L1 caches for Instruction and Data are normal for CPUs nowadays. Then for a given cluster of 4 cores, there's 2 MB L3 cache. Then presumably, that 'System Cache' sits above as the de facto L4 cache for the CPU. Remember that the number's descriptive; it'd just be the 4th level of cache to look through from the CPU's perspective, while it'd presumably be the 3rd level for the GPU to search.

So, given how small caches are, relatively speaking, sooner or later it's basically a case of 'I need to kick something out to create space for this new thing'.
So, 'why information within the cache is where it is' kinda becomes 'the order in which I evicted old stuff'.
Naturally, the next question is 'so how do you decide what to clear out?' There a lot of different policies. And honestly, we're getting beyond my depth here . That said, if you're curious, I can at least point you towards this wikipedia page.

Thanks for directing me to a diagram. Now that I know what the caches function is and what benefit they bring, I can now read them, and honestly they are a lot simpler to understand than pure text. I reached the level at which abstract information makes sense for me! Thanks everyone. What you mentioned about cache management leads and the Wikipedia page are exactly the kind of stuff I want have a better grasp of (shared memory, cache coeherency, local of reference). I could have never understood a word of what's written on those pages without all of you.

Future Boy Nemo said:
Disclaimer: I have zero formal education in this field, so if what I'm about to say is incorrect I apologize in advance.

My understanding is that the cache is more or less 'what you've used recently'. Once L1 gets filled up (which happens faster than you can blink -- microprocessors are about as fast as the sun is big) whatever was accessed longest ago gets bumped into L2, and when that's full things get bumped into L3. I usually picture each level of the cache as being a deck of index cards, and each time you find a card you want it gets moved to the top of the pile. If you take a card from L2 or L3 or the VRAM or a server in Iceland then L1 overflows and the card at the bottom is put at the top of L2, causing L2 to overflow its bottom card to the top of L3.

There are complicated algorithms these days that are used to modify the relative importance of data in the cache, so even if card A hasn't been used in awhile the system knows it will be soon and keeps it from going below, say, L2 if it can help it, but it's all automated. The programmer generally doesn't directly control the cache, just designing around it if optimization is critical, but it's probably done now and then in embedded systems and might be possible for standardized hardware like consoles? But even if it is possible, in 99.9% of cases the cache... arranger? scheduler? The cache organizer will be automated.

Fake edit: @Look over there explained it better than me with just two sentences and more confidence. I'll definitely be reading that wiki page to up my knowledge!

Zedark said:
The aim of cache is to have the information that you need to use in a closer location to the actual compute units, so that latency is minimised. The ideal situation happens where all the data you need next is present in the cache, and the stuff you don't use is not present in the cache. This would theoretically maximise performance, and is the ideal. The cache eviction algorithm needs to approach this optimal situation under two constraints: it can't hold all the things you need in the near future in cache due to size constraints, and even if it could, it generally doesn't know which pieces of data those are. Under those constraints, the best we can do is to keep those pieces of data in cache that are most likely to be used soon.

Which algorithm would do that? An often-used one is LRU (least recently used). We basically assign an age (in number of ticks since its last use as registered on an internal clock) and evict the oldest item whenever a new piece of data that is not in the cache needs to be loaded. You then place that piece of dat in cache and remove the least recently used one. This uses the heuristic that you are most likely to use a piece of data that you used recently (since it's associated with part of the code you are executing). Whenever that piece of data gets reused, you can set the age to 0 again.

You would probably design this in a hierarchical fashion: whenever an L1 cache element is evicted, it goes to L2, and then L3, and then to main memory. There are probably more sophisticated algorithms than vanilla LRU, but each should be a variant of LRU at its core. Edit: I would omagine that more complex algorithms can mix information about the frequency of use into an LRU variant as well in order to combine as much information as possible into the eviction algorithm.

Let's use myself as an example of a computer. Let's imagine I gathered all this hentai I borrowed from the library at home. When my man cave is full of it, and that I can't hide it anymore from the others, I have to let go of some of them. And I would obviously get rid of those I have read and that don't excite me anymore.

So, in my mind, I have a constant label attached to each and every book (its name or cover image or the fact if I have read them or not, or my personal appreciation of them) I have in my man cave. If I am smart, I would arrange the magazines in a way that I can find them easily and not just pile them on each other. That way, even if I have a thousand magazines, I have an easy reference to find the ones that don't excite me and bring them back to the library. If I have planned things well in advance I would label each magazine with my star rating. The higher the rating, the less likely I will remove them when my cave is full. The first ones to go are obviously the ones that have the lowest ratings. I would then borrow new books and read them and apply another rating, and so on and so forth.

By this logic, I assume that the CPU has a few data stored about each entry in the cache, or a... cache of what is in the cache in some sense? And this 'cache of a cache' is just a pair of labels and memory addresses of what is in the second cache?

I think I will continue investigating this topic by myself.

FernandoRocker said:
I would definitely prefer a 720p OLED screen over a 1080p one (be it LCD or OLED).

At 7" there is really not much difference between 720p and 1080p (I meant there is but is not very noticeable). If the screen has less pixels, it wouldn't only have better battery, but games could have more graphical effects by rendering everything at 720p instead of 1080.

The power to render a game natively at 1080p compared to 720p is kinda big. You need way power power to do it.

I know things do not scale like this, but just to oversimplify, 1080p has more than double the pixels... You basically need twice the processing power to render at 1080p instead of 720.

I would rather prefer they use that power to get better graphics fidelity and increased framerates over increased resolution.

I agree that a 1080p screen is overkill. However, like Dakhil mentioned it, it is the lowest possible resolution for having variable and high refresh rate and there is a clear use case for those features. Since 720p is likely to be the resolution target of the games running on handheld mode on the succ, it would however result in games not being played at the native res of the screen and that will generate artefacts that are noticeable.

That is why I made the proposition to go for a 4k screen instead. It would have all bells and whistles and allow games to run natively (4k is 9 times 720p)

ReddDreadtheLead said:
I’m not Thraktor

On a different I wonder if Nintendo will opt to have a 4-8MB SLC that is accessible for both the GPU and the CPU. It will be slower than the other higher cache levels but it would certainly help reduce the need to hit up on the RAM and increase the effective bandwidth while also increasing the level of efficiency present in this chip.

Making the 88-102GB/s a lot more manageable than just, well, 88-102GB/s that we see on paper.

Also worth mentioning that the changing of GPU frequency can affect the bandwidth of the cache.

Though I don’t know how this affects the SLC in this case.

:-/

The setup you propose is interesting. I wonder how much 'steroids' the bandwidth can get theoretically if you pair the bus with a cache that makes sense economically.

Zedark · Mar 19, 2022

Kenka said:
That is something I didn't test for myself. I compared the Oculus Quest's weight with the Switch's a in a former post and assumed that it weighting 200g less would make it comfortable for VR. I guess not. So I guess it is not a use case for the 12 SM.

Thanks for directing me to a diagram. Now that I know what the caches function is and what benefit they bring, I can now read them, and honestly they are a lot simpler to understand than pure text. I reached the level at which abstract information makes sense for me! Thanks everyone. What you mentioned about cache management leads and the Wikipedia page are exactly the kind of stuff I want have a better grasp of (shared memory, cache coeherency, local of reference). I could have never understood a word of what's written on those pages without all of you.

Let's use myself as an example of a computer. Let's imagine I gathered all this hentai I borrowed from the library at home. When my man cave is full of it, and that I can't hide it anymore from the others, I have to let go of some of them. And I would obviously get rid of those I have read and that don't excite me anymore.

So, in my mind, I have a constant label attached to each and every book (its name or cover image or the fact if I have read them or not, or my personal appreciation of them) I have in my man cave. If I am smart, I would arrange the magazines in a way that I can find them easily and not just pile them on each other. That way, even if I have a thousand magazines, I have an easy reference to find the ones that don't excite me and bring them back to the library. If I have planned things well in advance I would label each magazine with my star rating. The higher the rating, the less likely I will remove them when my cave is full. The first ones to go are obviously the ones that have the lowest ratings. I would then borrow new books and read them and apply another rating, and so on and so forth.

By this logic, I assume that the CPU has a few data stored about each entry in the cache, or a... cache of what is in the cache in some sense? And this 'cache of a cache' is just a pair of labels and memory addresses of what is in the second cache?

I think I will continue investigating this topic by myself.

I agree that a 1080p screen is overkill. However, like Dakhil mentioned it, it is the lowest possible resolution for having variable and high refresh rate and there is a clear use case for those features. Since 720p is likely to be the resolution target of the games running on handheld mode on the succ, it would however result in games not being played at the native res of the screen and that will generate artefacts that are noticeable.

That is why I made the proposition to go for a 4k screen instead. It would have all bells and whistles and allow games to run natively (4k is 9 times 720p)

:-/

The setup you propose is interesting. I wonder how much 'steroids' the bandwidth can get theoretically if you pair the bus with a cache that makes sense economically.

The idea of assigning a rating makes sense, but it should be noted that the type of data is typically not one-and-done like reading a magazine tends to be. Imagine you're playing BOTW and you see a blue moblin ahead of you. For that moblin, you need to load in the blue textures. Over the next few seconds, you are most likely to have to use that texture again, even if you temporarily rotate the camera away. Therefore it makes sense to keep the blue moblin textures in cache, instead of the black moblin textures which you haven't seen for minutes. If you then encounter a snow fox, the CPU will load those textures into cache and throw out some other data to make room for it. The evicted data will be the Black moblin because it was the least recently used. And this makes sense: it is much less likely that you see a black moblin when you haven't seen one in minutes than it is that you see a blue moblin when you just had one in camera view literal seconds ago: You'd have to search the world for black moblin whereas the blue one can be found by simply panning the camera back to where you were just looking. This is the basic idea behind LRU.

bixente · Mar 19, 2022

GDC starts soon. Hopefully some interesting news comes from that.

Dark Cloud · Mar 19, 2022

GDC starts on monday

Skittzo · Mar 19, 2022

SiG said:
I am curious as to whether or not this uncovered patent regarding a newer Dock with an internal swivel might help address concerns about Drake's proposed docked clocks.

Perhaps they're having a harder time trying to dissipate heat from a regular or OLED Dock that they needed to redesign it again to accomodate for greater exhaust pressure?

Perhaps they were anticipating the poor yeilds of Samsung 8nm that they were ready to shell out (no pun intended) another dock to accomodate for the greater fan blow, but in the end perhaps didn't need it because they switched over to TSMC 7nm?

The patent application is really not at all relevant, no. First off, it was originally filed two years ago so it can't be related to current or recent testing they're doing on Drake. Second, there is only the mention of the one new set of air vents on the swivel block itself, and this is explicitly stated to be done because the construction of the saucer swivel block causes those connectors (including the one to the AC adapter) to be sealed off, whereas it was internally open before.

Third, if they were planning to ever use this product it would not have been published prior to being patented, almost all of the patents that describe products Nintendo actually sells get non-publication requests so that they remain secret until they're officially revealed.

Thraktor · Mar 19, 2022

Dakhil said:
Nvidia also mentioned relatively recently that Nvidia Quantum-2 is fabricated using TSMC's N7 process node.

Thanks, I wasn't aware of that. It's from the Mellanox side of the company, though, so the decision to use N7 might have predated the acquisition being finalised.

Skittzo said:
Just out of curiosity, if you know, do these costs per wafer literally just include the raw materials associated with the wafer and that particular process designated for that wafer, or does it also roll in all of the other costs associated with that production line, i.e. salaries, maintenance, etc.?

I would also imagine that the cost associated with each production line would vary significantly depending on the actual compositional makeup of the chips involved, or how many discrete layers are formed and with how much raw material. What you've cited of course are estimates but it does seem like the true cost could be quite different depending on a ton of factors.

As it related to Nintendo here, they obviously want the best deal they can for these chips but on the other hand they didn't have to put 12SMs into this thing. It's possible that they're willing to spend a little extra on design and possibly even production to better meet whatever needs they have for this product.

They're presented in the video as costs that TSMC charge customers per wafer, rather than TSMC's own production costs, so they should include all costs associated with manufacturing, including salaries, maintenance, amortised facility and machinery costs, etc., plus whatever margin TSMC makes. Sophie Wilson who provides the numbers works for Broadcom, and would be in a very good position to know what TSMC charge, and she actually states that the numbers are "from TSMC", which would imply they're actual prices per wafer quoted from TSMC, not estimates.

That said, the exact numbers aren't that important, as they're a few years old, and don't include additional costs like packaging, testing and Nvidia's margins. I'm more interested in the relative prices between nodes, which I don't have any reason to believe will have changed much. I thought about normalising all the wafer prices in the post, but that would have just made everything more confusing, with ratios of ratios and all that.

Dakhil said:
Assuming TSMC's been chosen by Nintendo and Nvidia as the foundry company of choice for the fabrication of Drake, I think whether Nintendo and Nvidia choose TSMC's N6 process node or TSMC's N5 process node for the fabrication of Drake comes to which process node has more, additional capacity available for the fabrication of Drake moreso than the cost of the process node.

Although Nvidia did pay a premium to TSMC for securing additional capacity for TSMC's N5 process node, I imagine the capacity for TSMC's N5 process node for Nvidia is going to be extremely tight, considering how ridiculously high demand for TSMC's N5 process node is (e.g. Dimensity 8000 and Dimensity 8100, Zen 4, Navi 31 and 32, etc.), and most of Nvidia's products in 2022 and beyond seem to be fabricated using TSMC's N5 process node (e.g. Hopper, Ada, etc.).

I feel confident that Grace's going to fabricated using TSMC's N5 process node since Grace's a datacentre CPU, and TSMC so far has been Nvidia's default choice for the fabrication of datacentre chips. And I think there's a good chance Atlan's going to be fabricated using TSMC's N5 process node, considering Atlan's sampling in 2023; and Nvidia's Arm based SoCs are generally fabricated on the same process node as the consumer and/or professional and/or datacentre GPUs with the same GPU architecture. And Orin seems to be no exception. (Of course, customised variants of Nvidia's Arm based SoCs, such as Drake for example, are a different story.)

I honestly don't think capacity is that big of an issue. TSMC are rapidly expanding 5nm capacity at the moment, and going by their projections from June last year, by 2023 their 5nm capacity (in wafer output) will match or exceed their 7nm capacity. Considering the increased density on 5nm, that would mean Drake would take up a lower proportion of total output on 5nm than 7nm (or any other manufacturing node in production, for that matter) by some margin. Meanwhile TSMC are starting volume production of N3 in the second half of this year, so bleeding-edge customers (read Apple, who have been by far their largest 5nm customer) will be migrating away from 5nm at the same time as Drake production would be ramping up.

Regarding Nvidia's large pre-payments for 5nm allocation, I would argue that increases the likelihood of them using the process for Nintendo's chip, if anything. If Drake was being designed for N5, they would have included it in the arithmetic of how much capacity they need and have already secured it. Given Nvidia's size, I wouldn't be surprised if they've paid for as much as 20-30% of TSMC's total 5nm capacity, and one of the benefits of having to pre-pay for allocation is that they're now guaranteed the wafer output they've paid for. Their sales to Nintendo are probably the most predictable part of their business at the moment, so managing guaranteed supply with near guaranteed demand shouldn't be too difficult. I should also note that there's no claim in the article you linked (or elsewhere that I've seen) that Nvidia are paying a "premium", or paying more per wafer for TSMC's 5nm process than any other customer, only that they have to pre-pay for their allocation, unlike AMD or Apple, who would pay much closer to delivery.

You're right that Nvidia's SoCs are generally fabbed on the same process as their GPU partners, but that actually wasn't the case with TX1, which was manufactured on a more advanced 20nm node than the 28nm node used by Maxwell desktop GPUs. I think the issue with trying to identify trends in previous Nvidia SoCs is that they've never designed a semi-custom SoC for an individual customer like this before, so there's no guarantee that it will follow prior patterns.

Mr.Gamerson said:
Do you guys think a 1080p screen is much more likely now that the specs may be much better than expected, especially with these upscaling techniques being more and more prevalent and improving? I personally would prefer them stick to a 720p screen and use the power and cost savings for other parts of the system, but I don't know if the savings are worth not having better picture quality and 1080p will be a good marketing bullet point to help show advancement over the current oled screen, especially if this system is marketed as a next-gen switch.

I think a 1080p screen is likely, but not really related to the specs, or even that I think it's necessary from an image quality point of view, just because they're targeting 4K output in desktop mode. If they stick with a 720p handheld screen, then developers will have to deal with a 9x difference in resolution between docked and handheld (compared to a 2.25x now). The difference in performance on the GPU side won't be anywhere near 9x between the two modes, so it would be a pain for developers to have to deal with vastly different performance per pixel between the two modes.

Branduil · Mar 19, 2022

Mr.Gamerson said:
Do you guys think a 1080p screen is much more likely now that the specs may be much better than expected, especially with these upscaling techniques being more and more prevalent and improving? I personally would prefer them stick to a 720p screen and use the power and cost savings for other parts of the system, but I don't know if the savings are worth not having better picture quality and 1080p will be a good marketing bullet point to help show advancement over the current oled screen, especially if this system is marketed as a next-gen switch.

Probably depends on what the actual node is. 720p could make sense for a larger node, especially if they're disabling SMs. OTOH if it's 5nm I don't think it makes sense to go smaller than 720p, there'd be way too much overhead for the portable mode compared to docked.

Hermii · Mar 19, 2022

Branduil said:
Probably depends on what the actual node is. 720p could make sense for a larger node, especially if they're disabling SMs. OTOH if it's 5nm I don't think it makes sense to go smaller than 720p, there'd be way too much overhead for the portable mode compared to docked.

How can you have to much overhead? What harm does it do?

Process	Density (mT/mm2)	Drake size (mm2)
Samsung 8nm	45.6	175.4
Samsung 7nm	59.2	135.1
Samsung 5nm	83.4	95.9
Samsung 4nm	109.7	72.9
TSMC N7	65.6	122.0
TSMC N6	70.9	112.8
TSMC N5	106.1	75.4

	Die area	Dies per wafer	Cost per wafer ($)	Cost per die ($)	Cost per die ratio
TSMC N7	122.0	427	5,859.28	13.72	1.15
TSMC N6	112.8	462	5,859.28	12.68	1.06
TSMC N5	75.4	723	8,622.76	11.93	1.00

Process	Defect density (per cm2)	Dies per wafer	Cost per wafer ($) - 1.00 ratio
Samsung 5nm	0.5	383	4,569.19
Samsung 5nm	0.3	459	5,475.87
Samsung 8nm	0.5	148	1,765.64
Samsung 8nm	0.3	201	2,397.93
Samsung 8nm	0.1	280	3,340.40

StarTopic Future Nintendo Hardware & Technology Speculation & Discussion |ST| (Read the staff posts before commenting!)

Kremling

Heyaa Heyaa Heyaa!

Baba Yaga Hut

Like Like

Baba Yaga Hut

Deleted member 887

Guest

2010 experience points!

Octorok

Baba Yaga Hut

Like Like

2010 experience points!

Manakete

Like Like

2010 experience points!

Moblin

Warpstar Knight

Uncle Muji

Shriekbat

Octorok

Bob-omb

Kremling

Daydream Believer

The guy with the ToV avatar

𝕽𝖊𝖓𝖊𝖌𝖆𝖉𝖊 𝕬𝖓𝖌𝖊𝖑

Like Like

2010 experience points!

Like Like

2010 experience points!

Manakete

Piranha Plant

Cappy

Moblin

Like Like

#TeamLate2025WithAPotentialForEarly2026

Chain Chomp

Piranha Plant

Like Like

Piranha Plant

Like Like

#TeamLate2025WithAPotentialForEarly2026

Piranha Plant

#TeamLate2025WithAPotentialForEarly2026

Like Like

The guy with the ToV avatar

Tingle

Warpstar Knight

Baba Yaga Hut

"[✄]. [✄]. [✄]. [✄]." -Microsoft

Bob-omb

Manakete