StarTopic Future Nintendo Hardware & Technology Speculation |ST|

kvetcha

radicalized by deerhoof
Pronouns
he/him
If it were just creating in-between frames from two finished frames, it would be little different than a higher quality version of what most TVs already offer. It would also necessarily add image latency, since you'd need to render frame A to create in-between frame A-1, and hold back frame A another 1/60 or 1/120 of a second or whatever the case may be.
I believe their own documentation has confirmed that they are generating tweens.

Or at least that they require two frames to begin frame generation, maybe?
 

Teal'c

Cappy
Don’t get your hopes up.
more than a hope was a curiosity 😃

I was thinking about how there was talk of custom chips when Switch was introduced (which was later denied if I remember correctly) and how Nvidia had helped Nintendo with APIs.
I've been going through the recent Linux drops for Tegra, where they've added support for Drake. Drake's OFA driver - the OFA is what DLSS 3.0 uses for its special sauce - is the same as Orin's, unlike some other parts of T239, where it overrides the Orin driver.

So as close to 0% as we can imagine at this point.
I see, thank you 🙂
 
Linux updates:
Interesting Timing Bits: Drake Linux was being developed on software simulation in January of last year. In April, there were a set of Drake related updates that seem to indicate that actual engineering samples were being produced. In July, the code was branched to consolidate Orin changes for public release so that Drake work could continue. Any further references to Drake in the various public repos since are entirely updates to places where Drake share's Orin's driver, but needs a Drake specific exception (like the cpu-freq updates). This is likely so that there is a One True Source for Orin drivers and Drake dev can simply pull it from upstream, rather than maintaining multiple forks at Nvidia that are constantly cross merging.

By April, you mean April of this year, not last?
 

LinkURL

Rattata
Pronouns
He/him
Would it be possible to use some DLSS-like technology (developed by NVidia or even NERD) to "decompress" textures? Something much more efficient than we have today.
It would be interesting to get around the problems of bandwidth, and internal memory of the console and cartridges.
 

JoshuaJSlone

Bob-omb
Pronouns
he / him
I believe their own documentation has confirmed that they are generating tweens.

Or at least that they require two frames to begin frame generation, maybe?
I take it to mean that they're using data from frame A and B to create B+1. If the machinery can see how things changed from from A to B, it can continue those same motions and make a 150% version while also working on frame C.
 

oldpuck

Like Like
Pronouns
he/him/his
I believe their own documentation has confirmed that they are generating tweens.

Or at least that they require two frames to begin frame generation, maybe?
Unless I am vastly misunderstanding, they just need to frames to seed the process. They're absolutely generating future frames - specifically they call out generating the next frame entirely on the GPU while the CPU is totally bottlenecked.

The target is extreme high frame rate games. 60FPS+ starts to exceed human reaction time for most folks. The speed of your neurons from your eye to your hand is slower than a frame rate update, but you can absolutely perceive the smoothness visually. So if your game engine is running at 60FPS you're probably find for human reaction, but if you're getting 120 frames in smoothness, that's the win.

As folks point out, this implies perceived latency in moving the camera rapidly, but NVidia claims some integration with their latency tools so I'm waiting for a breakdown before speculating heavily.
 

ILikeFeet

Pikmin
Would it be possible to use some DLSS-like technology (developed by NVidia or even NERD) to "decompress" textures? Something much more efficient than we have today.
It would be interesting to get around the problems of bandwidth, and internal memory of the console and cartridges.
I haven't heard anything about using ai to decompress textures. but Nvidia has tools for using the gpu to decompress assets
 

oldpuck

Like Like
Pronouns
he/him/his
I take it to mean that they're using data from frame A and B to create B+1. If the machinery can see how things changed from from A to B, it can continue those same motions and make a 150% version while also working on frame C.
You need two frames of data to run the Optical Flow Analyzer to detect motion vectors, I believe
 

kvetcha

radicalized by deerhoof
Pronouns
he/him
I take it to mean that they're using data from frame A and B to create B+1. If the machinery can see how things changed from from A to B, it can continue those same motions and make a 150% version while also working on frame C.
Unless I am vastly misunderstanding, they just need to frames to seed the process. They're absolutely generating future frames - specifically they call out generating the next frame entirely on the GPU while the CPU is totally bottlenecked.

The target is extreme high frame rate games. 60FPS+ starts to exceed human reaction time for most folks. The speed of your neurons from your eye to your hand is slower than a frame rate update, but you can absolutely perceive the smoothness visually. So if your game engine is running at 60FPS you're probably find for human reaction, but if you're getting 120 frames in smoothness, that's the win.

As folks point out, this implies perceived latency in moving the camera rapidly, but NVidia claims some integration with their latency tools so I'm waiting for a breakdown before speculating heavily.
I understood it this way, but their own visual aid shows alternating upscaled and generated frames, so I think they've muddied the waters a bit.
 

ReddDreadtheLead

Chain Chomp
You are correct, of course, but I think we're stuck with that no matter what. Ampere-B has been on Samsung 8nm the entire time, Orin was built on 8nm, and we know T239 shares design with T234. There is likely not a cost advantage porting to TSMC now over porting to TSMC later, because it's going to be a port either way, as opposed to being built fresh on TSMC. I’ll

Of course, Ampere-A was already on TSMC 7nm, and the ARM uses custom verilog libraries to support a wide range of nodes out of the box for their CPUs. At least some of the groundwork was laid well before Drake dev started
OK let me phrase this better: Nintendo funds, the R&D for Nvidia to design a customized chip just for them, they use their AI supercomputer to basically make a chip.


Scenario one: Nintendo spends the R&D budget to do the chip at Samsung and then they do it again the R&D budget to have a rebuilt at a completely different foundry, such as TSMC and that is your Lite/battery saver model.


Scenario two: Nintendo forgoes having to do that at the Samsung foundry and pays the larger upfront cost rather than do it twice, and just goes off the basis of doing it at the TSMC’s 7nm foundry as the IP already exists and so does the ARM engineers that nvidia has for implementation, and it’s supposed to be a tegra as always. It does not use any of Samsung‘s IPs in mind, and it has a cheaper upgrade path in the long run for the next refresh. That is your lite/battery saver model.



Chip-Design-and-Manufacturing-Cost-under-Different-Process-Nodes-Data-Source-from-IBS.png



I’m taking a pause on this because it just seems a bit far-fetched to believe Nintendo will spend hundreds of millions to just derive the chip from an existing chip for their own purpose, sell it to you in a new video game console, then rebuild the entire chip at a different foundry where other Nvidia products are, spend several hundred millions and I’m pretty sure they’re going to incorporate new ram into this for better power savings, and then only sell it to you as a revision, and as a mini model.

Basically a rehash of what they did with the Nintendo switch but with a new piece of silicon inside. I find that hard to believe.

The only way I could believe it, and this is just me, is if the 8nm one is going to be treated as a switch pro and they (in the background) pay for the funds for a chip on a newer process, where it goes into the device that will take the position as the actual switch successor platform and they introduce a switch and a switch lite 2 to day one.

Basically do what they did with a Nintendo GameCube and Nintendo Wii, and treat that as a new generation, even though it’s the exact same hardware, just one is faster than the other. This way they can recoup the cost longer-term and basically have a new family of systems that will replace the old switch and will be a new era of switch family of systems that will be introduced later on.



And the 8N version is solely a stopgap, over time it’ll be a smoother smartphone model.


All using the “same” chip but different targets.
 

mjayer

Piranha Plant
I'm just sitting here so stoked that it looks like the BOTTOM FLOOR of what we're getting is a NINTENDO CONSOLE that is hitting above a PS4 PRO... and it's also portable.

I. Am. Pleased.
 

oldpuck

Like Like
Pronouns
he/him/his
engineering samples being out in April 2021 is interesting. either that means devs were working on simulated kits or they were working on analogues. I would bet on there being orin-based kits before then
We know NVN2 was running on Orin at some point, yes? And that Orin doesn't have RT hardware? This tracks with devkits having "RT problems" which we had heard in the rumor mill before
 

JoshuaJSlone

Bob-omb
Pronouns
he / him
Would it be possible to use some DLSS-like technology (developed by NVidia or even NERD) to "decompress" textures? Something much more efficient than we have today.
It would be interesting to get around the problems of bandwidth, and internal memory of the console and cartridges.
DLSS1 is just plain image upscaling, and could be used on textures I suppose. Lots of people make mods for PC games where they put the textures through some form of machine upscaling for cheap improved results. But in that case it's still being treated like a regular texture file in the end, taking up more space and bandwidth. Making it instead an alternative to bilinear filtering or whatever for final use of textures... how feasible or costly that is is beyond my layman's knowledge.
I understood it this way, but their own visual aid shows alternating upscaled and generated frames, so I think they've muddied the waters a bit.
Sure.
Frame B
Frame B+1 (created from A and B)
Frame C
Frame C+1 (created from B and C)
Frame D
Frame D+1 (created from C and D)
 

Skittzo

Yoshi
Founder
Pronouns
He/Him
FamiGold
13
Would it be possible to use some DLSS-like technology (developed by NVidia or even NERD) to "decompress" textures? Something much more efficient than we have today.
It would be interesting to get around the problems of bandwidth, and internal memory of the console and cartridges.
As I understand it Nvidia does have AI tools that upres existing textures, though I'm not sure that type of thing can be done on the fly which would be necessary for decompressing textures like you seem to be suggesting.
 

Thraktor

Tektite
Pronouns
He/Him
I just had a play around with the Jetson power tool myself, and I'm getting quite different estimates than were posted before.

For the GPU, my methodology was to turn off the DVA/PVA, etc., set the CPU to a fixed value, and compare the power consumption with the GPU turned off entirely to the power consumption at each GPU clock speed for 4 TPC (8 SM) and 8 TPC (16 SM) configurations. Then for each clock, I took the mid-point of the 4 TPC and 8 TPC power consumption, and subtracted the power with the GPU turned off. I set the GPU load level to high for all these tests. I feel this is a reasonable way to isolate the GPU power consumption and estimate it in a 12 SM configuration.

The power consumption figures I got for the GPU with 12 SMs are:

420.75MHz - 5.70W
522.75MHz - 6.80W
624.75MHz - 8.65W
726.75MHz - 10.85W
828.75MHz - 13.20W
930.75MHz - 16.35W
1032.75MHz - 19.90W
1236.75MHz - 29.35W

These probably overestimate the power consumption of an 8nm Drake GPU by a bit, as it uses smaller tensor cores, has fewer ROPs, and should have some small power savings by using only a single GPC rather than two. Still, I wouldn't expect the power consumption to be significantly lower than this. A few months ago I made some rough estimates of power consumption of Ampere on 8nm by measuring my RTX 3070 and scaling from there, and I got a figure of 16.1W for 12 SMs at 1155MHz. This was a very crude estimate (just take the reported power consumption from GPU-Z and divide by the number of SMs), and seems to come in a bit lower than we see above. I'd err on the side of trusting Nvidia's official tool for estimating Orin power consumption over my rough estimate based on a GPU 4x as big, though.

I believe the original Switch's GPU consumed somewhere around 3W in portable mode, which means we're already pushing things at the base clock of 420MHz. Even if Drake comes down to about 5W at 420MHz from the smaller tensor cores and other optimisations, that's still a big increase. It's impossible to say how power consumption behaves at lower clocks, and maybe they could squeeze it in at the same 384MHz clock as the original Switch, but my guess is if Nvidia doesn't clock below 420MHz in any configuration, it's because it's not efficient to do so. I'm still of the opinion that running a full 12 SM GPU on 8nm isn't feasible in handheld mode for a device the same size as the current Switch, which means either disabling SMs in handheld mode, or a different manufacturing process.

On the docked side of things, they're more limited by cooling than power consumption, and a 768MHz clock seems reasonable with a similar cooling set-up to the original Switch, if running a bit hotter. They could possibly push up higher to 900MHz-1GHz or so if they really wanted to, but it's more a question of Nintendo's tolerance for a more expensive cooling setup and/or more fan noise than a technical limitation.

For the CPU, as it's not possible to turn the CPU off entirely, I simply took the difference between the 4 core and 12 core CPU configuration for each clock, again with the CPU load level set to high. As a bit of a sense-check, I also checked the reported power consumption for each clock on an 8 core configuration, and validated that the difference in power draw between 4 and 8 cores was approximately the same as between 8 and 12 cores. This is true typically to within 0.1W, so I think it's reasonable to assume the power cost is linear with respect to the number of clusters activated (ie if going from 1 to 2 clusters adds the same amount of power consumption as moving from 2 to 3, then we can infer that going from 0 to 1 cluster of cores costs a similar amount).

The figures I got for an 8 core CPU are as follows:

1113.6MHz - 2.2W
1267.2MHz - 2.5W
1497.6MHz - 3.1W
1651.2MHz - 3.8W
1728.0MHz - 4.1W
1881.6MHz - 4.9W
2035.2MHz - 5.8W
2188.8MHz - 7.1W

CPU power consumption on Drake might be a bit lower due to using a single cluster and potentially less cache, but I would expect only pretty small differences here. Compared to the GPU, though, I don't think there's as big an issue. The CPU on the original Switch consumed a bit under 2W, so Nintendo could clock a bit higher than the old 1GHz. While ~1.1GHz or so might not sound like much, an 8 core A78 clocked at 1.1GHz (with 7 cores for games) is still a huge increase over 4 A57 cores (with 3 for games) at 1GHz. If they push the power draw of the CPU up a bit (which I feel is more likely than increasing handheld GPU power draw), then at 3W they could get up to around 1.5GHz, which would put them in a pretty nice spot.

The CPU side of things is a bit better than I expected, to be honest. You're obviously not going to clock 8 A78s up to 2GHz+ on 8nm in Switch's form-factor, but they don't have to clock them down to oblivion to get them to work. It would still be comfortably more capable than PS4/XBO's CPUs in a portable device while on an old manufacturing process, which isn't a bad position to be in.
 
ES6 being Xbox exclusive is more likely going to be a blocker than Switch 2 performance I’m guessing. Otherwise 100% with you. I’ve been thinking for quite some time that around PS4 Pro quality on Switch is my dream hardware for at least another 5-10 years.
Ever since the hack, my internal goalpost for a 15 watt device that I'd be very happy with was pretty much 'take a PS4 Pro, double the CPU power, and kick the RAM amount up a notch'. Approximately a midpoint between PS4 and PS5. Shouldn't be hard to clear at this rate!

Edit: well, I was in the midst of typing this post out when Thraktor posted his. But that merely resets my internal dream CPU target to translating to TSMC N5, or somewhat pushed N7/N6. Still doable!

I'd really be wary to assume they'll increase the battery size. I think it's safer to stick with the ~4-5W range Erista had for estimations.

This Switch should see improvements in screen power efficiency at the very least. I'm not sure about RAM or storage efficiency.


Not quite true, look at the post I quoted.
RAM efficiency, in the energy per bit sense, will improve from a combination of nodes and moving to LPDDR5. Buuut, it's highly unlikely that the energy efficiency has improved to the point of compensating for up to four times the amount of bits being moved. I think that with the amounts Thraktor estimated before, at maximum usage, we're looking at somewhere around +1 watt being used, give or take some tenths.
Storage energy efficiency... if it gets upgraded from eMMC to eUFS, then under the same operating conditions we should be seeing a total power usage bump of maybe 2 to 3 deciwatts (that's deci, or tenths of a watt to be clear) in return for multiple times the sequential reading speed, multiple times the sequential writing speed, random read IOPs getting kicked up from high 4 digits to mid 5 digits-low 6 digits, and random write IOPs going from ... 3 or 4 digits to the 5 digits era. (short version: please, I want eUFS :whistle:)
 
Last edited:

oldpuck

Like Like
Pronouns
he/him/his
By April, you mean April of this year, not last?
April of this year. To be clear, the indication is that the folks developing the Linux drivers had access to hardware at this time. It is unclear at what point in the pipeline developing Linux drivers was at.

In January, Linux driver devs were running T239 in Synopsys Virtualizer, a tool for running simulated hardware for exactly this purpose. T239 was reporting as T239 instead of "T23x" in some places, which meant the Orin drivers which should work on Drake were not correctly being detected by the kernel and the system wasn't booting. This is very early days driver development.

There are no references to the VDK after February in the commit logs, and in April commits start fixing timing bugs with the PCIe bus, which implies to me we're on hardware now.
 

ILikeFeet

Pikmin
We know NVN2 was running on Orin at some point, yes? And that Orin doesn't have RT hardware? This tracks with devkits having "RT problems" which we had heard in the rumor mill before
Orin does have RT cores, but Nvidia doesn't advertise them. my assumption with the RT problem is that the games aren't too optimized yet. maybe devs were using similar settings as PC/consoles, maybe the SDK isn't running as well as it could, maybe this was before dlss. there's too much unknown to make a conclusion other than what they did test ran through battery
 

arsen

Rattata
Falcon/TSEC: FAst Logic CONtroller, used in lots of things. Drake's seems to be different from Orin. A Falcon is used in TSEC, the Tegra Security Coprocessor in the X1, which accelerates cryptography and is part of the secure OS boot process that prevents the Switch from being jailbroken
I thought Falcon was replaced by Peregrine since Ampere?
 

Mr Doggo

I will uh uhmmmmmuh... hie :)
DLSS really makes me happy. For me the main problem of heavy third party games isnt 30fps as most people here would say, instead for me its resolution.
 

Stinky Horse

Shriekbat
@Thraktor those seem MUCH more in line with the Orin spec figures, and desktop Ampere in general. If they could really match the other figures, we might be seeing a whole new range of 8NM GPUs for laptops etc using whatever witchcraft that would have been.
 

ReddDreadtheLead

Chain Chomp
I just had a play around with the Jetson power tool myself, and I'm getting quite different estimates than were posted before.

For the GPU, my methodology was to turn off the DVA/PVA, etc., set the CPU to a fixed value, and compare the power consumption with the GPU turned off entirely to the power consumption at each GPU clock speed for 4 TPC (8 SM) and 8 TPC (16 SM) configurations. Then for each clock, I took the mid-point of the 4 TPC and 8 TPC power consumption, and subtracted the power with the GPU turned off. I set the GPU load level to high for all these tests. I feel this is a reasonable way to isolate the GPU power consumption and estimate it in a 12 SM configuration.

The power consumption figures I got for the GPU with 12 SMs are:

420.75MHz - 5.70W
522.75MHz - 6.80W
624.75MHz - 8.65W
726.75MHz - 10.85W
828.75MHz - 13.20W
930.75MHz - 16.35W
1032.75MHz - 19.90W
1236.75MHz - 29.35W

These probably overestimate the power consumption of an 8nm Drake GPU by a bit, as it uses smaller tensor cores, has fewer ROPs, and should have some small power savings by using only a single GPC rather than two. Still, I wouldn't expect the power consumption to be significantly lower than this. A few months ago I made some rough estimates of power consumption of Ampere on 8nm by measuring my RTX 3070 and scaling from there, and I got a figure of 16.1W for 12 SMs at 1155MHz. This was a very crude estimate (just take the reported power consumption from GPU-Z and divide by the number of SMs), and seems to come in a bit lower than we see above. I'd err on the side of trusting Nvidia's official tool for estimating Orin power consumption over my rough estimate based on a GPU 4x as big, though.

I believe the original Switch's GPU consumed somewhere around 3W in portable mode, which means we're already pushing things at the base clock of 420MHz. Even if Drake comes down to about 5W at 420MHz from the smaller tensor cores and other optimisations, that's still a big increase. It's impossible to say how power consumption behaves at lower clocks, and maybe they could squeeze it in at the same 384MHz clock as the original Switch, but my guess is if Nvidia doesn't clock below 420MHz in any configuration, it's because it's not efficient to do so. I'm still of the opinion that running a full 12 SM GPU on 8nm isn't feasible in handheld mode for a device the same size as the current Switch, which means either disabling SMs in handheld mode, or a different manufacturing process.

On the docked side of things, they're more limited by cooling than power consumption, and a 768MHz clock seems reasonable with a similar cooling set-up to the original Switch, if running a bit hotter. They could possibly push up higher to 900MHz-1GHz or so if they really wanted to, but it's more a question of Nintendo's tolerance for a more expensive cooling setup and/or more fan noise than a technical limitation.

For the CPU, as it's not possible to turn the CPU off entirely, I simply took the difference between the 4 core and 12 core CPU configuration for each clock, again with the CPU load level set to high. As a bit of a sense-check, I also checked the reported power consumption for each clock on an 8 core configuration, and validated that the difference in power draw between 4 and 8 cores was approximately the same as between 8 and 12 cores. This is true typically to within 0.1W, so I think it's reasonable to assume the power cost is linear with respect to the number of clusters activated (ie if going from 1 to 2 clusters adds the same amount of power consumption as moving from 2 to 3, then we can infer that going from 0 to 1 cluster of cores costs a similar amount).

The figures I got for an 8 core CPU are as follows:

1113.6MHz - 2.2W
1267.2MHz - 2.5W
1497.6MHz - 3.1W
1651.2MHz - 3.8W
1728.0MHz - 4.1W
1881.6MHz - 4.9W
2035.2MHz - 5.8W
2188.8MHz - 7.1W

CPU power consumption on Drake might be a bit lower due to using a single cluster and potentially less cache, but I would expect only pretty small differences here. Compared to the GPU, though, I don't think there's as big an issue. The CPU on the original Switch consumed a bit under 2W, so Nintendo could clock a bit higher than the old 1GHz. While ~1.1GHz or so might not sound like much, an 8 core A78 clocked at 1.1GHz (with 7 cores for games) is still a huge increase over 4 A57 cores (with 3 for games) at 1GHz. If they push the power draw of the CPU up a bit (which I feel is more likely than increasing handheld GPU power draw), then at 3W they could get up to around 1.5GHz, which would put them in a pretty nice spot.

The CPU side of things is a bit better than I expected, to be honest. You're obviously not going to clock 8 A78s up to 2GHz+ on 8nm in Switch's form-factor, but they don't have to clock them down to oblivion to get them to work. It would still be comfortably more capable than PS4/XBO's CPUs in a portable device while on an old manufacturing process, which isn't a bad position to be in.
What would thd draw be with 4SMs then? Compared to the 8 and 16 that is.

This can probably be charted on a graph and scaled appropriately to see what it should be about for the in-between
 

ILikeFeet

Pikmin
@Thraktor those seem MUCH more in line with the Orin spec figures, and desktop Ampere in general. If they could really match the other figures, we might be seeing a whole new range of 8NM GPUs for laptops etc using whatever witchcraft that would have been.
the problem with going even lower is that Nvidia faces stiff competition from AMD and Intel. I don't think the MX570 is that widely adopted
 

Anatole

Cappy
Pronouns
He/Him/His
I read the blog post again, and I’m waffling a bit on what I’ve said before on DLSS 3. To me, there are two sections in the blog post that don’t fit together.

First:
The DLSS Frame Generation convolutional autoencoder takes 4 inputs – current and prior game frames, an optical flow field generated by Ada’s Optical Flow Accelerator, and game engine data such as motion vectors and depth.
This sounds like extrapolation to me. You render the current frame N using the super resolution network, calculate the optical flow field between frame N and N-2, and pass that to the neural network, which internally predicts how that flow field will evolve in the next frame.

It could also be frame N and frame N-1, but I am assuming that we only use non-generated frames for optical flow.

I also assume that we are only calculating one optical flow field. You could calculate several fields pairwise, which gives you a notion of acceleration like I was describing a couple of posts ago, but on closer inspection, the blog post seems pretty specific about using “an optical flow field” in the singular.

Second:
For each pixel, the DLSS Frame Generation AI network decides how to use information from the game motion vectors, the optical flow field, and the sequential game frames to create intermediate frames. By using both engine motion vectors and optical flow to track motion, the DLSS Frame Generation network is able to accurately reconstruct both geometry and effects, as seen in the picture below.

This paragraph sounds like interpolation. In that loop, you would render frame N+2 using super resolution, calculate the optical flow field between frame N and N+2, and pass that to the neural network.

The disadvantage of this method is that you are rendering the frames in advance, but the tradeoff is that you no longer need to have any notion of acceleration, since you have the exact translations at each point in the optical flow field. One optical flow field is sufficient, which fits the singular description. The neural network would only have to estimate what fraction of that translation occurs between frame N and N+1 versus between N+1 and N+2.

At this point, idk. This all hinges on how Nvidia is defining the “current frame;” if all they mean is “the frame being rendered,” then the distinction is a moot point, and it’s frame interpolation. I’m sure there will be clarification when the press embargo lifts.
 

mjayer

Piranha Plant
I just had a play around with the Jetson power tool myself, and I'm getting quite different estimates than were posted before.

For the GPU, my methodology was to turn off the DVA/PVA, etc., set the CPU to a fixed value, and compare the power consumption with the GPU turned off entirely to the power consumption at each GPU clock speed for 4 TPC (8 SM) and 8 TPC (16 SM) configurations. Then for each clock, I took the mid-point of the 4 TPC and 8 TPC power consumption, and subtracted the power with the GPU turned off. I set the GPU load level to high for all these tests. I feel this is a reasonable way to isolate the GPU power consumption and estimate it in a 12 SM configuration.

The power consumption figures I got for the GPU with 12 SMs are:

420.75MHz - 5.70W
522.75MHz - 6.80W
624.75MHz - 8.65W
726.75MHz - 10.85W
828.75MHz - 13.20W
930.75MHz - 16.35W
1032.75MHz - 19.90W
1236.75MHz - 29.35W

These probably overestimate the power consumption of an 8nm Drake GPU by a bit, as it uses smaller tensor cores, has fewer ROPs, and should have some small power savings by using only a single GPC rather than two. Still, I wouldn't expect the power consumption to be significantly lower than this. A few months ago I made some rough estimates of power consumption of Ampere on 8nm by measuring my RTX 3070 and scaling from there, and I got a figure of 16.1W for 12 SMs at 1155MHz. This was a very crude estimate (just take the reported power consumption from GPU-Z and divide by the number of SMs), and seems to come in a bit lower than we see above. I'd err on the side of trusting Nvidia's official tool for estimating Orin power consumption over my rough estimate based on a GPU 4x as big, though.

I believe the original Switch's GPU consumed somewhere around 3W in portable mode, which means we're already pushing things at the base clock of 420MHz. Even if Drake comes down to about 5W at 420MHz from the smaller tensor cores and other optimisations, that's still a big increase. It's impossible to say how power consumption behaves at lower clocks, and maybe they could squeeze it in at the same 384MHz clock as the original Switch, but my guess is if Nvidia doesn't clock below 420MHz in any configuration, it's because it's not efficient to do so. I'm still of the opinion that running a full 12 SM GPU on 8nm isn't feasible in handheld mode for a device the same size as the current Switch, which means either disabling SMs in handheld mode, or a different manufacturing process.

On the docked side of things, they're more limited by cooling than power consumption, and a 768MHz clock seems reasonable with a similar cooling set-up to the original Switch, if running a bit hotter. They could possibly push up higher to 900MHz-1GHz or so if they really wanted to, but it's more a question of Nintendo's tolerance for a more expensive cooling setup and/or more fan noise than a technical limitation.

For the CPU, as it's not possible to turn the CPU off entirely, I simply took the difference between the 4 core and 12 core CPU configuration for each clock, again with the CPU load level set to high. As a bit of a sense-check, I also checked the reported power consumption for each clock on an 8 core configuration, and validated that the difference in power draw between 4 and 8 cores was approximately the same as between 8 and 12 cores. This is true typically to within 0.1W, so I think it's reasonable to assume the power cost is linear with respect to the number of clusters activated (ie if going from 1 to 2 clusters adds the same amount of power consumption as moving from 2 to 3, then we can infer that going from 0 to 1 cluster of cores costs a similar amount).

The figures I got for an 8 core CPU are as follows:

1113.6MHz - 2.2W
1267.2MHz - 2.5W
1497.6MHz - 3.1W
1651.2MHz - 3.8W
1728.0MHz - 4.1W
1881.6MHz - 4.9W
2035.2MHz - 5.8W
2188.8MHz - 7.1W

CPU power consumption on Drake might be a bit lower due to using a single cluster and potentially less cache, but I would expect only pretty small differences here. Compared to the GPU, though, I don't think there's as big an issue. The CPU on the original Switch consumed a bit under 2W, so Nintendo could clock a bit higher than the old 1GHz. While ~1.1GHz or so might not sound like much, an 8 core A78 clocked at 1.1GHz (with 7 cores for games) is still a huge increase over 4 A57 cores (with 3 for games) at 1GHz. If they push the power draw of the CPU up a bit (which I feel is more likely than increasing handheld GPU power draw), then at 3W they could get up to around 1.5GHz, which would put them in a pretty nice spot.

The CPU side of things is a bit better than I expected, to be honest. You're obviously not going to clock 8 A78s up to 2GHz+ on 8nm in Switch's form-factor, but they don't have to clock them down to oblivion to get them to work. It would still be comfortably more capable than PS4/XBO's CPUs in a portable device while on an old manufacturing process, which isn't a bad position to be in.


Okay ya, thanks for raining on our parade I guess. You know, it's still not too shabby at the end of the day with these numbers. Only disappointed a little! Just a little. So based on your updated information I changed the estimate on the Samsung options here below. Any possibility you help get the TMSC nodes more accurately estimated as well?

ESTIMATED DRAKE DOCKED FLOPS BY NODE:
8 nm Samsung = 2.4 TF
5 nm Samsung = 3 TF
7 nm TSMC = 3.5 TF
6 nm TSMC = 3.8 TF
5 nm TSMC = 4.1 TF
4 nm TSMC = 4.5 TF
 

RennanNT

Octorok
These probably overestimate the power consumption of an 8nm Drake GPU by a bit, as it uses smaller tensor cores, has fewer ROPs, and should have some small power savings by using only a single GPC rather than two.
Is it possible to test with 2 TCPs? If so, would we get a closer approximation of 12 SM in a single GPC if we add the difference between 4 and 8 SM to 8 SM results?
 

Skittzo

Yoshi
Founder
Pronouns
He/Him
FamiGold
13
What would thd draw be with 4SMs then? Compared to the 8 and 16 that is.

This can probably be charted on a graph and scaled appropriately to see what it should be about for the in-between
Yeah it's clear there's some nonlinearity happening with such a large discrepancy over the other numbers (was it adding the 4SM number to 8SM number or multiplying the 16SM number by 0.75?) so it would be interesting to see how much we could narrow down these rough estimates.
 

Stinky Horse

Shriekbat
At this point, idk. This all hinges on how Nvidia is defining the “current frame;” if all they mean is “the frame being rendered,” then the distinction is a moot point, and it’s frame interpolation. I’m sure there will be clarification when the press embargo lifts.
Yes I was about to post the same. 'Current and prior' 100% refers to rendered frames only, because those rendered frames are the inputs. so it's a question of 'does 'current' also mean something else in relation to final frame order', which I don't think it necessarily does. I think it is more sensible to slot in the generated frame between the two inputs, rather than after, as it will undoubtedly look better. And the latency will typically be better, at least compared to ... uh brute forcing full 4K, which Nvidia unhelpfully offers as the comparison. (i'd much prefer a comparison to DLSS 2.1!)
 
@oldpuck
Hmm, here's my stab at the A78 power usage on Samsung 8nm question.
First off, I'm assuming that Samsung's 4LPP is comparable to TSMC N5, so I'm subbing in 3 [email protected] watt for 4LPP. I actually don't think that's the case, but I need a number to start with, and hey, maybe I was wrong and 4LPP actually is comparable to N5. So consider this an optimistic/best case estimation.
Then, I take what Samsung claims in this slide from July 20221 at face value. Like some others here, that's not necessarily something I prefer to do, but for this exercise, fine.
4475af18972bd40753f3f2486c899e510fb3091b-2.jpg


Working backwards from 4LPP to get to 8LPP, take the 3 [email protected] watt, then go 3/(1.11*1.11*1.09) ~= 2.23 [email protected] watt.
Given that, my guess for how much power 1.7 Ghz would require would then be...
2.23/1.7 ~= 1.31 (ratio of the two frequencies)
1.31^2 ~= 1.73 (the relative increase in power to go from 1.7 Ghz to ~2.23 Ghz)
1/1.73 ~= 0.58
So, at best, 0.58 watts to get a single A78 to 1.7 ghz on Samsung's 8 nm node. More likely to be a bit higher than that. Probably fair to round it up to 0.6 watts?
So then if I were take ~2.23 [email protected] watt, then apply some more napkin math (half the frequency for quarter the power and all that jazz)...
~1.116 [email protected] watt. Still a bit better than what Thraktor got above (8x 1.113Ghz totaling 2.2 watts, so 1x 1.113 [email protected] watt). But I did not expect this handwavey crap to land anywhere near that close.
 

oldpuck

Like Like
Pronouns
he/him/his
Yuzu devs are allowed to (and do) have a buggy product. Nintendo doesn't have that luxury for a system that costs hundreds. That's all there is to it.
The Yuzu development team is also larger than NERD, isn't legally reusable by Nintendo, and based on SLOC alone is a 10+ million dollar project for the core emulator alone.
 

oldpuck

Like Like
Pronouns
he/him/his
So then if I were take ~2.23 [email protected] watt, then apply some more napkin math (half the frequency for quarter the power and all that jazz)...
~1.116 [email protected] watt. Still a bit better than what Thraktor got above (8x 1.113Ghz totaling 2.2 watts, so 1x 1.113 [email protected] watt). But I did not expect this handwavey crap to land anywhere near that close.
@Thraktors numbers are also in the ballpark of my handwavy Orin math as well, if somewhat better. Which lands me back where I was a month ago "8nm feasible, but man, it's pushing it"
 

Kenka

Moblin
Pronouns
He/Him
Thanks a ton @Thraktor. You neve let us down. We now know what this chip would do if manufactured using Samsung 8nm.

Let's hope for the best case scenario and assume that Nintendo has chosen a more recent process node. But if not, then I guess a Drake with half the power is still OK.
 

Alovon11

Bob-omb
Pronouns
He/Them
I just had a play around with the Jetson power tool myself, and I'm getting quite different estimates than were posted before.

For the GPU, my methodology was to turn off the DVA/PVA, etc., set the CPU to a fixed value, and compare the power consumption with the GPU turned off entirely to the power consumption at each GPU clock speed for 4 TPC (8 SM) and 8 TPC (16 SM) configurations. Then for each clock, I took the mid-point of the 4 TPC and 8 TPC power consumption, and subtracted the power with the GPU turned off. I set the GPU load level to high for all these tests. I feel this is a reasonable way to isolate the GPU power consumption and estimate it in a 12 SM configuration.

The power consumption figures I got for the GPU with 12 SMs are:

420.75MHz - 5.70W
522.75MHz - 6.80W
624.75MHz - 8.65W
726.75MHz - 10.85W
828.75MHz - 13.20W
930.75MHz - 16.35W
1032.75MHz - 19.90W
1236.75MHz - 29.35W

These probably overestimate the power consumption of an 8nm Drake GPU by a bit, as it uses smaller tensor cores, has fewer ROPs, and should have some small power savings by using only a single GPC rather than two. Still, I wouldn't expect the power consumption to be significantly lower than this. A few months ago I made some rough estimates of power consumption of Ampere on 8nm by measuring my RTX 3070 and scaling from there, and I got a figure of 16.1W for 12 SMs at 1155MHz. This was a very crude estimate (just take the reported power consumption from GPU-Z and divide by the number of SMs), and seems to come in a bit lower than we see above. I'd err on the side of trusting Nvidia's official tool for estimating Orin power consumption over my rough estimate based on a GPU 4x as big, though.

I believe the original Switch's GPU consumed somewhere around 3W in portable mode, which means we're already pushing things at the base clock of 420MHz. Even if Drake comes down to about 5W at 420MHz from the smaller tensor cores and other optimisations, that's still a big increase. It's impossible to say how power consumption behaves at lower clocks, and maybe they could squeeze it in at the same 384MHz clock as the original Switch, but my guess is if Nvidia doesn't clock below 420MHz in any configuration, it's because it's not efficient to do so. I'm still of the opinion that running a full 12 SM GPU on 8nm isn't feasible in handheld mode for a device the same size as the current Switch, which means either disabling SMs in handheld mode, or a different manufacturing process.

On the docked side of things, they're more limited by cooling than power consumption, and a 768MHz clock seems reasonable with a similar cooling set-up to the original Switch, if running a bit hotter. They could possibly push up higher to 900MHz-1GHz or so if they really wanted to, but it's more a question of Nintendo's tolerance for a more expensive cooling setup and/or more fan noise than a technical limitation.

For the CPU, as it's not possible to turn the CPU off entirely, I simply took the difference between the 4 core and 12 core CPU configuration for each clock, again with the CPU load level set to high. As a bit of a sense-check, I also checked the reported power consumption for each clock on an 8 core configuration, and validated that the difference in power draw between 4 and 8 cores was approximately the same as between 8 and 12 cores. This is true typically to within 0.1W, so I think it's reasonable to assume the power cost is linear with respect to the number of clusters activated (ie if going from 1 to 2 clusters adds the same amount of power consumption as moving from 2 to 3, then we can infer that going from 0 to 1 cluster of cores costs a similar amount).

The figures I got for an 8 core CPU are as follows:

1113.6MHz - 2.2W
1267.2MHz - 2.5W
1497.6MHz - 3.1W
1651.2MHz - 3.8W
1728.0MHz - 4.1W
1881.6MHz - 4.9W
2035.2MHz - 5.8W
2188.8MHz - 7.1W

CPU power consumption on Drake might be a bit lower due to using a single cluster and potentially less cache, but I would expect only pretty small differences here. Compared to the GPU, though, I don't think there's as big an issue. The CPU on the original Switch consumed a bit under 2W, so Nintendo could clock a bit higher than the old 1GHz. While ~1.1GHz or so might not sound like much, an 8 core A78 clocked at 1.1GHz (with 7 cores for games) is still a huge increase over 4 A57 cores (with 3 for games) at 1GHz. If they push the power draw of the CPU up a bit (which I feel is more likely than increasing handheld GPU power draw), then at 3W they could get up to around 1.5GHz, which would put them in a pretty nice spot.

The CPU side of things is a bit better than I expected, to be honest. You're obviously not going to clock 8 A78s up to 2GHz+ on 8nm in Switch's form-factor, but they don't have to clock them down to oblivion to get them to work. It would still be comfortably more capable than PS4/XBO's CPUs in a portable device while on an old manufacturing process, which isn't a bad position to be in.
Honestly, I feel that it may be too much still.

Erista in portable mode was around 5.5W for the CPU/GPU iirc, so 1.0Ghz CPU would be around 2W, and 460Mhz on the GPU (have to consider the highest portable clocks for B/C games) would be around 6W, so it'd blow way past that...The GPU eating up all of Erista's portable wattage on it's own making the package at equivalent Erista clocks consume over 8W from the CPU/GPU alone.

And even assuming they follow direct battery density increases over the years and go with a 6000Mah battery that likely would result in lower battery life than Erista units, and not to mention the heat output.

With those numbers, 8N just seems impossible to fit into a handheld with a GPU that big, and with 0 indication of them turning off GPU cores there is no real grounds to assume that "They can just turn the GPU cores off to make it work in B/C" atm...

Primarily because some switch games seemingly do have logic tied to clocks here and there, along with the fact that games need to call that they are running in one of the 3 GPU performance modes in portable mode, so dropping clocks below TX1 clocks would pretty much obliterate B/C without emulation which would be extremely expensive to do.

Also not to mention Boost mode which is even in BOTW1 cranks the CPU to 1.7Ghz+ on the fly and puts in arbitrary GPU clocks per-game so for proper back-compat you'd need to make sure Drake can run in the full clock range that Boost mode can provide within the same wattage target
 
Last edited:
OP
OP
Dakhil

Dakhil

2010 experience points!
Founder

You are correct, of course, but I think we're stuck with that no matter what. Ampere-B has been on Samsung 8nm the entire time, Orin was built on 8nm, and we know T239 shares design with T234. There is likely not a cost advantage porting to TSMC now over porting to TSMC later, because it's going to be a port either way, as opposed to being built fresh on TSMC.
I think there's more work involved with transitioning from Samsung's 8N process node to TSMC's 4N process node for a die shrink since Samsung's 8N process node uses DUV lithography and TSMC's 4N process node uses EUV lithography. So I imagine Nintendo and Nvidia are probably going to have to redesign Drake with EUV lithography and TSMC's IPs in mind if going from Samsung's 8N process node to TSMC's 4N process node for a die shrink.

But if Nintendo and Nvidia do decide to use TSMC's N6 process node to fabricate Drake initially, then I think transitioning from TSMC's N6 process node to TSMC's 4N process node for a die shrink should be more relatively straightforward, especially with TSMC's N6 process node and TSMC's 4N process node both using EUV lithography, and both obviously using TSMC's IPs.

So I think although the costs between using TSMC initially vs using TSMC later are probably not drastically different, I think the amount of work associated with using TSMC later is at least considerably larger compared to using TSMC initially.

(And remember that TSMC's trying to shift customers from TSMC's N7 process node to TSMC's N6 process node.)
 

Gay Bowser

i’m lost out here for good
Founder
Pronouns
they/them
Let's hope for the best case scenario and assume that Nintendo has chosen a more recent process node. But if not, then I guess a Drake with half the power is still OK.
the performance difference between Samsung 8nm and TSMC N7/N6 at the same power is not double, can we maybe not go full negative exaggeration on this
 

Kenka

Moblin
Pronouns
He/Him
the performance difference between Samsung 8nm and TSMC N7/N6 at the same power is not double, can we maybe not go full negative exaggeration on this
I said double because Thraktor supports the idea that a GPU on 8nm would have half its cores disabled in portable mode.
 

Alovon11

Bob-omb
Pronouns
He/Them
I said double because Thraktor supports the idea that a GPU on 8nm would have half its cores disabled in portable mode.
There
is
no
indication
that
is
the
case

Not to mention the fact that doing that sort of thing would be a nightmare switching between docked and portable mode (Pretty much changing between 2 whole different GPUs)

Not to mention it'd pretty much lock B/C to 6SMs

And that's not to mention if turning off that much of the GPU would lead to power leakage.
 

JoshuaJSlone

Bob-omb
Pronouns
he / him
Second:

This paragraph sounds like interpolation. In that loop, you would render frame N+2 using super resolution, calculate the optical flow field between frame N and N+2, and pass that to the neural network.
It sounds like the same thing as the first to me, not sure what part makes you think frame N+2 comes into play?
I think it is more sensible to slot in the generated frame between the two inputs, rather than after, as it will undoubtedly look better.
Look better, yes. Taking a full second to render a frame also looks better than doing so dozens of times a second. But for real-time functionality like a game, being done sooner is important.
 

Mercury_Sagit

┏(‘▀_▀’)ノ♬♪
Contributor
Pronouns
He/Him
I don't think introducing variable core count into power profiles would make sense for Drake because:
  • If wattage is a concern such that reduced SM and CPU core in handheld mode is necessary, I'd assume that NV (and Nintendo) engineers already spotted it during the design phase, not close to release.
  • Prioritizing variable core count also doesn't make sense for a hybrid device imo. The Tegra line does have power profiles with such feature, but I'm not sure if they can switch between modes on the fly. Internet search so far points to nvpmodel, a tool used to configure Jetson/AGX power modes. However I understood that these modes are set before running, but not during a session (training, inference, etc). Such profiles do not really exist in desktop PCs either.
I read the blog post again, and I’m waffling a bit on what I’ve said before on DLSS 3. To me, there are two sections in the blog post that don’t fit together.

First:

This sounds like extrapolation to me. You render the current frame N using the super resolution network, calculate the optical flow field between frame N and N-2, and pass that to the neural network, which internally predicts how that flow field will evolve in the next frame.

It could also be frame N and frame N-1, but I am assuming that we only use non-generated frames for optical flow.

I also assume that we are only calculating one optical flow field. You could calculate several fields pairwise, which gives you a notion of acceleration like I was describing a couple of posts ago, but on closer inspection, the blog post seems pretty specific about using “an optical flow field” in the singular.

Second:


This paragraph sounds like interpolation. In that loop, you would render frame N+2 using super resolution, calculate the optical flow field between frame N and N+2, and pass that to the neural network.

The disadvantage of this method is that you are rendering the frames in advance, but the tradeoff is that you no longer need to have any notion of acceleration, since you have the exact translations at each point in the optical flow field. One optical flow field is sufficient, which fits the singular description. The neural network would only have to estimate what fraction of that translation occurs between frame N and N+1 versus between N+1 and N+2.

At this point, idk. This all hinges on how Nvidia is defining the “current frame;” if all they mean is “the frame being rendered,” then the distinction is a moot point, and it’s frame interpolation. I’m sure there will be clarification when the press embargo lifts.
I'd gear more into frame extrapolation since the paragraph right after the one you cited says:
For each pixel, the DLSS Frame Generation AI network decides how to use information from the game motion vectors, the optical flow field, and the sequential game frames to create intermediate frames. By using both engine motion vectors and optical flow to track motion, the DLSS Frame Generation network is able to accurately reconstruct both geometry and effects, as seen in the picture below.
I'd assume that both the game motion vectors and optical flow fields took values based on the most recently rendered frame(s) within the game itself, i.e before DLSS upscale. Therefore it would be most efficient if the same set of motion vectors from previous frames can be used for both DLSS upscaling and OFA.
Besides, there would be no stall if OFA finishes its jobs in less than the frametime, such that it can be presented before the next upscaled frame is ready. So technically OFA would have a much larger frametime budget compared to DLSS upscale, since it does not need to wait for the current natively rendered image from the game itself to start.
 
Last edited:

Skittzo

Yoshi
Founder
Pronouns
He/Him
FamiGold
13
If a 6SM configuration was always needed for handheld mode, wouldn't that be in the API as often as 12SM is? Yet as I understand it the API only ever sees 12SMs, there is no indication that it can utilize any other GPU configuration.
 

Alovon11

Bob-omb
Pronouns
He/Them
If a 6SM configuration was always needed for handheld mode, wouldn't that be in the API as often as 12SM is? Yet as I understand it the API only ever sees 12SMs, there is no indication that it can utilize any other GPU configuration.
Yep, exactly, there is no indication of any changes to the GPU core count in NVN2 as it is or indication of the usage of stuff related to that tech outside of one mention of a feature tangential to it iirc but there was only the one and it in itself means nothing in regards to actually being able to do that.

NVN2 only sees 12SMs

We should expect 12SMs in both modes.
8nm based on Thak's numbers is not viable for 12SMs in portable mode

If Tangmaster's numbers are right, then it is viable in portable mode, albeit more at OG Switch battery life even with 12SMs.

end of the story on that
 

Skittzo

Yoshi
Founder
Pronouns
He/Him
FamiGold
13
Yep, exactly, there is no indication of any changes to the GPU core count in NVN2 as it is or indication of the usage of stuff related to that tech outside of one mention of a feature tangential to it iirc but there was only the one and it in itself means nothing in regards to actually being able to do that.

NVN2 only sees 12SMs

We should expect 12SMs in both modes.
8nm based on Thak's numbers is not viable for 12SMs in portable mode

If Tangmaster's numbers are right, then it is viable in portable mode, albeit more at OG Switch battery life even with 12SMs.

end of the story on that
I mean, it's always possible that the portion of the API that leaked just happened to not have any mentions of a different core configuration for portable mode, but like I said if this theoretical 6 or 8SM mode was always used in handheld mode, then you'd expect it to show up as often as the docked 12SM configuration, right?

So while it's theoretically possible perhaps that we just simply didn't get that bit, that would be exceedingly unlikely no?
 

Alovon11

Bob-omb
Pronouns
He/Them
I mean, it's always possible that the portion of the API that leaked just happened to not have any mentions of a different core configuration for portable mode, but like I said if this theoretical 6 or 8SM mode was always used in handheld mode, then you'd expect it to show up as often as the docked 12SM configuration, right?

So while it's theoretically possible perhaps that we just simply didn't get that bit, that would be exceedingly unlikely no?
Yes, it's extremely unlikely it wouldn't show up in the GPU Definition section in NVN2
 

ReddDreadtheLead

Chain Chomp
I’m ready for Nintendo to put so much R&D for 12SMs, 8,12-16GB of LPDDR5, and 8 cores clocked to 1.9GHz

And then cap it to 2SMs, 4-6GB of memory, and only 3 cores, 2 for games. For fun. Return of the Tricore for the Triforce!!!!


please do not take an obvious joke as something serious, they aren’t going to spend that much money and not use it.
 


Top Bottom