StarTopic Future Nintendo Hardware & Technology Speculation & Discussion |ST| (New Staff Post, Please read)

Dakhil · Nov 11, 2021

Mr Swine said:
Does the SoC really need to fit 120mm2 if they plan to use more CPU cores and more SM at a lower speed to keep heat down?

Can’t they have a 135-140mm2 SoC?

That depends on if Nintendo plans on having the DLSS model*'s motherboard be very similar to the OLED model's motherboard by design, or if Nintendo's going to be designing and manufacturing a much smaller and more compact motherboard that can fit a much larger die for the DLSS model*.

Skittzo · Nov 11, 2021

Glom said:
This makes me think of Wooded Temple in Spirit Tracks, where you have to use the whirlwind to puff away the toxic clouds that are covering the place.

Can't wait for Link's new item, the Industrial Etchant.

ILikeFeet · Nov 11, 2021

BlackTangMaster said:
This is a real die shot of TX1. CPU and GPU are taking a little less than 1/3 of the total chip. We don't have a real Xavier die shot but Nvidia RP 'die shots' from the pascal-turing area were more or less on point for their GPU products. That said I would expect their Xavier/Orin die shots to be slightly different from their PR die shot representations due to their nature of being SoCs with more hardware accelerated parts than GPU only chipsets.

this isn't the real Xavier chip?

nvidia_xavier_die_shot_%28annotated%29.png

davec00ke · Nov 11, 2021

Dakhil said:
That depends on if Nintendo plans on having the DLSS model*'s motherboard be very similar to the OLED model's motherboard by design, or if Nintendo's going to be designing and manufacturing (or paying Nvidia to design and manufacture) a much smaller and more compact motherboard that can fit a much larger die.

Again! Nvidia does have anything to do with motherboards

BlackTangMaster · Nov 11, 2021

ILikeFeet said:
this isn't the real Xavier chip?

That's a 'press die shot'. That said, as I said earlier, most recent press shots from Nvidia have been on point when it comes to real die shots.

NVIDIA-GeForce-RTX-3090-RTX-3080-Ampere-Die-Shot-GA102-GPU-_4-scaled.jpg

*GA102 real die shot

*GA102 press die shot

Pretty similar but I would be more suspicious over mobile SoC press representations. They tend to diverge more from their press die shots as they have to deal with more different type of parts than their GPU counterparts.

ILikeFeet · Nov 11, 2021

BlackTangMaster said:
That's a 'press die shot'. That said, as I said earlier, most recent press shots from Nvidia have been on point when it comes to real die shots.

*GA102 real die shot

*GA102 press die shot

Pretty similar but I would be more suspicious over mobile SoCs press representation. They tend to diverge more from their press die shots as they have to deal with more different type of parts than their GPU counterparts.

that xavier shot looks way to messy to be a press shot, lol. unless it's when they started to make them accurate

Dakhil · Nov 11, 2021

davec00ke said:
Again! Nvidia does have anything to do with motherboards

I was actually editing my comment when you were commenting since I recently woke up when I originally wrote the comment.

BlackTangMaster · Nov 11, 2021

davec00ke said:
Again! Nvidia does have anything to do with motherboards

Nvidia has heavily pushed for massives SoCs being able to run from credit card board at 10W. It is the sole company that has a 450mm2 chip that runs at 10W that is not a high-end laptop or tablet.

BlackTangMaster · Nov 11, 2021

ILikeFeet said:
that xavier shot looks way to messy to be a press shot, lol. unless it's when they started to make them accurate

The Xavier shot is a press shot. Just like the Orin one. We have never seen a real die shot of Nvidia SoCs since TX1 that have been pictured twice (first in 2015 and the seconde one in 2017 in order to compare the switch '500 man hours custom' chipset with the original TX1).

NateDrake · Nov 11, 2021

ShadowFox08 said:
@NateDrake This is kind of random, but have you heard anything from devs about the CPU for the dane? Are they A78c or a78ae? A78c is optimized for mobile gaming and smaller althsn the a78ae in the Orion NX

Don't usually get that technical with them during discussions.

Dakhil · Nov 11, 2021

NateDrake said:
Don't usually get that technical with them during discussions.

And I think for now, it's for the best. Right now, I don't think asking for technical details is worth being in hot water with Nintendo.

~

Anyway, I wonder if Dane's already taped out or if Dane's going to be taped out very soon.

Alovon11 · Nov 11, 2021

Dakhil said:
And I think for now, it's for the best. Right now, I don't think asking for technical details is worth being in hot water with Nintendo.

~

Anyway, I wonder if Dane's already taped out or if Dane's going to be taped out very soon.

If not already, it gonna be soon as the Orin Family has been taped out as we have the official specs and they are coming out in Q1 2022.

Alovon11 · Nov 11, 2021

Okay, so I did some re-number crunching for Orin's TFLOPs.

Considering the L1 and L2 cache improvements we can judge it as a 30% increase in performance at the minimum before any potential FP16 Doubling improvements on the performance.

So we can actually use that number to figure out how Orin compares to other NVIDIA GPUs per-FLOP.

My main comparison for figuring out things will be

Ampere to Turing = RTX 3060Ti vs RTX 2080 Super (2% Difference)
Turing to Pascal = RTX 2070 Super vs GTX 1080Ti (1% Difference)
Ampere to Pascal = RTX 3060 vs GTX 1080Ti (7% Difference)

So first off, Ampere to Turing and trying to find the actual "How many more FLOPs does it take to match performance" number, we know the 3060Ti is very very marginally more powerful than the 2080 Super (Techpowerup posts it as a 2% difference)
So if the 3060Ti is 16.2TFLOPs, and the 2080 Super is 11.15TFLOPs, then that means a roughly 45.3% increase in TFLOPs for a 2% increase in performance, so a 43.3% Increase in TFLOPs for the same performance.

AKA Ampere takes 43.3% more TFLOPs to equal Turing's Performance.

-------------------
For Pascal to Turing, we can look at the 2070 Super vs the 1080Ti. The 2070 Super was only 9.06TFLOPs, and the 1080Ti was 11.34TFLOPs.
So it took 20.1% fewer TFLOPs to have a 1% performance increase between Pascal and Turing, so a 19.1% drop in TFLOP number going from Pascal to Turing normalized.

Or when looking at Turing to Pascal, it takes 24.2% more TFLOPs for Pascal to match Turing.
-------------------
Now for Ampere to Pascal, The 3060 is 7% weaker than the 1080Ti The 3060 has 12.74 TFLOPs, the 1080Ti has 11.34TFLOPs.
So from Pascal to the 3060, it has a 12.4% difference, but you have to add that 7% performance difference, so that becomes a 19.4% increase in TFLOPs for the same performance.

So 7% stronger than the 3060 would be 13.63TFLOps, which would be equatable to the 1080Ti.

-------------------
With those numbers out of the way, we can extrapolate Orin's comparative TFLOPs.

Let's say 2 Orin TFLOPs. If an Orin TFLOP is a 30% increase over an Ampere TFLOP, that would mean it would be equivalent to 2.6 Ampere TFLOPs.

If the gap between Ampere and Turing is 43.3%, then subtract 30% from that to account for the Orin boost, so an Orin takes 13.3% more TFLOPs than Turing to meet the same performance.

So 2 Orin TFLOPs is equivalent to 1.73 Turing TFLOPs before FP16 Doubling

------------
For a Pascal comparison, that would mean Orin actually becomes more performant per-FLOP than Pascal, 10.6% more. So 2TFLOPs of Orin would be equivalent to 2.2TFLOPs of Pascal.
-----

This gives us some interesting numbers for GPU comparisons

Like for example, how Orin Natively before FP16, would actually edge out the 1050TI.

As if 2 Orin TFLOPs is equivalent to 2.2 TFLOPs of Pascal, that would put it .1TFLOPs ahead of the 2.1TFLOPs of the 1050Ti,

And for Turing, it would be a tad ahead of the T600 as the T600 has 1.7 Turing TFLOPs.

Even further validating my idea that the Devkits likely could've been Clara AGXs with T600s in them

TL;DR: Redid Numbers, 2TFLOPs of Dane is now ahead of the 1050Ti and the T600 before FP16 Optimizations at native performance.

AKA, we have surpassed the PS4 by anywhere from 10-20% now for Dane GPU-Side

Dakhil · Nov 11, 2021

Alovon11 said:
If not already, it gonna be soon as the Orin Family has been taped out as we have the official specs and they are coming out in Q1 2022.

Outside of process node being used. I still think Samsung's 8N process node is being used to fabricate Dane, especially when taking into account that the max CPU and GPU frequencies of Orin are pretty much identical to the max CPU and GPU frequencies of the Tegra X1.

NateDrake · Nov 11, 2021

Dakhil said:
And I think for now, it's for the best. Right now, I don't think asking for technical details is worth being in hot water with Nintendo.

~

Anyway, I wonder if Dane's already taped out or if Dane's going to be taped out very soon.

That's one of the reasons I don't ask. There is little reason to know the CPU and clockspeeds or RAM this far in advance. The info serves no value to the casual audience and the niche details to please a small subsection of those within the community isn't worth the risk or exposure.

ILikeFeet · Nov 11, 2021

NateDrake said:
That's one of the reasons I don't ask. There is little reason to know the CPU and clockspeeds or RAM this far in advance. The info serves no value to the casual audience and the niche details to please a small subsection of those within the community isn't worth the risk or exposure.

BREAKING NEWS! NateDrake calls Famiboard nerds worthless!

it tru tho

Dakhil · Nov 11, 2021

fwd-bwd said:
This GTC session introduces Orin SoC's optimization strategies for deep neural networks (DNNs)—sparsity, tiling, and chaining—worth a watch if interested. It also includes the follow slide showing the hardware features that improve DNN optimization.

Edit: typo

Here are the slides for the "Optimizing Deep Neural Networks for NVIDIA DRIVE [SE31472]" session. I think there are many interesting tidbits of information from the slides that I think @Anatole would find interesting and informative, especially the sections about how Orin deals with sparsity, tiling, and chaining.

Thraktor · Nov 11, 2021

Dakhil said:
Anandtech mentioned that Jetson AGX Orin (and Jetson Orin NX by extension) has 17 billion transistors. I think Orin X could be the only chip in the Orin family that has 21 billion transistors.

I think they're just basing that on the old (2019?) figure that Nvidia gave, and aren't aware of the updated 21 billion transistor number. Nvidia haven't talked transistor numbers at all during GTC, and the most recent official number of 21 billion lines up exactly with expectations based on the process and die size (17 billion would mean a lower density process than desktop Ampere, which would be very unusual). I also don't think there are two different chips here. If anything, Orin X is just a different binning of the same Orin die (possibly a higher clocked bin, given that they've reported 254 TOPS for automotive Orin vs 200 TOPS for Orin Jetson).

ILikeFeet said:
in nvida's talk about medial stuff, they mention using ray tracing there, so Orin would be used there. however, this is also for the updated Clara, so there's also a separate gpu to extra performance. there's also some research in using ray tracing in depth perception and identification stuff

NineTailSage said:
I have to find the video but there's one of a engineer I believe talking about RT being used in automotive guidance systems to help with object detection.

Thanks. Yeah, that does make sense, and I can see why limited RT would be useful for a few or Orin's applications.

BlackTangMaster said:
This is a real die shot of TX1. CPU and GPU are taking a little less than 1/3 of the total chip. We don't have a real Xavier die shot but Nvidia RP 'die shots' from the pascal-turing area were more or less on point for their GPU products. That said I would expect their Xavier/Orin die shots to be slightly different from their PR die shot representations due to their nature of being SoCs with more hardware accelerated parts than GPU only chipsets.

To add to this, one thing worth keeping in mind is that the GPU is quite a bit more than the two SMs on TX1, and quite a few of those blocks around and between the SMs are also the GPU, perhaps as much as doubling the size of the GPU. This is pretty important for estimating the size of a small GPU like you'd get on TX1 (or Dane), in that it's not going to scale directly with the number of SMs. There's a certain amount of logic that's going to be per-GPU, a certain amount that's per-GPC, and then the part that's per-SM. For a GPU this small, there's only going to be one GPC, so the first two are effectively static, and it's important to distinguish between those parts and the SM parts. (We've also got per-TPC on the newer architectures, but I'll bundle them into the SMs, assuming the SMs are usually a multiple of two anyway).

On the positive side, this means that you could double the number of SMs and only increase the GPU die area by (let's say) 50%. On the other hand, though, it means you can't just say "If Dane GPU has X% as many SMs as Orin, it will take up X% as much space", as there's a bunch of logic there which doesn't scale.

That said, we could make some inferences from the relative size of the the SMs on the different systems. The Orin die shot, as you say, is a PR shot, but probably reflects the die as it was at the time in terms of areas well enough. As it was at the time is pretty important, though, as at the same time they said the transistor count was 17 billion and it was capable of 200 TOPS, but later updated that to 21 billion and 254 TOPS. Given it's a different shape than the final die, it's likely that the chip layout has changed a bit since then. Assuming an identical transistor density between the two, that would mean the old version of the Orin die was about 377mm2, and therefore the per-TPC die area is 8.8mm2, or 4.4mm2 per SM.

So if you assume that, from one arch to the next, the die area (not transistor count) of the per-GPU and per-GPC logic is pretty consistent, then for the same die area you'd get a 3 SM Ampere GPU in about the same die area as TX1's 2SM Maxwell GPU. Which is... not fantastic.

If you switch that around and assume that Nvidia somehow managed to keep the transistor count of GPU-level and GPC-level logic the same all the way from Maxwell to Ampere, then that logic will shrink by a factor of about 3 due to the density improvements, and you'll have more space left for the SMs. If we take the GPU of TX1 above to be 12mm2 for the SMs plus another 12mm2 for everything else, the everything else becomes 4mm2, and you're left with 20mm2, enough for 4 Ampere SMs, or 5 at a push. Even that's wildly optimistic, though, as there's obviously some transistor growth from generation to generation, and the bigger L2 cache in Orin's GPU would add considerably to that if they carried it over to Dane.

The other possibility is that other logic on the SoC outside of the GPU shrinks due to density improvements, and perhaps there's something to be gained there, or from dropping functionality which Nintendo doesn't need, but if people are also expecting an 8 core A78 CPU and a 128-bit memory bus, then those are two extra things which are going to be competing for space, so something's gotta give between these.

I've been saying for quite a while now that the most likely config for a new Switch SoC is 4 big CPU cores (ie A78), possibly with a few A55 cores, and a 4 SM GPU. That still seems the most likely outcome, and if anything the Orin reveal has solidified that, as the transistor density is the same as desktop Ampere, which means we're not getting the higher-density mobile libraries or 8LPA process improvements which would have helped them squeeze more onto Dane.

As I see it, a GPU bigger than 4 SMs is only plausible on Dane if one of three things happens:

They use higher density mobile libraries or process improvements on 8nm. This seems very unlikely if Orin hasn't used them.
They use a more advanced manufacturing process (eg 7nm or smaller). This seems very unlikely if the much more expensive Orin is on 8nm, and we've got reliable leaks saying it's 8nm.
Dane is significantly bigger than the 121mm2 Tegra X1. This again seems very unlikely, and TX1 is relatively large for this type of SoC to begin with.

I don't want to rain on anyone's parade, but I see a lot of people looking at Jetson Orin NX as a basis for Dane, but if anything it's a guide for what Dane won't do (if Dane was coming along a few months later with 1024 CUDA cores, they wouldn't have bothered releasing a Jetson with a full Orin die binned to 1024 cores). We're not going to get half of Orin at a quarter of the die size.

That said, I'd still be very exited by a Dane with 4 A78s and a 4 SM GPU based on the Orin architecture. There's a huge leap in performance and efficiency on the CPU front, and a big architectural jump on the GPU. Perhaps "only" ~2.5x as powerful in raw flops as TX1, but LPDDR5 and hopefully the much bigger caches inherited from Orin would make a big difference for any dev struggling with bandwidth, plus all the other features and improvements Ampere brings. Then, of course, the DLSS on top, and the potential for other uses of the tensor cores as ML becomes more and more common in games over the next few years.

ILikeFeet · Nov 11, 2021

it begs the question just how much acceleration 2 RT cores even provide. does it even speed up a work load by a usable amount?

going back to watch Digital Foundry's video on Neon Noir, Dictator talks about all the shortcuts Crytek made to allow the demo to run on low end hardware (voxel cone tracing and triangle intersection). given the limited triangle rt, 2 RT cores could speed that up some amount

Alovon11 · Nov 11, 2021

Thraktor said:
I think they're just basing that on the old (2019?) figure that Nvidia gave, and aren't aware of the updated 21 billion transistor number. Nvidia haven't talked transistor numbers at all during GTC, and the most recent official number of 21 billion lines up exactly with expectations based on the process and die size (17 billion would mean a lower density process than desktop Ampere, which would be very unusual). I also don't think there are two different chips here. If anything, Orin X is just a different binning of the same Orin die (possibly a higher clocked bin, given that they've reported 254 TOPS for automotive Orin vs 200 TOPS for Orin Jetson).

Thanks. Yeah, that does make sense, and I can see why limited RT would be useful for a few or Orin's applications.

To add to this, one thing worth keeping in mind is that the GPU is quite a bit more than the two SMs on TX1, and quite a few of those blocks around and between the SMs are also the GPU, perhaps as much as doubling the size of the GPU. This is pretty important for estimating the size of a small GPU like you'd get on TX1 (or Dane), in that it's not going to scale directly with the number of SMs. There's a certain amount of logic that's going to be per-GPU, a certain amount that's per-GPC, and then the part that's per-SM. For a GPU this small, there's only going to be one GPC, so the first two are effectively static, and it's important to distinguish between those parts and the SM parts. (We've also got per-TPC on the newer architectures, but I'll bundle them into the SMs, assuming the SMs are usually a multiple of two anyway).

On the positive side, this means that you could double the number of SMs and only increase the GPU die area by (let's say) 50%. On the other hand, though, it means you can't just say "If Dane GPU has X% as many SMs as Orin, it will take up X% as much space", as there's a bunch of logic there which doesn't scale.

That said, we could make some inferences from the relative size of the the SMs on the different systems. The Orin die shot, as you say, is a PR shot, but probably reflects the die as it was at the time in terms of areas well enough. As it was at the time is pretty important, though, as at the same time they said the transistor count was 17 billion and it was capable of 200 TOPS, but later updated that to 21 billion and 254 TOPS. Given it's a different shape than the final die, it's likely that the chip layout has changed a bit since then. Assuming an identical transistor density between the two, that would mean the old version of the Orin die was about 377mm2, and therefore the per-TPC die area is 8.8mm2, or 4.4mm2 per SM.

So if you assume that, from one arch to the next, the die area (not transistor count) of the per-GPU and per-GPC logic is pretty consistent, then for the same die area you'd get a 3 SM Ampere GPU in about the same die area as TX1's 2SM Maxwell GPU. Which is... not fantastic.

If you switch that around and assume that Nvidia somehow managed to keep the transistor count of GPU-level and GPC-level logic the same all the way from Maxwell to Ampere, then that logic will shrink by a factor of about 3 due to the density improvements, and you'll have more space left for the SMs. If we take the GPU of TX1 above to be 12mm2 for the SMs plus another 12mm2 for everything else, the everything else becomes 4mm2, and you're left with 20mm2, enough for 4 Ampere SMs, or 5 at a push. Even that's wildly optimistic, though, as there's obviously some transistor growth from generation to generation, and the bigger L2 cache in Orin's GPU would add considerably to that if they carried it over to Dane.

The other possibility is that other logic on the SoC outside of the GPU shrinks due to density improvements, and perhaps there's something to be gained there, or from dropping functionality which Nintendo doesn't need, but if people are also expecting an 8 core A78 CPU and a 128-bit memory bus, then those are two extra things which are going to be competing for space, so something's gotta give between these.

I've been saying for quite a while now that the most likely config for a new Switch SoC is 4 big CPU cores (ie A78), possibly with a few A55 cores, and a 4 SM GPU. That still seems the most likely outcome, and if anything the Orin reveal has solidified that, as the transistor density is the same as desktop Ampere, which means we're not getting the higher-density mobile libraries or 8LPA process improvements which would have helped them squeeze more onto Dane.

As I see it, a GPU bigger than 4 SMs is only plausible on Dane if one of three things happens:

They use higher density mobile libraries or process improvements on 8nm. This seems very unlikely if Orin hasn't used them.

They use a more advanced manufacturing process (eg 7nm or smaller). This seems very unlikely if the much more expensive Orin is on 8nm, and we've got reliable leaks saying it's 8nm.

Dane is significantly bigger than the 121mm2 Tegra X1. This again seems very unlikely, and TX1 is relatively large for this type of SoC to begin with.

I don't want to rain on anyone's parade, but I see a lot of people looking at Jetson Orin NX as a basis for Dane, but if anything it's a guide for what Dane won't do (if Dane was coming along a few months later with 1024 CUDA cores, they wouldn't have bothered releasing a Jetson with a full Orin die binned to 1024 cores). We're not going to get half of Orin at a quarter of the die size.

That said, I'd still be very exited by a Dane with 4 A78s and a 4 SM GPU based on the Orin architecture. There's a huge leap in performance and efficiency on the CPU front, and a big architectural jump on the GPU. Perhaps "only" ~2.5x as powerful in raw flops as TX1, but LPDDR5 and hopefully the much bigger caches inherited from Orin would make a big difference for any dev struggling with bandwidth, plus all the other features and improvements Ampere brings. Then, of course, the DLSS on top, and the potential for other uses of the tensor cores as ML becomes more and more common in games over the next few years.

Orin seems to synchronize memory bandwidth to SM Count so with 4SMs they'd have to go to 4GBs at 51.2GB/s which is unrealistic for a system in 2022
- Big Orin has 16SMs, 2048 CUDA cores, it has 204.8GB/s of memory bandwidth
- Orin NX has 8SMs, 1024 CUDA cores, it has 102.4GB.s of Memory bandwidth
  - And this is despite it being literally a binned down Orin, so something about cutting the GPU informed the speed they cut the Memory.
- Something about Orin seems to be tied to having 1MB/s to each of the CUDA cores.
The A55s are wasted sand in a system like this as they are weaker than the A57s in the TX1, it would be better time/silicon spent on going for 6 A78Cs as the A78C's cache would help gaming a lot without much extra space taken.
- Also on the space argument, it seems you gloss over the point that the A78AEs are bigger than the A78 and the A78C's due to the extra automotive/AI logic they have
  - As for how bigger, it's TBD, but they are bigger.
4SMs puts DLSS into question which puts the whole point of the system into question as 4SMs may not be enough for DLSS Performance mode unless if NVIDIA really did triple the speed of DL tasks on the Tensor Cores with Orin like Jensen implied.
- But even then it would likely only be able to hit 1440p which makes the marketability of the system harder and also would only really bring a 10% increase over the GT1030 due to the 30% lowball improvement Orin/Dane gets from the 50% more L1 and Double L2 cache increases over Ampere.
  - That is a very weak GPU and would almost kill Next-Gen support unless if by some miracle they get DLSS 4k Ultra Performance working on that low Tensor Core count.
Cost
- It would be cheaper to keep Dane closer to Orin and Orin NX, and while yes Orin NX is a binned Orin, that doesn't mean they can't just cut Orin down physically to the config of Orin NX for a dedicated chip like Dane. then throw out the A78AEs for A78Cs, and throw out the DLAs.
  - Also, the memory surplus we are in RN makes 12GB of DDR5 for that 102.4GB/s number easier to do, so more elements of Orin NX can be inherited by Dane, therefore costs drop further.

Sorry, but it just seems the whole stance here is overly pessimistic and has flaws of its own that would make the SoC not make much sense when put into context next to the rest of the Orin Family.

NineTailSage · Nov 11, 2021

Thraktor said:
I think they're just basing that on the old (2019?) figure that Nvidia gave, and aren't aware of the updated 21 billion transistor number. Nvidia haven't talked transistor numbers at all during GTC, and the most recent official number of 21 billion lines up exactly with expectations based on the process and die size (17 billion would mean a lower density process than desktop Ampere, which would be very unusual). I also don't think there are two different chips here. If anything, Orin X is just a different binning of the same Orin die (possibly a higher clocked bin, given that they've reported 254 TOPS for automotive Orin vs 200 TOPS for Orin Jetson).

Thanks. Yeah, that does make sense, and I can see why limited RT would be useful for a few or Orin's applications.

To add to this, one thing worth keeping in mind is that the GPU is quite a bit more than the two SMs on TX1, and quite a few of those blocks around and between the SMs are also the GPU, perhaps as much as doubling the size of the GPU. This is pretty important for estimating the size of a small GPU like you'd get on TX1 (or Dane), in that it's not going to scale directly with the number of SMs. There's a certain amount of logic that's going to be per-GPU, a certain amount that's per-GPC, and then the part that's per-SM. For a GPU this small, there's only going to be one GPC, so the first two are effectively static, and it's important to distinguish between those parts and the SM parts. (We've also got per-TPC on the newer architectures, but I'll bundle them into the SMs, assuming the SMs are usually a multiple of two anyway).

On the positive side, this means that you could double the number of SMs and only increase the GPU die area by (let's say) 50%. On the other hand, though, it means you can't just say "If Dane GPU has X% as many SMs as Orin, it will take up X% as much space", as there's a bunch of logic there which doesn't scale.

That said, we could make some inferences from the relative size of the the SMs on the different systems. The Orin die shot, as you say, is a PR shot, but probably reflects the die as it was at the time in terms of areas well enough. As it was at the time is pretty important, though, as at the same time they said the transistor count was 17 billion and it was capable of 200 TOPS, but later updated that to 21 billion and 254 TOPS. Given it's a different shape than the final die, it's likely that the chip layout has changed a bit since then. Assuming an identical transistor density between the two, that would mean the old version of the Orin die was about 377mm2, and therefore the per-TPC die area is 8.8mm2, or 4.4mm2 per SM.

So if you assume that, from one arch to the next, the die area (not transistor count) of the per-GPU and per-GPC logic is pretty consistent, then for the same die area you'd get a 3 SM Ampere GPU in about the same die area as TX1's 2SM Maxwell GPU. Which is... not fantastic.

If you switch that around and assume that Nvidia somehow managed to keep the transistor count of GPU-level and GPC-level logic the same all the way from Maxwell to Ampere, then that logic will shrink by a factor of about 3 due to the density improvements, and you'll have more space left for the SMs. If we take the GPU of TX1 above to be 12mm2 for the SMs plus another 12mm2 for everything else, the everything else becomes 4mm2, and you're left with 20mm2, enough for 4 Ampere SMs, or 5 at a push. Even that's wildly optimistic, though, as there's obviously some transistor growth from generation to generation, and the bigger L2 cache in Orin's GPU would add considerably to that if they carried it over to Dane.

The other possibility is that other logic on the SoC outside of the GPU shrinks due to density improvements, and perhaps there's something to be gained there, or from dropping functionality which Nintendo doesn't need, but if people are also expecting an 8 core A78 CPU and a 128-bit memory bus, then those are two extra things which are going to be competing for space, so something's gotta give between these.

I've been saying for quite a while now that the most likely config for a new Switch SoC is 4 big CPU cores (ie A78), possibly with a few A55 cores, and a 4 SM GPU. That still seems the most likely outcome, and if anything the Orin reveal has solidified that, as the transistor density is the same as desktop Ampere, which means we're not getting the higher-density mobile libraries or 8LPA process improvements which would have helped them squeeze more onto Dane.

As I see it, a GPU bigger than 4 SMs is only plausible on Dane if one of three things happens:

They use higher density mobile libraries or process improvements on 8nm. This seems very unlikely if Orin hasn't used them.

They use a more advanced manufacturing process (eg 7nm or smaller). This seems very unlikely if the much more expensive Orin is on 8nm, and we've got reliable leaks saying it's 8nm.

Dane is significantly bigger than the 121mm2 Tegra X1. This again seems very unlikely, and TX1 is relatively large for this type of SoC to begin with.

I don't want to rain on anyone's parade, but I see a lot of people looking at Jetson Orin NX as a basis for Dane, but if anything it's a guide for what Dane won't do (if Dane was coming along a few months later with 1024 CUDA cores, they wouldn't have bothered releasing a Jetson with a full Orin die binned to 1024 cores). We're not going to get half of Orin at a quarter of the die size.

That said, I'd still be very exited by a Dane with 4 A78s and a 4 SM GPU based on the Orin architecture. There's a huge leap in performance and efficiency on the CPU front, and a big architectural jump on the GPU. Perhaps "only" ~2.5x as powerful in raw flops as TX1, but LPDDR5 and hopefully the much bigger caches inherited from Orin would make a big difference for any dev struggling with bandwidth, plus all the other features and improvements Ampere brings. Then, of course, the DLSS on top, and the potential for other uses of the tensor cores as ML becomes more and more common in games over the next few years.

My only concern with completely using AGX Orin's manufacturing process versus what Nintendo and Nvidia might use for Dane is that Samsung are most likely using an automotive grade 8nm process that is designed to work within extreme temperatures (in of itself might be limiting on purpose to transistor density for the reason of reliability).

That's my only concern is that yes all of this is giving us a great blueprint of possibilities, but the end products will be for different purposes and Nintendo needs a chip that's not only performant in handheld but can scale up enough while docked, but also needs to be efficient as possible.
A 4SM part would need to be clocked at 1.2 - 1.4Ghz to be in the XboxOne range of raw theoretical numbers and this would be much higher than even where Nvidia have capped the clocks on the Orin SoC...

I also believe that Nvidia knows the Ampere architecture becomes very inefficient at higher clocks which is why it's capped the way it is in Orin, something they also address in the mobile RTX 30 variants of having lower base clocks to achieve a much better TDP range.

ILikeFeet · Nov 11, 2021

512 cores wouldn't be too bad since we'd be getting close to a 3x jump or so

Alovon11 · Nov 11, 2021

ILikeFeet said:
512 cores wouldn't be too bad since we'd be getting close to a 3x jump or so

Problem is what NineTailSage mentioned.

Clock speeds.
4SMs would need to be clocked way too high to get that 3X jump.

It's unrealistic vs 6 or 8SMs which can get that jump and better at clocks that make sense.

If Orin, even in a setting that can be cooled far better than Dane, is hitting<1.2ghz clocks, As NineTailSage mentioned, there seems to be a uArch thing with Ampere on clock speeds at least on 8nm.

So 4SMs at this point is just unrealistic as it would be a wasted investment power/thermal/design-wise vs 6 or 8SMs and designing a better cooler (which if you were to try to force 4SMs to the 1.2-1.4Ghz range to get that 3x boost, you'd need a new cooler anyway so why cut the GPU).

Z0m3le · Nov 11, 2021

Thraktor said:
I think they're just basing that on the old (2019?) figure that Nvidia gave, and aren't aware of the updated 21 billion transistor number. Nvidia haven't talked transistor numbers at all during GTC, and the most recent official number of 21 billion lines up exactly with expectations based on the process and die size (17 billion would mean a lower density process than desktop Ampere, which would be very unusual). I also don't think there are two different chips here. If anything, Orin X is just a different binning of the same Orin die (possibly a higher clocked bin, given that they've reported 254 TOPS for automotive Orin vs 200 TOPS for Orin Jetson).

Thanks. Yeah, that does make sense, and I can see why limited RT would be useful for a few or Orin's applications.

To add to this, one thing worth keeping in mind is that the GPU is quite a bit more than the two SMs on TX1, and quite a few of those blocks around and between the SMs are also the GPU, perhaps as much as doubling the size of the GPU. This is pretty important for estimating the size of a small GPU like you'd get on TX1 (or Dane), in that it's not going to scale directly with the number of SMs. There's a certain amount of logic that's going to be per-GPU, a certain amount that's per-GPC, and then the part that's per-SM. For a GPU this small, there's only going to be one GPC, so the first two are effectively static, and it's important to distinguish between those parts and the SM parts. (We've also got per-TPC on the newer architectures, but I'll bundle them into the SMs, assuming the SMs are usually a multiple of two anyway).

On the positive side, this means that you could double the number of SMs and only increase the GPU die area by (let's say) 50%. On the other hand, though, it means you can't just say "If Dane GPU has X% as many SMs as Orin, it will take up X% as much space", as there's a bunch of logic there which doesn't scale.

That said, we could make some inferences from the relative size of the the SMs on the different systems. The Orin die shot, as you say, is a PR shot, but probably reflects the die as it was at the time in terms of areas well enough. As it was at the time is pretty important, though, as at the same time they said the transistor count was 17 billion and it was capable of 200 TOPS, but later updated that to 21 billion and 254 TOPS. Given it's a different shape than the final die, it's likely that the chip layout has changed a bit since then. Assuming an identical transistor density between the two, that would mean the old version of the Orin die was about 377mm2, and therefore the per-TPC die area is 8.8mm2, or 4.4mm2 per SM.

So if you assume that, from one arch to the next, the die area (not transistor count) of the per-GPU and per-GPC logic is pretty consistent, then for the same die area you'd get a 3 SM Ampere GPU in about the same die area as TX1's 2SM Maxwell GPU. Which is... not fantastic.

If you switch that around and assume that Nvidia somehow managed to keep the transistor count of GPU-level and GPC-level logic the same all the way from Maxwell to Ampere, then that logic will shrink by a factor of about 3 due to the density improvements, and you'll have more space left for the SMs. If we take the GPU of TX1 above to be 12mm2 for the SMs plus another 12mm2 for everything else, the everything else becomes 4mm2, and you're left with 20mm2, enough for 4 Ampere SMs, or 5 at a push. Even that's wildly optimistic, though, as there's obviously some transistor growth from generation to generation, and the bigger L2 cache in Orin's GPU would add considerably to that if they carried it over to Dane.

The other possibility is that other logic on the SoC outside of the GPU shrinks due to density improvements, and perhaps there's something to be gained there, or from dropping functionality which Nintendo doesn't need, but if people are also expecting an 8 core A78 CPU and a 128-bit memory bus, then those are two extra things which are going to be competing for space, so something's gotta give between these.

I've been saying for quite a while now that the most likely config for a new Switch SoC is 4 big CPU cores (ie A78), possibly with a few A55 cores, and a 4 SM GPU. That still seems the most likely outcome, and if anything the Orin reveal has solidified that, as the transistor density is the same as desktop Ampere, which means we're not getting the higher-density mobile libraries or 8LPA process improvements which would have helped them squeeze more onto Dane.

As I see it, a GPU bigger than 4 SMs is only plausible on Dane if one of three things happens:

They use higher density mobile libraries or process improvements on 8nm. This seems very unlikely if Orin hasn't used them.

They use a more advanced manufacturing process (eg 7nm or smaller). This seems very unlikely if the much more expensive Orin is on 8nm, and we've got reliable leaks saying it's 8nm.

Dane is significantly bigger than the 121mm2 Tegra X1. This again seems very unlikely, and TX1 is relatively large for this type of SoC to begin with.

I don't want to rain on anyone's parade, but I see a lot of people looking at Jetson Orin NX as a basis for Dane, but if anything it's a guide for what Dane won't do (if Dane was coming along a few months later with 1024 CUDA cores, they wouldn't have bothered releasing a Jetson with a full Orin die binned to 1024 cores). We're not going to get half of Orin at a quarter of the die size.

That said, I'd still be very exited by a Dane with 4 A78s and a 4 SM GPU based on the Orin architecture. There's a huge leap in performance and efficiency on the CPU front, and a big architectural jump on the GPU. Perhaps "only" ~2.5x as powerful in raw flops as TX1, but LPDDR5 and hopefully the much bigger caches inherited from Orin would make a big difference for any dev struggling with bandwidth, plus all the other features and improvements Ampere brings. Then, of course, the DLSS on top, and the potential for other uses of the tensor cores as ML becomes more and more common in games over the next few years.

Why are you used unknown numbers of TX1 that has 2 unique SMs and a 7 year old design, vs Xavier, a 8SM Volta GPU which we have exact numbers for. The Volta GPU which includes 64 tensor cores is 89mm² on 12nm. That is the official size of the GPU, even if you assume Ampere has larger GPU logic, you are looking at a process node twice as dense...

Orin NX's architecture for 8SM is a single GPC too, so you are required to include some logic from 1SM all the way to 8SM, while a whole new GPC logic would be needed to increase the size further. Also just looking at official thermal numbers and your own expectations for the CPU clock, Orin NX's configuration without the extra logic makes a lot of sense.

We have seen numbers for Volta, I doubt pretty heavily that we are looking at a GPU in Dane with this configuration, anywhere near as large as Orin NX.

Dekuman · Nov 11, 2021

Alovon11 said:
Problem is what NineTailSage mentioned.

Clock speeds.
4SMs would need to be clocked way too high to get that 3X jump.

It's unrealistic vs 6 or 8SMs which can get that jump and better at clocks that make sense.

If Orin, even in a setting that can be cooled far better than Dane, is hitting<1.2ghz clocks, As NineTailSage mentioned, there seems to be a uArch thing with Ampere on clock speeds at least on 8nm.

So 4SMs at this point is just unrealistic as it would be a wasted investment power/thermal/design-wise vs 6 or 8SMs and designing a better cooler (which if you were to try to force 4SMs to the 1.2-1.4Ghz range to get that 3x boost, you'd need a new cooler anyway so why cut the GPU).

Worst case what are we looking at with a 4SM Switch 2?

Alovon11 · Nov 11, 2021

Dekuman said:
Worst case what are we looking at with a 4SM Switch 2?

4SMs at this point is outright unrealistic to me due to how much hoops would need to be jumped through to make it work (Clock speeds, cooler, DLSS wouldn't run well)

Worst realistic case is 6SMs now.

although 8SMs is the most likely config RN TBH

Sol · Nov 11, 2021

There's no way it would have just four CPU cores, right? A78 cores would certainly be a massive step up, but only four cores for a system that's going to last well beyond 2025 just doesn't seem like a lot.

Alovon11 · Nov 11, 2021

Sol said:
There's no way it would have just four CPU cores, right? A78 cores would certainly be a massive step up, but only four cores for a system that's going to last well beyond 2025 just doesn't seem like a lot.

also funnily enough it would cost more to do a 4+4 config verus 6 or 8 A78Cs.

BIG.LITTLE is a completely different CPU Configuration than the Cache-Paired A78C and A78AEs

Also, the A55s would have to sit on their own cluster rather than be part of the same cluster unlike unified A78Cs.

So it would actually cost Nintendo more to use A78+A55 vs A78Cs.

So 6 or 8 Cores is the actual likelihood still.

ILikeFeet · Nov 11, 2021

Alovon11 said:
also funnily enough it would cost more to do a 4+4 config verus 6 or 8 A78Cs.

BIG.LITTLE is a completely different CPU Configuration than the Cache-Paired A78C and A78AEs

Also, the A55s would have to sit on their own cluster rather than be part of the same cluster unlike unified A78Cs.

So it would actually cost Nintendo more to use A78+A55 vs A78Cs.

So 6 or 8 Cores is the actual likelihood still.

where are you getting A55s from? they're no in Orin, so they definitely wouldn't be here

Alovon11 · Nov 11, 2021

ILikeFeet said:
where are you getting A55s from? they're no in Orin, so they definitely wouldn't be here

Thraktor

Thraktor said:
I think they're just basing that on the old (2019?) figure that Nvidia gave, and aren't aware of the updated 21 billion transistor number. Nvidia haven't talked transistor numbers at all during GTC, and the most recent official number of 21 billion lines up exactly with expectations based on the process and die size (17 billion would mean a lower density process than desktop Ampere, which would be very unusual). I also don't think there are two different chips here. If anything, Orin X is just a different binning of the same Orin die (possibly a higher clocked bin, given that they've reported 254 TOPS for automotive Orin vs 200 TOPS for Orin Jetson).

Thanks. Yeah, that does make sense, and I can see why limited RT would be useful for a few or Orin's applications.

To add to this, one thing worth keeping in mind is that the GPU is quite a bit more than the two SMs on TX1, and quite a few of those blocks around and between the SMs are also the GPU, perhaps as much as doubling the size of the GPU. This is pretty important for estimating the size of a small GPU like you'd get on TX1 (or Dane), in that it's not going to scale directly with the number of SMs. There's a certain amount of logic that's going to be per-GPU, a certain amount that's per-GPC, and then the part that's per-SM. For a GPU this small, there's only going to be one GPC, so the first two are effectively static, and it's important to distinguish between those parts and the SM parts. (We've also got per-TPC on the newer architectures, but I'll bundle them into the SMs, assuming the SMs are usually a multiple of two anyway).

On the positive side, this means that you could double the number of SMs and only increase the GPU die area by (let's say) 50%. On the other hand, though, it means you can't just say "If Dane GPU has X% as many SMs as Orin, it will take up X% as much space", as there's a bunch of logic there which doesn't scale.

That said, we could make some inferences from the relative size of the the SMs on the different systems. The Orin die shot, as you say, is a PR shot, but probably reflects the die as it was at the time in terms of areas well enough. As it was at the time is pretty important, though, as at the same time they said the transistor count was 17 billion and it was capable of 200 TOPS, but later updated that to 21 billion and 254 TOPS. Given it's a different shape than the final die, it's likely that the chip layout has changed a bit since then. Assuming an identical transistor density between the two, that would mean the old version of the Orin die was about 377mm2, and therefore the per-TPC die area is 8.8mm2, or 4.4mm2 per SM.

So if you assume that, from one arch to the next, the die area (not transistor count) of the per-GPU and per-GPC logic is pretty consistent, then for the same die area you'd get a 3 SM Ampere GPU in about the same die area as TX1's 2SM Maxwell GPU. Which is... not fantastic.

If you switch that around and assume that Nvidia somehow managed to keep the transistor count of GPU-level and GPC-level logic the same all the way from Maxwell to Ampere, then that logic will shrink by a factor of about 3 due to the density improvements, and you'll have more space left for the SMs. If we take the GPU of TX1 above to be 12mm2 for the SMs plus another 12mm2 for everything else, the everything else becomes 4mm2, and you're left with 20mm2, enough for 4 Ampere SMs, or 5 at a push. Even that's wildly optimistic, though, as there's obviously some transistor growth from generation to generation, and the bigger L2 cache in Orin's GPU would add considerably to that if they carried it over to Dane.

The other possibility is that other logic on the SoC outside of the GPU shrinks due to density improvements, and perhaps there's something to be gained there, or from dropping functionality which Nintendo doesn't need, but if people are also expecting an 8 core A78 CPU and a 128-bit memory bus, then those are two extra things which are going to be competing for space, so something's gotta give between these.

I've been saying for quite a while now that the most likely config for a new Switch SoC is 4 big CPU cores (ie A78), possibly with a few A55 cores, and a 4 SM GPU. That still seems the most likely outcome, and if anything the Orin reveal has solidified that, as the transistor density is the same as desktop Ampere, which means we're not getting the higher-density mobile libraries or 8LPA process improvements which would have helped them squeeze more onto Dane.

As I see it, a GPU bigger than 4 SMs is only plausible on Dane if one of three things happens:

They use higher density mobile libraries or process improvements on 8nm. This seems very unlikely if Orin hasn't used them.

They use a more advanced manufacturing process (eg 7nm or smaller). This seems very unlikely if the much more expensive Orin is on 8nm, and we've got reliable leaks saying it's 8nm.

Dane is significantly bigger than the 121mm2 Tegra X1. This again seems very unlikely, and TX1 is relatively large for this type of SoC to begin with.

I don't want to rain on anyone's parade, but I see a lot of people looking at Jetson Orin NX as a basis for Dane, but if anything it's a guide for what Dane won't do (if Dane was coming along a few months later with 1024 CUDA cores, they wouldn't have bothered releasing a Jetson with a full Orin die binned to 1024 cores). We're not going to get half of Orin at a quarter of the die size.

That said, I'd still be very exited by a Dane with 4 A78s and a 4 SM GPU based on the Orin architecture. There's a huge leap in performance and efficiency on the CPU front, and a big architectural jump on the GPU. Perhaps "only" ~2.5x as powerful in raw flops as TX1, but LPDDR5 and hopefully the much bigger caches inherited from Orin would make a big difference for any dev struggling with bandwidth, plus all the other features and improvements Ampere brings. Then, of course, the DLSS on top, and the potential for other uses of the tensor cores as ML becomes more and more common in games over the next few years.

His main argument is that anything bigger than 4 A78 + 4 A55s with 4SMs would be "Very Unlikely" due to size restrictions.

(Although as Z0m3le pointed out, the numbers he is running are likely far off on the size argument because of him comparing The TX1 to Orin which he should be comparing Xaiver to Orin as we have a proper GPU measurement for Xaiver and Xaiver is 8SMs too and Orin is on a node twice as dense)

TBH, I feel Thraktor is sort of ignoring the reality of the situation here and that that sort of config he is suggesting could cost as much, or more than the 8 A78C + 8SM config, but perform far worse.

It would cost that much due to piling RND Costs due to diverging so much from Orin
(A55s, non-cache based A78s, redesigned cooler and power delivery to push 4SMs to clocks to make everything work, Upfront and accelerated development costs for a DLSS Replacement for Launch versus being able to use DLSS at launch and develop the in-house version for longer.etc)

Like, TBH, 6 A78Cs with 6SMs is the bottom-line now realistically, and even then that would only cost marginally cheaper than 8 A78Cs and 8SMs due to the die size GPU-wise not going down too much due to it needing that full GPC.

Z0m3le · Nov 11, 2021

My point in using 4xA78+4xA55 is that it would require new R&D costs for Nintendo that A78C cores wouldn't, and that we know a 4xA55 cluster is about the same size as an A78 core, and the performance per clock would also be the same... So 6xA78C cores make more sense than 4xA55 cores, as you get an extra A78 core and 1 A78 core can make up for the 4 A55 cores. DynamIQ also allows that single core to be clocked separately than the gaming cores if needed.

The transistor cost for 6xA78C not including the extra cache would only be ~20% larger than Thraktors CPU configuration, not have a separate R&D cost and offer over 30% more performance, depending on how much the cache helps, it could be 50% more performance.

ReddDreadtheLead · Nov 11, 2021

I think there is a disconnect here, if we are to assume that they will use the A78C, then we can also assume that they can add A55 cores for this product.

Despite it using the A78AE. It’s not off the table per se

and it isn’t like they don’t care to use the A53s in the switch, the cores serve no purpose in how they functioned at the time.

With Dane it is unknown if they would use the A55s or not, but it isn’t like they have zero use for the device as a whole.

ShadowFox08 · Nov 11, 2021

Perhaps Nano Next is Dane/Switch 2, which is a smaller customized version of Orion NX and without the unnecessary A.I. stuff.?

Alright, I'm gonna do some crazy brainstorm b.s. thinking and trying to get a gauge of what Dane could be based off comparing Tegra Xavier at 30 watts vs Xavier NX in 15/ 10 watt modes to Orion NX. Basically what Dane performances hypothetically could be at 15 watts.

Tegra - Wikipedia

en.m.wikipedia.org

If we compare the 30 watt T194 Xavier to Orion at 25 watts.. They both have 8 core CPUs. The Xavier NX 15 watt however has up to 6 core CPUs and 75% of the GPU cores (384 vs 512). Xavier to Xavier NX bus width also lowers from 256 bit to 128 bit.

In this hypothetic scenario I'm gonna a assume that Dane has the same # of GPU cores like Orion NX (1024 cuda cores), but 75% clock speed max. It's essentially also going to be the same clock speed or close to Switch's (768 Mhz) docked mode, which would give us around 1.5 TFLOPs.

And then lower the CPU count from 8 A78AE cores to A78c 6 core CPU. A78C also happens to be smaller then the A78AE as well. Also, I'm gonna say the 128 bit bus bandwidth stays.

So somehow I can see Dane/Switch 2 be a 1.5-1.6 TFLOPs GPU, 6 A78C CPUs (1.5 GHz max?),
128 bit bus width with 8-12 GB lpddr5 RAM using 128 bus width to achieve 102 GB/s bandwidth at 15 watts power on a 6-7 inch 720p screen.

Maybe this is possible at 15 watts? 25% less CPU cores and they are clocked 25% lower, along with 25% lower GPU speeds and take out the camera and other unnecessary stuff from Orion NX. And so somehow all of this is on a chip as similar in size as TX1. I don't even accounted included DLSS or the larger memory cache. Also, no comment on RT cores.

When you think about it, it's really impressive thinking we are capable of matching GCN PS4 GPU perfomance on an 8nm node. And it's pretty cool when we see that ampere at 8nm is very comparable to 7nm rdna2 in performance. And guess what.. Steam Deck runs up to 1.6 TFLOPS GPU and 128 bus width lpddr5, all at 15 watts power like OG switch.

And man, Switch 2 with DLSS and the extra memory cache would be insane if they could fit it all in. Only thing is if they use A78c, they might not necessarily get the 3MB L2 + 6MB L3 cache upgrade

. Then again A78c gives up to 8mb of L3 cache.

Cortex-A78C

Providing market-specific solutions with advanced security features and large big-core configurations

www.arm.com

Z0m3le · Nov 11, 2021

ReddDreadtheLead said:
I think there is a disconnect here, if we are to assume that they will use the A78C, then we can also assume that they can add A55 cores for this product.

Despite it using the A78AE. It’s not off the table per se

and it isn’t like they don’t care to use the A53s in the switch, the cores serve no purpose in how they functioned at the time.

With Dane it is unknown if they would use the A55s or not, but it isn’t like they have zero use for the device as a whole.

First, it's clear we are discussing Orin NX previously discussed as Orin S, and it's relationship to Dane which uses the same architecture and was developed alongside itself, we know that Nintendo and Nvidia spend R&D into Dane/Orin.

What is the point of A55 cores? To handle background tasks, a single A78 core performs the same at the same clock, so is it useful to have 4 A55 cores vs 1 A78C core? Is it worth the R&D budget to add them over just limiting the CPU to 6 cores and having a 20% higher CPU budget.

We know the configuration for Orin NX can hit the power consumption needed for Dane. The Orin NX devkit system consumes 15 watts at some unknown clocks, but it also has a ton of extra components that won't be in Dane, I think we could see 1.6GHz 8 core A78C, with possibly the OS core at 1GHz. GPU wise, it could use the same clock as Switch at 768MHz, that would give it 1.6TFLOPs of Ampere+, which again puts it at PS4 with DLSS on top.

If we want to say that Dane is nothing like Orin and is a much lower end device with a completely different R&D budget, that discussion can be made, but the logic for them not configuring Dane after Orin NX doesn't make sense to me given that we know they use the same architecture and that Orin NX fits the right power consumption for the device too.

Deleted member 2 · Nov 11, 2021

Branduil said:
If "S" stands for Super, then why not just call it Super? I see no value in the S by itself, and it has the potential to introduce confusion, especially with multiple other Switches with appellations like Lite and OLED. Nintendo doesn't benefit from ambiguous console names. If you name it "Super Nintendo Switch" it's clear this is a step up from previous Switch systems, "Switch S" requires explanation and has an unclear relationship with other Switch systems.

If you like S better aesthetically that's whatever but I don't see any actual benefit to the name as a selling point.

Game Boy Advance SP
Nintendo DS
Nintendo DSi
Nintendo Switch S

Z0m3le · Nov 12, 2021

ShadowFox08 said:
Perhaps Nano Next is Dane/Switch 2, which is a smaller customized version of Orion NX and without the unnecessary A.I. stuff.?

Alright, I'm gonna do some crazy brainstorm b.s. thinking and trying to get a gauge of what Dane could be based off comparing Tegra Xavier at 30 watts vs Xavier NX in 15/ 10 watt modes to Orion NX. Basically what Dane performances hypothetically could be at 15 watts.

Tegra - Wikipedia

en.m.wikipedia.org

If we compare the 30 watt T194 Xavier to Orion at 25 watts.. They both have 8 core CPUs. The Xavier NX 15 watt however has up to 6 core CPUs and 75% of the GPU cores (384 vs 512). Xavier to Xavier NX bus width also lowers from 256 bit to 128 bit.

In this hypothetic scenario I'm gonna a assume that Dane has the same # of GPU cores like Orion NX, but 75% clock speed max. It's essentially also going to be the same clock speed or close to Switch's (768 Mhz) docked mode, which would give us around 1.5 TFLOPs.

And then lower the CPU count from 8 A78AE cores to A78c 6 core CPU. A78C also happens to be smaller then the A78AE as well. Also, I'm gonna say the 128 bit bus bandwidth stays.

So somehow I can see Dane/Switch 2 be a 1.5-1.6 TFLOPs GPU, 6 A78C CPUs (1.5 GHz max?),
128 bit bus width with 8-12 GB lpddr5 RAM using 128 bus width to achieve 102 GB/s bandwidth at 15 watts power on a 6-7 inch 720p screen.

And so somehow all of this is on a chip as similar in size as TX1. I don't even accounted included DLSS or the larger memory cache. Also, no comment on RT cores.

When you think about it, it's really impressive thinking we are capable of matching GCN PS4 GPU perfomance on an 8nm node. And it's pretty cool when we see that ampere at 8nm is very comparable to 7nm rdna2 in performance. And guess what.. Steam Deck runs up to 1.6 TFLOPS GPU and 128 bus width lpddr5, all at 15 watts power like OG switch.

And man, Switch 2 with DLSS and the extra memory cache would be insane if they could fit it all in.

Nano Next was suppose to predate Orin, it was suppose to be on the market right now, only problem is its a 5 watt design and isn't the same architecture as Orin which we know Dane to be.

ShadowFox08 · Nov 12, 2021

Z0m3le said:
Nano Next was suppose to predate Orin, it was suppose to be on the market right now, only problem is its a 5 watt design and isn't the same architecture as Orin which we know Dane to be.

Fair enough. Since we are ruling out that the Dane is based off the Orion NX because its too big (like Xavier NX), I think the Orion S can still exist now, despite it not showing up this week's presentation. We got thrown off thinking the Orion S just consolidated and merged into the NX.. But Maybe Nvidia really left this out on purpose because they didn't want us putting 2+2 together, until the time was right.. They still have more to reveal by the end of the year or q1 2022, don't they

? Nintendo would be the ones officially announcing the switch 2 of course.

I think we are all in the ball park on what Dane's power could be. We've discussed this before.. The 1.6 TFLOPs and 6-8 core A78 CPU using 128 bus width of lpddr5 doesn't seem impossible. AMD and Steck are already matching this on 7nm on a form factor slightly bigger than switch, and with the same 15 watts usage.

Z0m3le · Nov 12, 2021

ShadowFox08 said:
Fair enough. Since we are ruling out that the Dane will based off the NX because its too big (like Xavier NX), I think the Orion S can still exist now, despite it not showing up this week's presentation. We got thrown off thinking the Orion S just consolidated and merged into the NX.. But Maybe Nvidia really left this out on purpose because they didn't want us putting 2+2 together, until the time was right.. They still have more to reveal by the end of the year or q1 2022, don't they ? Nintendo would be the ones officially announcing the switch 2 of course.

I think we are all in the ball park on what Dane's power could be. We've discussed this before.. The 1.6 TFLOPs and 6-8 core A78 CPU using 128 bus width of lpddr5 doesn't seem impossible. AMD and Steck are already matching this on 7nm on a form factor slightly bigger than switch, and with the same 15 watts usage.

Orin NX chip size is unknown, don't rule it out. They aren't making Orin NX chips, they are making Orin chips around the same size as Xavier, and binning some as Orin NX. Nvidia is not making Orin NX dies, They would be at best 60% the size, and Dane would lose another ~20% removing all the AI accelerator engines.

AMD is doing it was a much less efficient x86 chip at higher clocks too right? There is nothing strange here, especially because Ampere has far more flops per watt than RDNA2 as well, and Nintendo likely isn't using NVME either.

As we can see here, at the end of 2019, Orin S is a 15W solution, Orin is a 40W solution. What Nvidia announced this week is Orin NX, a 10-15-25W solution and Orin, a 15-45W solution. There is no doubt that Orin NX is Orin S, there is no doubt that Orin NX's configuration can fit into Switch's power requirements just fine, as these are total system power draws, and 15-20W Tegra X1 was what was used in Switch. Orin NX's die size, as active configured transistor parts is unknown, they are binning Orin chips and sending bad chips down the performance tier as Orin NX.

Dane is based on Orin, there is 2 separate configurations for Orin, and we know that Dane will be targeting the same TDPs as Orin NX, so I'm super duper missing something here? Why does another configuration need to exist? Yes, Dane could absolutely be 6xA78C cores with 6SM, but it would just be clocked higher because the wattage with these architectures is all that matters, not how many x,y,z components you have, not narrowed down this far... it is far more likely they underclock an 8 core CPU and 8SM, then to increase the clocks on a lower configuration, simply because chip size won't change much. Again, Xavier has 8SM and that GPU is 89mm^2, Dane's density is double Xavier's, meaning even if the GPU is 50% bigger, it would still take ~65mm^2 for the GPU that Orin NX has, shrinking it to 6SM would at best bring it down to ~58mm^2 for the GPU, and would require a 30% higher clock to reach the same performance.

You are right we have it nailed down, and the specifics really don't matter, the GPU is a PS4+ with DLSS on top. just like Wii U is a 360+ with no DLSS. There is nothing too surprising here, in fact, I've been sticking around these numbers for about a year, with these same configurations we see in Orin NX, and even pushing similar CPU clocks that we are now suggesting... Orin NX is confirming that we had it right, and given that they are moving from a bad, flat, old process node from 7 years ago, to a much more modern node, and modern architectures... This performance bump of 3-5x the Switch before DLSS is taken into account, is expected.

ReddDreadtheLead · Nov 12, 2021

Z0m3le said:
First, it's clear we are discussing Orin NX previously discussed as Orin S, and it's relationship to Dane which uses the same architecture and was developed alongside itself, we know that Nintendo and Nvidia spend R&D into Dane/Orin.

What is the point of A55 cores? To handle background tasks, a single A78 core performs the same at the same clock, so is it useful to have 4 A55 cores vs 1 A78C core? Is it worth the R&D budget to add them over just limiting the CPU to 6 cores and having a 20% higher CPU budget.

We know the configuration for Orin NX can hit the power consumption needed for Dane. The Orin NX devkit system consumes 15 watts at some unknown clocks, but it also has a ton of extra components that won't be in Dane, I think we could see 1.6GHz 8 core A78C, with possibly the OS core at 1GHz. GPU wise, it could use the same clock as Switch at 768MHz, that would give it 1.6TFLOPs of Ampere+, which again puts it at PS4 with DLSS on top.

If we want to say that Dane is nothing like Orin and is a much lower end device with a completely different R&D budget, that discussion can be made, but the logic for them not configuring Dane after Orin NX doesn't make sense to me given that we know they use the same architecture and that Orin NX fits the right power consumption for the device too.

You don’t need 4 to do that task, it is supposed to be using the same OS as the switch is correct? Which is Horizon OS. The clocks for a single A55 core, which is significantly smaller than an A78 might I add, would only need to match what the OG switch needed in terms of performance, 1.3GHz is enough to do the same task as 1A57 at 1GHz while consuming a fraction of the power draw and taking up a fraction of the space for it.

This, of course, is a completely custom setup and won’t happen really, having a single A55 core. 2 is possible imo if they want a set of cores dedicated just to OS stuff. If there are 2 A55s they can be clocked lower and it would consume even less power.

This in turn prevents the issue of say 6A78 cores where 1 is used for the OS and only 5 are available to games. The set up with little + big is to free up as much of the A78 to be clocked a lot higher and leave the OS to the little cores and taking up even less space that can be allocated to other things like more cache perhaps. While you can use an A78 in a 6 core setup, you leave a noticeable amount of performance that can be allocated to a game directly.

In theory, you can get a much better battery life and an even more performant device in its lowest case and it’s highest case scenarios, while not consuming more die space and fitting within a smaller budget and still keeping 6-8SMs intact.

On another note, since Dane will be based off an architecture that has RT cores, those RT cores can be used for Audio in theory if they aren’t using it for docked or portable mode.

ShadowFox08 said:
Perhaps Nano Next is Dane/Switch 2, which is a smaller customized version of Orion NX and without the unnecessary A.I. stuff.?

Alright, I'm gonna do some crazy brainstorm b.s. thinking and trying to get a gauge of what Dane could be based off comparing Tegra Xavier at 30 watts vs Xavier NX in 15/ 10 watt modes to Orion NX. Basically what Dane performances hypothetically could be at 15 watts.

Tegra - Wikipedia

en.m.wikipedia.org

If we compare the 30 watt T194 Xavier to Orion at 25 watts.. They both have 8 core CPUs. The Xavier NX 15 watt however has up to 6 core CPUs and 75% of the GPU cores (384 vs 512). Xavier to Xavier NX bus width also lowers from 256 bit to 128 bit.

In this hypothetic scenario I'm gonna a assume that Dane has the same # of GPU cores like Orion NX (1024 cuda cores), but 75% clock speed max. It's essentially also going to be the same clock speed or close to Switch's (768 Mhz) docked mode, which would give us around 1.5 TFLOPs.

And then lower the CPU count from 8 A78AE cores to A78c 6 core CPU. A78C also happens to be smaller then the A78AE as well. Also, I'm gonna say the 128 bit bus bandwidth stays.

So somehow I can see Dane/Switch 2 be a 1.5-1.6 TFLOPs GPU, 6 A78C CPUs (1.5 GHz max?),
128 bit bus width with 8-12 GB lpddr5 RAM using 128 bus width to achieve 102 GB/s bandwidth at 15 watts power on a 6-7 inch 720p screen.

Maybe this is possible at 15 watts? 25% less CPU cores and they are clocked 25% lower, along with 25% lower GPU speeds and take out the camera and other unnecessary stuff from Orion NX. And so somehow all of this is on a chip as similar in size as TX1. I don't even accounted included DLSS or the larger memory cache. Also, no comment on RT cores.

When you think about it, it's really impressive thinking we are capable of matching GCN PS4 GPU perfomance on an 8nm node. And it's pretty cool when we see that ampere at 8nm is very comparable to 7nm rdna2 in performance. And guess what.. Steam Deck runs up to 1.6 TFLOPS GPU and 128 bus width lpddr5, all at 15 watts power like OG switch.

And man, Switch 2 with DLSS and the extra memory cache would be insane if they could fit it all in. Only thing is if they use A78c, they might not necessarily get the 3MB L2 + 6MB L3 cache upgrade . Then again A78c gives up to 8mb of L3 cache.

Cortex-A78C

Providing market-specific solutions with advanced security features and large big-core configurations

www.arm.com

Why did you remove 2 CPU cores

ShadowFox08 · Nov 12, 2021

@Z0m3le I don't know. I'm not exactly an sme on this but I was thinking about what Thraktor said about possibly not having enough space. Did we not conclude that the Orion NX board is too big to fit in a form factor that matches switch?

and yeah I do think it will cost more money to make a modified version of the NX with smaller A78s, even though those are meant for handheld gaming and they will save up space. So maybe it doesn't make sense to have the a78cs, but can they really fit 8 CPU cores, 8 GPU SMs, DLSS cores, RAM and potentially 2 RT cores?

Either way, there is no way the Switch 2 will not be a modified version of NX. A lot of things that aren't needed. camera and a.i. stuff irrelevant for gaming. Could they make the board smaller?

ReddDreadtheLead said:
You don’t need 4 to do that task, it is supposed to be using the same OS as the switch is correct? Which is Horizon OS. The clocks for a single A55 core, which is significantly smaller than an A78 might I add, would only need to match what the OG switch needed in terms of performance, 1.3GHz is enough to do the same task as 1A57 at 1GHz while consuming a fraction of the power draw and taking up a fraction of the space for it.

This, of course, is a completely custom setup and won’t happen really, having a single A55 core. 2 is possible imo if they want a set of cores dedicated just to OS stuff. If there are 2 A55s they can be clocked lower and it would consume even less power.

This in turn prevents the issue of say 6A78 cores where 1 is used for the OS and only 5 are available to games. The set up with little + big is to free up as much of the A78 to be clocked a lot higher and leave the OS to the little cores and taking up even less space that can be allocated to other things like more cache perhaps. While you can use an A78 in a 6 core setup, you leave a noticeable amount of performance that can be allocated to a game directly.

In theory, you can get a much better battery life and an even more performant device in its lowest case and it’s highest case scenarios, while not consuming more die space and fitting within a smaller budget and still keeping 6-8SMs intact.

On another note, since Dane will be based off an architecture that has RT cores, those RT cores can be used for Audio in theory if they aren’t using it for docked or portable mode.

Why did you remove 2 CPU cores

I was basically making some sort of an educated guess on predicting Dane's specs (10-15 watts) comparing it to the 30 watt Volta and noticed the 10-15 watt Volta versions had 6 CPU cores instead of 8. Not to mention slower clock speeds as well. Also retained 75% of the cuda cores, and at slower clock speeds and a 128 bus width 8 GB RAM (instead of 256 and more RAM) in an effort to meet the power draw requirements of 10-15 watts.

Z0m3le · Nov 12, 2021

ReddDreadtheLead said:
You don’t need 4 to do that task, it is supposed to be using the same OS as the switch is correct? Which is Horizon OS. The clocks for a single A55 core, which is significantly smaller than an A78 might I add, would only need to match what the OG switch needed in terms of performance, 1.3GHz is enough to do the same task as 1A57 at 1GHz while consuming a fraction of the power draw and taking up a fraction of the space for it.

This, of course, is a completely custom setup and won’t happen really, having a single A55 core. 2 is possible imo if they want a set of cores dedicated just to OS stuff. If there are 2 A55s they can be clocked lower and it would consume even less power.

This in turn prevents the issue of say 6A78 cores where 1 is used for the OS and only 5 are available to games. The set up with little + big is to free up as much of the A78 to be clocked a lot higher and leave the OS to the little cores and taking up even less space that can be allocated to other things like more cache perhaps. While you can use an A78 in a 6 core setup, you leave a noticeable amount of performance that can be allocated to a game directly.

In theory, you can get a much better battery life and an even more performant device in its lowest case and it’s highest case scenarios, while not consuming more die space and fitting within a smaller budget and still keeping 6-8SMs intact.

On another note, since Dane will be based off an architecture that has RT cores, those RT cores can be used for Audio in theory if they aren’t using it for docked or portable mode.

Why did you remove 2 CPU cores

I'm discussing 4xA78+4xA55 cores vs 6xA78C cores. A55 cores is more of a want from how people would build the next Switch, than we are seeing from Nvidia, who hasn't used A53 or A55 cores in any design since the Tegra X1's days. Please take into account that A78C cores are designed for Dedicated gaming devices, and that developers would expect 8 high performance cores, so they would expect 8xA78C cores, and that is what we are seeing in Orin, the whole big.little thing is on us, not any rumors or found evidence, no hint that Nvidia or Nintendo are going that route. Also because you can clock a core separately from the others, you could run 7xA78C cores at 1.6GHz and 1xA78C core at 450MHz, it takes up the same size as 4xA55 cores, and the same performance as those cores at 1.35GHz, so power consumption should not be as much as an issue. This allows 8 cores on 1 cluster with the same memory cache as 8xA78AE cores, which would be nice because they would share the same code, similar register/cache space.

Terrell · Nov 12, 2021

ReddDreadtheLead said:
it is supposed to be using the same OS as the switch is correct? Which is Horizon OS.

Well, that's a bit of an unknowable variable. It's right to assume it will be as lightweight as possible like Horizon, but whether or not it's the exact same? Who knows.

Z0m3le · Nov 12, 2021

ShadowFox08 said:
@Z0m3le I don't know. I'm not exactly an sme on this but I was thinking about what Thraktor said about possibly not having enough space. Did we not conclude that the Orion NX board is too big to fit in a form factor that matches switch?

and yeah I do think it will cost more money to make a modified version of the NX with smaller A78s, even though those are meant for handheld gaming and they will save up space. So maybe it doesn't make sense to have the a78cs, but can they really fit 8 CPU cores, 8 GPU SMs, DLSS cores, RAM and potentially 2 RT cores?

Either way, there is no way the Switch 2 will not be a modified version of NX. A lot of things that aren't needed. camera and a.i. stuff irrelevant for gaming. Could they make the board smaller?

I was basically making someimd of educated guess on predicting Dane's specs (10-15 watts) comparing it to the 30 watt Volta and noticed the 10-15 watt Volta versions had 6 CPU cores instead of 8. Not to mention slower clock speeds as well. Also retained 75% of the cuda cores, and at slower clock speeds and a 128 bus width 8 GB RAM (instead of 256 and more RAM) in an effort to meet the power draw requirements of 10-15 watts.

Thraktor's post was comparing TX1's unknown numbers to Orin's unknown numbers... it didn't make much sense to be honest. He should have used Xavier's known numbers, we know what Xavier's 8SM GPU takes in terms of size, it is 89mm^2 officially, for the GPU in Xavier, that is at 25M transistors per mm. Dane would be ~50M transistors per mm, and suggesting a ~50% increase in transistor count gets you to 66mm^2, if the GPU is ~40% of the die, that gives you a ~160mm^2 die that is worst case IMO, and is a similar size to Wii U's entire GPU die, so it isn't unreasonable. Realistically, Xavier has 64 tensor cores and Orin NX has 32, Xavier has fp64 hardware, Orin NX does not... I'd suggest only a 25% transistor increase here, based on how big Volta was, with these numbers and the GPU being 40% gives you a die area under 140mm^2. Considering the higher price point, I think it makes sense...

What we ultimately know is that power consumption is the biggest factor, and that seems to be capable of both 8 CPU cores and 8SM.

https://min.news/en/tech/11cb594ada6aec65b3d0d9040c4de0d9.html not sure about the source here, but the article claims that Dane (T239) Nintendo has the GPU clocked at 1GHz docked and 768MHz when portable... kind of crazy if you ask me, but maybe that is why they aren't worried about running DLSS in portable mode, that is almost 1.6TFLOPs in portable mode and only 2TFLOPs in docked mode... but it might take less energy to run portable mode with a small clock reduction and no tensor cores active, than using DLSS... Interesting to discuss at least IMO.

ShadowFox08 · Nov 12, 2021

Z0m3le said:
Thraktor's post was comparing TX1's unknown numbers to Orin's unknown numbers... it didn't make much sense to be honest. He should have used Xavier's known numbers, we know what Xavier's 8SM GPU takes in terms of size, it is 89mm^2 officially, for the GPU in Xavier, that is at 25M transistors per mm. Dane would be ~50M transistors per mm, and suggesting a ~50% increase in transistor count gets you to 66mm^2, if the GPU is ~40% of the die, that gives you a ~160mm^2 die that is worst case IMO, and is a similar size to Wii U's entire GPU die, so it isn't unreasonable. Realistically, Xavier has 64 tensor cores and Orin NX has 32, Xavier has fp64 hardware, Orin NX does not... I'd suggest only a 25% transistor increase here, based on how big Volta was, with these numbers and the GPU being 40% gives you a die area under 140mm^2. Considering the higher price point, I think it makes sense...

What we ultimately know is that power consumption is the biggest factor, and that seems to be capable of both 8 CPU cores and 8SM.

https://min.news/en/tech/11cb594ada6aec65b3d0d9040c4de0d9.html not sure about the source here, but the article claims that Dane (T239) Nintendo has the GPU clocked at 1GHz docked and 768MHz when portable... kind of crazy if you ask me, but maybe that is why they aren't worried about running DLSS in portable mode, that is almost 1.6TFLOPs in portable mode and only 2TFLOPs in docked mode... but it might take less energy to run portable mode with a small clock reduction and no tensor cores active, than using DLSS... Interesting to discuss at least IMO.

That article is giving off multiple red flags.

For one they are referring it to as "switch pro."
Second, I cant find a source anywhere about a Youtuber working for Digital Foundry (Shu Maosheis the name) saying that he heard devs got a prototype that has 8GB of RAM.

And I'm a bit suspicious about full 1GHz docked and 768MHz on handheld (even with DLSS turned off) being utilized. 1.6 TFLOPs on handheld is reaching it. Doesn't sound like something Nintendo would do for the sake of battery, and that small discrepancy if 1.6 vs 2 in docked makes no sense. I can see them do half of 1Ghz (or closer to 400 GHz), if docked really goes full 1Ghz.

Mercury_Sagit · Nov 12, 2021

ShadowFox08 said:
That article is giving off multiple red flags.

For one they are referring it to as "switch pro."
Second, I cant find a source anywhere about a Youtuber working for Digital Foundry (Shu Maosheis the name) saying that he heard devs got a prototype that has 8GB of RAM.

And I'm a bit suspicious about full 1GHz docked and 768MHz on handheld (even with DLSS turned off) being utilized. 1.6 TFLOPs on handheld is reaching it. Doesn't sound like something Nintendo would do for the sake of battery, and that small discrepancy if 1.6 vs 2 in docked makes no sense. I can see them do half of 1Ghz (or closer to 400 GHz), if docked really goes full 1Ghz.

This is purely hypothetical, but I don't think half of docked clock is a hard limit for Dane. Nintendo used the 460 MHz clock profile on Erista out of all things for their high profile games and that more than half of the docked clock profile (768 Mhz). That said we may never know about Dane's clock profiles till it's released.

Z0m3le said:
I wasn't suggesting the info was legit, but the idea that the portable GPU would have a higher clock than expected, because it doesn't use DLSS, could be a trade off Nintendo explores... for instance, currently Switch uses a 768MHz docked performance and a 460MHz portable performance (60% of docked performance)... If Gen1 Switch could use DLSS, then it would make sense to have docked and portable clocks even closer together (say 80%) this way you can render in 720p and output much higher resolutions when docked, but stay at 720p when portable... Which Dane still might be 720p.

Or both I'd guess. At least for Erista, Nintendo opened up the options for 3rd party developers to use either the 372 or 460 Mhz profiles in handheld mode (and possibly the CPU OC mode but Idk if any one used them so far), so I hope that such freedom is maintained for Dane.

Z0m3le · Nov 12, 2021

ShadowFox08 said:
That article is giving off multiple red flags.

For one they are referring it to as "switch pro."
Second, I cant find a source anywhere about a Youtuber working for Digital Foundry (Shu Maosheis the name) saying that he heard devs got a prototype that has 8GB of RAM.

And I'm a bit suspicious about full 1GHz docked and 768MHz on handheld (even with DLSS turned off) being utilized. 1.6 TFLOPs on handheld is reaching it. Doesn't sound like something Nintendo would do for the sake of battery, and that small discrepancy if 1.6 vs 2 in docked makes no sense. I can see them do half of 1Ghz (or closer to 400 GHz), if docked really goes full 1Ghz.

I wasn't suggesting the info was legit, but the idea that the portable GPU would have a higher clock than expected, because it doesn't use DLSS, could be a trade off Nintendo explores... for instance, currently Switch uses a 768MHz docked performance and a 460MHz portable performance (60% of docked performance)... If Gen1 Switch could use DLSS, then it would make sense to have docked and portable clocks even closer together (say 80%) this way you can render in 720p and output much higher resolutions when docked, but stay at 720p when portable... Which Dane still might be 720p.

ReddDreadtheLead · Nov 12, 2021

ShadowFox08 said:
I was basically making some sort of an educated guess on predicting Dane's specs (10-15 watts) comparing it to the 30 watt Volta and noticed the 10-15 watt Volta versions had 6 CPU cores instead of 8. Not to mention slower clock speeds as well. Also retained 75% of the cuda cores, and at slower clock speeds and a 128 bus width 8 GB RAM (instead of 256 and more RAM) in an effort to meet the power draw requirements of 10-15 watts.

It doesn’t really make sense though, the Volta you were comparing to is the non-NX which is 30w and has all 8 cores, the 15w Volta has 1/4th of the cores disabled and that is the NX.

You were comparing the non-NX to the NX from Volta to the Orin and basing the conclusion on that, which is inconsistent to make.

You basically removed more cores for no reason

It would have 8Cores, not 6 cores.

The other things were lowered the right way, the CPU count was unnecessarily lowered.

Terrell said:
Well, that's a bit of an unknowable variable. It's right to assume it will be as lightweight as possible like Horizon, but whether or not it's the exact same? Who knows.

I doubt it would be a heavy OS, past Nintendo system with the heaviest OS was the Wii U, other systems had a non-existent OS, Switch was made with being as lean as possible on purpose, I doubt they will try to make the OS heavy like the other systems.

Z0m3le said:
I'm discussing 4xA78+4xA55 cores vs 6xA78C cores. A55 cores is more of a want from how people would build the next Switch, than we are seeing from Nvidia, who hasn't used A53 or A55 cores in any design since the Tegra X1's days. Please take into account that A78C cores are designed for Dedicated gaming devices, and that developers would expect 8 high performance cores, so they would expect 8xA78C cores, and that is what we are seeing in Orin, the whole big.little thing is on us, not any rumors or found evidence, no hint that Nvidia or Nintendo are going that route. Also because you can clock a core separately from the others, you could run 7xA78C cores at 1.6GHz and 1xA78C core at 450MHz, it takes up the same size as 4xA55 cores, and the same performance as those cores at 1.35GHz, so power consumption should not be as much as an issue. This allows 8 cores on 1 cluster with the same memory cache as 8xA78AE cores, which would be nice because they would share the same code, similar register/cache space.

You can clock a core separate from others, but the cost of doing so is a more complex design to do such a thing. You don’t simply clock it lower, it’s designed to be clocked lower by having a separate railing for its required voltage.

Especially for something that is so far in clicks like 450MHz and 1.6GHz.

Z0m3le · Nov 12, 2021

ReddDreadtheLead said:
It doesn’t really make sense though, the Volta you were comparing to is the non-NX which is 30w and has all 8 cores, the 15w Volta has 1/4th of the cores disabled and that is the NX.

You were comparing the non-NX to the NX from Volta to the Orin and basing the conclusion on that, which is inconsistent to make.

You basically removed more cores for no reason

It would have 8Cores, not 6 cores.

The other things were lowered the right way, the CPU count was unnecessarily lowered.

I doubt it would be a heavy OS, past Nintendo system with the heaviest OS was the Wii U, other systems had a non-existent OS, Switch was made with being as lean as possible on purpose, I doubt they will try to make the OS heavy like the other systems.

You can clock a core separate from others, but the cost of doing so is a more complex design to do such a thing. You don’t simply clock it lower, it’s designed to be clocked lower by having a separate railing for its required voltage.

Especially for something that is so far in clicks like 450MHz and 1.6GHz.

Yes, 450MHz was one example, it could be 500MHz for instance, but doing this to a single core is less complex than a separate cluster with its own rail system. Technically you can do it for all 8 cores, but since you'd only do this for the OS core, I think doing it once makes the most sense IMO.

Blomqvist · Nov 12, 2021

Orin NX which is a binned version of Orin X (for yields on 8N are low) will hardly be a SOC for the next Nintendo system because it is huge and hence expensive. I recall dismissing Xavier NX as a possibility while it is smaller than this Orin chip. Orin S is not Orin NX, it will be a small and cheap to produce die with good yields and the ability to hit 5W TDP. And it will still be probably binned to make Jetson Nano Next. This thing is probably coming in Q1 2023 and everyone will be disappointed because it will be the lowest end of Nvidias offering.

Z0m3le · Nov 12, 2021

Blomqvist said:
Orin NX which is a binned version of Orin X (for yields on 8N are low) will hardly be a SOC for the next Nintendo system because it is huge and hence expensive. I recall dismissing Xavier NX as a possibility while it is smaller than this Orin chip. Orin S is not Orin NX, it will be a small and cheap to produce die with good yields and the ability to hit 5W TDP. And it will still be probably binned to make Jetson Nano Next. This thing is probably coming in Q1 2023 and everyone will be disappointed because it will be the lowest end of Nvidias offering.

No, Orin NX is Orin S, look at the picture above. Dane is what Nintendo will use, and that isn't Orin NX, it removes those AI components, but it still will have a 10-15 watt TDP and use the same architecture, core counts don't really matter when talking about Dane, it will be the same performance per watt as Orin NX outside of the automated engine stuff.

StarTopic Future Nintendo Hardware & Technology Speculation & Discussion |ST| (New Staff Post, Please read)

2010 experience points!

Baba Yaga Hut

Warpstar Knight

Octorok

Piranha Plant

Warpstar Knight

2010 experience points!

Piranha Plant

Piranha Plant

Chain Chomp

2010 experience points!

Like Like

Like Like

2010 experience points!

Chain Chomp

Warpstar Knight

2010 experience points!

"[✄]. [✄]. [✄]. [✄]." -Microsoft

Warpstar Knight

Like Like

Bob-omb

Warpstar Knight

Like Like

Bob-omb

Bounty Hunter

Like Like

Moblin

Like Like

Warpstar Knight

Like Like

Bob-omb

#TeamLate2025WithAPotentialForEarly2026

Paratroopa

Bob-omb

Deleted member 2

Guest

Bob-omb

Paratroopa

Bob-omb

#TeamLate2025WithAPotentialForEarly2026

Paratroopa

Bob-omb

The Great Equalizer

Bob-omb

Paratroopa

┏(‘▀_▀’)ノ♬♪

Bob-omb

#TeamLate2025WithAPotentialForEarly2026

Bob-omb

Rattata

Bob-omb