• Hey everyone, staff have documented a list of banned content and subject matter that we feel are not consistent with site values, and don't make sense to host discussion of on Famiboards. This list (and the relevant reasoning per item) is viewable here.
  • Furukawa Speaks! We discuss the announcement of the Nintendo Switch Successor and our June Direct Predictions on the new episode of the Famiboards Discussion Club! Check it out here!

StarTopic Future Nintendo Hardware & Technology Speculation & Discussion |ST| (New Staff Post, Please read)

Consoles don't just drop dead when their successor releases, especially not recently.
Tell that to the PS4 and XB1 sales.
Better yet the wii and wii u
And the gamecube too! And the N64. And the GBA. And so on.

You know what? I think the only console that kept selling at good levels after the successor came out was the PS2.

But as Pokemaniac said, the games keep coming for previous platforms. That’s the important thing. PS4/XBO stopped selling abruptly but the crossgen period is allowing their successors to get more games than if it was a clean break in compatibility. Both gens are actively buying software because both can use it.
Today one of the bigger moneymakers of Playstation is the PSN, it even allowed them to release the PS5 at a loss and not loose money during the transition.
So going back to the “BC or not BC” problem, Dane will have BC if Nintendo wants NSO to be anything more than the barebones service it is today, and obviously Nintendo wants that, PSN showed where the money is.
 
Eh, the Wii dropped dead even before the successor released. Wii U dropped dead as soon as itself was released.
badum-tshhh 🥁

I'm curious if the influx of PC portable gaming machines over the next year-n-half will move the needle for Switch 4K system design. E.g. Force their hand w/ super-fast SSD and memory, 16gb memory, hardware acceleration for folder and theme support, etc
 
badum-tshhh 🥁

I'm curious if the influx of PC portable gaming machines over the next year-n-half will move the needle for Switch 4K system design. E.g. Force their hand w/ super-fast SSD and memory, 16gb memory, hardware acceleration for folder and theme support, etc
why would they? barring the Steam Deck, all of these devices might add up to a million units
 
why would they? barring the Steam Deck, all of these devices might add up to a million units
Ah yea, good point on the volumes. 👈

I'm anticipating tremendous fanfare among enthusiast gaming and tech circles when Steam Deck is released w/ its comparatively outrageous specs. But admittedly its not like having technologically superior upstart competition has ever moved Nintendo strategy before. 🤷‍♀️
 
0
an interesting post about tensor core scaling vs ALU scaling on Beyond3D. really explains why there's a lot of potential in AI upscaling than brute forcing performance with more ALUs, which snowballs rapidly

 
an interesting post about tensor core scaling vs ALU scaling on Beyond3D. really explains why there's a lot of potential in AI upscaling than brute forcing performance with more ALUs, which snowballs rapidly


Interesting read and I definitely agree that although the electrical component side of tech are becoming extremely dense and complex, bandwidth overall isn't progressing fast enough at a reasonable cost to sustain raw performance growth in a mainstream products pricing. This is why creative solutions are intriguing because I want to see if we get something visually close from Switch 4k that comes close to PS5 and Series X in the IQ department...

Also side note about GTC coming up on Nov 10th, will we get confirmation from Nvidia on Orin being manufactured on 8nm or 7nm? With the platform expected to debut in products next year, I kind of think Jen-Hsun will show off a finished product for Orin with full spec details.
 
0
Looking at the example used in that post, thinking of what's needed to match the 3090 w/ DLSS performance mode via brute force is... oof. The size, the bandwidth needed, the resulting price, and the power needed (reminds me of that Igor's Lab mention of a new connector introduced to allow for GPUs to hit 600 watts)... Native rendering at the top end will die by economics.
Edit: By top end, I'm referring to resolution/frame rate/performance
 
Last edited:
Looking at the example used in that post, thinking of what's needed to match the 3090 w/ DLSS performance mode via brute force is... oof. The size, the bandwidth needed, the resulting price, and the power needed (reminds me of that Igor's Lab mention of a new connector introduced to allow for GPUs to hit 600 watts)... Native rendering at the top end will die by economics.
Edit: By top end, I'm referring to resolution/frame rate/performance
Even Digital Foundry is starting to pen-name it as the "Post-Resolution Era" so they seem to see the writing on the wall there.
 
Looking at the example used in that post, thinking of what's needed to match the 3090 w/ DLSS performance mode via brute force is... oof. The size, the bandwidth needed, the resulting price, and the power needed (reminds me of that Igor's Lab mention of a new connector introduced to allow for GPUs to hit 600 watts)... Native rendering at the top end will die by economics.
Edit: By top end, I'm referring to resolution/frame rate/performance

Agreed it's pretty astronomical of what hardware growth is needed to achieve parity with 3090 using DLSS, although the next-gen Lovelace cards will achieve most of this on the transistor density front by moving to 5nm. Not only will the TDP be high but needing close to 2TB of memory bandwidth to achieve similar performance as DLSS would make for insane prices for that top end.

It does make one wonder if Nvidia's next 4090 or Titan card ends up using HBM2 memory to give a performance edge over other manufacturers top end cards and to better differentiate between the 4080 and 4090(as the 3080 and 3090 were way to close in performance to justify the price at the top end). It's also pretty interesting to think that in just one gpu generation the top end performance with Ampere should be achieved on what is considered the next middle tier card on Lovelace in the 4070.

Again so many interesting topics to discuss in where Nvidia places their bets going forward and I think expounding on the usage of tech beyond the norm of expectations is needed to take game design and creativity to another level.

 
Even Digital Foundry is starting to pen-name it as the "Post-Resolution Era" so they seem to see the writing on the wall there.

It's also getting harder and harder to make compelling content on why this 4k game looks better than that one on gpu's that are running at 300-400 watts of power. One of the main reasons that seeing impossible ports on the Switch always draws major attention and I'm sure will continue to be a thing on the next iteration of Switch hardware. If Nvidia and Nintendo can strike the perfect balance in CPU, GPU and memory bandwidth performance, the next hardware could be the perfect example of what the original Switch set out to achieve...
 
The thing that worries me most is the RAM bandwith. No one os talking about It and It could be the biggest bottleneck for the entire system.
It makes me wonder if Nintendo will once again invest in "exotic RAM" a la GameCube and 1T-SRAM.

I wonder if Nvidia might be influencing their decisions when it comes to solutions regarding bandwidth. I know they probably want to get a good deal for whichever type of memory they go with.
 
I am incredibly excited for the DLSS properties of the successor. I just hope that we'll get to play the classics like Breath of the Wild at 4k/30 or 4k/60 as a treat.
 
Maby they can use HBM PIM to support DLSS
I had to look for this acronym online and it seems like this is some kind of memory tailored to AI tasks. I didn't know this was a thing.
According to Samsung's PR, this type of memory could reduce power consumption significantly if the scenario is optimal (I assume). Also, its modules called AXDIMM have indeed the same shape as regular DIMM which makes it at least a valid candidate for Dane.

Interesting, I guess. I wonder if this product will creep further up in conversations.
 
I am incredibly excited for the DLSS properties of the successor. I just hope that we'll get to play the classics like Breath of the Wild at 4k/30 or 4k/60 as a treat.

I remember when everyone thought raytracing is the new insane sh*t when the RTX 2000 Series was introduced. Barely anyone mentioned DLSS. Didn't help that DLSS 1.0 wasn't all that great. Now, a few years later, i could not give a damn about raytracing but DLSS is the bees knees. DLSS is litteraly the only thing i WANT Nintendo and Nvidia to utilise. A real game changer!
 
I remember when everyone thought raytracing is the new insane sh*t when the RTX 2000 Series was introduced. Barely anyone mentioned DLSS. Didn't help that DLSS 1.0 wasn't all that great. Now, a few years later, i could not give a damn about raytracing but DLSS is the bees knees. DLSS is litteraly the only thing i WANT Nintendo and Nvidia to utilise. A real game changer!
It is absolutely a game changer and a match made in heaven for Nintendo's hardware philosophy. Being able to display native 720p games at 4k resolution is exactly what Nintendo needs as more people adopt 4k TVs.
 
0
an interesting post about tensor core scaling vs ALU scaling on Beyond3D. really explains why there's a lot of potential in AI upscaling than brute forcing performance with more ALUs, which snowballs rapidly

I think a good point this post makes is that output quality is tied to how deep the network can be in real time. Some people have posted before speculating about a version of DLSS optimized for the Switch, but those optimizations would have to take the form of a shallower network architecture with somewhat poorer reconstruction quality. It will be a bit of a balancing act.
 
The thing that worries me most is the RAM bandwith. No one os talking about It and It could be the biggest bottleneck for the entire system.

People have been discussing this area, we just don't know enough about the architectural changes between Ampere and Lovelace or what modifications might be made to Dane that differs from desktop Lovelace cards... The current Switch suffered from bandwidth limitations so I fully expect both Nintendo and Nvidia to address this in the Dane Switch model(something Nintendo has always done in new hardware iterations).
It makes me wonder if Nintendo will once again invest in "exotic RAM" a la GameCube and 1T-SRAM.

I wonder if Nvidia might be influencing their decisions when it comes to solutions regarding bandwidth. I know they probably want to get a good deal for whichever type of memory they go with.
I don't know about exotic RAM, but I am curious to see if a good portion of the die is dedicated to cache memory in order to make up for the difference in unified bandwidth needed. That article about Ampere architecture states that developers are having trouble keeping cuda cores fed, which is why RDNA2 ends up more performant in rasterization (which makes one wonder do we see a fix to this bottleneck with Lovelace or do they wait until Hopper to restructure the overall SM layout.)
I am incredibly excited for the DLSS properties of the successor. I just hope that we'll get to play the classics like Breath of the Wild at 4k/30 or 4k/60 as a treat.
I think Nintendo are betting on this being the system selling feature for the next model.
Everyone understands the concept of Switch now and if they can manage to get PS4 quality games on the go, but output 4k graphics while docked the Switch family of systems will continue to print money...
I had to look for this acronym online and it seems like this is some kind of memory tailored to AI tasks. I didn't know this was a thing.
According to Samsung's PR, this type of memory could reduce power consumption significantly if the scenario is optimal (I assume). Also, its modules called AXDIMM have indeed the same shape as regular DIMM which makes it at least a valid candidate for Dane.

Interesting, I guess. I wonder if this product will creep further up in conversations.
I don't know if we ever see the console manufacturers go back to using exotic memory applications that have limited functionality(outside of just using regular HBM that is). The switch to unified memory in the first place was to simplify and reduce cost of needing both a pool of system memory and graphics memory. So if the cost on regular HBM comes down in price maybe it might be feasible down the road to see in a future console, but it seems solutions like AMD's Infinity cache might be a more cost effective practical solution.
 
I think a good point this post makes is that output quality is tied to how deep the network can be in real time. Some people have posted before speculating about a version of DLSS optimized for the Switch, but those optimizations would have to take the form of a shallower network architecture with somewhat poorer reconstruction quality. It will be a bit of a balancing act.
It makes me wonder if the needs if Orin will require more tensor performance over cuda performance. Or, with Dane being made for Nintendo first, they decided to tack on more tensor cores to Dane in a style like the GA100. That tradeoff would have to find a way to keep the tensor cores from being idle though. Maybe with some non-conference tasks, if possible
 
I think a good point this post makes is that output quality is tied to how deep the network can be in real time. Some people have posted before speculating about a version of DLSS optimized for the Switch, but those optimizations would have to take the form of a shallower network architecture with somewhat poorer reconstruction quality. It will be a bit of a balancing act.
Do they really need a version optimized for Switch though?
They currently have ultra performance mode and as long as that keeps getting better with each DLSS iteration, by the time Dane Switch comes along it will probably meet the minimum requirements needed to achieve a decent image quality in both handheld and docked modes. Again we don't fully know the advantages of Lovelace over Ampere architecture, so maybe that's where it greatly excels is with Tensor and RT calculations over Ampere architecture.

There are image quality competitors to Nvidia's solution so as amazing as DLSS is there still aren't that many games utilizing it just yet.
I fully expect that once this new Switch is even announced it will double the amount of games incorporating DLSS overnight because it lends itself more to the hybrid hardware. It's one thing to see a 300W graphics card using DLSS to put out pretty images, but a portable device using less than 20 watts doing more impossible ports than the first Switch is an Trojan horse for the Ai solution.
 
I don't know if we ever see the console manufacturers go back to using exotic memory applications that have limited functionality(outside of just using regular HBM that is). The switch to unified memory in the first place was to simplify and reduce cost of needing both a pool of system memory and graphics memory. So if the cost on regular HBM comes down in price maybe it might be feasible down the road to see in a future console, but it seems solutions like AMD's Infinity cache might be a more cost effective practical solution.
Shoot. I forgot about the advantage of a unified memory pool. I still remember the stories about the Xbox 360 and PS3 in this regard. Good observation.
 
0
Global-audience-graphic.png


Software RT Shadows running on a Mali G78


RT demo by Tencent






Now they're starting to put the focus on the GPU after years of it just being "there". Phones aren't very GPU oriented devices but I feel like we know so little about the gpus in these
 
Do they really need a version optimized for Switch though?
They currently have ultra performance mode and as long as that keeps getting better with each DLSS iteration, by the time Dane Switch comes along it will probably meet the minimum requirements needed to achieve a decent image quality in both handheld and docked modes. Again we don't fully know the advantages of Lovelace over Ampere architecture, so maybe that's where it greatly excels is with Tensor and RT calculations over Ampere architecture.
I think he’s talking about a Switch optimized DLSS in the sense that it’s optimized to run faster (not prettier), considering that Dane will have less Tensor Cores than any other RTX 2000 or 3000 gpu, and they will probably run at a lower clock.

Edit: I just realized this is the 2nd most viewed thread on the forum. The demand for a better Switch exists.
 
Last edited:
It does make one wonder if Nvidia's next 4090 or Titan card ends up using HBM2 memory to give a performance edge over other manufacturers top end cards and to better differentiate between the 4080 and 4090(as the 3080 and 3090 were way to close in performance to justify the price at the top end). It's also pretty interesting to think that in just one gpu generation the top end performance with Ampere should be achieved on what is considered the next middle tier card on Lovelace in the 4070.
Maby they can use HBM PIM to support DLSS
Given that AD102's rumoured to continue using GDDR6X, HBM's probably still cost prohibitive.
 
0
badum-tshhh 🥁

I'm curious if the influx of PC portable gaming machines over the next year-n-half will move the needle for Switch 4K system design. E.g. Force their hand w/ super-fast SSD and memory, 16gb memory, hardware acceleration for folder and theme support, etc
Force Nintendos hand🤣. Nintendo has no problem going against the grain and whatever is considered an “industry standard.”
 
The thing that worries me most is the RAM bandwith. No one os talking about It and It could be the biggest bottleneck for the entire system.
We were discussing it before(not here) and personally I think they will opt for an IC-like implementation for the system to help with the bandwidth.

It wouldn’t be the first time they opt for super fast embedded memory for the system.


But yeah the memory bandwidth situation is a pretty big one for this system.
 
0
Force Nintendos hand🤣. Nintendo has no problem going against the grain and whatever is considered an “industry standard.”
Yep nintendo goes on their own pace, their online systems are still behind sony's and microsofts they are not rushing to be better than them so I don't think they would do they same with hardware especially since nintendo doesn't 100% care about having the best specs around anymore
 
Yep nintendo goes on their own pace, their online systems are still behind sony's and microsofts they are not rushing to be better than them so I don't think they would do they same with hardware especially since nintendo doesn't 100% care about having the best specs around anymore

They can also get away with quite a lot less raw grunt to have comparable gaming performance as those AMD APUs. Due to console optimisation and dlss.
 
0
Doesn't it seem strange to anyone that there are no rumors about new titles (impossible on Switch) despite the fact that devkits have been around for quite some time now?
Or have I missed something? 😶

If I understand correctly, the big software houses (and smaller ones) already have devkits and a goal of having titles "ready" by late 2022.
 
Doesn't it seem strange to anyone that there are no rumors about new titles (impossible on Switch) despite the fact that devkits have been around for quite some time now?
Or have I missed something? 😶

If I understand correctly, the big software houses (and smaller ones) already have devkits and a goal of having titles "ready" by late 2022.
probably not best to talk about specific games during this period. makes it easier to nail down sources
 
Do they really need a version optimized for Switch though?
They currently have ultra performance mode and as long as that keeps getting better with each DLSS iteration, by the time Dane Switch comes along it will probably meet the minimum requirements needed to achieve a decent image quality in both handheld and docked modes. Again we don't fully know the advantages of Lovelace over Ampere architecture, so maybe that's where it greatly excels is with Tensor and RT calculations over Ampere architecture.

There are image quality competitors to Nvidia's solution so as amazing as DLSS is there still aren't that many games utilizing it just yet.
I fully expect that once this new Switch is even announced it will double the amount of games incorporating DLSS overnight because it lends itself more to the hybrid hardware. It's one thing to see a 300W graphics card using DLSS to put out pretty images, but a portable device using less than 20 watts doing more impossible ports than the first Switch is an Trojan horse for the Ai solution.
Ultra performance is good at reducing rendering cost, but the computational cost of DLSS itself is mostly dictated by output resolution. This is all assuming that it’s a convolutional neural network; Nvidia has said ‘convolutional autoencoder,’ but that’s just a specific genre of CNN architecture that looks like this:

1*44eDEuZBEsmG_TCAKRI3Kw@2x.png


Or in the case of the Facebook neural supersampling paper written by Anton Kaplanyan, who Intel just poached for XeSS, like this:

WjVFIXC.png


I’ve posted about this in more detail on Era, but the computational cost of a CNN is proportional to pixel resolution, and the reconstruction has to happen at the output resolution.

Gory details: the computational cost going from one layer to the next layer for a CNN with zero padding, a step size of 1, and a 3x3 filter is the cost of doing a dot product with (3 * 3 * number of channels in the current layer) elements in whatever precision you are working in times (height in pixels * width in pixels * number of channels in the next layer * cost of activation).

The Kaplanyan method also uses two other CNNs: one to weight previous frames at output resolution, and one to learn features at input resolution. The latter or its equivalent in DLSS is a place where you could have some savings in ultra performance mode, but the reconstruction network is the bottleneck by far in the case of the Facebook/Kaplanyan paper. If that holds true for DLSS, running the same architecture at 4K output resolution would approach being twice as expensive as at 1440p.

(I am assuming here that ultra performance uses the same neural network(s) as the other modes and that only resolution scales.)

I believe that someone on Era made an estimate of how long it would take DLSS to complete on a few potential versions of Dane, but I can’t find the post right now and don’t remember the poster. In any case, the gist was that reconstructing at 4K with DLSS may consume a significant portion of the frame time.

The post that ILikeFeet shared makes the point that DLSS likely could be even better at reconstruction if it were deeper, but that Nvidia is potentially limiting the number of layers to hit their 2 ms target. Similarly, the way to optimize DLSS for an even tighter performance budget would be to do the opposite and make the network even shallower at risk of somewhat decreasing the reconstruction quality.

Lovelace could indeed have some secret sauce that further leverages parallelization or sparsity, but there’s still a hearty chunk of calculations to do at the end of the day.
 
Oh yea, memory bandwidth. Switch does need more, but there's always those caveats of cost,area, and power draw.
To recap: cost in terms of the memory chips themselves (be it LPDDR4X or LPDDR5) (or packaging a stack of HBM, but...), area in terms of the memory bus, and power because the active transfer of data still costs power.
A cache solution added on top to crank up effective bandwidth at minimal area/power cost would be pretty neat, though I don't know the cost impact of that.

What was the base range we settled on expecting? 128-bit LPDDR4X to 128-bit LPDDR5?
Recap: Original Switch uses 64-bit LPDDR4. 2019 and OLED models should be 64-bit LPDDR4X. What 4X offers over 4 is an improvement in power efficiency per bit as well as potentially up to 1/3 more speed/bandwidth. 2019 and OLED models should have chosen to keep the speed/bandwidth the same as the original launch Switch and pocketed the power efficiency improvement for battery life.
128-bit LPDDR4X would be potentially (double)*(plus a third) = 2 2/3 times the bandwidth of the original model (or, an increase of 1 2/3 times).
LPDDR5, by specifications, can reach up to double the speed of base LPDDR4, so 128-bit LPDDR5 would potentially be (double)*(double) = 4 times the bandwidth of the original model (or, an increase of 3 times).
JEDEC did publish the standard for LPDDR5X towards the end of July this year. Just like 4->4X, 5-5X is another potentially 1/3 increase in speed/bandwidth. However, I don't think that I've seen mention of power efficiency improvement over 5. There are rumors of a phone or two using 5X by the end of this year, implying actual production. That said, I don't think that any of us reasonably expect 5X to be an option here for Dane. But hey, there's that random dark horse option of 64-bit LPDDR5X; get the bandwidth of 128-bit 4X, but with the efficiency of 5 at the area cost of 64-bit.
 
Ultra performance is good at reducing rendering cost, but the computational cost of DLSS itself is mostly dictated by output resolution. This is all assuming that it’s a convolutional neural network; Nvidia has said ‘convolutional autoencoder,’ but that’s just a specific genre of CNN architecture that looks like this:

1*44eDEuZBEsmG_TCAKRI3Kw@2x.png


Or in the case of the Facebook neural supersampling paper written by Anton Kaplanyan, who Intel just poached for XeSS, like this:

WjVFIXC.png


I’ve posted about this in more detail on Era, but the computational cost of a CNN is proportional to pixel resolution, and the reconstruction has to happen at the output resolution.

Gory details: the computational cost going from one layer to the next layer for a CNN with zero padding, a step size of 1, and a 3x3 filter is the cost of doing a dot product with (3 * 3 * number of channels in the current layer) elements in whatever precision you are working in times (height in pixels * width in pixels * number of channels in the next layer * cost of activation).

The Kaplanyan method also uses two other CNNs: one to weight previous frames at output resolution, and one to learn features at input resolution. The latter or its equivalent in DLSS is a place where you could have some savings in ultra performance mode, but the reconstruction network is the bottleneck by far in the case of the Facebook/Kaplanyan paper. If that holds true for DLSS, running the same architecture at 4K output resolution would approach being twice as expensive as at 1440p.

(I am assuming here that ultra performance uses the same neural network(s) as the other modes and that only resolution scales.)

I believe that someone on Era made an estimate of how long it would take DLSS to complete on a few potential versions of Dane, but I can’t find the post right now and don’t remember the poster. In any case, the gist was that reconstructing at 4K with DLSS may consume a significant portion of the frame time.

The post that ILikeFeet shared makes the point that DLSS likely could be even better at reconstruction if it were deeper, but that Nvidia is potentially limiting the number of layers to hit their 2 ms target. Similarly, the way to optimize DLSS for an even tighter performance budget would be to do the opposite and make the network even shallower at risk of somewhat decreasing the reconstruction quality.

Lovelace could indeed have some secret sauce that further leverages parallelization or sparsity, but there’s still a hearty chunk of calculations to do at the end of the day.

Yes I definitely remember those discussions of approximately how long it takes (I believe the RTX 2060/S was the card used) as the comparative means to measure the theoretical window the next Switch would need in order to render dlss resolution of 4k at both 30fps and 60fps. We still don't know how Lovelace will compare vs Turing when it comes to Tensor core performance and or if they found a unique solution to this by a different (custom design to that of desktop Lovelace) or if Lovelace as a whole has a different Tensor setup/just more performant.

We know from NateDrake that devkits are not only out in the wild, but have continued to go out to developers both small-large alike and the statement that many are excited about the possibilities is a promising enough statement that they must be happy with what they've received...
 
Oh yea, memory bandwidth. Switch does need more, but there's always those caveats of cost,area, and power draw.
To recap: cost in terms of the memory chips themselves (be it LPDDR4X or LPDDR5) (or packaging a stack of HBM, but...), area in terms of the memory bus, and power because the active transfer of data still costs power.
A cache solution added on top to crank up effective bandwidth at minimal area/power cost would be pretty neat, though I don't know the cost impact of that.

What was the base range we settled on expecting? 128-bit LPDDR4X to 128-bit LPDDR5?
Recap: Original Switch uses 64-bit LPDDR4. 2019 and OLED models should be 64-bit LPDDR4X. What 4X offers over 4 is an improvement in power efficiency per bit as well as potentially up to 1/3 more speed/bandwidth. 2019 and OLED models should have chosen to keep the speed/bandwidth the same as the original launch Switch and pocketed the power efficiency improvement for battery life.
128-bit LPDDR4X would be potentially (double)*(plus a third) = 2 2/3 times the bandwidth of the original model (or, an increase of 1 1/3 times).
LPDDR5, by specifications, can reach up to double the speed of base LPDDR4, so 128-bit LPDDR5 would potentially be (double)*(double) = 4 times the bandwidth of the original model (or, an increase of 3 times).
JEDEC did publish the standard for LPDDR5X towards the end of July this year. Just like 4->4X, 5-5X is another potentially 1/3 increase in speed/bandwidth. However, I don't think that I've seen mention of power efficiency improvement over 5. There are rumors of a phone or two using 5X by the end of this year, implying actual production. That said, I don't think that any of us reasonably expect 5X to be an option here for Dane. But hey, there's that random dark horse option of 64-bit LPDDR5X; get the bandwidth of 128-bit 4X, but with the efficiency of 5 at the area cost of 64-bit.
I don't know how viable this is, but I was thinking maybe Nintendo and Nvidia could add four 32-bit LPDDR5 channels or eight 16-bit LPDDR5 channels on Dane (similar to what Apple's doing with the Apple M1 and what Nvidia's doing with Xavier), alongside Nintendo adding two 64-bit LPDDR5 RAM chips next to Dane, and have the LPDDR5 channels and the LPDDR5 RAM chips run at the same time. I was thinking that would allow the DLSS model* to have a bus width of 256-bit; and assuming the LPDDR5 channels and the LPDDR5 RAM chips are running at the max I/O rate of 6400 MT/s, which is the max I/O rate for LPDDR5, then the DLSS model* could theoretically achieve a max bandwidth of 204.8 GB/s (6400 MT/s = 51.2 GB/s (64-bit)).
 
I don't know how viable this is, but I was thinking maybe Nintendo and Nvidia could add four 32-bit LPDDR5 channels or eight 16-bit LPDDR5 channels on Dane (similar to what Apple's doing with the Apple M1 and what Nvidia's doing with Xavier), alongside Nintendo adding two 64-bit LPDDR5 RAM chips next to Dane, and have the LPDDR5 channels and the LPDDR5 RAM chips run at the same time. I was thinking that would allow the DLSS model* to have a bus width of 256-bit; and assuming the LPDDR5 channels and the LPDDR5 RAM chips are running at the max I/O rate of 6400 MT/s, which is the max I/O rate for LPDDR5, then the DLSS model* could theoretically achieve a max bandwidth of 204.8 GB/s (6400 MT/s = 51.2 GB/s (64-bit)).

Apple has done this since 2018 with the RAM right?
Has anyone ever done a deep dive on how much this costs Apple in the grand scheme of things?
 
Yes I definitely remember those discussions of approximately how long it takes (I believe the RTX 2060/S was the card used) as the comparative means to measure the theoretical window the next Switch would need in order to render dlss resolution of 4k at both 30fps and 60fps. We still don't know how Lovelace will compare vs Turing when it comes to Tensor core performance and or if they found a unique solution to this by a different (custom design to that of desktop Lovelace) or if Lovelace as a whole has a different Tensor setup/just more performant.

We know from NateDrake that devkits are not only out in the wild, but have continued to go out to developers both small-large alike and the statement that many are excited about the possibilities is a promising enough statement that they must be happy with what they've received...
I do agree with the spirit of what you are saying about the devkits, and I have no doubt that the quality will still be a great improvement over the current Switch. I think I am more pessimistic about Lovelace’s potential to significantly improve tensor calculation throughput on the same process node, but I am not well versed in the SOC design aspect of this thread, so it’s definitely possible.

Barring that, I am expecting Nintendo and Nvidia to make some tradeoffs in framerate, resolution, and/or network architecture to hit their performance and quality targets. My expectations are calibrated somewhat lower than the ideal scenario of 4K at 60 FPS with quality equivalent to ultra performance on current desktops.
 
I do agree with the spirit of what you are saying about the devkits, and I have no doubt that the quality will still be a great improvement over the current Switch. I think I am more pessimistic about Lovelace’s potential to significantly improve tensor calculation throughput on the same process node, but I am not well versed in the SOC design aspect of this thread, so it’s definitely possible.

Barring that, I am expecting Nintendo and Nvidia to make some tradeoffs in framerate, resolution, and/or network architecture to hit their performance and quality targets. My expectations are calibrated somewhat lower than the ideal scenario of 4K at 60 FPS with quality equivalent to ultra performance on current desktops.

Well the sheer difference in Tensor performance from GA100 to GA102 is purely in cache amounts, so maybe if the rumored cache increase on Lovelace ring true, the next hardware could see an increase in both Tensor performance and rasterization of similar results if not more.
 
Global-audience-graphic.png


Software RT Shadows running on a Mali G78


RT demo by Tencent






Now they're starting to put the focus on the GPU after years of it just being "there". Phones aren't very GPU oriented devices but I feel like we know so little about the gpus in these

I've said it before, RT Shadows and RTGI are usually the most scalable/cheapest RT solutions.
Heck we can see it even with SVOGI being Voxel-Traced GI that is so scalable that it runs on OG Switch hardware in Crysis 1 and 2 Remastered for the system.
 
0
Ultra performance is good at reducing rendering cost, but the computational cost of DLSS itself is mostly dictated by output resolution. This is all assuming that it’s a convolutional neural network; Nvidia has said ‘convolutional autoencoder,’ but that’s just a specific genre of CNN architecture that looks like this:

1*44eDEuZBEsmG_TCAKRI3Kw@2x.png


Or in the case of the Facebook neural supersampling paper written by Anton Kaplanyan, who Intel just poached for XeSS, like this:

WjVFIXC.png


I’ve posted about this in more detail on Era, but the computational cost of a CNN is proportional to pixel resolution, and the reconstruction has to happen at the output resolution.

Gory details: the computational cost going from one layer to the next layer for a CNN with zero padding, a step size of 1, and a 3x3 filter is the cost of doing a dot product with (3 * 3 * number of channels in the current layer) elements in whatever precision you are working in times (height in pixels * width in pixels * number of channels in the next layer * cost of activation).

The Kaplanyan method also uses two other CNNs: one to weight previous frames at output resolution, and one to learn features at input resolution. The latter or its equivalent in DLSS is a place where you could have some savings in ultra performance mode, but the reconstruction network is the bottleneck by far in the case of the Facebook/Kaplanyan paper. If that holds true for DLSS, running the same architecture at 4K output resolution would approach being twice as expensive as at 1440p.

(I am assuming here that ultra performance uses the same neural network(s) as the other modes and that only resolution scales.)

I believe that someone on Era made an estimate of how long it would take DLSS to complete on a few potential versions of Dane, but I can’t find the post right now and don’t remember the poster. In any case, the gist was that reconstructing at 4K with DLSS may consume a significant portion of the frame time.

The post that ILikeFeet shared makes the point that DLSS likely could be even better at reconstruction if it were deeper, but that Nvidia is potentially limiting the number of layers to hit their 2 ms target. Similarly, the way to optimize DLSS for an even tighter performance budget would be to do the opposite and make the network even shallower at risk of somewhat decreasing the reconstruction quality.

Lovelace could indeed have some secret sauce that further leverages parallelization or sparsity, but there’s still a hearty chunk of calculations to do at the end of the day.
Aren’t Nintendo also aiming to use a CNN from their own public patent from recent?


A reasonable guess, I think that this Neural Network that Nintendo will seem to be developing in the background will be shown in GDC 2023 or probably not at all =P

Reason I say that, is that Dane would presumably be out by then, probably to better showcase it In application on console, on top of their own simulations, maybe


Oh yea, memory bandwidth. Switch does need more, but there's always those caveats of cost,area, and power draw.
To recap: cost in terms of the memory chips themselves (be it LPDDR4X or LPDDR5) (or packaging a stack of HBM, but...), area in terms of the memory bus, and power because the active transfer of data still costs power.
A cache solution added on top to crank up effective bandwidth at minimal area/power cost would be pretty neat, though I don't know the cost impact of that.

What was the base range we settled on expecting? 128-bit LPDDR4X to 128-bit LPDDR5?
Recap: Original Switch uses 64-bit LPDDR4. 2019 and OLED models should be 64-bit LPDDR4X. What 4X offers over 4 is an improvement in power efficiency per bit as well as potentially up to 1/3 more speed/bandwidth. 2019 and OLED models should have chosen to keep the speed/bandwidth the same as the original launch Switch and pocketed the power efficiency improvement for battery life.
128-bit LPDDR4X would be potentially (double)*(plus a third) = 2 2/3 times the bandwidth of the original model (or, an increase of 1 2/3 times).
LPDDR5, by specifications, can reach up to double the speed of base LPDDR4, so 128-bit LPDDR5 would potentially be (double)*(double) = 4 times the bandwidth of the original model (or, an increase of 3 times).
JEDEC did publish the standard for LPDDR5X towards the end of July this year. Just like 4->4X, 5-5X is another potentially 1/3 increase in speed/bandwidth. However, I don't think that I've seen mention of power efficiency improvement over 5. There are rumors of a phone or two using 5X by the end of this year, implying actual production. That said, I don't think that any of us reasonably expect 5X to be an option here for Dane. But hey, there's that random dark horse option of 64-bit LPDDR5X; get the bandwidth of 128-bit 4X, but with the efficiency of 5 at the area cost of 64-bit.
Range is difficult to really, well, ‘range”, as ORIN is assumed to use LPDDR5, so if it is then Dane would presumably be a customized version of this family of SoCs then it would be LPDDR5. Plus, the incompatibility of LPDDR4/4X and 5/5X memory controller where you need to redesign it for it, so it’s more limited here in what would be selected imo. 5 and 5X go without saying and do not need to redesign to work with each other. While they don’t also mention the power savings, we can extrapolate based on the 33% improvement at the same power of LPDDR5, which would be 8533MT/s compared to the 6400MT/s.


In other news, possibly good but it is just a rumo, don’t remember if it was mentioned or not, but the Mi 12 is supposed to ship with LPDDR5X this year, so Dane might have LPDDR5X which further increases the amount of memory Bandwidth the device can have, by ~33% of course at the highest (docked mode) for a 136GB/s

but of course, rumors are rumors
 
Last edited:
In other news, possibly good but it is just a rumo, don’t remember if it was mentioned or not, but the Mi 12 is supposed to ship with LPDDR5X this year, so Dane might have LPDDR5X which further increases the amount of memory Bandwidth the device can have, by ~33% of course at the highest (docked mode) for a 136GB/s

but of course, rumors are rumors
And that's not considering the idea of them using a more unique RAM solution to accelerate bandwidth for those who program for it (EX: Xbox One's ESRAM which likely resulted in "effective bandwidth" for that system being around 120GB/s+ despite the DDR3 being only 68GB/s)


A Switch Dane with ESRAM or an Infinity Cache sort of alternative for the GPU could very well accelerate that 103-136GB/s number from LPDDR5/LPDDR5X up to the 200GB/s range in effective bandwidth
 
0
Oh yea, memory bandwidth. Switch does need more, but there's always those caveats of cost,area, and power draw.
To recap: cost in terms of the memory chips themselves (be it LPDDR4X or LPDDR5) (or packaging a stack of HBM, but...), area in terms of the memory bus, and power because the active transfer of data still costs power.
A cache solution added on top to crank up effective bandwidth at minimal area/power cost would be pretty neat, though I don't know the cost impact of that.

What was the base range we settled on expecting? 128-bit LPDDR4X to 128-bit LPDDR5?
Recap: Original Switch uses 64-bit LPDDR4. 2019 and OLED models should be 64-bit LPDDR4X. What 4X offers over 4 is an improvement in power efficiency per bit as well as potentially up to 1/3 more speed/bandwidth. 2019 and OLED models should have chosen to keep the speed/bandwidth the same as the original launch Switch and pocketed the power efficiency improvement for battery life.
128-bit LPDDR4X would be potentially (double)*(plus a third) = 2 2/3 times the bandwidth of the original model (or, an increase of 1 2/3 times).
LPDDR5, by specifications, can reach up to double the speed of base LPDDR4, so 128-bit LPDDR5 would potentially be (double)*(double) = 4 times the bandwidth of the original model (or, an increase of 3 times).
JEDEC did publish the standard for LPDDR5X towards the end of July this year. Just like 4->4X, 5-5X is another potentially 1/3 increase in speed/bandwidth. However, I don't think that I've seen mention of power efficiency improvement over 5. There are rumors of a phone or two using 5X by the end of this year, implying actual production. That said, I don't think that any of us reasonably expect 5X to be an option here for Dane. But hey, there's that random dark horse option of 64-bit LPDDR5X; get the bandwidth of 128-bit 4X, but with the efficiency of 5 at the area cost of 64-bit.
Realistically we can expect 68 GB/s for lpddr4x and 102GB/s for lpddr4x, both in 128 but vis width.

I'm not holding my breath on lpddr5x for Switch 2 .Even if the get tech comes out next year and offers 33% more bandwidth then LPDDR5, like it's lpddr4x predecessor, I don't see it happening that soon. Lpddr5x would be perfect for a revision though (even though extra bandwidth probably wouldn't be used then).

And having a larger cache will help a lot.
 
Realistically we can expect 68 GB/s for lpddr4x and 102GB/s for lpddr4x, both in 128 but vis width.

I'm not holding my breath on lpddr5x for Switch 2 .Even if the get tech comes out next year and offers 33% more bandwidth then LPDDR5, like it's lpddr4x predecessor, I don't see it happening that soon. Lpddr5x would be perfect for a revision though (even though extra bandwidth probably wouldn't be used then).

And having a larger cache will help a lot.
Why would they go with LPDDR4X over regular LPDDR5?
 
Realistically we can expect 68 GB/s for lpddr4x and 102GB/s for lpddr4x, both in 128 but vis width.

I'm not holding my breath on lpddr5x for Switch 2 .Even if the get tech comes out next year and offers 33% more bandwidth then LPDDR5, like it's lpddr4x predecessor, I don't see it happening that soon. Lpddr5x would be perfect for a revision though (even though extra bandwidth probably wouldn't be used then).

And having a larger cache will help a lot.

The Steamdeck uses Lpddr5 RAM and has a bandwidth of 88GB/s, I fully expect it to be somewhere close to that number or slightly higher to be honest. Larger cache would definitely make all of this a different topic altogether, so here's to hoping that we get some news on Lovelace architecture sooner than later.
 
0
IDC if they do or not but 4G/5G would be huge
oh boy another data plan to sign up for. I'm not sure if people would be into that. it didn't work for the Vita, and I'm unsure it would work here. not when people can just tether to their phones or a bespoke hotspot.



I finally read Anantech's review of the new apple mobile SoC, and I like the bubble chart for efficiency. going by GFXBench's database, this puts the sustained performance on par with a 750Ti

AztecHigh.png
 
Last edited:
Please read this new, consolidated staff post before posting.

Furthermore, according to this follow-up post, all off-topic chat will be moderated.
Last edited by a moderator:


Back
Top Bottom