I was playing around a bit with the new "RT Overdrive" mode in Cyberpunk, and although it's a bit of a stretch for my 3070, it was still interesting to play around with, even at very low resolutions.
It's actually a very good test case for Ada's architectural improvements over Ampere. Most of the changes to Ada (shader execution re-ordering, improved RT cores, larger L2 cache) are aimed at improving RT performance, and the RT Overdrive mode in Cyberpunk is the heaviest stress test of RT performance at the moment. It also uses Nvidia's RTXDI, and seems to have been implemented in close collaboration with Nvidia, so we would expect it to be well-optimised for the newest Nvidia GPUs.
Tom's Hardware did some benchmarks
here, and for a point of comparison I'm going to use the difference between the RTX 3070 and RTX 4070, as they both use 46 SMs. According to the benchmarks, the RTX 4070 outperformed the RTX 3070 by 42.3% at native 1080p. The RTX 4070 runs at higher clocks, though, with a boost clock of 2,475MHz compared to 1,725MHz on the 3070. That's a 43.5% increase in clocks for a 42.3% increase in performance with the same number of SMs. Not exactly what I was expecting.
There's a big caveat here in that advertised boost clocks typically don't represent the actual clock speeds GPUs run at. Without knowing what clock speeds the GPUs were running during Tom's Hardware's benchmarks it's impossible to say exactly what the per-clock performance difference is, but it seems unlikely that Ada offers a significant performance per clock benefit over Ampere even in RT heavy workloads, which is honestly quite surprising to me. There are some other caveats like memory bandwidth only being about 10% higher on the 4070, although I'd expect the much larger L2 to be a big benefit there.
Offering little to no performance-per-clock improvement even on heavy RT tasks is particularly disappointing when you consider the massive increase in transistor counts with Ada. The GA104 die used for the RTX 3070 is a 17.4 billion transistor part. The AD104 used for the RTX 4070 is a 35.8 billion transistor part. The AD104 does have more SMs on board, 60 compared to the 48 on GA104, but accounting for that we're still looking at 65% more transistors per SM. Of course most of those added transistors aren't in the SMs themselves (I would guess the 48MB L2 cache accounts for a lot of them), but it's still a significant increase in transistor count overall for very little increase in per-clock performance.
Bringing this back to the topic of the thread, it doesn't look like Nintendo's missing much by using Ampere over Ada. I was assuming that Ada would offer a nice boost to RT performance, but that doesn't seem to be the case. Ada is much more power efficient than Ampere, but this tracks what we'd expect from the move to a better manufacturing process, and we should expect the same benefit shrinking Ampere to 4N.
In fact, with Ada being a much more transistor-heavy architecture, if you're trying to optimise for the best performance within limited die size/transistor count, it seems like Ampere on 4N would actually comfortably outperform Ada, even on RT-heavy workloads.
Frame generation in DLSS 3 is of course another aspect to consider, but as I've said before, I don't think frame generation would be a sensible or feasible option for Nintendo even if they were using Ada. The additional latency and artefacting when running at lower frame rates (ie rendering at ~30fps for 60fps output) would be one thing, but I strongly suspect that the tensor core performance required for frame generation (which runs at the
output resolution of DLSS's upscaling step, so 4K if that's what you're expecting) would be beyond what we could expect for a GPU small enough to be viable for use in a device like the Switch.