It does seem like 25% more would be better, but it's probably not. Which isn't intuitive!
The 3070 and the 3070Ti have nearly identical TFLOPS, but huge gaps in memory bandwidth. If that extra memory bandwidth matters, we'd expect the 3070Ti to overperform, relative to it's small compute advantage. Here are Digital Foundries benchmarks, summed up:
Card | TFLOPS | Bandwidth GB/s | 1080p Average FPS | 1140p Average FPS | 4k Average FPS |
---|
RTX 3070 | 20.3 | 448 | 167 | 122 | 76 |
RTX 3070Ti | 21.7 | 608 | 174 | 130 | 82 |
% Improvement | 6.9% | 36% | 4% | 6.5% | 7.8% |
Basically nothing. Under the highest possible load, with 4k textures and LODs, the bandwidth improvement results in less than 1% performance improvement. The 3080Ti and the 3090 show similar results. But those are huge cards, so the second question is, how does Ampere scale down, with regard to memory bandwidth?
Card | 3050 | 3060 | 3070 | 3080 | 3090 |
---|
Bandwidth/TFLOP | 24.6 | 28.3 | 22.1 | 25.6 | 26.3 |
4k FPS/TFLOP | 3.7 | 3.9 | 3.8 | 3.1 | 3.0 |
What we can see is that, over all, the architecture scales down better than it scales up. And the 3060, which has 25% more bandwidth (relative to compute) than the 3070, performs identically. So there is no reason to believe that the smaller cards have higher bandwidth needs. In fact, the opposite. There isn't any data to support the idea that the GPU maxes out at 25GB/s/TFLOP, and likely sees marginal - if any - benefit over 22GB/s/TFLOP
A 3.5 TFLOP Drake could be running balls to the wall, and still have 14GB/s of spare memory bandwidth left over for the CPU. Likely closer to 25GB/s. Whole 8 CPU core phones
plus their GPUs are running on that amount of bandwidth. So this is probably generous for the CPU as well. Even more so if Nintendo takes advantage of the larger cache configuration that the A78C offers.
TL;DR: The extra 25% of bandwidth would likely result in little to no performance gain.
So I'm an obsessive (can you tell?) and I put craploads of GPU data in a big spreadsheet. And I turned up something which should have been obvious to me, but wasn't until it stared me in the face.
| 3050 | 3060 | 3070 | 3080 | 3090 | T239 |
---|
Cache KB/Bwidth GB | 9.14 | 8.53 | 9.14 | 6.73 | 6.56 | 10.03 |
Cache KB/TFLOP | 225.05 | 241.89 | 201.77 | 172.39 | 172.68 | 292.57 |
If you look at it this way, even the 1MB cache in Drake is actually
crazy high relative to the rest of the line. This is probably a side effect of scaling Ampere down. Cache size scales with number of memory controllers, not number of GPU cores or their clock speeds. The may only have one GPC, clocked way down, but it's 1MB of cache no matter what.
Of course, this might not be the best/only way to look at it. I'm not sure if these things are good proxies for cache hit rate? But again, Ampere just seems to get more efficient as it gets smaller, so I tend to think that Drake will overperform, without any changes to it's memory architecture, rather than underperform.