If you look at tx1 the same way, how does it hold up vs other maxwell cards?
I think it would be interesting to have a similar comparison between the Tegra X1 and NVidia's Maxwell line.
Tegra X1 is fairly in line with its fellow Gen 2 Maxwell GPUs.
TL;DR: I tried to prove that the method was right and ended up coming across not so encouraging numbers, when compared to its respective generation the Tegra X1 already had a better ratio of bandwidth to raw power than the PC GPUs. In the end I believe that perhaps either the CPU consumes much more bandwidth than we imagined,
Neat! I didn't have the Maxwell data handy. Let me slightly tweak your data, slightly
GPU | TX1 Switch 307 | TX1 Switch 460MHz | TX1 Switch 768MHz | GTX 950 | GTX 960 | GTX 970 | GTX 980 | GTX 980 Ti |
---|
Bandwidth GB / TFlop | 135.4 | 90.4 | 65.10 | 57.9 | 46.5 | 50 | 45 | 55.6 |
(Width-10GB/s)/TFLOP | 82.5 | 55.0 | 39.67 | | | | | |
So, I've added a column for the larger handheld profile that Nintendo introduced with
Breath of the Wild. I've also added a row where the CPU cluster is sucking up a pretty small 10GB/s of bandwidth, just as a ballpark. Admittedly, that's a ballparking assumption, but it does show bandwidth constraints right where you'd expect them.
, or the bandwidth scale should be thought of more in relation to the rendering resolution than necessarily to the raw power.
You're absolutely right the the bandwidth constraint depends on the workload, resolution X frame rate. In our benchmarks, resolution is locked, by framerate is free, and we measure that frame rate to see the config's performance.
Memory workload and compute workload should scale together. We get this kind of intuitively, right? You lock your resolution, but double the frame rate, and you double the number of times shaders run, double the number of times those shaders access memory.
But "memory workload" doesn't mean "bandwidth usage." Cache is the obvious example here - you've doubled the number of memory accesses, sure, but cache keeps some of those from using bandwidth. The same actually applies to compute, as well. Some operations are repeated, and if their results are cached, then you skip the compute.
Ideally, we'd just benchmark T239 ourselves, with total control over it's clocks
But since we can't, we try to answer questions about the architecture generally. Do memory workload and compute workload scale with each other on Ampere? And is Ampere's ratio of bandwidth to compute a good one?
That's why I pulled out the direct performance/TFLOP comparisons across the range. Ampere actually shows signs of
not scaling well at the top of range. Why isn't clear, but a good guess would be the cache again. To maintain the same cache hit rate across the range, cache
probably needs to grow faster than compute but in Ampere, it grows slower. But the bottom of the range is what we care about, and it looks like Ampere scales down quite well.
Second question, is Ampere starved? It doesn't seem to be. Our best data is at the top of the range, unfortunately, but at the bottom, we can see that compute predicts performance better than bandwidth does. That tends to indicate that we're compute limited, not bandwidth limited.
That doesn't guarantee that it will hold up on T239. There may be a performance cliff that Ampere falls over when it gets smaller. Or it may be that graphics workloads radically change in the next year or two. That's what happened to Maxwell, actually. It was designed before Physically Based Rendering became common in games, and suddenly bandwidth mattered a lot more than it used to. But considering the state of gaming now, that seems doubtful.