Thraktor
"[✄]. [✄]. [✄]. [✄]." -Microsoft
- Pronouns
- He/Him
REAL FLOPS AND FAKE FLOPS STOP COUNTING FLOPS
FLOPS means Floating Point Operations Per Second.
Floating point basically means "fractions." In graphics fractions are common. The other thing about graphics is you're often performing the same operation with that fraction over and over again. See that object in the distance? It's smaller on screen because it's far away, multiply it's left side by 0.5. Now multiply it's right side by 0.5. Now multiply it's top side by 0.5...
Two things about CPUs. They're bad at fractions, and they're bad at performing lots of near identical operations. Enter the GPU. The GPU is good at Graphics, and the core of it is a calculator designed to be good at the math that graphics needs. It can do lots of similar floating point operations at the same time, it instead does them all at once. Because graphics do Lots of Similar operations, there are two ways to make it more powerful. You can make the calculators faster (run at a higher clock speed) or you can add more calculators.
The calculator cores in an Nvidia GPU are organized in to something called SMs. Here is the SM in Turing, the RTX 20 cards.
Right now, I want you to pay attention to the sections labeled "FP32" and "INT32". FP32 stands for "32 bit floating point". When you count FLOPS in a GPU you're counting how many of these FP32 units they have, and multiplying it by however many times these units can execute per second.
The INT32 block stands for "32 bit integer." Graphics love floating point, but they're not exclusively floating point. They need integer operations. That is what this bank is for. Before Turing, both AMD and Nvidia did this basically the same way. A bank of INT and a bank of FP. We talked about FLOPS, but increasing FLOPS also increased INTOPS (integer operation per second) by similar amounts. So we didn't need to talk about them separately.
Then, in RTX 30, Nvidia did this. Let's play spot the difference
There are a few differences here, to be fair, but the one we care about is that there isn't an INT32 block anymore. There is a block labeled "FP32/INT32", beside the "FP32" block that we got before. Nvidia has changed the INT32 block to be able to excute either a floating point operation or an integer operation. FLOPS HAVE DOUBLED
...but only if not integer operations are happening. That INT32/FP32 block can only execute one or the other. Only half the FLOPS are always available. The other half are shared with the INTOPS. If an integer operation is performed, the floating point operations are stick with only the old FP32 bank.
With Ampere, the architecture that is in Switch NG, Nvidia doubled FLOPS, but did not double performance. This was considered a bullshit marketing tactic, and was labeled "fake flops." The thing is, it isn't bullshit. It does increase performance, but not as much as GPU customers had been used to for 20 years.
RTX 30 cards had 2x the flops of RTX 20 cards, but only 1.6x more performance. You cannot use FLOPS alone as a way to compare performance. Just because a system is "newer" doesn't mean that each of it's FLOPS are as good or better than an older machine's FLOPS.
I got up this morning thinking "you know, I should write a post explaining the whole 'doubled FLOPS' Ampere thing", only to find oldpuck got there ahead of me, as per usual...
There is one thing I'd like to add/clarify, though, because I think it's important. Most of the discussion is about how Ampere 'doubled FLOPS' compared to Turing, but it's Turing, not Ampere, that's the odd-one-out.
To clarify, as far as I can tell, Turing is the only consumer GPU architecture ever to have fully independent INT32 ALUs, completely separate from the FP32 ALUs. You're correct that in other architectures when FP32 FLOPS increased, so did INT32 OPS, but from my reading every other architecture could only ever execute either FP32 or INT32 calculations at a time, not both. Here's a quote from Nvidia's Turing whitepaper, on page 11 (emphasis added):
Turing implements a major revamping of the core execution datapaths. Modern shader workloads typically have a mix of FP arithmetic instructions such as FADD or FMAD with simpler instructions such as integer adds for addressing and fetching data, floating point compare or min/max for processing results, etc. In previous shader architectures, the floating-point math datapath sits idle whenever one of these non-FP-math instructions runs. Turing adds a second parallel execution unit next to every CUDA core that executes these instructions in parallel with floating point math.
I've read through as much AMD documentation as I can find, and from what I can tell all of their architectures follow the same approach, with each block of ALUs either able to run INT or FP operations, but not both simultaneously.
Turing was a weird architecture, in that Nvidia realised that there was a decent bit of integer code required alongside floating point, they split the computational hardware in two, dedicating a full 50% of it exclusively to integer operations. So they went from 128 FP/INT "cores" per SM on Maxwell and Pascal to 64 dedicated FP "cores" and 64 dedicated INT "cores" on Turing.
This is where the whole FLOPS comparison really broke down. While FLOPS did, at least theoretically, measure the peak computational throughput of other GPUs, it only measured half of Turing's. Here's a review of the Turing RTX 2080. It's a 10 TFLOP GPU, and it beats the 10 TFLOP Vega 56 by up to 60%! This is a slightly unfair comparison, as Vega 56 was a couple of years old at that point, and RDNA1/2 improved performance a lot for AMD, but it's mostly unfair because the RTX 2080 has an entire extra 10 TOPS of INT32 performance as well that's not accounted for when looking only at TFLOPS figures.
Of course Nvidia wasn't exactly getting great use out of those dedicated INT32 ALUs (which is likely why nobody else ever tried it). From the Turing whitepaper again, Nvidia claim an average of 36 integer operations for every 100 floating point operations. That means they were likely only getting around 36% usage from the INT32 units.
To flip the comparison between the RTX 2080 and Vega 56 another way, let's compare them in terms of combined operations per second for both FP32 and INT32. Vega 56 is still a 10 trillion operations per second GPU by this metric, but RTX 2080, using all its hardware, is a 20 trillion ops per second card. By this measure, Turing underperforms Vega by 20% or more, because those INT32 units are sitting idle most of the time. Turing is often treated as an architecture that was very efficient "per FLOP", but if we include the integer hardware as well, it was actually pretty inefficient overall, even compared to AMD's relatively inefficient Vega architecture.
So, Nvidia quickly recognised that the dedicated INT32 hardware experiment had failed, and switched back to the old combined FP32/INT32 approach with Ampere, this time having one bank of dedicated FP32 hardware and one combined INT/FP bank. This meant that the TFLOPS doubled without performance doubling (as oldpuck described above), but this wasn't because Ampere was using "fake FLOPS" or over-counting, it was because Turing was under-counting. The use of FLOPS to compare GPUs completely broke down with Turing, as it had all this additional integer hardware, but Ampere's changes meant these execution units were counted within FLOPS measures again. If we look at it per operation over all, rather than just per FLOP, Ampere is getting up to 30% more performance per theoretical operation than Turing, because it didn't have integer hardware sitting idle.
If we ignore Turing for a second, let's compare Ampere to the AMD's contemporary RDNA2 architecture. The RTX 3070 and the RX 6800XT are both 20TFLOP GPUs, and the 6800XT outperforms the 3070 by between 10-20%, with the bigger gains at higher resolutions. The 6800XT has double the RAM, more bandwidth and 16x the cache of the 3070, so I suspect that plays a part in the higher res wins. Taking that out, you're probably looking at closer to a 10% win for RDNA2.
This deficit for Ampere doesn't really have anything to do with the "doubled FLOPS", but it's simply that Nvidia spent the previous few generations on weird experiments like dedicated integer hardware (and more useful experiments like tensor cores and hardware RT acceleration), whereas AMD spent the previous few generations completely redesigning their GPU architecture and achieving significant performance gains over their old designs, overtaking Nvidia in the process. Cut Turing out of the picture, and Ampere is part of a relatively straight-forward line of stagnating performance in traditional rendering for Nvidia GPUs while they focus on RT and tensor cores, alongside AMD making significant advances in traditional rendering performance over the same time period.