Paul_Subsonic
Être majestuatisant et subversif
- Pronouns
- He/Him
So to my understanding we can estimate tensor performance of Drake using (the number of tensor cores per SM on Desktop Amere) * 4 *(the tensor performance per tensor core on Ampere) * 8 ?The classic way is just ALUs*Frequency.
FLOPS=Floating Point Operations Per Second. A FLOP is performed by an ALU. Frequency is a measure of how many times the ALU runs it's execution cycle per second. So FLOPS is, obviously, The number of ALUs you have * How many times they can do an operation per second.
In the past, ALUs did one Instruction Per Cycle, but now they can run more. That's the 2 that we use for Nvidia now, but it's been 1 in the past.
It's only useful in this specific case. If I said that Nvidia was 1.3x as performant per-FLOP than AMD, no one would blink. But when I did the benchmarks (I included only one card in my original post) I found that the more predictive strategy was to look at the FP16/FP32 numbers. It comes out to about 1.3 in the end anyway, but only if you compare cards that are close to each other in power. The FLOP averaging is more predicative across big scales
This matters because we can't do apples-to-apples comparisons between consoles isn't possible because they're not open platforms. And if Xbox/Playstation GPU architecture has changes over the desktop versions, we want a way to predict the performance qualities.
If Series S is an 8 TFLOP machine on FP16 operations - especially if games are optimized to use it, and especially if those optimizations are common on the other two consoles - then that matters because it makes it hard to ever get there with Desktop Ampere, no matter how close the FP32 TFLOP number is.
Which leads me, actually, to a second topic I've been looking at - how many tensor cores Drake might have. We know that Orin runs double rate tensor cores, but while doing some research* I discovered it actually has 8 times the tensor performance of desktop Ampere. Not only does Orin run double rate tensor, it also has 4 times as many per SM.
This matters because the DLSS numbers of Drake actually aren't all that great if it's just desktop Ampere. But Nvidia could decide to stick more Tensor cores in there, which would, in turn, up the FP16 speed on the device, and again close the gap.
*I say research, to be clear, this is for work. We're expanding the GPU cluster at my office, and performance measuring of GPUs - at least, datacenter GPUs - is now my job. Which is handy because I didn't actually know anything about modern GPUs till I started hanging around this forum.
By finding the differences between Turing and Ampere regarding number of tensor cores and tensor perf per core, could we estimate a millisecond cost for DLSS using Digital's Foundry numbers in that old DLSS for Switch Pro video ?