I never picked up on this prior, but the comment about completing DLSS for the previous frame while rendering for the next frame in order to run all systems concurrently was quite interesting.
Yeah - that's why something like 15ms of DLSS cost actually doesn't make 4k60 DLSS impossible on Drake. Upscaling can be happening on the next frame without interfering with the frame's rendering.
I'm not actually sure if Wolfenstein used this technique. It's a little silly on a PC game, though perhaps it's enabled only when settings are pushed to max? It really seems like Nvidia showing how far developers could push it, more than how far they should.
Oh man now that you mentioned it, I looked at the
linked documentation from that video and this thing is a gold mine. So much good info.
It's great - I had read that paper before. If you compare all the Nvidia architectures since the 360 era, they all kinda look alike - except Turing, which looks very strange, before Ampere reverses course. I had wondered why that was the case and this offhand comment not only explained it, but lead me to realize that Ampere isn't a course reversal like I thought. If you really care, I'll give you a dump but warning DEEPLY ESOTERIC BULLSHIT BELOW (also, paging
@Thraktor, since we talked about this in the past)
Pascal organizes an SM like this: Each SM has 4 partitions, each partition has 32 CUDA cores, and a scheduler.
Turing organizes an SM like this: Each SM has 4 partitions, each partition has 16 CUDA cores, plus 16 Integer cores, and a scheduler
Ampere organizes and SM like this: Each SM has 4 partitions, each partition has 32 CUDA cores, and a scheduler. Half the CUDA cores support Integer operations
It looks like Pascal had a basic way to organize SMs down, Turing took this wild bet on integer math that kinda makes no sense for a GPU, and then Ampere comes back to Pascal's structure, with some small extra support for integer math.
The problem is that Nvidia's architecture papers, in their rush to make themselves sound good and brand everything, have obscured some of the details. That's not actually what's happening, but to get there, I needed to know
why Nvidia thought they needed so much integer power in a GPU, which is basically a floating point machine.
The answer, which was in the video, is ray tracing. When you're navigating large data structures in memory, you do a lot of pointer arithmetic. If you want the 10th item in a list, the awful way is to start at the beginning of the list and scan all of memory till you find the one labeled "10." The faster way is to make every entry in the list the same size. Multiply that size by ten, added to the address of the beginning of the list, and you'll be right there.
That math is all integer operations, and ray tracing makes
lots of complex data structures. Hence the need for lot of integer support. But why did Ampere pull it back? The answer is, they didn't.
If you go back to Pascal, each SM has 4 partitions, each partition has a scheduler. On the surface that means that each partition can execute one instruction across 32 cores, and each SM can execute 4 instructions at a time. But actually Pascal's schedulers have two dispatch units. Each partition has the ability to execute 2 instructions at the same time, but if and only if the 2 instructions never share a resource at any time.
This is called Instruction Level Parallelism. There is this tension in GPUs - you want to add cores for more power. You need to keep all those cores fed. Lots of partitions, with only a few cores a piece, it's easy to full load a partition, but it can be hard to make enough threads that all the partitions have something to do. Smaller number of partitions, with more cores, it's easier to make sure all the partitions have work, but each partition might contain cores that are twiddling their thumbs.
ILP is really hard to take advantage of, unfortunately, and it adds complexity to everything. Turing came up with a totally different solution that allowed them to eliminate the hardware that makes ILP possible, but actually
increase parallelism. Here is how.
Pascal's CUDA cores actually weren't
just FP32 units. They were
also integer units. But an integer operation would take an instruction, and prevent any FP32 instructions from running in that partition at the time. No worries, in most shaders, integer ops are rare. Now along comes ray tracing and increases integer operations.
Turing's design doesn't add special Int units like it seems to. It just breaks the existing CUDA cores into two pieces. One piece is
just the INT ALU, the other piece is
just the FP32 ALU. It actually removes the ability for the scheduler to dispatch two instructions at once. But since most instructions take multiple clock cycles to run, the schedule has a chance to fire of a second instruction while the first is running. And if the first is a FP instruction, it knows for a fact that the INT pipeline is free, or vice versa.
This is a much simpler way of deciding when two instructions can run at the same time, and it's much easier to take advantage of, as long as you have a good mix of Integer and integer operations. Which Turing does...
in games that use ray tracing.
As the video says, the result was that there was just enough of a bump in performance that you could get the same frame rates with RT, as you used to be able to get
without RT. Which is a huge accomplishment, but as the video points out, is less than people expected. What the video also doesn't mention is that this design, because it removes even the small ILP that Pascal was able to get to, when RT is off, the leap in performance is lower than expected. The design isn't as efficient.
Now Ampere starts to make sense. Instead of being a return to the Pascal design, it is more like an evolution of Turing. It keeps Pascal's structure of having no second dispatch port, and instead getting parallelism by having two data paths. It just restores that combined INT/FP ALU design in one of the two paths. And it goes further, allowing the schedulers to dispatch to the RT units or the tensor unit as well. Again, instructions take multiple clocks to run, so instead of the scheduler issuing an instruction, going to sleep till it completes, then waking up again, the scheduler is constantly able to dispatch instructions, at the cost potentially leaving one of the paths inactive for an instruction or two.
The downside of this design is that while on-paper the amount of INT compute has stayed the same, it's now competing with FP compute again. They're not blocking each other like before, but they do share resources and can compete. That's probably good, the amount of INT compute was probably too high, even for RT workloads. But it means the RT core needs to claw back a little of the performance lost from having to compete for the INT/FP32 data path.
Part of Nvidia's solution is to just allow the RT cores to run in parallel with everything else. That hides the cost of RT. But that doesn't make RT faster, it just lets the cost hide as long as it's not the slowest operation. You still need to speed up the RT core. That's why the second gen RT core was such a redesign. It needed to run faster. Nvidia hasn't talked much about how they doubled throughput on the RT core, but at least part of the solution was upping the size of the L1 cache in each SM.
Right now, none of this changes my thoughts on what Ampere can do in the Switch, but it makes some of the decisions click for me in a new way. If nothing else, the emphasis on parallel operation shows how much dedicated ports might be able to squeeze extra performance by overlaying RT/Upscaling/Rendering tasks in ways that PC games don't