Remember how I said each compiler optimization was a
pass?
In a normal compiler, there is the
parsing pass. This does zero optimizations, it just reads the original code and turns it into the "intermediate representation", basically, a version of the original code that it can take notes on - IR for short. Then subsequent passes go through that IR to make all their optimizations, then a final run through the updated IR to generate the final code.
But in this case, we already have optimized code. We just need to convert it to a new architecture. And those architectures are almost identical. So we can skip all the optimization passes. And actually, if we're not optimizing, we don't need the intermediate representation, because we're not taking notes. And if we don't have an IR, we don't even need to bother to read all the original code at once, because we're never going to make changes to one part of the code based on a part far far away. We just need to scan for lines that use old instructions, and then replace them with new ones.
For example, imagine that Drake actually removes the move instruction. Move takes some data, removes it from where it used to be, and puts it in a new location. Nvidia decides that's rare enough, that people who want to do that can just
copy the data, and delete the old data. Let's look at our final code from before.
Code:
move 2.5 to screen.address
move 5 to screen.address
move 7.5 to screen.address
move 10 to screen.address
move 12.5 to screen.address
move 15 to screen.address
move 17.5 to screen.address
move 20 to screen.address
move 22.5 to screen.address
move 25 to screen.address
We don't actually need to delete the old values, so
copy works just fine for us. But we want our transpiler to be
fast. So we don't scan the code at all to figure that out, instead we just go line by line, and if we see a
move, we replace it with two lines, a copy and a delete
Code:
copy 2.5 to screen.address
delete NULL ; it's a constant, there is nothing to actually delete
move 5 to screen.address
delete NULL
move 7.5 to screen.address
delete NULL
move 10 to screen.address
delete NULL
move 12.5 to screen.address
delete NULL
move 15 to screen.address
delete NULL
move 17.5 to screen.address
delete NULL
move 20 to screen.address
delete NULL
move 22.5 to screen.address
delete NULL
move 25 to screen.address
delete NULL
We can do this extremely fast, because we didn't bother to optimize away the deletes. No IR, no optimization passes, heck, you don't even need to keep the whole program in memory before you start, you can do it line by line, squirting out new instructions while the game is sending you the old ones.
But... all our beautiful optimizations! This new shader takes
twice as long to execute as the old shader. But we can't optimize much either. We've got a game
running as we speak and it can't do
anything till this shader starts executing. If we wait long enough to execute this shader, we create stutter. The longer we spend optimizing this shader, the longer that stutter goes. What to do?
Don't worry about it. Drake is 6x as fast as the original Switch! This is the opposite of a normal situation. Normally you want to spend as long compiling as possible, and make the shader as fast as possible, because you only compile once, but you run the shader over and over again.
But now we're in this weird position where compiling slows the game down, but a slow shader will still run blazingly fast on this new hardware. It's more important to get compiling - or in this case, transpiling - as fast as possible, than it is to make the fastest shader possible.