Universe 1.5: For reasons of cost/availability/R&D, they settle on using the 8nm process after Nvidia estimates how much the power consumption can be improved with a custom SOC design that prioritizes efficiency where Orin doesn't, removes various unnecessary components, and implements FLCG.
Oldpuck has already responded, but I'd agree with him that the 2x improvement in power efficiency required to make this work is way too much to expect from Drake's GPU. The biggest gen-on-gen power efficiency improvement of recent years without a change in manufacturing process is RDNA to RDNA2, where AMD claimed a 54% improvement in performance per Watt, and that required pretty fundamental changes right across the architecture. Here's AMD's slide on how they hit a 54% improvement:
Nvidia would need almost twice that improvement over Orin's GPU to hit a similar handheld power draw as TX1 (let alone Mariko), which is just too big a stretch for me.
That doesn't mean I'm ruling out 8nm by any means, it just means that if it's 8nm, then I don't think they can run all 12 SMs in portable mode, in a device the same size and form-factor as the current models, with at least as good a battery life as the original model. Something's got to give, whether it's size or battery life or form-factor or whatever. I honestly wouldn't rule out a surprising change to the form factor. We're all assuming this new device is going to be a Switch-but-faster, even though we don't have any conclusive evidence on that front.
Thanks for porting this. To provide context for those who aren't sure what this means, Nvidia assigns a "compute capability" version number to every GPU which precisely defines the GPU architecture used for the purposes of CUDA compilation. So, for example, the Ampere A100 chip is 8.0 (or sm_80), the desktop Ampere chips are 8.6 (sm_86), and Ada chips are 8.9 (sm_89). If you compile CUDA directly to a binary for a specific architecture (which creates a "cubin" file), you have to specify the version you want to compile for, and this determines compatibility. From
Nvidia's CUDA documentation:
A cubin generated for a certain compute capability is supported to run on any GPU with the same major revision and same or higher minor revision of compute capability. For example, a cubin generated for compute capability 8.6 is supported to run on a GPU with compute capability 8.9; however, a cubin generated for compute capability 8.9 is not supported to run on a GPU with compute capability 8.6, and a cubin generated with compute capability 8.x is not supported to run on a GPU with compute capability 9.0.
So if you compile for 8.x, then you can run that on a GPU supporting 8.y, so long as y>x, with one caveat that I'll get to later.
Underlying that is the way Nvidia manages their shader ISA. Nvidia's GPUs have used variations of the same shader ISA on GPUs going back for quite a long time now, and one of the benefits of GPU shader ISAs over CPU ISAs is that backwards compatibility isn't really essential. Shaders in PC games are generally run-time compiled, and compute code can be recompiled, or use an intermediate format like PTX with run-time compilation. Nvidia therefore take the opportunity every few years to reboot their shader ISA, dropping support for old instructions or changing behaviour in a way that would break backwards compatibility. These are the major versions of the SM numbers above. Between those, we only see additions to the ISA without breaking backwards compatibility. For example Ada (sm_89) is fully backwards compatible with code compiled for Ampere (sm_80/sm_86), but also adds some new functionality (eg FP8 tensor core instructions).
The specifics of that tweet aren't directly relevant to the new Switch (as it's not going to use an sm_90/sm_90a GPU, which is Hopper), but the topic is definitely of interest, as it relates to two different topics that are still a bit up in the air: backwards compatibility and architectural changes from Orin.
Backwards compatibility on the GPU side is relatively easy to explain when you compare the compute compatibility version numbers of the GPUs. TX1's GM20B GPU is 5.3, and Drake's GA10F GPU is 8.8. There's a difference in major version, so shader code compiled for GM20B won't run directly on GA10F. Console games typically (always?) ship with compiled shaders, which is why this is an issue for console BC, but not for PCs.
That said, it's worth noting that it's not a completely different ISA, so not like moving from ARM to x86 or from a Nvidia CPU ISA to an AMD GPU ISA. Nvidia don't publish full documentation on their shader ISA, but they do provide lists of supported instructions on each architecture, which you can find
here. We can see from there that about 75% of the instructions in the Maxwell instruction set are still in place in Ampere, meaning 75% of instructions might not need translation to run on an Ampere GPU. Of course, BC isn't much use unless it covers 100% of an instruction set, and Nvidia and Nintendo need a way of handling the other 25%, along with any idiosyncrasies within those shared instructions.
One option is to fully decompile and recompile each shader. This has the benefit that it's probably the easiest to implement in a crude format (Nvidia already has decompilation tools), and it would work from any source architecture to any target architecture, regardless of whether there's any similarities in the ISA. It would also allow you to make more efficient use of the new architecture (eg the 2xFP32 in Ampere), because you're performing the full compilation for that architecture. The problem is that recompilation can be slow, and if you're performing it at runtime you can get noticeable shader compilation stutter which can be annoying for players.
Another option is binary translation. In this case you don't decompile the code, you just go through the binary one line at a time, leaving in place every instruction that's still supported by the new architecture and replacing every instruction that's no longer supported. The advantage of this is that, compared to recompilation, it's very quick and could probably be performed at runtime with no noticeable performance impact. The disadvantage is that there isn't always a one-to-one mapping of old instructions to new instructions, so it's not necessarily a trivial translation process, even if you're on very similar ISAs.
Some of these changes aren't a big deal. For example, Maxwell supports both DSET (FP64 Compare And Set) and DSETP (FP64 Compare And Set Predicate), whereas Ampere only supports DSETP. I assume you could convert a DSET instruction to a DSETP instruction by simply adding an always true predicate, and FP64 shader code is never used in games anyway, so you can get easy compatibility without having to worry about performance (the commonly used FP32 compare and set instructions are unchanged from Maxwell). However, if you look at the control instructions, there are a lot of changes there, and it's not necessarily obvious how you would precisely match the behaviour of the Maxwell code with Ampere instructions, as Ampere approaches control instructions like synchronisation in a different way.
The other side of these compute capability numbers is that Drake has one to itself: 8.8. Looking at the 8.x family of compute capability, we've got:
8.0 - A100 - HPC Ampere
8.6 - GA102, etc. - Gaming Ampere
8.7 - GA10B - Orin
8.8 - GA10F - Drake
8.9 - AD102, etc - Ada
The fact that it doesn't have compute capability 8.7 means that Drake features some shader compatibility changes from Orin. That is, CUDA code complied for Drake won't run on Orin. One important thing to note here is the caveat I mentioned above: guaranteed forward compatibility of compiled CUDA code within a major version excludes Tegra/SoC GPUs, or at least has done in the past. Orin and Drake are within the standard numbering system, but they're maybe better thought of as a fork off to the side. There's no guarantee that CUDA code compiled for Orin or Drake will run on Ada GPUs.
So there's some additional instructions or features which Drake has over Orin, which aren't necessarily pulled from Ada. I've been thinking about this for a while, and I think the most likely reason for this is actually what I was discussing above. Namely, that Nvidia has added or changed some number of instructions on the Drake shader ISA to more easily facilitate translation-based BC with TX1 shaders. There's not really much else that I can think of which would warrant any change over Orin. The only SM-level changes that could have been back-ported from Ada would have been the updated tensor cores and RT cores, but I think we have good indications from the leak that we're looking at standard Ampere tensor and RT cores. Meanwhile one of the benefits of designing a custom SoC for your new console is that you can make these kinds of changes to achieve better backwards compatibility with existing games.