I told you all it wouldn't have 8 SMs!
Seriously, though, that's nuts. Like "I know this is coming from a hack from Nvidia and Nvidia have actually confirmed the hack happened but I still kind of think it's fake" nuts. On 8nm, even ignoring portable mode for a moment, I can't even see them running all 12 SMs
in docked mode in the same form factor as the base Switch, at least not without a very loud fan. Off the top of my head, I can think of the following possibilities:
- The T239-based Switch "Pro" was a TV-only console all along. This makes less business sense to me (a hybrid device would sell more, justifying the development costs), but it would make a larger, more power-hungry chip much more reasonable.
- This isn't Samsung 8nm (and it seems kopite isn't sure on this any more). My guess would be a Samsung 5nm process, but it could be anything. Even on a TSMC 5nm process (which I still think is very unlikely), I'd be surprised to see them running all 12 SMs at any clock in portable mode.
- They're disabling some of the SMs in portable mode, cutting it down to 6 SMs, or even just 4, to save power.
My guess is a combination of 2 and 3. Samsung 5nm process of some kind and running only 6 SMs while in portable mode. With portable/docked clocks of 400/800MHz (being a bit conservative here), that would put us at about 600 Gflops handheld and 2.4 Tflops docked. This actually makes some sense when you consider the much bigger resolution difference between handheld and docked once 4K comes into play. The original Switch had a roughly 2x difference in resolution between 720p and 1080p, and roughly a 2x difference in performance. Here we have a 4x difference in performance, and possibly a 4x difference in resolution from 1080p handheld to 4K docked.
Warp counts aren't related to CUDA cores, it's more about the amount of register memory and the capability of the thread scheduler and stuff like that. The actual amount of warps you can run concurrently on an SM isn't always exactly that max, either, depending on how you're using them. It does indicate that there are more low-level differences between desktop Ampere and T239 than we're aware of, but my guess is that it's probably shared with Orin, and may be one of a number of low-level optimisations around power consumption.
Thraktor is very confused!
Clock gating means being able to stop delivering a clock signal to certain functional units to save power. Specifically in this case, it means being able to disable some SMs while keeping others running. So it would very much look like point 3 above is true, and they're looking at disabling SMs in portable mode.