But the double-wide Orin tensor cores would be the more efficient option in performance per Watt if Nintendo's looking purely for DLSS performance. Let's say that you had a 6 SM part based on desktop Ampere and wanted to double the tensor core performance. If you switched to Orin's double-wide tensor cores, then you're approximately doubling the power consumption from tensor cores*, and that's about it. However, if you kept the standard tensor cores, but doubled up on the number of SMs, you're doubling the power consumption for everything. Tensor core power consumption doubles, because you've got two of them, but you're also adding extra standard CUDA cores, extra RT cores, extra texture units, extra control logic, and all the additional wiring, logic and associated power consumption from moving data and instructions to and between these units.
* I'd actually assume the power consumption of Orin's tensor cores is less than double the power consumption of standard Ampere tensor cores, as while you're doubling the ALU width, there'll be a certain proportion of instruction decode and control logic which won't be doubled.
A100's tensor cores do operate at twice the operations per clock of the GA10x series. It's detailed in
Nvidia's GA102 white paper (page 25). These are purely theoretical op/clock figures, so the cache wouldn't come into it (although it surely does play a part in real-world performance differences).
My "evidence" is simply that, if it's manufactured on Samsung N8, I personally find it extremely unlikely that they would be able to run all 12 SMs in portable mode and manage an acceptable battery life. You're welcome to disagree with that. If it's on a better manufacturing process, or is just a physically larger device than Switch, and therefore able to fit a much larger battery, then potentially they could run all 12 SMs in portable mode, but on N8 and with a ~5000mAh battery I don't see all 12 SMs being viable in portable mode.
Incidentally, the things you mention aren't really all that different to the current Switch, as developers already have to perform a mode change when docking or undocking, changing resolutions, graphical effects, managing changes to available GPU resources and bandwidth, etc. The only issue I could see is if developers on Switch are able to assign warps to individual SMs. This isn't a thing on PC, for obvious reasons, but might be the case in the console space. This might require developers to have separate sets of SM affinity mappings for docked and handheld, although if it were a clean cut from 12 SMs to 6, then Nintendo could just implement a system level modulo 6 operation on SM affinity in portable mode (eg if a warp was assigned to SM 9, it would go to SM 3 in portable mode), which would keep warps together and evenly distributed without additional developer effort.
Thanks for the additional context on this. Out of interest, does it explicitly state that FLCG stands for first-level clock gating, or are you implying it? It just strikes me as a bit odd that they would add something called second-level clock gating and only years later add first-level clock gating. Unless, as you say FLCG was something that already existed in some form but was never really utilised or exposed.
Looking into this, there's actually an additional clock gating mode, BLCG (block level clock gating). The info I can find on this seems to suggest it was the first type of clock gating implemented by Nvidia, and seems to operate at a very high level. The next level is then SLCG (second level clock gating), which is lower level. Incidentally, I found
this commit for T210 (Tegra X1) which adds support for SLCG on T210 within Linux, and in the case of this particular commit for T210, it looks like SLCG just covers interfaces, codec blocks, and soforth, rather than core GPU logic. The code you link, by the way, isn't actually driver source code, it's a verilog file for Nvidia's DLA hardware. It's likely Nvidia use the same conventions for what counts as SLCG there, though. My guess is that FLCG would then probably operate at a lower level again.
In any case, it's interesting that there is at least one Ada GPU feature which is supported in Drake but didn't make it into Orin. So if Orin's GPU is a half-step between Ampere and Ada, Drake's is maybe a two-thirds step.
Honestly I don't know. But there are two separate questions there. The first is what performance is required to get Death Stranding itself to render the image data that gets fed into DLSS, and the second is what performance is required to get DLSS to bring that up to 4K. Either of these could be the bottleneck. On the first question, that's entirely on a game-by-game basis. The second question is much more predictable, and Digital Foundry has a good video looking into it for a potential Switch Pro/2. However, there's also the possibility that the DLSS used on the new Switch won't be identical to the PC version, and it may have been optimised to perform better at the cost of a bit of image quality. So in that case the answer would change again.