Regarding the DLSS cost discussions, one thing that is often brought up is how it's a fixed cost and it will be very heavy to do 4k upscaling with the limited amount of tensor cores that would be available in a Switch 2. But I almost never see any mention of DLSS concurrency which Nvidia introduced with Ampere. See this video where Nvidia goes through some changes from Pascal to Turing to Ampere:
You can see here how, by running DLSS concurrently with the RT and shader cores, the frame-time can be reduced by 0.7ms. Now this is obviously not a very significant boost, but this is a 3080 so this is pretty much the time it takes to do DLSS upscaling on such a powerful card, but the slower the DLSS calculation the bigger the potential gain. There will obviously be a latency cost from this, as DLSS is calculated on the previous frame while the current frame is being rendered, but while something like 10ms would be a significant chunk of your frame-time budget, it's would be a barely noticeable increase in input latency.
Is there a reason this isn't talked about? I don't know if it's ever implemented on PC games, I think most non-Turing cards capable of DLSS would upscale so quickly it might not be worth bothering with it but Nvidia clearly states that Ampere is capable of this type of concurrency. Turing on the other hand is not, so using a 20xx series card as a benchmark for Switch 2 DLSS performance seems like it could be quite misleading.