The whole concurrency situation with CPU/GPU/DLSS is something that even retro consoles have been doing for a long time with CPU/PPU. It's not about processing a whole frame's worth of data through all phases from beginning to end before starting the next. It's about setting up the next part before moving onto the next and not wasting cycles when possible.
Take the NES, for example. The PPU (Picture Processing Unit) handles all rendering of background, sprites, and whatnot by reading from memory-mapped registers where each register is linked to a particular part, like a background's /origin/position, sprite's color palette, etc along with the small VRAM pool for graphic cels these registers can also reference. Every 16.66ms (NTSC, forgive me for not going into the details of this vs PAL), the PPU runs through 2 phases. A VDraw phase, where it draws the frame one scanline at a time (scanline 0 - 239, NTSC blanks out the top and bottom 8 scanlines), and a VBlank phase, where it is idle (scanline 240-261) for a total of 262 scanlines. It doesn't wait for the CPU. It will take what's in its registers, renders the frame for the 1st phase, and sits idle for the second phase.
From my understanding of games back then, the NES CPU will handle the game logic for a frame, and then will call the command to wait for the next VBlank (scanline 240), regardless of how much work it has to do. Once that scanline is reached after waiting for it, it updates the PPU's registers with whatever needs updating based on the game logic it had just processed. Then immediately afterwards begins work on the next game logic frame. So think of the PPU phases starting with the VDraw, whereas the CPU's phase starts with the VBlank. Let's assume it takes just a few scanline's worth of time to update those PPU registers, so the workflow could be something like this....
VBlank phase
- Scanline 240 --- VBlank phase starts. PPU goes idle. CPU starts updating PPU registers from last processed game logic frame.
- Scanline 240~244 --- CPU continued updating PPU registers from last processed game logic frame.
- Scanline 244 --- CPU finishes PPU update. Begins processing next game logic frame. PPU remains idle.
- Scanline 244~261 --- CPU continues processing next game logic frame. PPU still idle.
- Scanline 261 --- VBlank ends, wraps around to scanline 0 on next scanline. PPU becomes active, ready for scanline 0 to start rendering. CPU still processing next game logic frame.
VDraw phase
- Scanline 0~199 --- VDraw phase starts. PPU renders 200 scanlines worth of the frame based on prior game logic frame in registers. CPU still processing next game logic frame.
- Scanline 200 --- CPU finishes next game logic frame. Makes call to wait for VBlank. PPU renders scanline at this mark.
- Scanline 200~239 --- CPU waits. PPU finishes rendering based on last game logic frame.
The CPU could finish its work sooner or later. Ever wonder why some games slow down? It's because the work load for the CPU to process a single game logic frame is too much to fit into a single VDraw + VBlank span of time, so when the call to wait for the VBlank is done, it could already passed the start of VBlank, meaning it has to wait until the next go-around. It's dropping to 30 game logic fps,. The PPU, however, continues to operate at 60fps regardless of the CPU. It has a fixed amount of time to spend per scanline before moving onto the next, using what's in its registers. If the CPU only updates those registers at a rate of 30fps, then the PPU renders using the same information twice before they get changed.
For something like the Switch 2 with more modern technology and different handling of the flow, to me, it's more like how instructions go through a CPU pipeline, or like an assembly line. First stage is the CPU, second stage is the GPU, and third stage in this case is DLSS. The CPU stage handles the game logic frame, and then waits for the GPU to be ready for the next frame. When the GPU is ready, the CPU hands over the information the GPU needs to begin. The GPU takes it, and begins rendering the frame based on that, letting the CPU start the next game logic frame. The GPU continues to render the frame, and once it's finished, it waits for the DLSS phase to finish its own task. It hands over the rendered frame (and whatever else it needs) so the DLSS stage can do its thing, thereby letting the GPU stage be ready to receive from the CPU stage.
Let's say there was only the CPU and GPU. If for a single frame, it took the CPU stage 3.33ms, and the GPU stage 16.66ms, this means for every frame from start to finish would take ~20ms. But with concurrency, we visually see a frame change every 16.66ms (or 60fps). That is because the bottleneck is the GPU at 16.66ms, meaning the CPU is having to sit idly for 13.33ms, waiting for the GPU.
Now let's introduce DLSS, allowing us to spend less time rendering with the GPU by using a lower resolution. If for a single frame, it took the CPU stage 3.33ms, the GPU stage 8.33ms, and the DLSS stage 8.33ms, that's still a total of 20ms from start to finish. But with concurrency, we'd visually see a frame change every 8.33ms (or 120fps). This is because the bottleneck of the GPU is reduced. The CPU only has to wait 5ms instead of 13.33ms after processing, so it does get pushed more.
CPU has to wait on the GPU, GPU has to wait on both CPU and DLSS, and DLSS has to wait on GPU. Like others have said, DLSS isn't free, but can increase frame rates by giving the GPU the chance to spend less time rendering by using a lower resolution. So even with something like BotW demoed for Switch 2 at 4k60, so long as no stage in this pipeline (whether 3 stages or 100 stages) goes above 16.66ms, it can hit 60fps.
I'm done rambling for a bit now.