Just going through a bunch of stuff related to the DF video.
People are already aware of that, 4 more SM isn't enough of a difference to skew these results when using Windows and unoptimized ports are the samples.
I think it's more fair to think of this as a docked test. The lower-than-anticipated clock + the extra cores = 3TFLOPS. Obviously, it's not exactly the same, but it's very close.
Rich actually worked on a more aggressive handheld downclock experiment, but the system simply became unstable at 500Mhz. While trying to fix that, a firmware update from Dell actually prevented the underclock/undervolt from going that low anymore.
A big miss with the DF test was ignoring power draw. How many watts was the RTX2050 pulling when clocked at 750Mhz? If its over 20 watts, that should have been a red flag that T239 cannot be on 8nm.
Rich collected some power draw numbers. There are two problems with them, however. The 2050 Mobile is the lowest bin of that GPU die, so it's as power inefficient as it gets. The second is that the power data can't distinguish between the GPU and the VRAM, and the laptop still uses GDDR6, which is
very power hungry.
The only number Rich shared with me was when he was trying to get the undervolt working, and that is a tricky prospect. 17W at 500Mhz. I think 8nm is ruled out, personally, but there are unknowns there.
If its not on 8nm then it is very likely 4N and 750Mhz is closer to the portable profile rather than docked. If the docked profile does end up being 1.1Ghz, that would mean those tensor cores will be clocked 32% higher than in their test.
Yes, but there are more cores. In terms of tensor operations per second, the 2050M@750Mhz is the same as T239@1Ghz, which is why I encouraged Rich to go with it. One of the questions that gest asked by Smart People Who Don't Follow This Thread is "Can a 3 TFLOP Ampere chip actually do X, Y, and Z" and at the very least this is a definitive answer to that question.
The relevance of the 2050M's version of 3TFLOP (slower clocks, more cores) to T239 is left as an exercise for the viewer
Do we know how much VRAM the T239 has? If it's 8GB, that's pretty damn good overall. Running RE4R with a 4070 Laptop GPU was already pretty snazzy, even if the RT on that game is considered "Eh" in comparison to a lot of other games.
As others have pointed out, it's a shared pool. This gives games some flexibility in how much they allocate to textures vs game logic.
It should be noted that Rich found some "egregious stutters" in
Death Stranding that he managed to track down as VRAM thrashing, with the game stuttering as assets were rapidly dropped and new ones loaded, which should disappear in a Switch NG port.
Coming from a place wholly inexperienced with building laptops/desktops, I'm guessing there was no real way to swap out the RAM from a pre-built Dell Vostro to bump it up from 4GB to 6GB, let alone 8GB?
Yeah, it's soldered on even.
With 8 CPU Cores at 1190MHz and 4 TPC (8 SMs) at 612MHz and EMC (Memory Fabric) running at high load 6400 MT/s, it's using 18.6W. And this is just an approximation because T239 has 6 TPCs instead of 4. So I'm definitely puzzled as to why he thinks these clocks are the sweet spot, as he supposedly also had access to the same Jetson Power Tool, and also why he still thinks it's fabbed on 8nm. Unless he thinks Switch 2 Handheld will emulate PC Handhelds like SteamDeck/ROG Ally and use 15W while in Handheld Mode.
"Sweet spot" = "best performance per watt." I believe Rich has access to some more robust Nvidia data on power curves from previous testing, but the ARM data comes from me, and IIRC, his data pretty much matched Thraktor's for Ampere curves.
Checked the site for the first time today and I see chat exploded for a different reason. Can't watch the Digital Foundry video, I'm in class rn, is it looking good or bad? I saw some gif reactions and a lot of back and forth on clock speeds when scrolling back through.
It matches what I expected, let's say. Which is not to say that I, or any others (like
@LiC or
@Thraktor) who tend to think in the same perf envelope are "right", but that our extrapolations from desktop benchmarks seem to hold up.
This is a fair point.
Alex's video on Alan Wake 2 PC settings shows a pretty significant hit to performance from setting post-processing to run at the output res (which includes depth of field), so that could be contributing here. Although the impact won't necessarily be the same as AW2, as they may have different implementations.
Unfortunately it's very tricky to isolate the actual run time of DLSS itself, because of things like this. I think Rich's main point was just to emphasise that DLSS isn't a free lunch and there is a cost there, which is something we should keep in mind, even if the specific numbers he presented should be taken with a pinch of salt.
Yeah, the two camps in DF forums/Discord fall into "Just DLSS it, it's basically free" and "no way DLSS will even run on a chip that small." This video puts both to bed.
Are the tensor cores of an RTX 30XX faster than the ones found on an RTX 20XX other than possible higher clocks? If they are, is the difference enough so that 48 TC's from 30 series can match 64 TC's from 20 series?
In regard to Rich's video - the RTX 2050M is secretly a binned RTX 3050M, rebranded, so this is Ampere tensor cores. But to answer your question, the 30 series tensor cores aren't exactly "faster" but they do support a new kind of optimization, structured sparsity, which some code can take advantage of.