Hidden content is only available for registered users. Sharing it outside of Famiboards is subject to moderation.
Samsung processes:
8N - Samsung have had a few different manufacturing processes referred to as 8nm (which are updated versions of their older 10nm process), but for Nvidia they've always used 8N, which is the Nvidia-specific version of their 8nm family. This has been used for all Ampere gaming GPUs, and is being used for Orin, so is unsurprisingly the default choice for another Ampere-based SoC. Density, performance and power consumption on the process are all well-known (and would have been known during development of Drake), so not much more to add here.
7LPP/6LPP - 7LPP was Samsung's first process using EUV, and was used in a variety of smartphone SoCs and more recently for some IBM CPUs. Outside of IBM it hasn't seen any new chips in several years, and IBM are a bit of a special case, both in terms of their longer time-to-market and their R&D involvement (I believe Samsung's 7LPP leverages IBM R&D). With EUV tools in short supply, Samsung would have also been eager to move production lines over to 5nm/4nm processes, which would bring in more revenue. 6LPP was announced as an improved version of 7LPP, but later disappeared off roadmaps with no chips ever manufactured on it. I would consider 7LPP or 6LPP very unlikely at this stage.
5LPE/5LPP - Samsung considers 5LPE and 5LPP to be their second generation EUV process, and lists them alongside 7LPP as part of the same family (see
here). Here 5LPE is the early version of the process, and 5LPP the improved version. They're
largely compatible with 7LPP designs, which would explain why they replaced 7LPP so quickly with the exception of IBM. The 5nm family is a possibility for Drake, but the lines between Samsung's "5nm" chips and their "4nm" chips is a little blurry, as I'll mention below.
4LPE/4LPP/4LPX - There are currently three different processes referred to as "4nm" by Samsung. Originally, Samsung listed 4LPE as an evolution of the 7LPP/5LPE line, but in 2021
they started listing 4LPE (along with the new 4LPP) as a new branch on their slides. According to
TechInsights, 4LPE is a "a major process node change with pitch scaling", and 4LPP seems to be an evolution of it. There is, however, another process called 4LPX, which was used for the Snapdragon 8 Gen 1, and is, according to TechInsights again, "essentially a 5LPE technology".
Samsungs's 5nm and 4nm processes have been used for a variety of smartphone SoCs from Qualcomm and Samsung themselves (and Qualcomm continue to use them for chips like the Snapdragon 6 Gen 1). Most news about chips made on these processes is that they have very poor yields, and aren't very power efficient. On the power efficiency side, they're certainly behind TSMC's N5/N5P/N4 processes, but something like 4LPE is still a big improvement over 8N, bringing performance somewhere in the ballpark of TSMC's N7/N6.
On yields, there's some important context required about the type of yield.
SemiAnalysis reported that, for the Exynos 2200 (the first 4LPE SoC), "the parametric yields were horrendous even though catastrophic yields were fine". Catastrophic yields here means "is the chip physically functional or not?", whereas parametric yields mean "can I clock the chip to my expected speed with my expected power draw?". Effectively, it looks like Samsung Foundry were producing chips which were fully functional, but not meeting their customer (Samsung LSI's) expectations in terms of clocks and power draw. This is a big difference from a manufacturing line that is pumping out massive numbers of useless chips, and speaks more to Samsung Foundry overselling the performance of the process rather than anything else.
If we look at the actual chip in question, the main limiting factor was reportedly the GPU, namely the new RDNA2 GPU which Samsung LSI have licensed from AMD. Reports suggest that the initial plan was to clock the GPU at 1.69GHz, and by release this was reduced to 1.29GHz to improve parametric yields (reportedly to "around 80%"). To me, the idea of using a desktop GPU architecture in a smartphone and expecting it to clock to almost 1.7GHz seems crazy in the first place, but it puts a bit of context around the yields. RDNA2 clocks higher than Ampere in general, but I wouldn't expect Nintendo to be nearly as aggressive on clocks, and a Switch form factor obviously has the benefit of active cooling.
I think either Samsung's 5LPP or 4LPE processes would be a reasonable choice for Drake. A solid performance improvement over 8N, it would be in the ballpark of TSMC's N7/N6 on performance and power consumption while likely beating them a bit on transistor density. The biggest argument against it is that Nvidia isn't using the same process for any other chips. It's also worth noting that, if Nintendo and Nvidia were making this decision back in late 2019/early 2020, Samsung's 5nm/4nm yield and performance would have been unknown, which would have made it a riskier choice than the well-known TSMC N7 (which Nvidia was already manufacturing A100 on).
3GAE/3GAP - I'm including these partially for completeness, but also because they're the only Samsung process we know Nvidia is using other than 8N. According to
a report a month ago, Nvidia is one of a number of clients due to use a new Samsung 3nm process, with volume shipments due in 2024. This is later than we expect for Drake, but
Samsung originally expected 3GAE to being volume manufacturing in 2021, so it's been long delayed. They
started low-volume production of 3GAE back in June, and I think the expectation is that a Exynos SoC will be produced in low volumes on 3GAE next year, followed by external customers in 2024 (possibly on 3GAP).
3GAE is Samsung's first process using GAAFET (Gate All Around Field Effect Transistors) which is the first fundamental change in transistor structure since the introduction of FINFET with the 14nm/16nm nodes. 3GAP will be the improved version of this process. Samsung's claims for their 3nm processes would make them roughly competitive with TSMC's 5nm and 4nm processes.
So, would Nintendo and Nvidia have chosen Samsung 3GAE/3GAP back in late 2019 or early 2020 and been blindsided by an unexpected delay, leaving us pushed back to 2024? I don't think so. For one thing, the delay to 2024 was publicly known
back in June 2021, and would have been known to partners like Nvidia before that. It's also hardly surprising that Samsung's attempts to beat the rest of the industry to GAA by several years would have some issues, and it would have been the riskiest possible option available for Nintendo and Nvidia at the time they were choosing a process (certainly riskier than TSMC's FINFET 3nm processes, which are already in volume production). They could have got the same performance from a TSMC 5nm/4nm process like 4N (which Nvidia's using for almost everything else) with very low risk, so it wouldn't really make any sense to roll the dice on Samsung's 3nm processes.
I am curious what Nvidia are using 3GAE/3GAP for, though. It's not likely to outperform 4N, so it doesn't really make sense for a successor to Hopper, Ada, or Grace. Thor perhaps? It's due to hit production vehicles in 2025, so volume manufacturing in 2024 would make sense. Still, I would guess that ASIL certification takes some time, and is likely more straight-forward on an established manufacturing process (Orin's 8N is pretty bleeding edge by the standard of automotive chips), so I would have assumed 4N would again be the safer option. Maybe Altan was a 4N design, and Samsung's offered Nvidia a good enough deal for them to replace it with a 3GAP-based Thor.
I realize I am replying to old messages here, but this damn thing is gonna get announced soon, and I want to make sure I get it right before the game ends
Warps = Compute Shaders = Pixel Shaders. Drake's SMs have the same number of partitions, registers, register memory, and CUDA cores as desktop SM, but for some reason limits the number of warps available to 3/4 of the capacity of desktop. This is
not VTG - vertex/tessellation/geometry - shaders. Usually this is half the number of warps, but it's not for Drake.
My understanding is that pixel shaders have become far less common than vertex shaders in modern games. Because everything else is the same, this is either a software limitation, or they've pulled scheduling hardware for compute out of the SM. I can't imagine this saves power, but it might save die size. RT runs at the compute stage of the pipeline, but I don't know if those shaders are scheduled the same way.
Does anyone have strong knowledge of how the lack of compute shaders relative to VTG shaders might impact perf?
This might have been covered in the last few pages (I only skimmed through them), but I think 48 warps per SM is actually the same as desktop Ampere (and Ada). Per
Nvidia's Ada tuning guide:
The maximum number of concurrent warps per SM is 48, remaining the same compared to compute capability 8.6 GPUs
Desktop Ampere is compute capability 8.6 (vs 8.0 for A100, which supports 64 warps per SM). I would guess that A100 and Orin would be considered compute-focussed designs, and necessitate the greater number of concurrent compute shaders compared to gaming-focussed hardware like desktop Ampere and Drake.
Nvidia gets good deals on process nodes by having multiple products on the same node. Right now, Nvidia manufactures chips on three nodes - TSMC 7nm (datacenter ampere), Samsung 8nm (desktop Ampere), TSMC 5nm (Ada). Nvidia has no major new product lines to manufacture. It would be very strange to have a product not on one of those three nodes, and would drive up costs.
20nm wasn't a weird choice by Nintendo, it was Nvidia building an integrated SOC out of their mature GPU tech. Kepler was on 28nm, so the K1 was on 28nm.
Hardware testing with Nvidia GPUs by Thraktor, Orin documentation, and power tests by third parties all suggest an 8W minimum for Drake's GPU at 620+Mhz. The listed power ranges are a 50% reduction in power.
Drake has a new power saving technology called FLCG. However, FLCG is also in Ada chips according to internal documentation. Nvidia reports a 50% reduction in power usage for Ada at the same performance level as Ampere, with both FLCG
and the node shrink together. Early power tests with 4090 suggest that power reduction is actually optimistic.
Either
- Nvidia and Nintendo are spending extra money to be on a process node with no other shared products, or...
- They're on the same node as Lovelace, releasing after Lovelace, but still not running Lovelace ,or...
- Drake has power saving magic that is not on any other device in the Nvidia product line, including their top of the line cards, but is so simple that it's still the same Ampere architecture, or...
- The wattage numbers in the test don't mean what we think they mean.
1) Has happened before, when Nvidia went with 20nm for the TX1, but it's worth noting they planned on moving Maxwell over to it, and were very unhappy with the yields and didn't make the move. 2) Is insane, but I've theorized it was possible myself. 3) I simply do not believe. 4) Seems likely
When it comes to sharing manufacturing processes, I don't think it's just about cost (although I'm sure they can negotiate better wafer prices with larger orders). I suspect another part of it is just allowing them to be more responsive to demand on individual products. Say they manufacture Drake on Samsung 4LPE, and they make orders for a given number of wafers from Samsung, then if the new Switch doesn't sell well, Nvidia can't use those 4LPE wafers for anything else but Drake. Conversely, if the RTX 4090 underperforms, they can use up their 4N orders on AD103, AD104, Hopper, Grace, etc.
Also, I wouldn't say no. 2 is insane. I'd say it's probably unlikely at this point, but not insane. SoCs designed to release around the same time as CPUs or GPUs don't typically use the full latest architectures (see PS5/XBS/etc.), and it seems likely to me that, for their target release window, using the full Ada arch on 4N wasn't on the table. So if they wanted to push things to get the best chip possible, then as I see it they could have had two options: either push the architecture, or push the manufacturing process. That is, stick with 8N and spend your R&D time on back-porting whatever Ada features you can, or forget any architectural upgrades and spend that R&D time on porting vanilla Ampere to 4N.
For a power-limited device like the Switch, pushing the manufacturing process makes way more sense to me. The architectural improvements in Ada are largely improved tensor core performance (which they could have had on Ampere with Orin's double-size tensor cores), and improved ray tracing performance. The latter would be nice, but hardly essential. There's also the updated OFA and whatever else is necessary for DLSS 3.0, but as previously discussed I don't think DLSS 3.0 really makes sense for a device like the Switch, even if the hardware was there.
Conversely, moving Ampere to 4N would be an obvious and big win. They could have started as soon as gaming Ampere chips taped out (early/mid 2020?) and it would have been a pretty straight-forward and low-risk job.