GrandDemand
Cappy
- Pronouns
- He/him
Hey, I went completely overboard responding to this. It's not directed at you personally, I just wanted to do a deep dive on 4N vs. N7.One possible alternative is TSMC's 7N process node since Nvidia's currently using TSMC's 7N process node to fabricate BlueField-3, Quantum-2, and ConnectX-7. Nvidia also announced during GTC 2023 that BlueField-3 was in production.
And Trendforce's forecast for 2025 mentions that the minimum amount of computing resources required globally for artificial intelligence generated content (AIGC) products are ~145k to ~233k A100 GPUs, which are fabricated using TSMC's 7N process node.
T239: 4N or N7?
The more I've thought about it the more I find N7 unlikely. Here is my rationale, explained in far too much detail lol
1. Die Size
On N7, Nvidia was able to achieve a transistor density of 65.6 MTr/mm² for GA100 (die used in A100). It's successor, GH100 (Hopper/H100), manufactured on 4N achieves a density of 98.3 MTr/mm². Dividing the two gets us a reasonably accurate scaling factor Nvidia gets from N7 to 4N, which is 0.6673. So taking T239 from 4N with an approximate die size of 91mm², multiplying it by the scaling factor's reciprocal (1.498), we find that T239 if manufactured on N7 would be 136.2mm². Now this is only about a 50% increase in die size which seems relatively insignificant... however let's move on to how much this die would cost
2. Die Cost
Using the silicon cost calculator on Adapteva with the following values:
136.2mm² die size (T239 on N7)
Price per 300mm diameter N7 Wafer = $8K
Yield: 0.95
We find that the cost per die is $18.227. This represents a $2.61 reduction from my prior estimate of $20.84 per T239 on 4N. But there's a few more things we should consider.
Firstly is yield.
The yield figure I gave above is inaccurate for a couple of reasons. TSMC's N5 family is their best yielding processes in a long time, most likely all the way back to 28nm is when they had as high or higher yield. N7 by comparison, while still having incredibly low defect density, does not yield as exceptionally well as N5/4N. This is a problem because:
A) 0.95 is too high of a baseline yield for N7, and
B) T239 on N7 is also 50% larger area than on N4. Larger dies yield worse than smaller ones, as the likelihood of defects appearing on a critical portion of the die is higher, potentially rendering it entirely nonfunctional.
These represent pretty big issues for Nintendo. We know from NVN2/L4T that there must be 12 SMs present for GA10F, and 8x A78C cores for the CPU cluster. If any of these areas have critical defects present that would require disablement of an SM or CPU core, then the die is a complete dud to Nintendo. They can't be cut down like desktop GPUs or CPUs and binned as a lower tier SKU, all of these IP blocks have to be both functional, and hit the proper frequency targets at the given voltage supplied. This problem can be designed around by adding additional redundant logic blocks and transistors, but this results in an increase in die area. Which increases the cost further, and again increases the chance of critical defects appearing.
So overall what does this mean for the yield of T239 on N7? Essentially, the approximate yield figure of 0.95 for 4N decreases to a best estimate of 0.90 (which is likely still too high, but let's overestimate so N7 gets a better shot) for N7, taking the aforementioned factors that hurt yield into account. Rerunning the silicon cost calculator with 0.90 instead gets us a cost per die of $19.24. Only about $1 more expensive per die, and still about $1.50 cheaper per die than 4N. Let's move on to the next section however to see if it still remains cheaper considering other expenses.
3. R&D Costs
In terms of engineering resources, it is easier for Nvidia to create T239 for 4N versus N7. Why is this? It has to do with the logic blocks Nvidia has already created for 4N compared to N7. Let's examine 4N first:
On 4N, Nvidia already has RTX 4000 series GPUs (Ada Lovelace). While Ampere (the architecture used for GA10F GPU in T239) is not the same architecture as Ada, the two architectures resemble another more than any two prior Nvidia architectures. Specifically, the Streaming Multiprocessors (SMs) are the exact same, minus the updated Tensor and RT cores on Ada versus Ampere. However, there are many IP blocks such as the CUDA cores, private/shared cache, ROPs, 3D FFs, TMUs, etc. that are identical or essentially identical. Besides the SMs, the overall GPC layout is basically the same across most Ada and Ampere SKUs (most because there are a few exceptions, but they're not significant here).
Now let's examine what logic blocks on N7 Nvidia has that are translatable to T239 on the same node:
The closest analog to T239 here is Nvidia's GA100 die. This die is also based on Ampere. So, case closed, it's the exact same architecture as GA10F or at least has more similarities to it than Ada right? Surprisingly no. GA100 goes into Nvidia's A100 AI and HPC accelerator GPUs for the datacenter, and as such many of the logic blocks are different than that of desktop/laptop Ampere (GA107-102). Here are the notable differences between these two dies (GA100 and GA102), despite both being based on the Ampere architecture.
GA100:
- Each of the 4 processing blocks (partitions) per SM has 8 FP64 cores, 16 CUDA cores than handle only FP32, and 16 CUDA cores that handle only INT32
- Each partition has 8 load/store units
- 192KB L1 data cache/shared memory per SM
- Each Tensor core can handle 256 dense or 512 sparse FP16 FMA operations. FP64 operation capable
- FP32 to FP16 (non-tensor) ratio is 1:4
- GPC Structure: 2 SMs per TPC, 8 TPCs per GPC, 16 SMs per GPC
- No Raster Engine
- No RT Cores
- Each partition has 8 CUDA cores for FP32 only, and 8 "hybrid" CUDA cores that can handle either FP32 or INT32. Only 2 FP64 units are present per SM, and they are separate from the partitions (unlike on GA100)
- 4 load/store units
- 128KB L1 data cache/shared memory per SM
- Weaker Tensor cores, 1/2 the FP16 FMA ops versus GA100. No FP64 FMA
- FP32 to FP16 ratio is only 1:1
- GPC Structure: 2 SMs per TPC, 6 TPCs per GPC, 12 SMs per GPC
- Raster Engine per each GPC
- 1x 2nd Gen RT Core per SM
AD102
- 3rd Gen RT Cores (many improvements from 2nd Gen)
- 4th Gen Tensor Cores (addition of FP8, double throughput of Ampere 3rd Gen Tensor Cores in respective data type)
Finally, let's look at T234 (Nvidia Orin) on Samsung 8N to see what IP is likely to be ported to T239 (Drake), and what will be absent.
T234
Ported
-64 bit LPDDR5 memory controllers and PHYs
-Some of the IO control logic and connections (USB, SD/eMMC)
-Data Fabric and interconnects (not identical, but they will use what's on Orin to help design Drake)
Removed
-2x DLA v2
-PVA v2
-HDR ISP
-VIC (Video Imaging Compositor) 2D Engine
-At least 3x CSI (x4)
-10GbE
-Some other IO required for image processing and debugging
-Probably a few more (not entirely relevant)[/ISPOILER]
Keep in mind that Samsung 8N, and TSMC N7 and 4N are all not design compatible. This means that regardless of if T239 is on N7 or 4N, what is ported from Orin will have to be modified according to the design rules of TSMCs node. So with all of that out of the way, let's summarize the IP blocks already existing for T239 on 4N, and those that will need to be ported from either 8N or N7.
Present on 4N:
- Logic blocks within the SM: Polymorph Engine, CUDA cores, ROPs, TMUs, load/store units, SFUs, warp schedulers, dispatch units, registers, L0 instruction caches, L1 data cache/shared memory. Keep in mind these are all the same amount/structured in the same way as desktop Ampere
- GPC structure: raster engine, 2 SMs per TPC, 6 TPCs per GPC
- NVENC/NVDEC (H.264, H.265, AV1) [reduced stream counts vs. Ada]
- Display control logic + HDMI PHYs
- L2$ SRAM + control logic (2048KB per tile)
- PCIe PHYs + control logic
2nd Gen RT Cores, 3rd Gen Tensor Cores (1/2 Orin and GA100 throughout)
Ported from 8N (Orin)
- [NVOFA
- Various IO control/PHYs like USB, SD/eMMC
- Some design logic from data fabric, Interconnect, potentially logic controls/SRAM for caches in CPU
- SM logic blocks: FP32 only CUDA core, FP64 CUDA core, ROPs, TMUs, load/store units, SFUs, warp schedulers, dispatch units, registers, L0 instruction caches
- GPC structure: 2 SMs per TPC
- NVDEC only (minus AV1)
- L2$ SRAM + control logic (512KB per tile)
- PCIe PHYs + control logic
The IP on 8N will need to be ported regardless of whether T239 is manufactured on 4N or N7, this represents a "fixed" R&D cost for Nvidia. However, by going with 4N over N7, there are far more IP blocks already present, reducing the overall engineering resources required, and thus the overall design costs.
Does this narrow the gap between the cost per die of T239 on N7 versus 4N, or even tilt it in the former's favor?
Yes but indirectly. It doesn't reduce the cost of the silicon itself, however it does reduce how much Nintendo either needed to pay Nvidia outright for the R&D or the cost Nintendo pays Nvidia per functional die (less margin markup), or both. It depends on how the cost structures between Nvidia and Nintendo were negotiated, but regardless, the overall cost Nintendo is paying to Nvidia will be reduced.
4. Wafer Supply
To preface, this is where I really went overboard. If I hadn't done that already in the last section.
While it's true Nvidia has capacity on N7 for various datacenter products, they also have a massive 4N wafer allocation from TSMC. You may think that Nvidia is unable to divert wafers away from H100 and RTX 4000 to fulfill the massive order volume that Nintendo will need. However there have been some important developments recently that change this calculus in my opinion.
Firstly, it has been heavily rumored that due to poorer than expected sales of RTX 4000 (Ada), Nvidia is reducing how many AD102, 103, 104, etc. dies they are producing. This is due to both an attempt to prevent an oversupply of RTX 4000 which would force down MSRPs (or at least actual pricing from retailers), as well as free up wafers to allocate toward additional H100 manufacturing. While I agree with former, the latter comes with a large caveat.
Currently, wafer supply of 4N is not the bottleneck to H100 production; Nvidia has more than enough to allocate toward their high margin AI accelerator. Instead, it is actually CoWoS packaging that is the bottleneck, meaning the packaging of HBM (high bandwidth memory) side by side with the GH100 die on an interposer. With the AI boom in full swing, TSMC is unable to package dies together with HBM quickly enough to fulfill demand, despite HBM and GH100 supply being sufficient. TSMC is increasing CoWoS packaging capacity accordingly, however it will take time to build up this additional manufacturing.
Interestingly, Nvidia has also increased their wafer supply for 4N, even with RTX 4000 not meeting sales expectations and H100 unable to be manufactured relative to the increase in 4N wafers allocated. So, what might all this extra capacity be used for? In my opinion, the beginning of high volume manufacturing for T239. We know that devkits are in the hands of 3rd party developers at this point, and that a 2H 2024 launch is likely for Switch NG. If HVM for the SoC seems too early, remember that these dies need to be manufactured, packaged with memory, integrated onto a PCB, wired to additional PCBs containing things like WiFi/Bluetooth modules, gyroscope, NAND memory, etc. Then all of these components need to be assembled into the console along with a screen attached. Joy Cons and Docks and other accessories need to also be manufactured, all the components need to be QA tested, and everything needs to packaged together in a box, shipped across the world, and in the hands of retailers before launch. And Nintendo needs to have millions of these consoles all ready to go. The timeline for this process lines up extraordinarily well with both a 2H 2024 launch and HVM for T239 beginning now or a few months earlier.
Let's compare gross dies per wafer of T239 and GH100, silicon and other costs for each, and net margin of both products for Nvidia on 4N.
T239
Die size: 90.89 mm²
Yield: 0.95
Gross dies per 4N wafer: 686
Cost per die: $20.84
GH100
Die size: 814 mm²
Yield: 0.80
Gross dies per 4N wafer: 67
Cost per die: $277.78
A quick explanation for the 0.80 yield figure for GH100. It is an absolutely massive die at 814 mm², dwarfing T239's area of 90.89 mm² by 8.95x. Because of this almost nine fold increase in die area, you may expect a much worse decrease in yield than 0.15 compared to T239. However, unlike Drake, GH100 can be cut down to remove critical defects, and in fact is. GH100 has 144 SMs, 12x 512bit HBM controllers, and 60MB of L2 for the full die. However the top SKU (H100 SXM5) has only 132 SMs, 10 memory controller, and 50MB L2, and a further cut down SKU also exists (H100 PCIe) with even fewer SMs (114).
So now let's compare Nvidia's margin for each die when sold in packaged form. We'll go with GH100 first.
H100 has an average sale price of about $30k, and their margin is reported as being 1000%. Personally I think this is probably too high, but let's break down the costs of an assembled H100 anyway.
GH100 cost per die: $277.78
80GB HBM3: (at a reasonable estimate of $10 per GB) = $800
CoWoS interposer + packaging = ???
Power delivery, wiring, IO, PCB + packaging = ???
Heatsink/cooler = ???
I guess if reports are to be believed, the costs of the components other than the die and HBM, all well as packaging, validation, and shipping total about $1922? This seems pretty absurd to me, maybe the yield for the CoWoS packaging step is absolutely atrocious but I doubt it's bad enough to account for this huge discrepancy. In my opinion margin is more likely about 1500%. So anyways, moving back to the initial point, with 67 dies per wafer, each die is able to make Nvidia about $27000, so each 4N wafer allocated to GH100 is worth about $1.809 million to Nvidia.
Let us now compare this to T239.
With 686 dies per wafer, and a cost per die of about $20.84, and an estimated markup to Nintendo of 60%, we find that a 4N wafer of T239 would make Nvidia about $8578. I don't think I did the math incorrectly, but regardless a wafer of GH100 makes Nvidia about 210x more potential profit.
So why did I go on this tangent about margin and product costs and yada yada yada if it would just conclude with the implication that Nvidia would make vastly more money if they didn't make T239 on 4N. Because, all of the potential profit remains just that, potential, if all your GH100 dies are sitting in a TSMC warehouse, waiting months and months or even years for CoWoS capacity to finally catch up. Nvidia aren't stupid, they will make as many GH100 that they need to both fill demand and prevent a huge backlog of unpackaged H100's piling up. By the end of 2023, Nvidia is predicted to have sold 550000 H100s. However, this doesn't mean delivered to customers, it means some will be in customers hands, and a huge amount of those customers will have prepaid to get their H100s when they're actually fully complete.
Let's say that of those H100s, half are actually delivered to customers by the end of the year. Going back to our 67 dies per wafer, a yield of 0.80 per die, and a realistic yield estimate for the CoWoS packaging step of 0.9, we find that Nvidia will need 5700 4N wafers to produce the requisite dies for 225000 H100s. If they make extra GH100 dies (let's say 400000), they'd need about 8300 4N wafers. The last available estimate for TSMCs total N5 family capacity was 150000 per month. However this was in April of 2022, it's likely they are at around 200000 per month now. For 8300 4N wafers (GH100 only) over the course of 2023, Nvidia would need to allocate about 700 wafers per month of their total 4N wafer allocation. I can assure you they're allocated quite a bit more than that per month. To further render the argument that 4N capacity isn't enough invalid, let's look at TSMCs CoWoS packaging capacity. It is estimated that TSMC is able to package 8000-9000 per month, and that's dies per month, not entire silicon wafers full of dies. Nvidia recently in May wanted to increase their CoWoS allocation by 10000 over the course of the remainder of 2023. Let's split that into 2000 for each of the last 5 months in 2023, and assume that prior to that they had about 60% of overall CoWoS capacity per month. This results in a total of about 75000 packaged dies in 2023 for Nvidia. That's a whole lot less than 550000 H100s actually delivered. Essentially we can conclude that 4N supply to GH100 no constraint on T239 production. But why stop there, maybe Nvidia still wouldn't have enough 4N supply?
How many wafers to fulfill demand for T239? Let's assume that Nintendo wants to go big and have 10 million units available at launch and to fulfill Holiday demand.
At 686 dies per wafer, and a yield of 0.9, Nintendo would need about 13100 wafers. Maybe from T239 production to ready to sell console, the lead time is about 6 months. If HVM of T239 started in July 2023, and Nintendo launches Switch NG in October 2024, they would have about 10 months to produce enough T239 SoCs for 10 million consoles to be available immediately at launch. Per month, this would represent about 1300 4N wafers Nvidia would need to allocate to Nintendo of their total wafer supply from TSMC. With Nvidia being a large customer of TSMC, and TSMC producing 200000 N5 family wafers per month, let's just say that yeah, 2000 wafers per month for GH100 and T239 plus an additional few thousand for RTX 4000 is certainly within Nvidias monthly 4N allocation.
Going back to GH100 and T239, let's finally circle around to revenue.
For 550000 GH100 sold, Nvidia makes 16.5 billion. For 10 million T239 sold to Nintendo, Nvidia generates about $333 million in revenue. However, that T239 figure is revenue in fully delivered products (B2B). With the CoWoS capacity constraints laid out earlier, Nvidia will only be able to actually deliver 75000 H100s in 2023, bringing the revenue from these GH100 dies down to only 2.25 billion. This is still about a 7 fold increase in revenue, but while CoWoS capacity is constraining H100 supply, T239 doesn't use CoWoS and has no such supply constraint for Nvidia to make money. And if you have the node capacity, why not make more money?
Last edited: