Wonder if it'll be some awkward naming scheme like the Switch Flip because they're "Flipping the script" on what they normally do in terms of sequel platforms. Or take a cue from the NES, and call it the Nintendo Switch Entertainment System.
That said, given how Nintendo understands how successful Apple is with the iPhone, it only makes sense to keep the naming scheme simple. Apple went from the iPhone to the iPhone 3G in order to communicate that it had 3G connectivity. It was simple, yet effective. Then it was the iPhone 3GS, and then the iPhone 4. They just kept the numbers, but made it clear it was new.
It would make a case for Nintendo to make it simply the Switch 4K, though at the same time, you don't want to convey the wrong message that it can do 4K on both docked, and handheld, because only the former could do that.
"The Nintendo Switch 2: You were expecting Die Hard 2, but in fact it's Aliens."
Oh if only Bill Paxton was still around, for some of that "
Game over man,
GAME OVER!"
---
Alright, veering off course a bit, but I really wanted to type up something on seeing Baldur's Gate 3 benchmarks. If you already know your computers, I don't think that anything I say will be new. I'm hoping that it's at least a worthwhile read to the subset of readers who are less familiar.
Lots of rambling; as always, you are quite welcome to correct me where I'm wrong. Particularly, I was motivated by Digital Foundry's
article today. And in addition, there's
this from pcgameshardware.de; my focus is on the CPU benchmarks.
So Alex suggests that one of the reasons for the performance tanking in act 3 of BG3 is the significant increase in NPCs and the corresponding AI scripts weighing down the CPU.
First thing I wanted to blaber on about would actually be the CPU benchmarks on pcgameshardware.de. Particularly noticeable is the drastic difference between the regular Zen 3/4 SKUs and the X3D SKUs. Hardware-wise, the difference between regular and X3D is that the regular SKU can clock a bit higher, while for the X3D SKU, a given CPU core has access to 96 MB of L3 cache instead of 32 MB.
Oh right, quick interjection about "the memory hierarchy" for the readers. CPU needs data/instructions to do work. Said data/instructions need to be held somewhere in order to be retrieved. The hierarchy is just describing the order of locations to search in. Internal storage is at the bottom/end. Storage is awful, so thus RAM exists. RAM, while a massive improvement, eventually gets deemed 'can still be improved upon'. And lo, this thing called 'cache' now exists. And over time we have multiple levels of cache. You'll see reference to L1/L2/L3 cache. The number is descriptive; it's just the order in which they're searched in. If you can't find what you want in L1, then you move on to L2. If you can't find it in L2, you move onto L3 if there is an L3 (Tegra X1 doesn't have an L3), but otherwise to RAM you go. You eventually run out of levels of cache to go through and have to turn to RAM.
Furthermore, cache should be holding recently needed data. New data wouldn't be already there. And relatively older data that was placed there eventually gets replaced with new incoming data.
Back to the X3D vs non-X3D Zen chips.
X3D gets a win whenever you get a hit in L3 cache where the non-X3D would instead get a miss. That is, you're working with something that fits within 96 MB, but not 32 MB.
Non-X3D gets a win whenever accessing data is a hit or miss for
both SKUs, due to higher clocks. As in, both hit in L1? Clocks should win out. Both hit in L2? Clocks should win out. Both hit in L3? Clocks win. Both need to hit RAM? Clocks win.
For the X3D SKU to pull ahead by so much, it should mean that it's accumulating a lot of wins. That is a lot of times where you miss L1, you miss L2, but then hit in the 33-96 MB of L3 (because it'd have to miss the regular SKU's 32 MB of L3). That's a lot of times where you're accessing recent-ish data, and there's a decent chunk of said recent-ish data.
AI scripts... makes sense? They seem to be something that would be constantly running. They also seem like that they would be something that would keep touching the same data.
Moving away from X3D vs non-X3D to a quick look between different CPUs...
Going from 6C/12T to 8C/16T SKUs within a gen helps somewhat, but nowhere near proportionally. Like the 3600->3800XT isn't a 1/3 increase in average FPS. Similarly with 5600->5800X. It's mainly going from one gen to another that makes the big differences. For those less familiar, I'll do a recap.
Ryzen 3000 (Zen 2) to Ryzen 5000 (Zen 3) series: Core got expanded (read as: higher potential IPC, or performance per clock), plus how the cores are organized changed from 4 cores in one cluster to 8 cores in one cluster. Latency between cores within the same cluster is much better than between different clusters. Also, each core has access to the L3 cache in its cluster. For the SKUs tested here, cores in the 3100/3600/3800XT have access to 16 MB of L3 cache. Cores in the 5600/5800X have access to 32 MB of L3 cache. Reminder: PS5/Series use monolithic Zen 2, so their cores have access to 4 MB of L3 cache. Also, remember that while GDDR has the bandwidth advantage over LPDDR or regular DDR, their latency is worse (ie, worse for
frequent access). It'll be interesting to see how the PS5 version fares...
Intel 8000/9000/10000 (they're all Skylake cores) to 13000 (Raptor Lake; uses Raptor Cove and (enhanced) Gracemont cores): Between Skylake and Raptor Cove, there've been two significant expansions to the core. There's also been a few changes to the memory subsystem. Skylake had 32 KB of L1 Instruction/32 KB of L1 Data per core. Raptor Cove has 32 KB of L1 Instruction/48 KB of L1 Data per core. Skylake has 256 KB of L2 per core; Raptor Cove has 2 MB of L2 per core. Skylake chips with hyperthreading have a total L3 cache of 2 MB times the number of cores. Raptor Lake usually has a total L3 cache of 3 MB times (# of Raptor Cove cores + # of Gracemont
clusters). Also Raptor Lake generally gets pushed to higher clocks than the Skylake-based generations. So the performance uplift is coming from a combination of clocks, higher potential performance per clock, and a beefier memory subsystem.
Alright, now to what really got me itching to make this post. There's a particular paragraph in the Eurogamer article that stood out to me:
Is this performance justified and can it be fixed? To answer that, I think we can look at how the performance scales with the amount of cores and threads. When looking at the Core i9 12900K, we see some interesting data when examining how the game runs across different amounts of cores and threads. The best performing combination here is actually eight cores without hyperthreading on. Eight cores only performs four percent better than six, while the fully enabled 12900K is just two percentage points better than the six-core result, despite doubling threads and available cores. Eight p-cores with hyperthreading enabled is the worst result of all, a touch slower than the six core result.
And yea, it sticks out to me cause every now and then we get people wondering about 'how much does the lack of SMT matter?' here
First off, hyperthreading/HT. What is that? (besides being Intel's brand for their version of Simultaneou Multi-Threading, or SMT) So, years and years ago (at least a couple of decades ago), a couple of inefficiencies were observed. Temporal/time and electrical/energy.
Sometimes a core is just stalling for whatever reason. Maybe some part of the core needs to be used for what you want to do next, but it's also kind of busy working right now. So you're just waiting around, twiddling your thumbs doing nothing.
That sucks. So HT/SMT is the idea of enabling a core to have a 2nd thread running; to potentially convert some dead time into something productive. Assuming said other thread doesn't need the same resources.
Electrically, this is back before clock gating got further refined. You're lighting up/turning on huge chunks of the core, if not all of it. But you're not necessarily using all that you've lit up. Well, that's a waste of energy, right? So HT/SMT can potentially utilize parts that are on but otherwise not in use; try to 'un-waste' some energy.
But notice that HT/SMT is effectively a method to try to cover for gaps. If your core is better at single thread utilization, there's less room for potential improvement from a 2nd thread. If your clock gating is refined enough such that you're better at only powering the parts you need, you're wasting less energy that a 2nd thread would've tried to cover for.
Also remember that HT/SMT doesn't create additional resources. For example, you're sharing the cache that's private to each core...
Second, the 12900k. What is that? Alder Lake uses Golden Cove and (original) Gracemont cores. Golden Cove is... well, I should say that Raptor Cove is basically a refinement of Golden Cove. The notable difference here is that Golden Cove has 1.25 MB of L2 cache per core. Anyway, the full 12900k is 8 Golden Cove cores plus 2 clusters of Gracemont cores (1 cluster of Gracemont = 4 cores, so 2 clusters = 8 cores). Total of 30 MB of L3 cache.
The (original) Gracemont cores themselves have 64 KB L1 Instruction/32 KB L1 Data per core. Then they have 2 MB of L2 per
cluster (not core). And for completion's sake; I mentioned (enhanced) Gracemont is what's used by Raptor Lake earlier. The "enhancement" here is increasing a cluster's L2 from 2 MB to 4 MB.
Couple more links for those interested...
https://www.anandtech.com/show/1704...hybrid-performance-brings-hybrid-complexity/6 - inter-core latency for the 12900k. Also, latencies for the memory sub-system in
nanoseconds.
https://chipsandcheese.com/2021/12/02/popping-the-hood-on-golden-cove/ - gotta scroll down quite bit, but eventually you'll get to a section covering latencies for the memory sub-system in both nanoseconds
and clock cycles.
Performance hierarchy seems to be:
8 Golden Cove only, hyperthreading OFF
8 Golden Cove + 8 Gracemont, hyperthreading ON (ie the full 12900k)
6 Golden Cove only, hyperthreading OFF
8 Golden Cove only, hyperthreading ON
Ok, my gut reaction:
8 Golden Cove only with no HT? At most 8 threads at a time. Need context switching every time a core wants to change threads. Plus side though? You're confident that a thread has all of a core's L1 and L2 cache to themselves. I'm guessing that this config is scoring a lot of wins from the AI density to compensate for the fewer threads.
Full 12900k? So, default behavior is that the task scheduler first populates the P-cores (Golden Cove here) with 1 thread each, then the E-cores (the Gracemont cores) get a thread each,
then you finally start putting hyperthreads on the P-cores. So it's actually not until thread #17 that HT starts kicking in. An individual Gracemont is far weaker than an individual Golden Cove, but a thread on a Gracemont still gets the full core to itself along with L1 and access to the cluster's L2, compared to a hyperthread on a P-core getting leftovers.
6 Golden Cove only with no HT? I suppose that 6 threads at a time just isn't quite enough.
8 Golden Cove only with HT on? How did that drop off so much? I'm looking in the direction of hyperthreads clashing with the main thread over... maybe the cache, due to the AI scripts. And/or both threads on the core need the exact same execution units, or other parts of the core. But I'm thinking that 1.25 MB of L2 cache to split between two threads maaaay be a bit of a tight squeeze in this scenario. So there may be a fair amount of times where this config is getting kicked out to L3 or beyond, while the other configs are remaining in L2.
(...although, it would be interesting if Alex can do this test with a Raptor Lake CPU and see if 2 MB of L2 cache for the P-core would help...)
...uh, oh, yea, maybe I should try to this to Nintendo somehow? Um, so guys, what's probably necessary to do if we want a port of Baldur's Gate 3 on the NG? Slash as much unnecessary AI running concurrently as possible?
Edit: bwaaa, **** me, I forgot to cover Ryzen 5000 (Zen 3)-> Ryzen 7000 (Zen 4). Zen 3 to Zen 4 was another expansion to the core. So, performance uplift comes from higher clocks and potential IPC.