• Hey everyone, staff have documented a list of banned content and subject matter that we feel are not consistent with site values, and don't make sense to host discussion of on Famiboards. This list (and the relevant reasoning per item) is viewable here.
  • Furukawa Speaks! We discuss the announcement of the Nintendo Switch Successor and our June Direct Predictions on the new episode of the Famiboards Discussion Club! Check it out here!

StarTopic Future Nintendo Hardware & Technology Speculation & Discussion |ST| (New Staff Post, Please read)



I forgot about that leak, the Sonic game ended up being real. Maybe there really is/was an Odyssey 2 in development... though, if it was real I can't see them just scrapping the project if a lot of the work was already done at that time. A new game expanding on the open concept of Bowser's Fury is personally what I am hoping for but Odyssey 2 would still be very exciting and i'm sure a great experience.
 
Series S only has 512 GB with 360 GB usable. Switch's 256 GB with the same ratio of usable memory would put it at around 180 GB; but likely more could be usable since around 80% of Switch's 64/32GB's are usable
I believe one major reason why storage space is so much smaller than expected for Series S and X is that they reserve space for multi-resume: you need to reserve dozens of GBs to dump the RAM of a game onto, and you need to have that space for multiple games (up to 5 I believe). So that will be 50 GB or more for the Series S.
 
COD is alot like the game Ark: Survival Evolved. Installing it will say it takes like 180GB then when you look it includes a million things you may not even want installed. Ark for instance if you own all the expansion packs installs every single unique file for things you may not even be playing. If you tell ark only to install what you are using suddenly like 120GB of space is freed up
 
assets will be cut down in size, so not really

EDIT: actually looking it up, the reason for CoD's large file size is because it reuses assets from previous games, which are deleteable. since Drake doesn't have these games, the size will be much smaller anyway

20231102165340-1698945637579.jpg

20231102165345-1698945641920.jpg
Yes I agree that it's deliberately bloated nonsense, so it's a question of whether or not MS is going to actually force Activision to stop doing that shit.
Some porting studio will somehow make it work.

Also thereโ€™s always the chance of requiring download code for the physical release.

With the PS5 version of MW3 being 140GB, thereโ€™s always a chance that the Switch 2 will be 60-100GB.

But Iโ€™m more curious with how much the Switch 2 cartridge will hold, since Iโ€™m guessing there must have been some sort of breakthrough for it to at least hold 64GB.
Considering that the latest COD disks no longer even contain the game, good chance activision pulls similar BS with Switch 2... unless MS makes them not do that.
I'm not sure why it's a debate if Blops 6 will be on Switch 2.

-Switch 2 will be around Series S power
  • Nintendo and Microsoft have made a 10 year deal to bring COD to Switch platforms
  • Activision CEO himself said he deeply regrets not bringing COD to Switch

Switch 2 will be getting COD.

No more "but file size" talk.
I didn't say COD won't be coming to Switch 2 I said that the biggest hurdle is going to be file size. If activsion actually puts effort into keeping the file size more manageable then it should be fine but right now their trend has been to deliberately bloat filesize for anti-competitive reasons. Hopefully that trend changes.
I brought this up as a possible stumbling block a while back, but the more i think about it the less it makes sense.

Series S only has 512 GB with 360 GB usable. Switch's 256 GB with the same ratio of usable memory would put it at around 180 GB; but likely more could be usable since around 80% of Switch's 64/32GB's are usable

There's zero chance Microsoft makes a cod that can't be installed out of the box on a Series S, and I'm unaware of a game that's actually 360 GB in size.
Even a 200 GB cod install could be made to be downloadable into a 256GB Switch 2 day 1 with less bloated (no 4k) assets and targeting the device itself.

Microsoft would not have made a deal if it had no intention of porting those games to Nintendo platforms. In hindsight, my own thoughts initially was misplaced and i'd guess yours is too.

I'm going to make a lame prediction that Switch 2 COD will be around 50-100GB with optional installs



ditto. I agree.
It's a question of whether or not MS is willing to force Activision to put some actual effort into file size management.
 
Samsung seems desperate to approach NVIDIA with 3nm process.
According to industry sources on the 20th, Samsung Electronics' Semiconductor (DS) Division, Foundry Business Unit, has set securing 3nm product orders from Nvidia as its top priority for this year.

An industry insider stated, "Each department has been notified to prioritize tasks related to securing orders from 'Nemo' over their existing duties, indicating a concerted effort." Within Samsung Electronics, "Nemo" is the code name for Nvidia.

However, it appears that the Foundry Business Unit has not established a dedicated task force or specialized team specifically for securing Nvidia-related orders.

Previously, in 2020, Nvidia entrusted the manufacturing of its consumer graphics processing unit (GPU) GeForce RTX 30 to Samsung Electronics' 8nm process, and they have continued to receive chips from this process. However, Nvidia has recently allocated most of the volume for its advanced process chips to TSMC, leading to a halt in orders for Samsung Foundry. Currently, Nvidia's AI semiconductors 'H100' and 'A100' are also manufactured using TSMC's 4nm and 7nm processes.
 
So I was actually watching one of Rich's videos where he dismantles the Xbox series s, and holy crap I had no idea that thing was actually tiny! I know it's less powerful than the series X, but still it's way smaller than I expected
It's a really well designed piece of kit, at least at an industrial design level.

. Which kind of makes it a little bit more disappointing, that had Nintendo still been making full-on consoles and not hybrids, something the size of even the Wii U released in 2025 might have been significantly better than the PS5/XbsX.
This is probably a bridge too far. You could clearly make a jump beyond the PS5 at the same size - not a leap but a jump. Or you could make the PS5 somewhat smaller. I don't think the tech exists to put PS5 levels of power in a Wii U shaped box, much less do it for less than 500 dollars.

It's not a perfect proxy, but the PS5 is roughly as powerful as a $400 card at the time it launched. Considering the console launched at $500, with an SSD, a CPU, a BluRay drive and 16 GB of RAM you can see how much Sony was losing on the thing. That's why Nintendo had to get out of the graphics wars - they don't have gigantic non-video game lines of business that can subsidize 2 years of selling their core product at a loss.

In a side reality where Nintendo is making dedicated TV consoles still, and is launching an Nvidia based device in 2025, they're probably still charging $400 bucks, and trying to make money on it. A good proxy then might be a sub $300 card from Nvidia. Let's get optimistic and say the 4060. All of Nvidia's coolest bells and whistles, including Frame Generation, but basically no more performance than the Series X. And probably as big as the Series S (which is admittedly quite small)
 
Regarding a physical CoD BO6 version for the Switch, I wonder if Microsoft can't just put the single-player mode on the cartridge and use an additional download for the multi-player mode. The multi-player mode will receive constant updates anyway.
Yeah, I could see that being the case
If activsion actually puts effort into keeping the file size more manageable then it should be fine but right now their trend has been to deliberately bloat filesize for anti-competitive reasons. Hopefully that trend changes.
I thought it was simply a matter of them not caring, what are these anti-competitive reasons?
 
Thraktor's Guide to Concurrency in Ampere

One topic which comes up quite frequently in this thread (particularly in the context of DLSS) is how concurrency works in Ampere GPUs. In Nvidia's Ampere whitepaper, they point out new concurrency features in the architecture, without providing much detail on how they work, or what kind of limitations there are in using it, besides some graphs showing a reduction in frame time from running graphics, RT and tensor cores at the same time for Wolfenstein Youngblood.

I've been trying to understand how this works myself, and in the process get a better understanding of how Nvidia's GPU architectures work at a lower level, and I feel like I've got a good enough understanding of concurrency in Ampere, at least between regular shader code and tensor cores, to explain it in a way which might be useful to people. And, in the process of writing, will hopefully clarify some things for myself.

One thing I should mention is that what I'm describing below is a simplified explanation of how SMs work, mainly to make it easier to understand, but also because I don't know enough of the lower level details to speak with any confidence on it. The main simplification I'm making is ignoring pipelining, which is quite important, but also quite complex, and I feel the general points are the same even if we ignore pipelining, although the specific implementation differs a bit. I'm also going to ignore things like warps getting split at branches, complex instructions, etc.

A Quick Intro to GPUs

To start, I should cover some basic points on how GPUs work which will be relevant later. The most important of these is the concept of SIMT, or single-instruction-multiple-threads, which is the paradigm by which GPUs operate. This means what it says, which is that GPUs execute a single instruction across multiple threads of data at once. So, for example, if you have a pixel shader with a thread for each pixel, and there's an instruction which states "multiply X by Y and store the answer in Z", it will execute that instruction for every pixel in that thread group, even though they all may have different X and Y values.

In Nvidia's case, a group of threads which executes together is called a Warp, and a warp contains 32 threads. So each time an instruction is executed on a Nvidia GPU, it's run on a warp of 32 threads. At a higher level these are organised into what are called thread blocks, but that's not too important here.

Each warp is issued to an SM (which are the building blocks of Nvidia's GPUs) to execute on, continuing to execute instructions until the shader has completed. Ordinarily a GPU would have a very large number of warps issued to its SMs at any one time. In Ampere's case, it can handle up to 48 warps per SM, and with 12 SMs on T239's GPU, that would mean up to 576 total warps, or 18,432 threads issued at a time.

The Ampere SM

Here's Nvidia's diagram of an Ampere SM from the whitepaper:

temp-Image-EWm7-C8.avif


The Ampere SM is divided into four partitions, each of which contains registers, shader cores, tensor cores, instruction dispatch, etc. Each of these executes instructions independently from each other, and we'll look at them in more detail below. In addition to what's in the partitions, there is also an L1 cache/shared memory pool, texture units, and the RT core. We'll come back to the RT core later, but for the moment, let's focus on those partitions. Here's a diagram of a partition:

temp-Image-D65a-Td.avif


An SM partition contains everything needed to execute shader instructions independently. There is a register file, which stores the data being executed on by the threads, load/store units to move data in and out of those registers, and a warp scheduler and dispatch capable of dispatching instructions across three different data paths. The first data path is capable of executing FP32 and INT32 instructions, the second one is capable of executing just FP32 instructions, and the third datapath is capable of executing "tensor core" instructions and FP16 instructions.

Dispatching and Executing Instructions

If you look at the diagram of the SM partition, you'll see it notes (32 threads/clk) next to the dispatch unit. This is pretty important, as what it's saying is that, within each SM partition, one warp of 32 threads can be dispatched to one of the three data paths each clock cycle. This means that you can't simultaneously dispatch instructions to, say, both FP32 data paths within the same clock cycle. You would have to dispatch an instruction for one warp to one data path on one clock cycle, and then dispatch an instruction for another warp to the other data path on the next clock cycle.

Just because you can't dispatch to multiple data paths on the same clock cycle doesn't mean you can't have multiple data paths executing concurrently, though. Otherwise having multiple FP32 capable data paths would be useless if you can't use them at the same time. The key to this is that instructions typically take multiple cycles to execute.

I'm going to ignore pipelining here to keep things simple, but if you look at the two FP32 capable data paths in the diagram, you'll see each one is divided into 16 blocks. Nvidia calls these "CUDA cores" in marketing, although they're not really cores. What they actually tell us is that each one of these data paths can execute 16 FP32 operations per clock cycle. Now, if an FP32 instruction for a warp containing 32 threads is dispatched to one of these data paths, and it can execute 16 ops per clock, then it's straightforward to see that a standard FP32 operation (like fused multiply add, for example) would take two clock cycles to execute on one of these data paths.

If it takes 2 cycles to execute an FP32 operation on a warp, and the dispatch unit can issue one warp per cycle, then we can see how having two FP32 data paths becomes useful, as you can dispatch to each data path on alternate clock cycles, and, in theory at least, get 100% utilisation of both simultaneously.

Tensor Code Is Just Shader Code

One important thing to note here is that tensor cores are pretty much just big shader cores designed for a very specific operation. While shader cores perform operations like add or multiply on individual numbers (across multiple threads at the same time), tensor cores perform multiplication operations on matrices. They run instructions which sit in shader code just like the other data paths do, and if you look at Ampere's instruction set, you can see those instructions, labelled HMMA and IMMA. So when the dispatch unit comes across an FP32 instruction, it will send it to either of the first two data paths, when it comes across an INT32 instruction it will send it to the first data path, and when it comes across a matrix instruction, it will send it to the tensor core data path.

For those curious, it seems that these matrix multiplication instructions are synchronised across the entire warp, where a single matrix multiplication is split over the 32 threads in the warp. This makes a lot more sense than trying to execute 32 separate matrix multiplication operations simultaneously, which would require a huge amount of register space.

To understand how well the tensor core can operate concurrently with the other data paths, we need to know a bit more about these instructions. I'm going to focus on FP16 matrix multiplications but the same logic applies to TF32, BF16, INT8, etc. From Nvidia's documentation we know that Ampere supports two matrix sizes for these operations, 16x8x8 and 16x8x16. We'll first look at the 16x8x8 case, which means multiplying a 16x8 matrix by an 8x8 matrix. This requires 1024 FMA operations to execute.

We can calculate from Nvidia's advertised performance figures that each SM is capable of executing 512 FP16 tensor ops per clock, ignoring sparsity (their numbers claim double this, by counting FMA as two operations). This means that the tensor core in each SM partition can execute 128 FP16 operations per clock. So, a 16x8x8 multiplication which requires 1024 operations to complete would execute in 8 clock cycles. By the same logic, a 16x8x16 multiplication would execute in 16 cycles. Tensor ops in FP16 with sparsity use 16x8x32 multiplications (or really 16x8x16 after accounting for the sparsity structure) and also execute in 16 cycles.

Running Tensor Code Concurrently With Shader Code

Knowing that tensor cores run shader instructions, just like regular shader cores, and knowing how long those instructions take to execute, we can start to get an idea of how running tensor code like DLSS alongside regular shader code works.

Firstly, tensor code is bundled up in threads and warps just the same as any other code. For those 48 warps issued to an SM, you would have to have some regular shader code warps issued, and some warps which use tensor cores. For the sake of an example, let's say you have 32 warps of regular shader code, and 16 warps which use tensor cores. Then, for each SM partition, we can assume there are 8 warps of regular shader code issued to it, and 4 warps which use tensor cores.

Each cycle, the dispatch unit in an SM partition can dispatch one instruction from one of those 12 warps to one of the three data paths. To help illustrate how this would work, I've created a diagram showing a theoretically optimal case involving just standard FP32 instructions (let's say FMA) and FP16 matrix instructions of size 16x8x16:

temp-Imagec-DHh-Ao.avif


Each row is one clock cycle, going from top to bottom, and each column is one of the three data paths. On the left side I've shown what the dispatch unit is doing that cycle, and in each column I've shown colour when an operation is being executed (blue for FP32, green for tensor) with a dark line at the top where it's dispatched. If a cell is white, that data path is idle.

You can see here how all three data paths can be kept reasonably busy even though the dispatch unit can only issue an instruction to one of them each cycle. The first two data paths could theoretically achieve full utilisation if there was no tensor code, but even with tensor code being dispatched it has a relatively small effect, just forcing one idle cycle for every 16 execution cycles. The key to this is how long the tensor ops take to execute. If they completed very quickly, then there would be more of the idle cycles you see in the other two data paths whenever a tensor op needs to be issued.

It's important to note that this is a theoretically optimal case. The dispatch unit isn't always going to be able to issue an instruction on every cycle, because there may not be a warp with an instruction ready for dispatch. These warp stalls, as they're called, can be caused by a number of reasons, for example waiting on data to arrive from RAM, and are the reason for the large number of warps which are issued at a time. Even if some of the warps are stalled for whatever reason, having 12 of them to dispatch from means it's very likely that at least one of them will be read to dispatch on any given clock cycle. Still, that's not guaranteed, and there are inevitably going to be missed cycles here and there.

On the specific issue of running shader code concurrently with tensor code like DLSS, although they can execute concurrently with almost full efficiency in a theoretical setup, there are additional bottlenecks to running them together. For instance, issuing a few DLSS warps to an SM alongside your shader code means there are fewer shader code warps available, so you're more likely to see all of them stalled, and the dispatch unit unable to issue to the FP32/INT32 data paths, than if you had a full complement of shader code warps. The same is the case with warps using tensor core code, where it's more likely that (say) 4 warps are all stalled waiting for memory than if all 12 warps were lining up to use the tensor cores.

Speaking of memory, regular shader code and DLSS would be competing for both cache and memory bandwidth. Hopefully this isn't too much of an issue with LPDDR5X providing more bandwidth than we expected, but it's still a potential limiter on performance when running them concurrently. Finally, I should note that ML models like DLSS aren't 100% matrix multiplication, and there are things like activation layers in there which will require regular shader code (likely FP16), but that should be a relatively small portion of the execution time.

Running FP16 Code Concurrently With FP32 Code

Another closely related topic that has come up a few times is the possibility of running non-tensor FP16 code concurrently with FP32 code, as the Ampere whitepaper states "standard FP16 operations are handled by the Tensor Cores in GA10x GPUs". In fact, I've argued myself in the past that this could be useful for developers with a mix of FP32 and FP16 code, but it seems like it's less useful in practice than running tensor code concurrently with FP32 code.

The reason for this is execution time. While tensor operations are on nice big matrices which take up to 16 cycles to complete, which leaves a lot of time for the dispatch unit to also issue FP32 instructions, non-tensor FP16 instructions are at the other end of the spectrum, executing very fast. From Nvidia's performance figures, we know that non-tensor FP16 instructions are executed at a rate of 128 per SM per clock, which means that a tensor core data path in one of the SM partitions can execute 32 FP16 operations per clock when running non-tensor code.

With the dispatch unit issuing one warp of 32 threads per clock, though, this means that non-tensor FP16 instructions execute in a single clock cycle. So, the dispatch unit would have to dispatch to the tensor core data path every single clock cycle to fully utilise the non-tensor FP16 performance available, and it can't do that while also dispatching to the other two data paths.

Here's another diagram like the one above, but now showing a combination of FP32 instructions and non-tensor FP16 instructions (the latter in red):

temp-Image4b-D9-Hp.avif


You can see the issue here, as the dispatch unit is dispatching on every clock cycle, but there's still a lot of idle cycles across the three data paths. In fact, so long as it's dispatching every clock cycle, then the achievable performance is 32 operations per clock (or 128 per clock for the entire SM) regardless of whether those are FP32 instructions or FP16 instructions or a mixture of both, just by virtue of the dispatch limitation. Taking pipelining into account would change the behaviour a bit, but the dispatch limitation would remain the same either way.

That doesn't mean that using FP16 isn't worthwhile, as it takes up less space in memory, less register space, and less bandwidth, even if you're not getting faster execution. I'm also very curious if any developers can make use of the tensor cores' matrix multiplication operations for non-ML use-cases. If you've got a problem which maps well to matrix multiplication and where FP16 is sufficient, then you could achieve a very large speedup by rewriting your shaders to make use of the HMMA ops. (I'm also curious if Nvidia have provided tools to do so, as the way in which matrices are synchronised across the warp mean these use cases would have to be handled a bit differently from regular shader code by the compiler).

Running RT Cores Concurrently With Everything Else

I've focussed mostly on running tensor core code like DLSS concurrently with regular shader code, but another feature of Ampere is the ability to run RT concurrently with both of these. There's less to say here, as Nvidia is more explicit about the functionality, saying in the whitepaper that "The new GA10x SM allows RT Core and graphics, or RT Core and compute workloads to run concurrently, significantly accelerating many ray tracing operations."

The RT core is responsible for BVH traversal and triangle intersection testing, ie finding exactly what triangle a given ray intersects, and where, and it's fixed-function hardware which sits apart from the SM partitions we talked about before, so it makes sense that it can be made to operate independently. This doesn't cover the entirety of RT workloads, as you still need shaders to create rays and process them after a hit is found, and to perform any work required after that (eg shading reflections), but it means the shader cores don't have to sit idle while the RT cores are doing their thing, or vice-versa.

TL,DR:

From purely a point of view of executing instructions within an SM, Ampere GPUs should be able to concurrently execute both regular shader code and tensor core code like DLSS, with theoretically up to about 94% efficiency achievable given the limits of the dispatch units. In reality there's likely to be contention between the workloads over things like cache and bandwidth, so real-world benefits would be lower, but I'd still imagine you'd get a good performance boost over running them sequentially, particularly given the relatively high bandwidth from LPDDR5X.

For FP32 and non-tensor FP16 code, while they can operate somewhat concurrently, the limitation of the dispatch unit only being able to dispatch one warp per cycle, combined with the very quick execution time of the FP16 instructions, means the benefit to running them together is small. There still can be register/memory/bandwidth benefits to using FP16 code, though.

RT cores can operate concurrently with everything else, as per Nvidia's whitepaper. Not much more to say here.

This is all based on what I can understand from Nvidia's whitepapers and other sources online, but it's not something that Nvidia have ever fully documented, publicly at least. So if anyone has any corrections or anything else to add it would be much appreciated.
This is an absolute classic Thraktor post, thanks a ton for making it! It is in-depth posts like this that make this thread such a one-of-a-kind experience.

From this, it seems like concurrent tensor operations used in matrix multiplication should be worthwhile to do concurrently, of course under the assumption that the memory can be supplied in a sufficiently fast fashion. What the ultimate performance hit for the ALUs will turn out to be remains a question, but the hope would be that this type of overlapping can hide a (hopefully large) portion of the computational requirement of DLSS, which in practice would give us a frame time boost vs. the non-concurrent implementation of DLSS (where DLSS happens as a separate operation and the FP32 warps are stalled).

In terms of latency, doing DLSS of the current frame simultaneous with the rendering of the next frame induces a frame's worth of input lag (33 ms for 30 fps, 17 ms for 60fps). That's of course not ideal, but it seems like it could be a worthwhile trade-off assuming that we can claw back a good chunk of the frame time buffer for non-DLSS tasks.
 
Yeah, I could see that being the case

I thought it was simply a matter of them not caring, what are these anti-competitive reasons?
I don't know if there's reason to believe it beyond a joke, but I've seen people say that the reason some games are so big is so you don't have room left to install the competition.
 
What if 'SP' is the abbreviation of the next console ๐Ÿคฏ.
Maybe this project started development around the time of and along with the development of the potential "Switch Pro", that ended up not materializing because of covid and chip shortages. So "Switch Pro Red" or "Switch Pro Mario", and the codename stuck.
 

However, Nvidia has recently allocated most of the volume for its advanced process chips to TSMC, leading to a halt in orders for Samsung Foundry.
Sounds like good news for TeamTSMC4N.
 
I don't know if there's reason to believe it beyond a joke
For me, I really don't see it as anything other than a joke, the idea that publishers would be that adversarial about things just seems too outlandish. Like, surely these publishers have to realize that it could just as easily happen the other way around? As in, the competition locks you out with no chance for coexistence because your game won't fit in the remaining storage
 
This is an absolute classic Thraktor post, thanks a ton for making it! It is in-depth posts like this that make this thread such a one-of-a-kind experience.

From this, it seems like concurrent tensor operations used in matrix multiplication should be worthwhile to do concurrently, of course under the assumption that the memory can be supplied in a sufficiently fast fashion. What the ultimate performance hit for the ALUs will turn out to be remains a question, but the hope would be that this type of overlapping can hide a (hopefully large) portion of the computational requirement of DLSS, which in practice would give us a frame time boost vs. the non-concurrent implementation of DLSS (where DLSS happens as a separate operation and the FP32 warps are stalled).

In terms of latency, doing DLSS of the current frame simultaneous with the rendering of the next frame induces a frame's worth of input lag (33 ms for 30 fps, 17 ms for 60fps). That's of course not ideal, but it seems like it could be a worthwhile trade-off assuming that we can claw back a good chunk of the frame time buffer for non-DLSS tasks.
4N, big L2 cache and the ofa from mrs. wong and everyone is happy ๐Ÿ˜„
 
For me, I really don't see it as anything other than a joke, the idea that publishers would be that adversarial about things just seems too outlandish. Like, surely these publishers have to realize that it could just as easily happen the other way around? As in, the competition locks you out with no chance for coexistence because your game won't fit in the remaining storage
Part of why AAA publishers keep going for GAAS over single player games is that part of the goal of these products is to make money by monopolizing players' time.

So yes they would in fact be that adversarial. Especially a company like Activision which was one of the earliest companies to aggressively push loot boxes (started with CoD Advanced Warfare and then later Black Ops 3)
 


I forgot about that leak, the Sonic game ended up being real. Maybe there really is/was an Odyssey 2 in development... though, if it was real I can't see them just scrapping the project if a lot of the work was already done at that time. A new game expanding on the open concept of Bowser's Fury is personally what I am hoping for but Odyssey 2 would still be very exciting and i'm sure a great experience.

That was a survey made by Sega. It has no relation to any game that may or may not be/have been in development at Nintendo.
 
Series S have same CPU as PS5 and Series X, is two Zen2 clusters, each cluster with 4 cores, im believe that a78c will end up clocked 2.0-2.5ghz and yes ik it must be same clocks in two modes, ofc on TSMC 4n, samsung 8nm shouldnโ€™t be debate anymore

I would be very surprised if Switch 2 CPU clock isn't below 2GHz.
 
SilverStar would very clearly be Super Luigi Odyssey, given silver is used for second best. It was cancelled when, during development, Luigi couldn't throw his hat very far no matter how many takes they filmed.
 
Nintendo will open a second US store in San Fransisco in 2025:


I see this as related to Switch 2 in some ways.

I see Nintendo pushing more movies, stores and theme parks as a way to increase Nintendo mindshare among consumers, to lessen the risk of having Wii U moments in the future. The greater mindshare Nintendo has among gamers and the general public the less risk of releasing products that don't sell well enough.

And i think they are pushing hardest in the US because the US is their largest single market, and the dollar exchange rate makes it even more lucrative due to the very weak yen.

But i also think they will start to push for stores in Europe as well in the future, maybe 1 theme park in Europe as well in the future.
 
This is probably a bridge too far. You could clearly make a jump beyond the PS5 at the same size - not a leap but a jump. Or you could make the PS5 somewhat smaller. I don't think the tech exists to put PS5 levels of power in a Wii U shaped box, much less do it for less than 500 dollars.

You don't think? I wonder - a Zen4/RDNA3 APU made on N4P with a vapour chamber seems like it could be made very small, and the cooling solution could be cut down significantly. Though that might push it beyond 500 bones.

This is an absolute classic Thraktor post, thanks a ton for making it! It is in-depth posts like this that make this thread such a one-of-a-kind experience.

From this, it seems like concurrent tensor operations used in matrix multiplication should be worthwhile to do concurrently, of course under the assumption that the memory can be supplied in a sufficiently fast fashion. What the ultimate performance hit for the ALUs will turn out to be remains a question, but the hope would be that this type of overlapping can hide a (hopefully large) portion of the computational requirement of DLSS, which in practice would give us a frame time boost vs. the non-concurrent implementation of DLSS (where DLSS happens as a separate operation and the FP32 warps are stalled).

In terms of latency, doing DLSS of the current frame simultaneous with the rendering of the next frame induces a frame's worth of input lag (33 ms for 30 fps, 17 ms for 60fps). That's of course not ideal, but it seems like it could be a worthwhile trade-off assuming that we can claw back a good chunk of the frame time buffer for non-DLSS tasks.

Wait, would the latency actually be one frame? Or would it just be the time taken to do the DLSS, since that would not change from frame to frame?
 
My2cents on the next Super Mario 3D game, what if it's a sequel to BOTH Galaxy AND Odyssey?
Like "Super Mario Odyssey into the Galaxy" or smth
 
There are better ways to express this criticism than indirectly making a commentary on welfare. - ngdprew, Zellia, big lantern ghost, Lord Azrael, Tangerine Cookie, meatbag, Party Sklar, Dardan Sandiego
Super Mario Odyssey had one flaw that I hope can be fixed.

Too many damn "moons". They weren't special. They even had welfare moons. Lets get back to the perfectly fine, 120.
 
Super Mario Odyssey had one flaw that I hope can be fixed.

Too many damn "moons". They weren't special. They even had welfare moons. Lets get back to the perfectly fine, 120.
Even turning the majority of those super-easy-to-get into moon fragments or silver moons or something instead would've helped a lot in differentiating them from the more involved ones.
 
Super Mario Odyssey had one flaw that I hope can be fixed.

Too many damn "moons". They weren't special. They even had welfare moons. Lets get back to the perfectly fine, 120.

I liked it fine - how moons were there? 830? I got them all (have the star on my save file)

For some, they just want to get to the ending of the main story (which can be done in far fewer moons).

For others, they want some longevity to the game itself after completing the main story with extra challenges increasing replayability (ie: me)

Personally I love grabbing moon after moon (sometimes seconds apart), itโ€™s a constant dopamine hit

Looks like they struck a good balance it tries to appeal to those with different goals in mind.
 
Last edited:
Super Mario Odyssey had one flaw that I hope can be fixed.

Too many damn "moons". They weren't special. They even had welfare moons. Lets get back to the perfectly fine, 120.
Not a flaw. Moons weren't supposed to be a big hype get, they were supposed to be a satisfying end to a bite sized gameplay segment you could enjoy in a quick handheld play moment.

Want the end of handheld game design? Say goodbye to the lucrative hybrid form factor.
 
Wait, would the latency actually be one frame? Or would it just be the time taken to do the DLSS, since that would not change from frame to frame?
If I understand what you're getting at properly, I think you're right. So if it was a 30fps game but the concurrent DLSS (and whatever further post-processing) action took less than 16.6ms, the output wouldn't need be 33.3ms behind just because that's what each frame has been given.

30fps frame output (on 60hz screen) could go from
BBCCDDEE
to
ABBCCDDE

and maybe the difference could be even less on a variable refresh rate screen? Though I don't feel I have a solid enough understanding of how all the timing stuff works there to say with certainty.
Not a flaw. Moons weren't supposed to be a big hype get, they were supposed to be a satisfying end to a bite sized gameplay segment you could enjoy in a quick handheld play moment.
In my day we called those "blue coins"!
ej20994.gif
 
You don't think? I wonder - a Zen4/RDNA3 APU made on N4P with a vapour chamber seems like it could be made very small, and the cooling solution could be cut down significantly. Though that might push it beyond 500 bones.



Wait, would the latency actually be one frame? Or would it just be the time taken to do the DLSS, since that would not change from frame to frame?
The thing is that when you overlap DLSS, you get the following situation: On the CPU, the updates for the current update in frame 1 (including, importantly, the user input) have finished, and now the GPU will render the base image before DLSS using the ALUs (non-tensor compute hardware), let's call it stage A. Then, frame 1 needs to go through DLSS which we call stage B, and at the end frame 1 needs to have post-processing applied to the post-DLSS image, called stage C. By the time frame 1 enters stage B, frame 2 has performed its CPU processes (and gathered input based on what is on the screen) and can enter stage A. Once stage B in frame 1 finishes, stage C can start for frame 1. Only once stage C is finished do we output frame 1 to the screen.

What this means is that we can only output frame 1 once stage C is finished. The benefit we extract from overlapping DLSS is that instead of having stage A + B + C take the total time of one frame (17 ms for 60 fps, 33 ms for 30 fps), we can now have stage A + C take up almost all of the frame time without stage B reducing the available time for native rendering and for post-processing. In practice, this likely means that there is more time to produce a high quality base image before DLSS. However, the consequence is that the full rendering of a frame, which includes stages A, B, and C, does not finish within the boundary of a frame. This means that we "miss" outputting the current frame, and have to wait one frame before we can output the current frame. Because of overlapping of the stages, this does not compound, but it does mean that each frame's display happens one frame later than the moment the CPU gathered user input for it, and therefore we experience one frame's worth of input lag. Essentially, the CPU has gathered input for frame 2 while the player is viewing frame 1, and therefore their input is mismatched by one frame.

This is my understanding of it, at least. If my understanding was wrong or someone wants to add additional nuance to it, please reply (it helps the discussion and everyone's understanding of the process)!
 
Maybe this project started development around the time of and along with the development of the potential "Switch Pro", that ended up not materializing because of covid and chip shortages. So "Switch Pro Red" or "Switch Pro Mario", and the codename stuck.
Maybe SPR=Super? Ala Super Switch?
 
I liked it fine - how moons were there? 830? I got them all (have the star on my save file)

For some, they just want to get to the ending of the main story (which can be done in far fewer moons).

For others, they want some longevity to the game itself after completing the main story with extra challenges increasing replayability (ie: me)

Personally I love grabbing moon after moon (sometimes seconds apart), itโ€™s a constant dopamine hit

Looks like they struck a good balance it tries to appeal to those with different goals in mind.
The problem is that they feel not as rewarding as the old stars (theyโ€™re literally laying around every corner). They couldโ€˜ve added other collectibles instead of devaluing the moons.
 
One pretty major issue with the idea of DLSS concurrency is that any tensor core operation outside of the DLSS step would be unusable without incredible timing with regards to programming.

I assume the cost of DLSS will be like 4 ms which is substantial but not killer.

If you could get Super Resolution and Ray reconstruction to work with this concurrency idea, this becomes interesting, but Ray reconstruction appears to be much more expensive than Super Resolution and I assume concurrency will have tons of frametime costs from cache misses so itโ€™s hard to imagine Super Resolution and Ray reconstruction being completed within 16.66 ms with constant cache misses.

This also depends on if NVIDIA is willing to spend the resources to make a shittier, faster version of Ray reconstruction that is designed around much more limited ray tracing.
 
0
The problem is that they feel not as rewarding as the old stars (theyโ€™re literally laying around every corner). They couldโ€˜ve added other collectibles instead of devaluing the moons.
But then you can't beat the game while doing only the moons you want
 
Playing Ghosts of Tsushima on Steam Deck has only increased my appetite for Switch 2's potential. Maybe it's odd that I'm thinking about Switch 2 while playing a Sony exclusive title on a handheld PC... but I can't help myself. Comparison is inevitable.
Screenshot-2024-05-25-100710.png
Screenshot-2024-05-25-100743.png

Some thoughts:
  • This is one of the prettiest games I've ever played, and it is a 'mere' PS4 game running at PS4 equivalent settings (dynamic 1080p with FSR2, 30 FPS, medium-low settings). Asset quality is what you'd expect - but the art direction - the way light bounces, how grass is shaded, how the world presents itself - stunning. And this is despite me cranking the settings way down - it's very scalable.
  • HDR is a must for Switch 2. It just looks stunning and gives an immediate upgrade to visuals regardless of resolution. I expect it in both modes.
  • I'm playing with a 4K framebuffer, so the Deck is always outputting a 2160p 60 Hz HDR signal to the TV. This is what I expect for Switch 2. The game is running at a dynamic upscaled 1080p, and the image quality is outstanding. Increasing the resolution past this point results in noticeable but negligible differences in sharpness. The distinction between 1440p and 4K is especially not noticeable when I'm 8 feet away from my 65'' TV.
  • 30 FPS should be serviceable for most people, yes even on an OLED TV, as long as the game is responsive, the FPS is locked, and/or there is a good motion blur solution. I've had no issues playing this game for hours this way. (I know that some people physically cannot handle 30 frames, this is where I hope developers offer performance modes or support for 40 FPS in some way)
  • I expect many demanding multiplatform ports to target 1080p 30 FPS after DLSS And that would be a fantastic outcome. Aiming for higher is good of course. I don't expect the same degree of diminished visuals with the Switch's 'impossible ports', which sacrificed resolution/framerate/detail all at once.
  • Fuck it, if some games need to get 720p/900p docked after DLSS just to make it onto the Switch 2, I'll take it, as long as the image itself is still well anti-aliased and packed with detail.
  • At living room distance (which is what Nintendo would care about most for docked mode) - the Switch 2 should provide consistently high quality visuals in docked mode, to the point where many complaints about image quality and performance should disappear. We'll see the inevitable complaints about '1080p in 2025' but there will be disinterest in the spec wars, considering how good games should look.
I've enjoyed using my Steam Deck as a mini console and have finished up recent games like Lies of P on it, I'm willing to accept all the cutbacks to have a portable device. It's not a very seamless hybrid device though, so the Switch 2 will end up being my go-to for recent releases if third parties step up their game.
 
Last edited:
Based on testing, ray reconstruction costs 1 ms (on a 4070) of frametime more than denoisers needed for just ray traced reflections.

And a 4070 has 4x the core count of the Switch 2, slightly more than 4x the bandwidth, and a clock speed that will be nearly twice the Switch 2 in docked mode most likely and around 3.5x the clock speed of the Switch 2 in handheld mode most likely.

So we're talking about ray reconstruction costs of 1 ms more than RT reflections only denoisers on a card that is like 6x more powerful. Assuming RT reflections denoisers cost around 0.5 to 1 ms on a 4070, we're looking at estimated ray reconstruction costs of 1.5 to 2 ms on a 4070. And this quickly scales to unusable on the Switch 2.

Note that DLSS Super Resolution 1440p costs 0.66 ms on a 3070 and 0.37 ms on a 4080 (4070 frametimes are not given) so Super Resolution 1440p probably costs around 0.5 ms on a 4070.

So, will NVIDIA make a more limited version of ray reconstruction to help the Switch 2 as well as like 3060 owners? Uhhhhh, not sure, but would be nice.
 
Makes me wonder if Nintendo still have some games that are ready for the Switch 2 pipeline, but the Delays was mostly because of 3D Mario.

Like Iโ€™m quite impressed that Nintendo now has the capability of withholding games, compare to the Wii U era in which we waited for so long for games to arrive.

 
Makes me wonder if Nintendo still have some games that are ready for the Switch 2 pipeline, but the Delays was mostly because of 3D Mario.

Like Iโ€™m quite impressed that Nintendo now has the capability of withholding games, compare to the Wii U era in which we waited for so long for games to arrive.


They already hold FE engage and XB3 before, which is little bit out of my expectation.
 


So regarding Hellblade 2 and how it runs on Xbox Series S and the Deck...

Assuming this game launched on Switch 2, what would be the expected performance we could get, considering the resolution and lack of lumen on Series S are the biggest downgrades and upscaling via DLSS plus RT would, in theory, be better on Switch 2?


Well I tried to play around with the RTX 3050 mobile 4GB and underclock it like I did for control, but it's not possible. 4GB, even with 720p target resolution and DLSS Ultra Performance eats the VRAM. I know that the game needs 6GB of VRAM at its minimum spec, but I thought why not try it out๐Ÿ˜… .

xARfSJh.png


The difference between the "T239" simulated speed or GPU at full speed or the is quite marginal.

U7BIy4x.png
sbiE55w.png
Underclocked + Voltage limit (lowest possible) (sub 20fps)Default (~20fps)
 
I expect many demanding multiplatform ports to target 1080p 30 FPS after DLSS And that would be a fantastic outcome. Aiming for higher is good of course. I don't expect the same degree of diminished visuals with the Switch's 'impossible ports', which sacrificed resolution/framerate/detail all at once.
This seems to come up a lot, but I doubt it. Unless DLSS is much more costly than we've been guesstimating, it's decreasing the input resolution that will have a much bigger effect on system resources, especially if it only needs done 30 times each second. Unless they're really pushed as far as doing something like sub-360p input in docked mode, so the advantages of upscaling to 1440p rather than 1080p would be smaller.

Though I'm tempted to play with DLSSTweaks and see what some stuff like 360p->1440p comes out like.
 
This seems to come up a lot, but I doubt it. Unless DLSS is much more costly than we've been guesstimating, it's decreasing the input resolution that will have a much bigger effect on system resources, especially if it only needs done 30 times each second. Unless they're really pushed as far as doing something like sub-360p input in docked mode, so the advantages of upscaling to 1440p rather than 1080p would be smaller.

Though I'm tempted to play with DLSSTweaks and see what some stuff like 360p->1440p comes out like.

To me it's less a question of the technical capability of the system and more of priority for developers i.e. the minimum they might be willing to ship to deliver the port on time. If dynamic DLSS is in play they may allow the input to vary as it needs to, and keep 1080p output as a 'safe' resolution that can guarantee a nice image. Especially since it meets the minimum target of handheld mode being 1080p.

Preferably they aim as high as they can, I don't really know how much work it takes to optimize for one resolution vs. the other.
 
Well I tried to play around with the RTX 3050 mobile 4GB and underclock it like I did for control, but it's not possible. 4GB, even with 720p target resolution and DLSS Ultra Performance eats the VRAM. I know that the game needs 6GB of VRAM at its minimum spec, but I thought why not try it out๐Ÿ˜… .
Yeah that is almost entirely VRAM bottleneck city right there.

I've watched some more RTX2050 gaming benchmarks and deducted ~25% performance, and honestly most of them look and run pretty good. Compared to the Switch, the improvement in visuals is astronomical.
 
Last edited:
Please read this new, consolidated staff post before posting.

Furthermore, according to this follow-up post, all off-topic chat will be moderated.
Last edited by a moderator:


Back
Top Bottom