• Hey everyone, staff have documented a list of banned content and subject matter that we feel are not consistent with site values, and don't make sense to host discussion of on Famiboards. This list (and the relevant reasoning per item) is viewable here.

StarTopic Future Nintendo Hardware & Technology Speculation & Discussion |ST| (Read the staff posts before commenting!)

I'm not sure you're even reading what I'm saying, because you're responding to a post where I literally point out that BotW's content is not particularly representative of Nintendo's output as a whole.

Also like, I'm not sure what the deal is with the sudden focus on development time. Making video games is a creative process with idiosyncratic time costs at the best of times, and as you likely remember, the last few years have not been "the best of times". Troubled development can happen regardless of what sort of hardware power or fidelity is being targeted.

And again, all consoles are better utilized as time goes on. Just look at PS5/XS, which still frequently aren't even the lead platform for first party games.
That's not really being "better utilized" but more like actually being used at first place (games made for last gen are limited for reasons), which as things are going... This year will be a good one for those two. Nintendo will definitely have technical marvels from their darling studios because they need the jump and early good press in any case, but that still doesn't change the fact besides BOTW's dual team, if they keep dripfeeding budgets to their titles like they currently do (and many unironically want them to), we better don't get to see anyone surprised about them being noticeably behind the curve once again, let alone asking for a better SoC than Drake. Talking about the thread, there are many here that already think it will be too dated by 2024 but in such a scenario, you'd be lucky to even get it to render what 2013 consoles already did.
 
0
May I ask for a somewhat recent update to the current standings of Switch 2 leaks? I was fairly caught up until everything that happened last month and I just feel very lost now. I used to like to read up what’s happened here but I feel so behind. Maybe someone has posted something like this, I’ll happily take a link then to read up.
I can only respond with my own perspective, and others may disagree.

I. Nate and friends declared in a podcast that the release of a certain hardware, originally scheduled for early 2023, was canceled.
  • It is unclear what this hardware was (Pro? Drake?), and whether the release plan or the hardware itself was canceled.
  • The dev kit withdrawal, 2024 release, and a few other talking points are surrounded by weasel words such as “I wonder”.
II. A Nikkei opinion piece and a Bloomberg report publicized that Nintendo intends to increase production for FY03/2024.
  • Both disclosed that the info came from the supply chain. Nikkei suggested 20 million units, and Bloomberg simply more than last year.
  • I wonder about the motivation of these suppliers talking to the press. There are articles such as this one on Economic Daily News pumping the stocks of Nintendo suppliers, published immediately after the reports above.
III. Bailey (IGN), Robinson (VGC), and Dring (GI.biz) stated that Nintendo has no major releases after TotK, hence bowing out of E3.
  • To take these statements as facts, one has to believe that they all independently verified the entire release schedule of Nintendo, all independently decided none of the titles worthy, and all independently concluded that being the reason of E3 absence.
  • Not to mention the liberal use of weasel words such as “huge game”, “significant game”, and “light schedule”.
  • Also note in the same IGN article, Bailey reported that Microsoft won’t have a floor presence due to marketing budget cuts.
  • It is easier for me to believe that the source of these E3 apologist reports was ESA/ReedPop covering their asses, than to believe that Nintendo has no games or Microsoft has no money.
Hidden content is only available for registered users. Sharing it outside of Famiboards is subject to moderation.
 
Oh okay. So the idea is the sparsity algorithm starts with a dense matrix, and doesn't just cut values near 0 but modifies the matrix such as it's possible to cut 50% without modifying too much the output. But obviously there's still some loss.
Basically it's optimization : reducing the processing significantly while dropping quality only a little.

But that means the questions remains : do Nvidia use sparcity acceleration for DLSS, or do they not use it and prefer accuracy ?

EDIT : found this in Nvidia's whitepaper for GA102 :


I am completely lost. Is it 2x or up to 2x ? Is it supposed to mean that desktop Ampere supports 2:4 sparsity, but some other Ampere GPUs don't support that but still support sparsity in some manner ? Is the second part meant to simply do a comparison with Turing and isn't supposed to be related to the first ? WHY DOES NOTHING MAKE SENSE ?
I went into the TensorRT documentation and a different Nvidia whitepaper to look into this more, and here's what I think is going on to the best of my understanding. I'll separate this post into two sections on sparsity workflow and the theoretical 2x speedup.

Sparsity workflow

Each layer of the neural network has to have 0s in the weights acting on 2 out of each set of 4 channels to use 2:4 sparsity.
For each output channel and for each spatial pixel in the kernel weights, every four input channels must have at least two zeros. In other words, assuming that the kernel weights have the shape [K, C, R, S] and C % 4 == 0, then the requirement is verified using the following algorithm:

Python:
hasSparseWeights = True
for k in range(0, K):
    for r in range(0, R):
        for s in range(0, S):
            for c_packed in range(0, C // 4):
                if numpy.count_nonzero(weights[k, c_packed*4:(c_packed+1)*4, r, s]) > 2 :
                    hasSparseWeights = False
When you train a dense neural network, this doesn't happen naturally, but some of the weights may be very small. You can prune/clip the smallest 2 out of every 4 weights to zero and force the network to have structured sparsity, but you will lose accuracy in the process. However, there is a tool called Automatic SParsity (ASP) that allows you to do a second training step after pruning to further optimize the weights and reduce the accuracy loss. I think this is the part that you are talking about with the extra optimization step.
Forcing kernel weights to have structured sparsity patterns can lead to accuracy loss. To recover lost accuracy with further fine-tuning, refer to the Automatic SParsity tool in PyTorch.
The readme at the Github repo for ASP shows the standard way to use this for a pretrained dense model. Basically, you train the dense model, prune 2 of every 4 weights, then do the second training step with the sparse weights until the difference in accuracy is negligible:
Python:
model = define_model(..., pretrained=True) # define model architecture and load parameter tensors with trained values (by reading a trained checkpoint)
criterion = ... # compare ground truth with model predition; use the same criterion as used to generate the dense trained model
optimizer = ... # optimize model parameters; use the same optimizer as used to generate the dense trained model
lr_scheduler = ... # learning rate scheduler; use the same schedule as used to generate the dense trained model

from apex.contrib.sparsity import ASP
ASP.prune_trained_model(model, optimizer) #pruned a trained model

x, y = DataLoader(args)
for epoch in range(epochs): # train the pruned model for the same number of epochs as used to generate the dense trained model
    y_pred = model(x)
    loss = criterion(y_pred, y)
    lr_scheduler.step()
    loss.backward()
    optimizer.step()

torch.save(...) # saves the pruned checkpoint with sparsity masks
There's a nice figure in the whitepaper that diagrams what this multistep training process would look like:
image.png


2x speedup

My best guess for why it's "up to 2x" performance is that, in some cases, structured sparsity is not faster than normal performance, even though there's twice as much arithmetic throughput. TensorRT will tell you which layers fit these criteria.
At the end of the TensorRT logs when the TensorRT engine is built, TensorRT reports which layers contain weights that meet the structured sparsity requirement, and in which layers TensorRT selects tactics that make use of the structured sparsity. In some cases, tactics with structured sparsity can be slower than normal tactics and TensorRT will choose normal tactics in these cases. The following output shows an example of TensorRT logs showing information about sparsity:
Code:
[03/23/2021-00:14:05] [I] [TRT] (Sparsity) Layers eligible for sparse math: conv1, conv2, conv3
[03/23/2021-00:14:05] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers: conv2, conv3
With a normal matrix, you can store all of the values in contiguous arrays, which are very fast to access, but with sparse matrices, you have to come up with some compressed representation that stores where each data point is located in the matrix. From the whitepaper:
An example of a matrix that satisfies 2:4 sparsity pattern requirement is shown in Figure 1. With this pattern, only the 2 nonzero values in each group of 4 values need to be stored. Metadata to decode compressed format is stored separately, using 2-bits to encode the position of each nonzero value within the group of 4 values. For example, metadata for the first row of matrix in Figure 1 is [[0, 3], [1, 2]]. Metadata information is needed to fetch corresponding values from the second matrix when performing matrix multiplication. Note that for a group of 4 values having more than 2 zeros, the compressed format will still store 2 values to maintain a consistent format.
image.png
There is an extra mapping step that you have to do with this metadata to get the sparse data in the right position for the accumulator in the tensor cores. For small matrices, this mapping step and the associated overhead from the sparse representation is significant, so the speedup over dense matrices is not fully realized.

image.png


For a GEMM operation, Nvidia uses MxNxK notation to denote multiplying an MxK matrix by a KxN matrix. As K gets larger, the arithmetic work becomes the main bottleneck, and you asymptotically approach the theoretical 2x speedup. I believe M is the number of output channels, N is the size of the input image times the batch size, and K is proportional to the kernel size and the number of channels in the layer, e.g. for a 3x3 kernel size and 6 input channels, K would be 54 (3 * 3 * 6). Essentially, every row in the MxK matrix is a learned filter that acts on the input data in the columns of the KxN matrix.
Speedups that 2:4 sparse matrix multiplications achieve over dense multiplications depend on several factors, such as arithmetic intensity and GEMM dimensions. Figure 3 shows speedups achieved over a sampling of GEMM dimensions (cuSPARSELt1 cuBLAS2 libraries were used for the sparse and dense GEMMs, respectively). As larger GEMMs tend to have higher arithmetic intensity, they get closer to the 2× speedup afforded by Sparse Tensor Cores.
image.png

Since DLSS runs in real time in a couple milliseconds, it is likely on the lower end of this plot. For reference, the Facebook neural supersampling paper had only 128 channels in its deepest layer and used 3x3 kernels, which I believe would correspond to a GEMM-K of 1152 (128 * 3 * 3).
 
Last edited:
III. Bailey (IGN), Robinson (VGC), and Dring (GI.biz) stated that Nintendo has no major releases after TotK, hence bowing out of E3.
  • To take these statements as facts, one has to believe that they all independently verified the entire release schedule of Nintendo, all independently decided none of the titles worthy, and all independently concluded that being the reason of E3 absence.
  • Not to mention the liberal use of weasel words such as “huge game”, “significant game”, and “light schedule”.
  • Also note in the same IGN article, Bailey reported that Microsoft won’t have a floor presence due to marketing budget cuts.
  • It is easier for me to believe that the source of these E3 apologist reports was ESA/ReedPop covering their asses, than to believe that Nintendo has no games or Microsoft has no money.
I agree that this information coming from the ESA makes the most sense. Nintendo always has something major in production and there's no good reason for the ESA to be in the know about what they're working on anymore. Multiple individuals reaching exactly the same conclusion like that doesn't make much sense unless they're all getting the same basic information from the same source.
 
@LiC correcting me on tensor core numbers sent me down an Nvidia documentation rabbit hole.

The long and the short of it: Orin's DLA is, basically, a battery of additional tensor cores that live outside the GPU. TensorRT sees both the DLA and the GPU, and Nvidia documentation lists the combined TOPS numbers in most places. You gotta really dig to find the separate GPU/DLA TOPS.
 
0
Oh, looks like vouchers are available in Brazil as well! Now that's surprising, we're always excluded from everything.

Is it likely that Xenoblade 2 will ever get discounted again? It didn't on the last sale. Maybe I should use a voucher to buy it, but I could wait for a sale if it still could happen, I still havent played XC1.

EDIT: I only now realized I posted this here instead of the general thread, sorry!
 
Last edited:
In theory, if Nvidia/Nintendo wanted better FP16 performance (as it seems AMD/Microsoft went for) they could increase that number. Which would also improve DLSS times.
Well, when looking at the Thraktor code, it seems the tensor cores are standard desktop Ampere - EXCEPT for the FP16, where it's double rate.

Do we know if DLSS uses FP16 ? That would make sense they'd improve the thing that's used by DLSS.
 
Last edited:
Quoted by: LiC
1
I went into the TensorRT documentation and a different Nvidia whitepaper to look into this more, and here's what I think is going on to the best of my understanding. I'll separate this post into two sections on sparsity workflow and the theoretical 2x speedup.

Sparsity workflow

Each layer of the neural network has to have 0s in the weights acting on 2 out of each set of 4 channels to use 2:4 sparsity.

When you train a dense neural network, this doesn't happen naturally, but some of the weights may be very small. You can prune/clip the smallest 2 out of every 4 weights to zero and force the network to have structured sparsity, but you will lose accuracy in the process. However, there is a tool called Automatic SParsity (ASP) that allows you to do a second training step after pruning to further optimize the weights and reduce the accuracy loss. I think this is the part that you are talking about with the extra optimization step.

The readme at the Github repo for ASP shows the standard way to use this for a pretrained dense model. Basically, you train the dense model, prune the values you don't need, then do the second training step with the sparse weights until the difference in accuracy is negligible:

There's a nice figure in the whitepaper that diagrams what this multistep training process would look like:
image.png


2x speedup

My best guess for why it's "up to 2x" performance is that, in some cases, structured sparsity is not faster than normal performance, even though there's twice as much arithmetic throughput. TensorRT will tell you which layers fit these criteria.

With a normal matrix, you can store all of the values in contiguous arrays, which are very fast to access, but with sparse matrices, you have to come up with some compressed representation that stores where each data point is located in the matrix. From the whitepaper:

There is an extra mapping step that you have to do with this metadata to get the sparse data in the right position for the accumulator in the tensor cores. For small matrices, this mapping step and the associated overhead from sparsity is significant, so the speedup over dense matrices is not fully realized.

image.png


For a GEMM operation, Nvidia uses MxNxK notation to denote multiplying an MxK matrix by a KxN matrix. As K gets larger, the arithmetic work becomes the main bottleneck, and you asymptotically approach the theoretical 2x speedup. I believe M is the number of output channels, N is the size of the input image times the batch size, and K is proportional to the kernel size and the number of channels in the layer, e.g. for a 3x3 kernel size and 6 input channels, K would be 54 (3 * 3 * 6). Essentially, every row in the MxK matrix is a learned filter that acts on the input data in the columns of the KxN matrix.


Since DLSS runs in real time in a couple milliseconds, it is likely on the lower end of this plot. For reference, the Facebook neural supersampling paper had only 128 channels in its deepest layer and used 3x3 kernels, which I believe would correspond to a GEMM-K of 1152 (128 * 3 * 3).
Thanks for the incredible explanation ! That's a lot of work you just did. I've learned so much today thanks to you.
 
Last edited:
Well, when looking at the Thraktor code, it seems the tensor cores are standard desktop Ampere - EXCEPT for the FP16, where it's double rate.

Do we know if DLSS uses FP16 ? That would make sense they'd improve the thing that's used by DLSS.
I don't know whether DLSS uses half-precision instructions, but the tensor core change likely didn't have anything to do with that since Orin is not going to have much (or any) real-world use for DLSS, and the change was evidently not carried forward to Drake, which will.
 
I don't know whether DLSS uses half-precision instructions, but the tensor core change likely didn't have anything to do with that since Orin is not going to have much (or any) real-world use for DLSS, and the change was evidently not carried forward to Drake, which will.
I think I have made some weird calculation error because I've just realized my results would mean that for FP16, Orin is QUADRUPLE rate.

Edit : redid the calcluations. Maybe Orin ACTUALLY IS quadruple rate.
 
Last edited:
I think I have made some weird calculation error because I've just realized my results would mean that for FP16, Orin is QUADRUPLE rate.
Nvidia says that Orin AGX has 170 gpu TOPS, 5.2 TFLOPS of compute and 16 SMs. That is 4x what you would expect from Ampere.

I haven't been able to determine what the hell is going on there, but again, because Nvidia will tend to use ambiguous terms where it makes their product look good, it's possible that they're referring to some different subset of instructions than the TOPS number referred to in their desktop GPUs. This is one of several things that made estimating DLSS 2 performance on Drake make me throw my hands in the air.
 
The Tensor core rabbit hole has taken a hold of me, it's too late to save me now.

There's something weird I'm not sure I understand. In thraktor's code, we have the following :
case Instr::HMMA_16832SP_F32_F16:
return isGA10F ? 2048.0 : 4096.0;
Which would imply that for FP16 with FP32 accumulate and with sparcity, Drake has 2.048 TOPS per SM per GHz, and Orin 4.096.
BUT
If we look at the GA102 whitepaper, we know that the RTX 3080 has 119 TFLOPS for FP16 with the stuff. Considering it has 68 SMs, that makes 1.75 per SM. Considering that, according to the paper this is mesured with boost clock, and that the boost clock of 3080 is 1710 MHz, we get the TOPS per SM per GHz, which is ... 1.0234. So basically 1.024, and a quarter of 4.096.

Could this mean that, on this specific metric, Orin is actually quadruple and not double rate ?
 
Quoted by: LiC
1
Nvidia says that Orin AGX has 170 gpu TOPS, 5.2 TFLOPS of compute and 16 SMs. That is 4x what you would expect from Ampere.

I haven't been able to determine what the hell is going on there, but again, because Nvidia will tend to use ambiguous terms where it makes their product look good, it's possible that they're referring to some different subset of instructions than the TOPS number referred to in their desktop GPUs. This is one of several things that made estimating DLSS 2 performance on Drake make me throw my hands in the air.
Considering the numbers are in the code and not some marketing stuff...
The only thing that's made to be viewed by a "larger audience" is the GA102 specsheet, which makes it look kinda bad compared to the other products.

Edit : Or maybe HMMA_16832SP_F32_F16 is not FP16 with FP32 accumulate and with sparcity, although that seems unlikely considering the name is pretty transparent.
 
0
So I have found this picture :
images.php


This indicates that for FP16, A100 is double rate. If we take the previous info at face value, that would mean Drake is standard Ampere except FP16 is double rate, A100 is double rate everything, and Orin is double rate everything and quadruple rate FP16.

conspiracy-theory.gif
 
The Tensor core rabbit hole has taken a hold of me, it's too late to save me now.

There's something weird I'm not sure I understand. In thraktor's code, we have the following :
case Instr::HMMA_16832SP_F32_F16:
return isGA10F ? 2048.0 : 4096.0;
Which would imply that for FP16 with FP32 accumulate and with sparcity, Drake has 2.048 TOPS per SM per GHz, and Orin 4.096.
BUT
If we look at the GA102 whitepaper, we know that the RTX 3080 has 119 TFLOPS for FP16 with the stuff. Considering it has 68 SMs, that makes 1.75 per SM. Considering that, according to the paper this is mesured with boost clock, and that the boost clock of 3080 is 1710 MHz, we get the TOPS per SM per GHz, which is ... 1.0234. So basically 1.024, and a quarter of 4.096.

Could this mean that, on this specific metric, Orin is actually quadruple and not double rate ?
Hidden content is only available for registered users. Sharing it outside of Famiboards is subject to moderation.
 
* Hidden text: cannot be quoted. *
It seems the numbers line up for GA100, but not for GA102. In what you showed we clearly see the 2x as you said, but on Nvidia's specsheets we see a 4x.
So
1) this rules out completely the possibility of a confusion regarding what HMMA_16832SP_F32_F16 means. The whitepapers use the same terminology between them, and show a different difference than the source code, which also use the same terminology between them.
2) I... don't know. I just don't know what to make of that. Honestly, I think at this point the most likely option is an error in Nvidia's whitepapers.
 
0
Nvidia says that Orin AGX has 170 gpu TOPS, 5.2 TFLOPS of compute and 16 SMs. That is 4x what you would expect from Ampere.

I haven't been able to determine what the hell is going on there, but again, because Nvidia will tend to use ambiguous terms where it makes their product look good, it's possible that they're referring to some different subset of instructions than the TOPS number referred to in their desktop GPUs. This is one of several things that made estimating DLSS 2 performance on Drake make me throw my hands in the air.
I made moar calculations and it would seem, unless I made some error, that precise case doesn't work; as it's 170 INT8 TOPS. 170/16 SM/1.3GHz=8.17 TOPS per SM per GHz, and for Ampere 476 INT8 TOPS/68 SM/1.710 GHz = 4.09, exactly half.
Not only does it show a 2x difference as expected from the source code, but it also lines up with the source code regarding the values, which indicates that those numbers should be respectively 8.19 and 4.10, which is the case within margin of error.

It seems the weirdness doesn't affect anything else than FP16. The INT8 numbers seem coherent between source code and documentation, if I didn't mess up my maths.


I tried to find the source of that weirdness, see if we can isolate it. When using the slide I showed earlier, we can see that that slide and the GA100 whitepaper are incoherent, as according to the slide we should get 312 TFLOPS in FP16 Sparse, which is the case in neither FP32 accumulate nor FP16 accumulate. This is important because the slide is incoherent with the GA100 whitepaper, and the source code is incoherent with the GA102 whitepaper. The two incoherencies have no shared document.

So basically, there cannot be a single error.

I was intending to use the large number of documents to isolate the source of the inconsistency, turns out there is not one source of inconsistency.

Also, the slide is incoherent with the source code, and can be coherent with GA102 whitepaper if "FP16 FMA" is FP16 with FP32 accumulation (I have no clue what FMA is). GA100 whitepaper is coherent with the code.


TL;DR :
1) The issue is only regarding FP16, I couldn't find an inconsistency for INT8.
2) There is no single error we can point to. We can't just axe one document and have the rest make sense.
3)
SlideCodeGA102GA100
SlideXIncoherentMaybe coherentIncoherent
CodeIncoherentXIncoherentCoherent
GA102Maybe coherentIncoherentXNot applicable
GA100IncoherentCoherentNot applicableX
 
Last edited:
0
Looks like Drive OS 6.0.6 was released on January 31st (although it doesn't show up under "latest" yet because it's only the NVONLINE version?). Still has not added back the mention of GA10F in the documentation.
 
I think I'm gonna take one for the team and just buy an OLED. Knowing my luck, if the next Switch suddenly gets announced next week, you're welcome.
Or it will backfire with next Nintendo report saying that current OLED model sells amazingly and currently there is no need for new hardware
 
0
From your own article: "which is ten percent higher than the amount of Switch Lite repeat buyers". 30% of Switch Lite buyers already own a Switch.

Considering that the Lite sells significantly less than the OLED, we may conclude that the Lite had done less to increase the appeal and userbase of the Switch than the OLED. Once again, price alone is not the dealbreaker (as long as it's not something ridiculous).
You left a model out. Might wanna go re-read the post you responded to.
I will preface it that I agree with the main point that Nintendo won't have big profit margins at launch. If Reggie interview is to be believed, the 3DS cost them roughly $220 at launch and NCL was fine selling at that price but not at loss ($199). So, there was never really a time where they were too greed with launch profit margins. And with how much they make from software, the gains are not worth the risk.

With that said, the 3DS problem wasn't the price. All the price drop did was frontload sales from those who were interested but usually wait for discounts on year 3. But price only takes you so far, as shown by the 2DS which went as low as 80 bundled with flagship games and yet sales keep going down from the peak on year 1 and 2.

If they had made a more attractive product, they could have succeed at 250 or maybe more. Just like the $400 PS4 did far better than the $300~350 Wii U.

Between, for example, a $350 Drake and a $450 Drake with barely any difference besides the profit margin, sure the later is far more risky. If with that extra $100 they can make a significantly more compelling product though (e.g. Switch vs Switch Lite), going cheaper doesn't means safer.
Seems like a lower price was possible but Reggie seemed to think only prices in multiples of 50 are acceptable... and then got lowered to a price that wasn't a multiple of 50. Womp womp. And pretty sure $30 over cost is one of the highest (if not THE highest) profits Nintendo's ever got from a hardware product at launch (except maybe Wii, but that has been in debate for about a decade now and don't wish to rehash said debate).

The inverse of your suggestion is also true: a highly desired product will only get you so far. Since we’re talking prices in the launch year (and not the subsequent price drops in year 3 and beyond), that point is proven by PS3, highly anticipated as it was until it suddenly wasn't.
But outside a small handful of cases, the most popular hardware products in this industry’s history all share one thing in common with one another (other than a really impressive 3rd-party software catalog), and that’s an acceptable mass-market price that does not engage in cynical profiteering or recovering from over-engineering at launch. PS1, PS2, DS, Game Boy, Wii, Switch… even the PS4, with an understandably lower MSRP at launch than its predecessor.

But my point is that, if we have an SoC that is taped out and off to production, the overall design of much of this new device is done and the only things left to change are batteries, storage, controller parts and maybe RAM (though more RAM can only do so much if the SoC is done).

And based on pricing for those 3-4 components, if one assumes that a $300-350 pricepoint can cover the cost of the outlined SoC specs (as I do, because a lot of the reasoning for a higher cost seems based merely on a vague feeling that better parts are more expensive in spite of 6 years passing since the last launch MSRP was determined) and also accept a reasonable upgrade to these remaining components is already possible due to a combination of falling costs and achievable efficiencies to part production over the past 6 years as data seems to indicate there has been, a $100 cost difference on the remaining changeable parts would likely border on the obscene.
You could add to the Joy-Cons, but what is left to add to them that would cause such a ballooning cost that you need another $100 to cover it? And would consumers accept individual controller purchases being higher than they already are?
At an extra $100, you could pump up internal storage to 1TB and still likely have plenty of that extra cash to spare.
Batteries have already naturally increased in density (to the tune of anywhere between 1000-1500mAh) to offer a longer life at a similar price for the same/similar physical dimensions, any more and you're just throwing money at something consumers would be more than satisfied with for a device that is roughly in the same TDP envelope.
And again, RAM can be increased after the fact, but there is an obvious crucial limitation that can't be, so there's an upper limit where you get no performance gains without changes to the SoC itself, which... yeah, again, already done.

So really, the only reason to go for a $400-450 new bit of kit is to pocket more money, because the alternatives aren't really that great.
 
@Paul_Subsonic I wonder if your slide they are just counting 'FMA operations' as the instruction count.
A Fused Multiply-Add is typically counted as 2 floating point ops, even though it's a single instruction. But perhaps when it's referred to specifically as an 'FMA operation', it's counted as one 'FMA op'? Not sure.

For the other thing I guess that Drake retains the full speed accumulation to 32bit that Desktop Ampere lacks, although I'm not sure how much uses that would be for games.
 
@Paul_Subsonic I wonder if your slide they are just counting 'FMA operations' as the instruction count.
A Fused Multiply-Add is typically counted as 2 floating point ops, even though it's a single instruction. But perhaps when it's referred to specifically as an 'FMA operation', it's counted as one 'FMA op'? Not sure.

For the other thing I guess that Drake retains the full speed accumulation to 32bit that Desktop Ampere lacks, although I'm not sure how much uses that would be for games.
That is particularly interesting, and may be the KEY (I'm not exagerrating what you just said may actually be the answer to all our problems).

If we take your info, this opens the door for the slide to be in accordance with GA102, GA100 and Turing whitepapers at the same time, if we consider FP16 is with FP16 accumulate and not 32. That accordance is however at the cost of the coherency with the source code, but that may not be such a problem as we'll soon see.
Indeed, if we take what I just said and update the coherency table, we get :
SlideCodeGA102GA100
SlideXIncoherentCoherentCoherent
CodeIncoherentXIncoherentCoherent
GA102CoherentIncoherentXNon applicable
GA100CoherentCoherentNon applicableX

This is interesting because this new table allows us to point 1 single source of error : the code. Remove the code, and every other document makes sense.

Furthermore, you talk about the full speed accumulation to 32 bits, that is on Orin and A100 but not desktop Ampere; but the source code goes against that.
Indeed, the ration between FP16 with FP16 accumulate no sparsity and FP16 with FP32 + sparsity is the same 2x for GA100 and GA102, even though it should be 4x for GA102 because there's a 2x difference from sparsity and a 2x difference from the lack of full speed accumulation.

I'm a complete newbie to these subjects so I don't know if the lack of full speed accumulation on desktop is some well known and established thing, but if it is then we have a definitive answer.

TL;DR :
The source code is very sus, and if we have sufficient proof that Desktop Ampere lacks full speed accumulation to 32 bits, then we can rule out the source code as the impostor.
 
Last edited:
The inverse of your suggestion is also true: a highly desired product will only get you so far. Since we’re talking prices in the launch year (and not the subsequent price drops in year 3 and beyond), that point is proven by PS3, highly anticipated as it was until it suddenly wasn't.
Sure, a desired product also needs to be priced at something people are willing to pay. Going 50% more than the most expansive console at the time and twice the launch price of the most expensive console of the previous generation was a huge risk. It's not 2006 and we're not talking about $600 either. We have seen enough to know that $400~500 is an acceptable mass-market price, at least during their first 2~3 years.

But my point is that, if we have an SoC that is taped out and off to production, the overall design of much of this new device is done and the only things left to change are batteries, storage, controller parts and maybe RAM (though more RAM can only do so much if the SoC is done).
But what if they're aiming for higher specs than the equivalent of a 2023/24 Switch? Suddenly the SoC, RAM, cooling, battery might all be more expensive this time even after accounting inflation.

Not saying they will, but they definitely could.

So really, the only reason to go for a $400-450 new bit of kit is to pocket more money, because the alternatives aren't really that great.
Just because you can't think of great alternatives to add value with $50~100, it doesn't means there aren't any. Specially for a company which is always looking for new experiences and are willing to take risks there.

But even without something new and unexpected, there are things Nintendo could think it's worth it. E.g: OLED model improvements on launch, faster storage, higher specs for better AAA support, LaboVR-but-better, cameras for AR, better motion controls, etc.
 
0
I'm a complete newbie to these subjects so I don't know if the lack of full speed accumulation on desktop is some well known and established thing, but if it is then we have a definitive answer.

TL;DR :
The source code is very sus, and if we have sufficient proof that Desktop Ampere lacks full speed accumulation to 32 bits, then we can rule out the source code as the impostor.

All I know is the numbers in whitepapers, those show that the GA102 does fp32 accumulate slower than fp16 accumulate.
The only code I've seen posted was for defining the performance of Tegra 234/239, are you talking about something different?
 
All I know is the numbers in whitepapers, those show that the GA102 does fp32 accumulate slower than fp16 accumulate.
The only code I've seen posted was for defining the performance of Tegra 234/239, are you talking about something different?
I am talking about the @Thraktor code, which was defining the performance of GA10F and GA10B, so I think we're talking about the same thing.
@LiC reposted earlier today the info in the code regarding GA10F, GA10B and also showed GA10X and GA100 info. In that comment, the term used was "source code" so I'm using that teminology too. I think it's from the leaked source code although I may be wrong. I wasn't on the forum at the time that code was first posted.

Edit : I found exactly why there is an error. I don't have the time to make a post about it as I want to fact check and provide sufficient proof, but it would seem it's not an "error" per say; instead, the performance in the source code is the one of the Quadro version of the cards, and not the GeForce cards.
 Post about that coming when I can.
 
Last edited:
That is particularly interesting, and may be the KEY (I'm not exagerrating what you just said may actually be the answer to all our problems).

If we take your info, this opens the door for the slide to be in accordance with GA102, GA100 and Turing whitepapers at the same time, if we consider FP16 is with FP16 accumulate and not 32. That accordance is however at the cost of the coherency with the source code, but that may not be such a problem as we'll soon see.
Indeed, if we take what I just said and update the coherency table, we get :
SlideCodeGA102GA100
SlideXIncoherentCoherentCoherent
CodeIncoherentXIncoherentCoherent
GA102CoherentIncoherentXNon applicable
GA100CoherentCoherentNon applicableX

This is interesting because this new table allows us to point 1 single source of error : the code. Remove the code, and every other document makes sense.

Furthermore, you talk about the full speed accumulation to 32 bits, that is on Orin and A100 but not desktop Ampere; but the source code goes against that.
Indeed, the ration between FP16 with FP16 accumulate no sparsity and FP16 with FP32 + sparsity is the same 2x for GA100 and GA102, even though it should be 4x for GA102 because there's a 2x difference from sparsity and a 2x difference from the lack of full speed accumulation.

I'm a complete newbie to these subjects so I don't know if the lack of full speed accumulation on desktop is some well known and established thing, but if it is then we have a definitive answer.

TL;DR :
The source code is very sus, and if we have sufficient proof that Desktop Ampere lacks full speed accumulation to 32 bits, then we can rule out the source code as the impostor.
I am talking about the @Thraktor code, which was defining the performance of GA10F and GA10B, so I think we're talking about the same thing.
@LiC reposted earlier today the info in the code regarding GA10F, GA10B and also showed GA10X and GA100 info. In that comment, the term used was "source code" so I'm using that teminology too. I think it's from the leaked source code although I may be wrong. I wasn't on the forum at the time that code was first posted.

Edit : I found exactly why there is an error. I don't have the time to make a post about it as I want to fact check and provide sufficient proof, but it would seem it's not an "error" per say; instead, the performance in the source code is the one of the Quadro version of the cards, and not the GeForce cards.
 Post about that coming when I can.
I think it's much more likely that the whitepapers aren't showing raw core efficiency numbers, which is what the source code is showing, rather than the source code being "wrong" in any way. The source code is about GPUs only, not real cards, so that could play into it, but that's not a different type of card -- if anything, it's just apples to oranges comparing an abstract GPU implementation with a physical product.

I also feel like we're way out in the weeds past where any of this is potentially useful to us.
 
0
Edit : I found exactly why there is an error. I don't have the time to make a post about it as I want to fact check and provide sufficient proof, but it would seem it's not an "error" per say; instead, the performance in the source code is the one of the Quadro version of the cards, and not the GeForce cards.
 Post about that coming when I can.
I see what you mean now, and I think you may be right. Table 2 of the GA102 whitepaper shows the gaming cards and Table 3 shows the professional cards (A6000 actually - I believe Ampere cards retired the Quadro name). I can see there that the gaming cards don’t have the full speed tensor F16 to F32 accumulation, while the A6000 does, like you are saying. I also think that Stinky Horse is correct about the factor of 2 coming from counting FMA as two operations. With those two considerations in mind, it seems like all the numbers match up between the source code and the white papers.
 
I see what you mean now, and I think you may be right. Table 2 of the GA102 whitepaper shows the gaming cards and Table 3 shows the professional cards (A6000 actually - I believe Ampere cards retired the Quadro name). I can see there that the gaming cards don’t have the full speed tensor F16 to F32 accumulation, while the A6000 does, like you are saying. I also think that Stinky Horse is correct about the factor of 2 coming from counting FMA as two operations. With those two considerations in mind, it seems like all the numbers match up between the source code and the white papers.
I kind of fell off, but where does Drake fit in here?
 
Nate just dropped his 2023 prediction video for Nintendo


Pretty much as Andy and IGN have said then, light second half in 2023 following by a strong 2024 with the next device i believe.

Praying for 5nm and +12Gb and being ready for "disappointment" :)

PD: The forecast still confuses me but at this point im jumping off the #team2023 next week
 
So......Drake was scrapped and Nintendo will now be using binned Orin Chips and the next Switch will be the size of Steam Deck. It all makes perfect sense. No more need for speculation.
That's not quite true, since according to the illegal Nvidia leaks, Drake's (T239's) quite different from Orin (T234).
1) Orin uses the Cortex-A78AE for the CPU. And Nvidia's Linux commits mention that T239 has eight CPU cores in a cluster. That means that Drake probably uses the Cortex-A78C for the CPU since the Cortex-A78C is the only CPU in the Cortex-A78 family that supports up to 8 CPU cores per cluster. (The Cortex-A78AE, like the Cortex-A78, only supports up to 4 CPU cores per cluster.)
2) Drake's GPU is probably closer to consumer Ampere GPUs architecturally, considering the Tensor cores on Drake's GPU perform very similar, if not the same, as the Tensor cores on consumer Ampere GPUs, whereas the Tensor cores on Orin's GPU has 2x the performance of the Tensor cores on Drake's GPU and consumer Ampere GPUs in terms of Half-Precision Matrix Multiply and Accumulate (HMMA), Integer Matrix Multiple and Accumulate (IMMA), and Binary Matrix Multiple and Accumulate (BMMA) instructions.
3) All of the automotive hardware features are still going to be present, but disabled, if Nintendo's going to use binned Orin SoCs. But that's not the case since when T239's header file is compared with T23x's (T234's) header file, almost all of the automotive hardware features that are present on Orin are absent on Drake (e.g. NvMedia Camera Serial Interface (NVCSI), Image Signal Processor (ISP), Video Ingest (VI), Nvidia's JPEG decoder/encoder (NVJPG), etc.). (Drake interestingly does inherit Orin's Optical Flow Accelerator (OFA) and Video Imaging Compositor (VIC). I do wonder how Orin's (and Drake's) OFA compares to Ada Lovelace's OFA.) And Drake has one hardware feature that Orin doesn't have, which is File Decompression Engine (FDE), which according to Nvidia's senior design verification engineer, is for video games.
 
Pretty much as Andy and IGN have said then, light second half in 2023 following by a strong 2024 with the next device i believe.

Praying for 5nm and +12Gb and being ready for "disappointment" :)

PD: The forecast still confuses me but at this point im jumping off the #team2023 next week
Why disappointment?
 
This new hardware will be really good. Power upgrade won’t be the only upgrade. Ninty live to add gimmick on their hardware.
 
0
Please read this new, consolidated staff post before posting.

Furthermore, according to this follow-up post, all off-topic chat will be moderated.
Last edited:


Back
Top Bottom