I've been seeing a lot of posts about the cost of DLSS lately, but it's mostly based on what's been shown empirically. I wanted to write a short explainer post that talks about what a convolutional autoencoder is and shows how to calculate the number of operations that you would have to run.
Part I: operations
The basic idea of a convolutional autoencoder is that you have two steps:
convolution and
convolution transpose. Both of these steps simply slide a series of filters across an image; the difference is that one of them
downsamples and the other one
upsamples. The amount of
downsampling/
upsampling that these filters cause is controlled by the stride size, which is the number of pixels the filter will skip over as it’s sliding around the image. For example, with a stride size of 1, a convolution filter will move 1 pixel at a time; for a stride size of 2, it will move 2 pixels at a time.
In the
convolution step, you progressively compress and
downsample the image while increasing the number of learned features in each layer. Essentially, this forces the neural network to select which features are important to preserve for reconstruction.
Convolution with a stride size of 2
downsamples a 1920 x 1080 image to 960 x 540.
In the
convolution transpose step (sometimes erroneously called “deconvolution”), you progressively reconstruct and
upsample the output image using all of the learned features.
Convolution transpose with a stride size of 2
upsamples a 960 x 540 image to 1920 x 1080.
The number of filters used in one layer corresponds to the number of learned features/channels in the next layer. We know that the output has exactly 3 channels since it’s an RBG image, but we can pick any number of channels that we want for the hidden layers. A good rule of thumb with a stride size of 2 is to double the number of channels in each
convolutional layer and roughly halve the number of channels in each
convolutional transpose layer. However, this is not a strict rule (as I'll talk about in a sec).
Part II: a simple network
The simplest kind of neural network you can make has three layers: the input layer, the output layer, and a single “hidden” layer. The simplest convolutional autoencoder has a single
convolution step (from the input layer to the hidden layer) and a single
convolutional transpose step (from the hidden layer to the output layer). Here’s a diagram of our simple network:
You can see all the important features in the diagram. For the input, I am assuming 6 channels for simplicity; 3 channels from the current frame projected to the output resolution, and 3 channels from the previous composite frame. (This may not be what is used in practice). In the
convolution step, I use a stride size of 2 and double the number of channels, which gives us a 960 x 540 x 12 tensor. For the
convolution transpose step, I use a stride size of 2 and 3 channels, corresponding to a 1920 x 1080 image and 3 RGB channels.
When you run one of these networks on the tensor cores, they are implicitly cast to matrix multiplication. That’s what I’m showing in the bottom half of the diagram. For example, in the
convolution step, the 1920 x 1080 x 6 tensor is cast to a (960 * 540) x (3 * 3 * 6) matrix, where the 960 and 540 come from using a stride size of 2, 3 is the filter size, and 6 is the number of input channels. We have 12 filters, each of which is a 3 x 3 x 6 tensor; all these filters can be rewritten as a 54 x 12 matrix. The computational cost of this step is the cost of multiplying a 518400 x 54 matrix with a 54 x 12 matrix, which is 335 million operations. The end result is
downsampling, with an increase in the number of channels.
In the
convolution transpose step, the 960 x 540 x 12 tensor is cast to a (1920 * 1080) x (3 * 3 * 12) matrix. The 1920 and 1080 come from using a stride size of 2 with the
convolution transpose operation, 3 is the filter size, and 12 is the number of channels in the hidden layer. We have 3 filters this time (corresponding to the 3 output channels), each of which is a 3 x 3 x 12 tensor. We can rewrite all of these filters as a 108 x 3 matrix. The computational cost is multiplying a 2073600 x 108 matrix by a 108 x 3 matrix, which is 671 million operations. The end result is
upsampling, with a decrease in the number of channels.
Now just to talk about sparsity for a second; sparsity arises when the learned filters include values that are nearly zero. This is not always guaranteed, but it tends to happen in the hidden layers when networks become very deep, since the different filters should ideally be learning different features. The convolutional transpose operation also shows a lot of sparsity when written as a matrix operation (half of the values in that 2073600 x 108 matrix are zero). I don’t know any of the low-level hardware details of how it works in the tensor cores, but the basic idea behind sparsity acceleration is that you skip these multiplications by 0, since they only give a negligible contribution to the output.
Part III: takeaways and speculation
Understanding how these operations work gives us important insight into what we see empirically from DLSS. There are several key takeaways:
Filtering is independent of resolution. Once you cast a 3840 x 2160 x 6 tensor to a matrix, you can multiply it by the exact same 54 x 12 filter matrix that you used for the 1920 x 1080 x 6 tensor. This is one of the main features of convolutional neural networks.
Projection to output resolution and warping with motion vectors is likely not handled with AI. If we warped with motion vectors at input resolution, we would lose accuracy (this is shown empirically in the Facebook neural supersampling paper). One solution is to manually project the samples in the current frame to the output resolution, then warp those samples with the motion vectors. This projected image at the output resolution would be the input passed to the neural network. This effectively means that the cost of DLSS should** (more on this later) increase slightly with input resolution, because there would be more samples to handle before running the neural network.
Another solution would be to include motion vectors as separate channels passed to the neural network directly. This method is wasteful because it significantly increases the amount of data being passed to the neural network, and it’s ineffective because using 3x3 filters means that it is very difficult for samples to be warped further than 3x3 pixels per downsampling step.
We can increase the quality of the output by adding more hidden layers. For example, we could create a network with 3 hidden layers. This network would
downsample twice, i.e. from 1920 x 1080 to 960 x 540 to 480 x 270, and
upsample twice to reconstruct the output. In each successive
downsampling step, we would learn more features, which could then be used for reconstruction. We are only limited by two things. First, adding more hidden layers increases computational cost. Second, training becomes more difficult with more layers. With modern algorithms, it is possible to train neural networks that are quite deep, so the first issue is the more important one.
We can increase the quality of the output by adding more channels to the hidden layer. The number 12 that I picked for the hidden layer was arbitrary. We could set the number to 18, 24, or 36 and compare the results. Doubling the number of channels in a hidden layer also doubles the computational cost of that layer.
The different modes of DLSS may use different architectures. I have seen people in this thread say that, for example, Ultra Performance mode is more expensive than Performance mode because it is 9x upscaling instead of 4x upscaling. However, as we established above,
filtering is independent of resolution. We could use the exact same architecture regardless of what scaling factor we are using. In theory, the cost of the neural network (after the initial projection to the output resolution and warping with motion vectors) should be
identical as long as the architectures are the same.
However, it is possible that DLSS uses different architectures (i.e., an architecture with
more hidden layers or more channels in the hidden layer) for different modes. The deeper architecture necessary for Ultra Performance mode may be extraneous for Quality mode. This may explain** why the cost of DLSS is roughly constant; in Quality mode, for example, the slight increase in cost of having to warp more samples at a higher input resolution is offset by the decrease in cost afforded by using a shallower network.