Billy Dally's talk transcript at ScaledML 2018, with video and slides, speaking about Scaling ML with NVIDIA GPUs.

Bill Dally, Chief Scientist at NVIDIA and Professor at Stanford University

** **

Summary of the talk: NVIDIA GPUs for Scaling Machine Learning

** **

The current algorithms and the models that we use today were around in the 1980s. Why did it take from the 1980s, when all of these algorithms were developed, until about 2012 for things to really start taking off? The answer is: It really takes three things to make this work. You need the algorithms and the models, shown here below. You need a lot of data. We had that around 2005. We started finding labeled datasets that were large enough to make this stuff really work. But it wasn't until AlexNet, in 2012, when people took the missing ingredient, which was hardware fast enough to actually train these models in a reasonable amount of time, where "reasonable" is two weeks.

** **

As the hardware gets more capable, people use bigger models. In fact, today, hardware is pacing the revolution in deep learning. The progress we're making is limited by the speed of the available hardware.

** **

There are a couple different dimensions to this. One is: As we tackle more ambitious problems, the demands of training go up. It takes 7 ExaFLOPS to train Microsoft ResNet on ImageNet. If I do DeepSpeech 2, it takes about three times that: 20 ExaFLOPS. To train the Google Neural Machine Translation program is 100 ExaFLOPS. The numbers keep going up. As people tackle bigger problems, they need more compute performance.

** **

Now, as they get more compute performance, there are a lot of different dimensions of scaling. Probably the most interesting dimension of scaling is across applications. Every day, I see new applications of machine learning that are enabled by having hardware that's fast enough to do them.

** **

On the Internet and the cloud, almost every interaction you have with the web has got some machine learning in it somewhere. If you upload images, those things pass through piles of image networks that filter for objectionable content, copyrighted content. They find the faces in the image. They figure out whose faces they are. They figure out what ads to serve you based on what's in the image.

** **

In medicine and biology, we're achieving better-than-human performance at many image-based tasks: whether it be a photo of a lesion on the skin and diagnosing it as potentially cancerous or benign, reading x-rays, reading other medical images and making diagnoses. It's also the case where people are taking piles of medical records and using that to assist physicians in making diagnoses by giving them the benefit of big data.

** **

All sorts of entertainment is using this. At Nvidia, with graphics as one of our main businesses, we're now applying deep learning in many ways to graphics. Content creation: We're able to eliminate piles of artists’ time from producing video games and movies by being able to do things, like create animated characters, make their faces move in a realistic way, using deep networks to train that, and using audio to create the facial animation.

** **

We're able to basically make ray tracing be usable for real-time graphics for the first time. If we want to render at 60 frames per second, we can cast a few rays per pixel. If we do that, we produce an image that looks really grainy. It looks noisy. We can then take that noisy, grainy image, feed it through a deep network, and it looks like a motion-picture-quality image coming out the other side. So the next generation of video games will start to use ray tracing for the first time.

** **

Those are just typical examples on what Nvidia is currently working on.

** **

What Dimensions Can We Actually Scale?

** **

Moore's law is dead. We're no longer getting anything out of process technology. We used to go from one generation of a chip design process to another, and we would get about 3x in a figure of merit, which is kind of how many operations you can perform per unit of energy. Now we get maybe 1.1x. So, we get a 10% performance improvement, rather than three times.

** **

This is the dimension to be scaled.

** **

Another dimension you can scale: run a bigger model, and if the GPUs aren't getting better fast enough, you just use more GPUs. One thing you'll notice, in addition to more GPUs, is it requires increasing the batch size. But there's been a lot of research recently that has basically shown that if you basically use a larger batch size as a proxy for the learning rate, you can train with really enormous batch sizes without affecting accuracy.

** **

We're increasing the dataset size by a factor of 28, so 251:1 over the size of this. We're getting this straight line on a log scale improvement in accuracy. We've not yet hit that point of irreducible errors. As we add more data, we're still getting more accurate models.

What this means is that, if I want to get more accuracy and increase data by a factor of 256, then I’ll increase the model by a factor of 64 to have the capacity needed to learn. The net increase in overall work to be done winds up being 214 or 32k. So, I need 32,000 times as much compute performance to get that bit of accuracy, which is represented by this increase in data.

** **

First of all, it’s 120 half-precision teraflops as opposed to 45 for the TPU, so it's almost three times the performance of the TPU at what the TPU was designed to do.

** **

So, how is it we can build a programmable engine and have it be a better special purpose engine than the thing that's built to be only special purpose? The way to do that is to realize that there's a number of ways that we can specialize an engine. You can specialize an engine by saying, "Here's the thing I wanna do. I'm gonna hardwire that into one piece of hardware, and that's all it's ever gonna do." Perhaps, if you do a really good job of that and the algorithm doesn't change out from under you, that would be the speed of light, the best that you could possibly do on that.

** **

But then the question is: What if I back off a little bit from that and say, "What is the inner loop of this algorithm? Let me hardwire the inner loop. We’ll leave the rest of this programmable. What is that gonna cost me?" The inner loop of most of the deep-learning algorithms, whether they're recurrent networks, convolutional networks, or multilayer perceptrons, is matrix and matrix multiply. That's at the core of everything we do.

** **

So, what we decided to do was to build a special instruction called HMMA. The marketing people call it Tensor Core (that's why they get paid a lot of money). What it does is it takes two 4 x 4 FP16 matrices, multiplies them together, and sums the result in an FP32 matrix. It's 128 floating-point operation: 64 multiplies at FP16, and 64 adds at FP32. That's enough work that the cost of fetching the instruction, decoding it, and fetching the operand is in the noise. Here's the data on that.

** **

If you go back to sort of our original Kepler architecture, first of all, we didn't have half-precision, so everything was 32-bit. But if we had a half-precision floating point, a half-precision floating multiply add takes about 1.5 picojoules. Fetching and decoding the instruction is about 30 picojoules. So, the overhead there is about 2,000x. If you're trying to build things up out of just multiply/add, the overhead of programmability would be pretty severe. Of course, the GPU does better than this because we amortize that one instruction fetch over a warp of 32 instructions, but just to keep this simple…

** **

On Pascal, we introduced a four-element dot product instruction, a half-precision dot product that summed to 32 bits. So, we at least got eight arithmetic operations. We're up to 6 picojoules of work, and now our overhead is only 5x.

** **

On Volta, with the half-precision multiply accumulated instruction, that instruction is 110 picojoules of math. In fact, it's actually less energy in the math than if you did those instructions separately, because we were able to fuse together certain parts of the instruction. Some of the rounding doesn't have to be done multiple times and the like. So, it's an extremely efficient way of doing that basic operation.

** **

Even at that level, the cost of fetching, decoding and the instruction—all of the overhead—is 27%. So, if you're limited by matrix multiply—and, by the way, that's what a TPU does; it does matrix multiply—you're not gonna do better than 27% better than a GPU, in terms of energy consumed. So, that is what you have to gain on performance per watt, and what you give up for that is all the ability to write programmable layers You get a lot of future-proofing for 27%. So, that's a solution for data centers for both inference and training, to basically use GPUs in the data centers.

** **

Pursuing Acceleration

** **

If you are gonna build an accelerator, you should at least do it right. Firstly, it should have a native Winograd transform. So, if you're doing convolutional neural networks, and you're doing a 3 x 3 convolution, normally that would require nine multiplies. You'd have to multiply every activation in an input channel by the nine elements of that kernel to do the convolution.

** **

The other reason why this is better than what Google has to do with inference is it supports sparsity. We sort of did the original work on sparsity, and therefore we realized early on that neural networks are inherently sparse. Typically, for recurrent networks and multilayer perceptrons, you only need about 10% of the weights, so 90% of the weights aren't needed. Typically, especially if you're using ReLU non-linear functions, you only need about 30% of the activations. The other ones are all zero. So, you're typically only doing 3% of the multiplies. The other 97% of the multiplies, one of the two operands (or both), is zero.

** **

We put support in here to both keep all of the activations, and particularly the weights, in sparse form and decompress them on the fly when we're using them. It makes the memory that much more effective and reduces the memory bandwidth. So, if either of the inputs of a multiply is zero, we don't actually send the zero, because that requires toggling all the lines to zero, and then toggling them all back to the next number burns a lot of energy.

** **

We hold the lines in their previous state, and we send a separate wire that says, "I am a zero." The multiplier basically does nothing and sends an output line saying, "I'm a zero," and also freezes all of its wires. So, nothing toggles. No power is dissipated, and it's a huge advantage in energy savings.

** **

So, if you compare the efficiency, in terms of performance per watt (which is really what matters in these applications), the CPUs are off scale to the right somewhere. FPGAs are just bad ASICs, right? If you look at an FPGA and you don't use the hardwired parts of it, which are really ASICs, but you use the LUTs, the lookup tables that you use to build programmable stuff are almost exactly 100 times the area and 100 times the energy per operation of actually building the real ASIC.

** **

So, if what you need is an ASIC and instead you use an FPGA, you're down by a factor of 100. If you did Pascal, where all we had was the dot product instruction, you have that sort of 5x overhead of doing all the instructions, fetching stuff around the dot products.

** **

But when you're down to Volta, you're only 27% worse than doing a hardwired engine, and that's actually pretty good. If you do the hardwired engine, you do a little bit better, and then that's pretty much as good as you can get. We constantly look for things, like the Winograd transform, like the exploiting sparsity and the like, to do better.

** **

So, in fact, that's what I'm gonna talk about for the rest of this talk, what I spend most of my time on. I get up in the morning and I'm actually kind of excited about what I do because it's fun. And I ask myself this question: How do we continue to scale? How am I gonna make the next one three times faster than this one, given that I'm gonna get 10% from the process?

** **

There are a couple ways of doing this. The easy one, like I said before, is to train with more GPUs. That actually doesn't require any work on my part. This is a slide you've seen before. Now, it turns out that if you do wanna train with more GPUs—and these people are basically using up to 1,024 GPUs—what typically happens is you start training with a couple GPUs, and it takes a certain amount of time to do the computation, and then you have to take all of the gradients—I basically compute my weight gradients, and you compute your weight gradients for the same weights. You have to send 'em all up to a parameter server, or you can distribute that parameter server to the individual workers if you want. They have to do the summation of those and then send them back down.

** **

What happens if I double the number of compute nodes that I’m using here? My computation time gets cut in half, but I still have to send—from every compute node—the same number of weight gradients. So my communication time stays, at best, constant. It may actually go up because there’s more aggregate communication going on. At some point in time this communication time dominates. So along with Song Han and a number of the graduate students here at Stanford, we looked at this last summer, and what we realized is that just like you don’t need most weights—you can throw away 90% of the weights—it turns out your can throw away—or I shouldn’t say throw away—you can avoid communicating 99.9% of the weight gradients during this phase.

** **

So we actually send only 0.1% of these weight gradients. The way it works is as follows. I compute all of the weight gradients on a given node. I figure out what the top 0.1% of those are, and I send that top 0.1% to the parameter server. I don't throw away the other ones. I do several very important things to avoid losing accuracy and to avoid extending my training time. The first thing that I do is I accumulate locally, so I basically keep these weights, and I basically keep summing up my own gradient. I don't throw it away each time. I add the gradient from the next batch to the accumulated gradient since the last time I communicated.

** **

The next thing I do is I get the momentum right. It turns out when I add that gradient to the current weight—view this as sort of the current state of the weight, and this is my current gradient—I don't add the blue to the weight. I add the blue plus the momentum term to the weight. Now, momentum term causes my training to be more stable. Once I start moving a weight in a given direction, I keep moving it in that direction. What happens if I simply don't communicate the weight and I wait to communicate it later? I lose that momentum term. So what we do is we accumulate those momentum terms, and we basically have a correction factor which gives you the exact movement to the weight you would have had had you moved it each time with the momentum at that point in time.

GANs Development at Nvidia

** **

Lastly, I thought I would talk about scaling of applications and something that I think we’ve done at Nvidia research recently which is very cool, which is progressive GANs. It’s another way of showing that the demand for these networks is going up. So anybody who’s played with GANs realizes that you have this problem. Here’s a typical GAN for image synthesis—a generator network where I feed some latent variable—and typically just a random number. It produces something like an image of a face, and then a discriminator network takes that image of a face and tries to determine if it’s real or fake. That real or fake here is used as a loss function for the generator.

** **

So realize that the generator network never actually sees a face. It learns how to produce faces entirely from this one-bit feedback, real or fake. When you start out, you initialize this network with all—this is actually unreal—random weights, and this network with all random weights. It’s kind of like you just hired two new people. You put them on the job, and neither has any idea what they’re doing. So they just start doing random things, and it takes forever for them to converge. In fact, usually they don't.

** **

So up until we did this, people could not actually produce sharp 1024 x 1024 images. They would look like crud. They would be really blurry, low resolution, because you’re trying to learn too much at once.

** **

Now, let’s do 8 x 8. It’s looking like a higher resolution face. After you’ve mastered the basics—the 4 x 4 or the 8 x 8 or the 16 x16—you work up a factor of two at a time and then you can produce a 1024 x 1024 image. So this actually takes a long time to train because we’re not training one network. We’re actually training this progression of networks from 4 x 4 to 1024 x 1024. But actually, to get a given result, it’s much faster than trying to train from scratch because we have this curricular process to it where we learn the simple things first before going to the more complex.

** **

So this is a fun video. I ask you to look at the hours of training time here. So around a day we’re up to 64 x 64. A couple days and we’re now at 128. In three days we’re about to jump to 512 and we’re about to go to a thousand. It takes longer to train as they get bigger. This whole thing wound up training in a little more than two weeks. Two weeks is kind of the threshold. This went viral when the archive paper got posted on the web, and I think had nothing to do with how cool it was to train these GANs progressively. I think it had everything to do with the fact that the researchers in Helsinki who did this chose to use celebrity images.

** **

Now, one thing you can do, given that, with celebrity images is you can take that latent variable after you’ve learned this network and just sort of vary the latent variable around, and it’s a smooth interpolation between the different faces that the network has learned to hallucinate. So, yeah, that’s actually kind of weird. Anyway, to wrap up, it’s really fun working in deep learning, and in particular the hardware for deep learning, because it’s basically revolutionizing almost every aspect of human life—transportation, healthcare, education, graphics—and this whole revolution has been enabled by hardware and it’s being paced by hardware.