Anima Anandkumar, Principal Scientist at Amazon Web Services, Bren Professor at Caltech

Scalability and Amazon SageMaker, machine learning service

Today, we are seeing different forms of scaling. There is scale in terms of large models, and we have now very huge deep learning models. The scale in terms of the infrastructure, like how we are doing distributed training, sometimes possibly over a lot of machines or a lot of GPU cores.

So, what I want to show in this talk is scale in a different form that is the dimensions. How we can also scale in terms of thinking of data, thinking of computations in many dimensions. That’s where tensors come into play. Indeed, the term tensor is a lot more popular now, given the popularity of TensorFlow.

But I want to go back to its algebraic roots. Think about how the current linear algebra operations can be taken further with tensor algebra. So you’ll see all of this work that I present today is open source. I’ll also present some of the software frameworks to run these, as well as show how, towards the end, SageMaker has some of these algorithms.

If you have one dimension, that’s a vector. If you have two dimensions, that’s a matrix. Anything further is a tensor. So, tensor really is a generalization of matrices. We’ve used linear algebra extensively in our machine learning algorithms. If you think about any standard neural network architecture, whether it’s fully directed layer, they do all matrix operations, some form of matrix multiplication. Is that all or can we further make these architectures richer? That’s where we’ll see tensors are a natural form to further extend neural network architectures, as well as encode data that we encounter in different applications in better and richer forms.

Currently, idea is why tensors are natural frameworks for machine learning is because the modern data has so many dimensions. Every dimension in the data can now naturally be a new dimension in our tensor, in our data structures. That’s not that hard to visualize.

If you think of the image, there are the two dimensions, the width and the height of the image. We also have multiple channels like red, green, and blue. Then, you have the third channel, so in that sense, it’s a three-dimensional object. If you think about video, now time is the fourth dimension. So, you’ll have a four-dimensional object.

Now, we are going to similarly think of data with multiple modalities. You not only have images, but you also have text. An example application is visual question answering. The input is both an image and a question and based on that, you have to come up with an answer.

In this case, what are even the dimensions in data? That depends on how we encode the text. Do we encode the text in terms of characters? Do we encode the text in terms of vocabulary? How we represent our input text will determine how many input dimensions that are present in this application.

As you see, in real applications you will get data from different input sources and they form different dimensions, so a natural way to have this as input to, let’s say, a neural network architecture would be thinking in terms of a tensor.

You can also use tensors to represent relationships in the data, not only the raw input data by itself, but really how our different data dimensions related to one another.

The formulation is simple. If X is a random vector here, you’re looking at the cross-correlation between every two entries of X as well as a sub-correlation. All such correlations can be put together in a matrix.

This is very familiar to us, to represent pairwise correlations: all you care about are pairwise correlations. That is enough to fit your data. But then, we know that most of the models will not be Gaussian here. Gaussian will not be a good enough assumption. Hence, we can go to higher order correlations, and that’s where tensors are also useful.

Ideally, we want to do something more with correlations. If we just compute the correlations by themselves, that may tell you a little bit, but ideally, there’s some machine learning downstream based on these correlations. We’ll see later in the talk how these correlations can be exploited to get some insights about the data. One example is topic modeling, where we want to extract topics from documents in an unsupervised way. We’ll see how these correlations will help us do that.

Now that we saw that tensors can incorporate data in multiple dimensions as well as correlations of any order, the next question is what kind of operations do we do on tensors? Matrices, we know the primitive is matrix multiplication. So, we can multiply two matrices. Can we do the same with tensors? How do we do operations of tensors? You can see virtually that there is a nice generalization of matrix products to more dimensions.

More generally, you can transform any given tensor to a new tensor through these contractions. You can extend these vectors, in general, could be matrices, you can multiply them in different dimensions. You can contract them in different ways. That’s what gives us a dictionary of operations that’s very rich. The matrix, you can only multiply two directions. With higher order tensors, now we can multiply along multiple dimensions. You can multiply them with also higher order objects. Not only multiplying with vectors, you can multiply them with matrices, you can multiply them with tensors. Those are very nice extensions of operations we can do in linear algebra to now tensor algebraic operations.

The tensor train formulation approach.

There are other forms that you can use tensors in. As I said, time is now another dimension. You have some observations in each time point, and now you have to think about how to do forecasting in time series efficiently. This is a very challenging problem because you can have long-term dependencies and there can be a lot of chaotic behavior. Even a little bit of a change can have very different behaviors onwards. How do we model something like this?

Intuitively, we know that higher order correlations, will be present in such highly chaotic time series especially. Tensors can seem like a natural framework to incorporate higher order correlations for better forecasting. That’s what we explored by tensorizing recurring neural networks and LSTMs. Standard recurrent neural network, what it does is it maintains a state, and then it propagates that forward. There’s new input and there’s an output that you want to predict, but it’s a state that you’re propagating forward. You can think of it as a first order model because the state is all that matters to propagate forward. Everything that you want to know about the past is incorporated into this state.

Then, you can also think of a higher order model where you also take the past states and incorporate them together. In fact, you can look at the correlations between the different states in the past, and put them together and try to make a decision for forecasting based on all these correlations between the past states. Of course, this would blow up the dimension, so you don’t want to do it as it is.

What we explored was a new form in this context called the tensor train formulation. What is does is it trains to maintain a low rank approximation of this high order tensor so that you’re not blowing up the dimensions. It still has a reasonable number of parameters in your recurrent neural network. This tensor train goes forward as a chain and tries to find low rank approximations while trying to maintain the polynomial’s of high order correlations. This tensor train layer, you can incorporate it into your recurrent neural network and you can do the same with LSTMs. In the LSTM cells, you can do the tensorization.

What we found by doing this was it can give rise to much better long-term forecasting compared to the standard LSTMs. One was the traffic data set. This is one of the benchmarks for forecasting where you want to predict the traffic in the next two hours and the next four hours, and so on. You want to predict way into the future, in this case it’s a few hours, and you can see that for the LSTM, the area especially increases when it’s beyond, let’s say, this 12-hour mark for instance. The area increases a lot. But as for the tensorized LSTM, it seems to be much more graceful in terms of how the area increases as we increase the window for forecasting.

The SageMaker service potential.

SageMaker is a new platform that allows you to take machine learning from a concept to production very seamlessly. There is no set-up required for infrastructure. If you want to do multi-machine training, you can do that out of the box. If you want to change instances, you can do that seamlessly. If you want to host multiple models, do AB testing, deploy them by different forms, and do model management, you can do that very easily. This does a lot of heavy lifting when it comes to dev ops so you can focus on the machine learning aspect, which is where a lot of new innovation needs to be done. That’s what SageMaker is very useful for.