Skip to content

Ilya Sutskever’s talk transcript at ScaledML 2018


Chase Hart | April 19th, 2018


Ilya Sutskever,  Co-Founder and Research Director of OpenAI

Meta-Learning & Self Play

The Reinforcement Learning Problem

I’ll begin with a slightly higher-level introduction by telling you about the reinforcement learning problem. The reinforcement learning framework just tells you we have an agent in some environment and you want to find a policy for this agent to maximize your work.

In the formulation, the environment gives the agent, the observations, and the rewards. But in the real world, the agent needs to figure out its own rewards from the observation. The observations come in and you got a little neural network, or hopefully big neural network, that does some processing and produces an action.

I’ll explain to you in this slide the way in which the vast majority of reinforcement learning algorithms work. You try something random and if it is better than expected, do it again. Then, there is some math around it, but that’s basically the core of it. Everything else is slightly clever ways of making better use of this randomness.

The potential RL. The reinforcement learning algorithms that we have right now can solve some problems, but there are also a lot of things they can’t solve.

A really good algorithm would combine all the spectrum of ideas for machine learning, both supervised and unsupervised learning, representation learning, reasoning, inference, while test time and training of test time. All of those ideas will be put together in the right way to create a system which would figure out how the world works. But the algorithms we have today are still nowhere near at the level of their full potential.

Hindsight Experience Replay

I want to discuss some ways in which we can improve the reinforcement learning algorithms. As discussed earlier, the way reinforcement learning algorithms work is by trying something random, and if we succeed, if we do better than expected, then we should do it again.

But what happens when we try lots of random things and nothing works? The question is, can we somehow find a way to learn from failure?

The idea is the following: we aim to achieve one thing. But we will probably fail unless we are really good. We will achieve something else. So, why not use the failure to achieve the one thing the training data is trained to achieve instead of the other thing?

What we need is to have some kind of parameterization of goals so that every time we try to achieve one goal, if we fail, we may achieve a different goal.

We have our robot who tries to achieve the goal, but it failed. Instead, it achieved something else. It achieved an outcome. We will treat this outcome as the goal we intended all along. And, now we are learning from experience.

Dynamics Randomization for Sim2Real

One of the areas of research we worked on is robotics, and the reason for that is that robotics gives us all these cool problems, makes us grapple with reality. The constraints that the robot imposes on us are good constraints that we would like our algorithm to be able to meet. One of the things that would be very nice when it comes to dealing with robots is if we could somehow train a system in simulation and take its knowledge outside of the simulation onto a real robot. Here, I’m going to show you a really simple way of doing that.

The idea is the following: we know that there is going to be some difference between the physics of the simulation and the physics of the real robot. What is more, this difference can be hard to pinpoint. Some things we can do to help the algorithm is to randomize many dimensions of the simulation. We can randomize gravity, friction torques, lots of things.

Now when we run the algorithm into the world, it doesn’t know what are these different dimensions that we have randomized. It needs to figure it out on the fly. It’s a recurrent neural network, and as it interacts with the environment, it figures out how the world really is. It works, in some cases in an interesting way, and I’ll show you with the video.

In the video, we train the algorithm in simulation, without the randomization, and we are trying to bring the hockey puck to its goal, to the red dot, and it’s not succeeding. It’s confusing, it’s shaking because the world behaves in a way that it doesn’t expect.

Now, if you train the algorithm with these different randomizations, then the recurrent neural network is able to, in effect, try to do system ID and try to guess all the unknown little coefficients that it’s already learned to identify just from the observations. So, you can see that it’s very clearly a closed-loop. There’s a very simple way of doing Sim2Real and it might even scale to more difficult setups.

Learning a Hierarchy of Actions with Meta-Learning

Though unsuccessful yet, it would be very helpful to do reinforcement learning with a hierarchical approach.

If we have a distribution over tasks, then basically what we want is to train low-level controllers such that they make it possible to solve the tasks for our distribution of tasks quickly.

That approach basically works on this domain where we have this little ant crawling around and it learned three sub-policies, so this kind of big problem was quite easy to learn because credit assignment was easy and so on. The current problem is still unsolved, and this is not the solution, but it’s an interesting demonstration that perhaps will lead to a solution.

Evolved Policy Gradients

I also want to show this cool result where basically we thought okay, it would be cool if you could evolve a cost function that will make it possible to solve RL problems quickly. As we usually do with these kinds of situations, we have a distribution over tasks, and we literally evolve the cost function and the fitness of the cost function is the speed in which this cost function lets us solve problems from the distribution of problems.

This is going to be a slightly longer video of a single learning trial. Once the cost function has been learned, then learning is extremely fast. This is also a continual learning trial. It’s constantly updating its parameters and it’s trying to achieve this green half sphere. It’s a little jittery, but after a while, it will succeed.

The learned cost function allows for extremely rapid learning, but the learned cost function also has a lot of information about the distribution of tasks. In this case, this result is not magic because you need your training task distribution to be equal to test task distribution.

Self Play

Self-play is something which is really interesting. It’s mysterious. It’s an old idea that existed for many years, goes back as far as the  ‘60s. The first really cool result in self-play is from 1992 by Gerald Tesauro where he used a cluster of 386 computers to train a neural network using cue learning to play backgammon with self-play. The neural network learned to defeat the world champion and discovered superior strategies that backgammon experts weren’t aware of.

Then, we’ve also heard of the AlphaGo Zero Result, which was able to beat all humans in Go from pure self-play. There have also been our results on the Dota2 1v1 bot which was also trained in self-play. It was also able to beat the world champion in 1v1.

So self-play is beginning to work but what is exciting about self-play? Self-play has the property of very simple environments. Yet if you run self-play in a very simple environment, then you could potentially get behaviors with unbounded complexity. Self-play gives you a way of converting compute into data, which is great because data is really hard to get but the computing is easier to get.

Another very nice thing about self-play is that it has a very natural curriculum because if you’re good, it’s always difficult. You always win 50% of the time. If your opponent is really good and you’re really good, it’s still difficult. It doesn’t matter how good you are or how bad you are, or the bot is or the system is, it’s always challenging at the right level of challenge. So, that means that you have a very smooth path of going from agents that do not do much to agents that potentially do a lot of things.

This is why I find the idea of self-play exciting, and appealing, and worth thinking about.

One thing that we tried to do last summer is to see if we could use self-play in a physical environment. The hope here was basically to learn a martial art.

The cool thing about this is that no supervised learning was used. This is basically creativity. You simply say hey, here’s an environment. Can you please figure out what to do here? They discovered something which I think is a legitimate martial art.

So this one is interesting. This is transfer learning. You take one of the sumo wrestlers, and you take his friend away. Now, he’s standing in the ring, alone. You apply big random forces on it. As you can see, it is able to do a decent job at maintaining its balance, and the reason for that is that it’s already used to someone pushing on it. That’s why it’s stable.

So, one thing which I was hoping we would be able to get out of it is something like you put these robots in self-play, and then they learn all the physical cues. Then, you can fine-tune them to solve some useful task. This is something which hasn’t been done, the closing the loop of taking an agent outside of the self-play environment, and maybe fine tuning it on some task which we otherwise cannot solve at all.

Now, I want to say one last thing about self-play, and this is more speculative. Can such self-play system lead us all the way from where we are right now to AGI?

The high level is that a self-play environment gives you a perfect curriculum, it gives you an ability to convert compute into data. If you set it up just right, there is basically no bound to how far, to how complex you can be inside a self-play environment.

AI Alignment: Learning from human feedback

Here, the question that we’re trying to address is really simple: As we train progressively more powerful AI systems, it will be important to communicate to them goals of greater subtlety and intricacy. How can we do that?

In this work, we investigate one approach, which is to have humans judge the behavior of an algorithm and be really efficient. Here we have a video where you have human judges provide one bit of information at a time where they tell the agent which behavior is more desirable. So, they do it for a while and after 500 such interactions, you will soon see something amazing happen. It learns to backflip.

How does it happen? This is basically 10% of model-based RL. In a sense, this is model-based RL, but you’re modeling the reward but not the environment.

The way it really works is that the human judges provide feedback to the system. All those bits of feedback are being cached into a model of a reward using a triplet loss. You try to come up with a single reward function that respects all the human feedback that was given to it. Then, you run your RL on this cost function.

You can also communicate non-standard goals. In this game, this is a racing game, but we’ve asked the human judges to communicate the goal that the white car, the one that’s driving, should be behind the red car; it shouldn’t overtake the red one. That’s what it learned to do.

What’s going to happen in the future, most likely, as these systems get more powerful here as we do our work we will hopefully solve the technical problem of communicating whatever goals we want to these systems. It is the choice of the goals that will be hard. A political problem, that we will all, I guess, enjoy, or not so much to face.

You can find Ilya Sutskever’s presentation here and watch the video of his talk.