March 11, 2016

Why AlphaGo Matters

  1. AlphaGo in Context
  2. A Simplified Explanation of How AlphaGo works
    1. Convolutional Neural Networks in AlphaGo
    2. A Brief Note On Reinforcement Learning in Go
    3. Combining Several Neural Networks
  3. AlphaGo’s Limits and Where We Go From Here

An extraordinary thing has been happening over the past few days. DeepMind’s AlphaGo has achieved a huge milestone in AI, beating Go player Lee Sedol in the first 2 of 5 matches, a decade earlier than most experts predicted a computer would be likely to top a pro Go player. For most of recent history, software was laughably bad at Go. In the last decade, algorithmic improvements had barely caught high performance systems up to the level of casual hobbyist. Now, suddenly, in late 2015 and early 2016, AlphaGo has beaten European champion Fan Hui and has impressively won its first 2 matches against Lee Sedol, one of the best players (if not the best) over the last decade.

If you’ve been around me recently, you’ve probably gotten an earful about how cool it is. This post is an attempt to communicate why I’m so impressed by AlphaGo. I’ll avoid diving too deep into the technical details for this post, and try to provide some big picture intuition about what AlphaGo is doing. I’m worse than a novice at Go, so I won’t go into any details about the game.

AlphaGo in Context

I think it’s fairly easy for a casual follower of tech news to miss what a milestone this is. Others are drawing connections to he victories of Deep Blue at Chess and Watson at Jeopardy. While initially exciting, these mostly led to pretty widespread disappointment. Viewed cynically, they’re less achievements in Artificial Intelligence and more testaments to what big organizations can accomplish if they can assemble a large team and write a lot of code for manual evaluation.

Deep Blue, for instance, relied heavily on hand coded evaluation. For an analogy, I’ll take a recent nerdy achievement - a Clojure Bot become the first bot to conquer a milestone in the game of nethack without human intervention. The kind of code that made this possible? Well here’s a common enough type of commit for the project:

(defn hits-hard? [m]
  ;(#{"winged gargoyle" "Olog-hai"} (typename m)))  -- removed
  (#{"winged gargoyle" "Olog-hai" "salamander"} (typename m)))

That is, add a hard-coded enemy type, "salamander" to the set of monsters that hit hard. This kind of minute tweaking is the kind of logic that systems like Watson and Deep Blue are full of, and the kind of changes that engineers at IBM were allowed to make between matches when it faced Kasparov. Deep Blue had a heavily tuned opening book of moves to rely on. It crunched multiple games states from there, rigorously evaluating all or most possible moves for 6 or more turns from its current game state. More comprehensive evaluation was possible in some phases of the game, like the end game, where it could readily realize and evaluate all possible moves through the end of the game.

Now, I’m less concerned with trying to be critical of previous efforts, and certainly not BotHack which is a fun project that accomplished what it set out to do! I merely want to set up how categorically different AlphaGo is. For starters, the Go game state is already much more complex than Chess. There are more Go game states than atoms in the universe, so a rigorous tree search process is not tractable. Instead, AlphaGo takes machine learning approaches that are in many ways idealized versinos of what humans likely do when they play Go.

This approach is not limited to Go, however. AlphaGo relies on deep learning and reinforcement learning techniques that are general purpose - useful for image classification, learning to drive cars, and a host of other tasks. You could take AlphaGo apart, make a few tweaks, and put it back together and use it for a different game or even a different problem, all just by adjusting its machine learning models. AlphaGo itself is not dissimilar from the earlier system that DeepMind trained to play Atari Games.

This difference is critical. it’s a huge break from previous game playing milestones.

A Simplified Explanation of How AlphaGo works

There’s a Nature Paper describing how AlphaGo works (though it’s a bit short on all the fine details of the structure of the neural networks). I’ll do what I can here to translate it into simpler explanations. AlphaGo’s strengths are not hand-coded decisions processes, as I mentioned before. It relies on several machine learning components. In particular, it uses concepts from deep learning and reinforcement learning in conjunction, as policy networks. Policy networks predict the probabilities that an AI agent will receive some reward for taking particular action(s) based on its current state.

A policy network can be trained in different ways, and the breakdown for these policy networks for AlphaGo is as follows:

  • A policy network trained by supervised learning. That is, it learned by seeing examples of what expert Go players would do.

  • A policy network trained by reinforcement learning. This network learned by playing games and learning to make moves that led to more wins.

  • A fast (but less accurate) rollout policy used in tree search (more on this later) that would select a likely expert use for any given position quickly.

  • A value network that would predict the winner of the game based on the game state.

The three deep convolutional networks are the two policy networks and the value network. They are all intertwined in the learning process for AlphaGo:

  1. The supervised learning policy network learns how to select an expert move from any position.

  2. The reinforcement learning (RL) policy starts from the capabilities of the supervised learning (SL) policy network, and then learns how to beat previous versions of itself.

  3. The value network learns from the self-play dataset generated by the RL policy which board positions favor which players.

Convolutional Neural Networks in AlphaGo

I just can’t emphasize enough how important it is that AlphaGo uses convnets. These learning systems start by treating the Go board as if it were an image with one pixel per board position. So this:

Figure 1. 19x19 professional Go board.

Becomes this:

go board pixels
Figure 2. Pixel values in google doc form.

When convolutional neural networks are trained to classify images, the convolutional layers end up learning several intermediate representations that function essentially as filters. AlexNet, which moved the ImageNet classification accuracy forward from previous computer visions systems by leaps and bounds, produced classifications like this (following images from same source):

alex net classes
Figure 3. Three images classified correcrly by AlexNet, one misclassified.

The intermediate representatiosn it learned in order to make these classifications can be visualized like this (at least, in the lower levels):

alex net intermediate
Figure 4. Edge and texture detectors learned by AlexNet.

These kinds of features combine to form shapes in the next layers:

yann filters
Figure 5. Lower level features combine into more complex states.

The network can then learn representations that work equally well across various translations of an image:

features learned
Figure 6. Classes from features.

People can do all kinds of fun things with this outside of classification once the network is trained. There are analogies being made with art or dreams when neural networks are used to generate images after being trained to label them. My favorite application of something like this is a paper on training deep networks to recognize then design chairs (image from same source):

Figure 7. Neural networks generating images of chairs.

I want to stress the fact that these different networks in AlphaGo are learning to reason about Go visually, at several different layers of abstraction. They’re using these same mechanisms. AlphaGo recognises strong board positions by first recognizing visual features in the board. It’s connecting movements to shapes it detects. Now, we can’t see inside AlphaGo unless DeepMind decides they want to share some of the visualizations of its intermediate representations. I hope they do, as I bet they’d offer a lot of insight into both the game of Go and how AlphaGo specifically is reasoning about it.

This isn’t to say that the engineers developing systems like Deep Blue or Watson had some experience that was free of insight, but it’s of a different class. Human beings had to recognise and solve specific problems as they optimized game play, used rules and heuristics to simplify searches, found strange edge case behaviors when these things were combined and guarded against them, and so on. It’s just that these all require a person to explicitly recognize these things, design solutions, and program them in. AlphaGo learns these intermediate representations by which it can reason about Go baord states and it’s not been programmed to do it directly. It’s very possible it will invent new ways of playing Go that human beings will learn from.

And all this by doing things that are fairly analogous to what a human does:

  • watching what expert players do and learning to play like them.

  • imagining what might happen if it makes certain moves.

  • learning which moves lead to good outcomes and which moves lead to poor outcomes.

  • evaluating its strength on the board and the quality of its moves with generalized visual features.

I don’t really have anything more than a passing familiarity with Go’s rules, but I do see Go experts in various forums questioning what they know of the game based on AlphaGo’s play.

A Brief Note On Reinforcement Learning in Go

This is also one aspect of AlphaGo that is pretty cool. AlphaGo plays games against forks of itself distributed through time. Multiple games, more than a human player can realistically play. Imagine if a professional player of any game or sport could spawn copies of themselves at various stages of their career and go head to head with themselves, and learn from that experience. That’s a basic summary of the RL policy network’s training regime. The fact that it produces a strong Go player is not particularly surprising.

Combining Several Neural Networks

The different learning systems were all combined into a Monte Carlo Tree Search (MCTS) approach that did the following as AlphaGo considered moves (well, this is an oversimplified form but it provides the gist):

  1. Each considered move is evaluated once by the SL policy network. An initial evaluation is made and remembered for that move as a prior probability. This particular evaluation is never repeated for any move.

  2. Each considered move is evaluated by the value network and the fast policy rollout network (which is faster than the SL policy network but much less accurate). The fast policy network plays out moves to simulate the remainder of the game.

  3. These results are all weighted together, the move with the highest evaluation is then chosen.

So the tree search itself was explicitly programmed, but just as a means by which to weigh multiple lines of reasoning. The evaluation each of these lines of reasoning provided? All learned from observation and experience.

AlphaGo’s Limits and Where We Go From Here

AlphaGo is not a completely human analogous system by any means, although it takes a large step in that direction. The areas where AlphaGo is most behind human learners are:

  1. The vast amounts of data it takes to train its deep neural networks. AlphaGo’s SL policy, for instance, was trained on 30 million moves!

  2. An inability to transfer learning about concepts from one domain to another.

With (1), there are an ensemble of techniques that have been developed to make neural networks more efficient at learning and to avoid pitfalls like overfitting. I’ll cover many of these in future technical posts, but the takeaway is that their learning requires a lot of hacks and isn’t nearly as efficient as human learning for most tasks.

(2) is mostly outside of the domain of work that DeepMind has done, but I bring it up mainly to highlight a limit in the AI and Machine Learning at present. However good AlphaGo can become at Go, its architecture will never allow it to read a book about Go and then reason about what to do based on what the board looks like. This is a super power that’s still reserved for human learning systems.

So where does AlphaGo’s victory leaves us? Here I’ve dropped it into a graphic I shamelessly stole from a presentation by Michael Littman:

alpha go in context
Figure 8. The Future of AI, given AlphaGo’s Victory

It’s definitely a huge step forward, and categorically unlike other AI milestones in games due to the learning component. Like others, though, I am concerned about the the hype around deep learning, the concentration of deep learning research in private organizations whose primary income is advertising based, and so on. There are possibilities that pushing the hype too far could lead to another AI winter and there are real concerns that pro-neural network research trends could shut out new methodologies improving on deep learning or critical of it the same way neural network researchers were largely shot down in the 90s.

I’m hoping it will lead to a renaissance in reinforcement learning, a field where I think most of the important work of the next decade will be done. I think there’s an emerging multimodal framework:

  • reinforcement learning for shaping AI agent behavior.

  • convolutional neural networks for visual perception.

  • recurrent neural networks for speech and language.

DeepMind is combining reinforcement learning and convnets in very interesting ways. Others, working in automatic image captioning are coming up with ways to transfer information from the distributed representations in convnets to those in recurrent neural networks, allowing them to link concepts between visual perception and language. I’m most eager to see how these will all be combined in novel ways over the next few years.

Tags: AlphaGo Deep learning neural networks machine learning AI