Volodymyr Mnih - Playing Atari with Deep Reinforcement Learning (2013)

History / Edit / PDF / EPUB / BIB /
Created: March 9, 2016 / Updated: February 6, 2021 / Status: finished / 3 min read (~504 words)

  • Reinforcement learning algorithms must be able to learn from a scalar reward signal that is frequently sparse, noisy and delayed
  • The delay between actions and resulting rewards can be thousands of timesteps apart
  • Most deep learning algorithms assume the data samples to be independent, while in reinforcement learning we typically encounter sequences of highly correlated states
  • In reinforcement learning, the data distribution changes as the algorithm learns new behaviors
  • The paper presents a convolutional neural network that is trained using a variant of the Q-learning algorithm, with stochastic gradient descent to update the weights
  • The challenge is to learn control policies from raw video data
  • The goal is to create a single neural network agent that is able to successfully learn to play as many of the games as possible (games for the Atari 2600)

  • $\mathcal{E}$: The environment
  • $a_t$: An action at time $t$
  • $\mathcal{A} = \{1, ..., K\}$: A set of legal game actions
  • $x_t \in \mathbb{R}^d$: An image from the emulator at time $t$
  • $r_t$: A reward representing the change in game score at time $t$
  • $s_t = x_1, a_1, x_2, a_2, ..., a_{t-1}, x_t$: A sequence of actions and observations used to learn game strategies that depend upon these sequences
  • Q-network: A neural network function approximator with weight $\theta$

  • Use of experience replay
    • Store the agent's experiences at each time step, $e_t = (s_t, a_t, r_t, s_{t+1})$ in a data set $\mathcal{D} = e_1, ..., e_N$

  • Preprocessing done to reduce the input dimensionality
    • 128 color palette converted to gray-scale representation
    • Frames are down-sampled from 210 x 160 pixels to 110 x 84 pixels
    • The final input is obtained by cropping a 84 x 84 pixels region that roughly captures the playing area
      • This cropping is done in order to use the GPU implementation of 2D convolutions which expects square inputs
  • The input to the neural network is a 84 x 84 x 4 image (84 x 84 pixels x 4 last frames)
  • The first hidden layer convolves 168 x 8 filters with stride 4 and applies a rectifier nonlinearity
  • The second hidden layer convolves 324 x 4 filters with stride 2, again followed by a rectifier nonlinearity
  • The final hidden layer is fully-connected and consists of 256 rectifier units
  • The output layer is a fully-connected linear layer with a single output for each valid action

  • Tested on Beam Rider, Breakout, Enduro, Pong, Q*bert, Seaquest and Space Invaders.
  • No modification to the network architecture, learning algorithm or hyperparameters between games
  • Trained on 10 million frames (about 46h at 60 frames/second)
  • The agent sees and selects actions on every $k^{th}$ frame instead of every frame and its last action is repeated on skipped frames
  • k = 4 was used for all games except Space Invaders (due to the beams not being visible on those frames)

  • Mari/o: Conceptual use of the replay system