Background
- yonachn
- Dec 31, 2016
- 2 min read
Reinforcement Learning: A reinforcement learning agent attempts to learn a policy π : S → A by maximizing the reward it receives in return for performing specific actions. A policy is a mapping from the state space of the problem, S, to a probability distribution over the actions, A, available at a certain state. At each time step t, the agent chooses an action a(t) ∈ A available at the current state, s(t) ∈ S, and observes the reward received by taking this action, r(t), and the resulting state s(t+1) ∈ S. . We have a representation of the expected return after taking an action from a specific state, by defining the action- value function: Qπ(s,a) = E[Rt|s(t) = s,a(t) = a,π]. The action-value function obeys the Bellman Equation, which is the basic recursive update equation for any reinforcement learning problem: Qπ(s(t),a(t)) = E[r(t) + γ max a(0) Qπ(s(t+1),a(0))] Deep-Q-Networks: The DQN algorithm (Mnih et al. (2015)) uses a Convolutional Neural Network (CNN, Krizhevsky, Sutskever, and Hinton (2012)) to approximate the optimal Q function,and from it learn the optimal policy. This is done by optimizing the weights of the network in a way which minimized the Temporal Difference (TD) error of the optimal Bellman Equation. DQN is an offline algorithm;therefore,it doesn’t use the current state and action to update the network weights. Instead, it samples a minibatch of tuples [s(t),a(t),r(t),s(t+1),γ] from an Experience Replay [Lin (1993)], which is a buffer storing the agent’s experiences in each time step to enable later training updates of the DQN network. The DQN maintains two separate Q-networks: one current Q-network with parameters θ, and a target Q-network with parameters θ-target. Once every fixed number of training steps,DQN sets θ-target to θ, to avoid frequent updates of the target network and therefore noisy learning.
More material can be found on Tom Zahavy's website.