Background

yonachn
Dec 31, 2016
2 min read

Reinforcement Learning: A reinforcement learning agent attempts to learn a policy π : S → A by maximizing the reward it receives in return for performing speciﬁc actions. A policy is a mapping from the state space of the problem, S, to a probability distribution over the actions, A, available at a certain state. At each time step t, the agent chooses an action a(t) ∈ A available at the current state, s(t) ∈ S, and observes the reward received by taking this action, r(t), and the resulting state s(t+1) ∈ S. . We have a representation of the expected return after taking an action from a speciﬁc state, by deﬁning the action- value function: Qπ(s,a) = E[Rt|s(t) = s,a(t) = a,π]. The action-value function obeys the Bellman Equation, which is the basic recursive update equation for any reinforcement learning problem: Qπ(s(t),a(t)) = E[r(t) + γ max a(0) Qπ(s(t+1),a(0))] Deep-Q-Networks: The DQN algorithm (Mnih et al. (2015)) uses a Convolutional Neural Network (CNN, Krizhevsky, Sutskever, and Hinton (2012)) to approximate the optimal Q function,and from it learn the optimal policy. This is done by optimizing the weights of the network in a way which minimized the Temporal Difference (TD) error of the optimal Bellman Equation. DQN is an offline algorithm;therefore,it doesn’t use the current state and action to update the network weights. Instead, it samples a minibatch of tuples [s(t),a(t),r(t),s(t+1),γ] from an Experience Replay [Lin (1993)], which is a buffer storing the agent’s experiences in each time step to enable later training updates of the DQN network. The DQN maintains two separate Q-networks: one current Q-network with parameters θ, and a target Q-network with parameters θ-target. Once every ﬁxed number of training steps,DQN sets θ-target to θ, to avoid frequent updates of the target network and therefore noisy learning.

More material can be found on Tom Zahavy's website.

Background

Comments

RECENT POST

Latest Results

Background