Deep Reinforcment Learning

This post is written in parallel with some of my ongoing livestreams about implementing deep Q learning with Pytorch. Find the latest one below. image.png


There is one more intricacy to consider: exploration. Since the agent doesn't start with a knowledge of how to act, it needs some way to continuously better explore the environment as it learns more about how to act. One simple (and reasonably effective) approach is to start off acting completely randomly to generate the data to train over. The slowly over time, the rate of random actions is decreased and replaced with using the learned function more and more (exploitation). This is called epsilon-greedy.




Reinforcement learning (RL) is a technique to create a self learning agent (often represented using a neural net) that finds the best way to interact with an environment. This interactions comes in the form of some set of actions and for every action the environment may confer a reward on the agent (e.g. +1 for winning a game and -1 for losing).

Video games are often a good abstraction for studying RL. The discipline itself is relatively old but saw a resurgence in 2015 when it was shown to perform better than humans on Atari games.

One of the core algorithms for RL is Q learning. The goal of Q learning is to construct a function that, given the current state of the agent, can predict the maximum reward the agent can expect if taking any of the available actions (this is called the Q value).

The function may be a simple table, or it may be a complicated neural network (often referred to as a Deep Q Network or DQN for short). Once this function is learned, the best way for the agent to act is to simply pick the action associated with the largest total reward.

The above function is learned using the Bellman Equation which asserts that the Q value for the agent's current state is equal to the immediate reward at that time as well as the max possible reward that can be expected in the future. This recursive relationship allows the Q function to be learned from some random initialization.

OpenAI's gym and The Cartpole Environment

The OpenAI gym is an API built to make environment simulation and interaction for reinforcement learning simple. It also contains a number of built in environments (e.g. Atari games, classic control problems, etc).

One such classic control problem is Cart Pole, in which a cart carrying an inverted pendulum needs to be controlled such that the pendulum stays upright. The reward mechanics are described in the gym page for this environment.


Results of Applying DQN to the Cartpole problem

As part of the first stream in the series I mentioned above, I put together a model for solving the cartpole environment and got it training with a set of parameters that felt right from past experience. Usually a learning rate somewhere between 1e-3 and 1e-4 tends to work well. I set the epislon decay factor so that it would hit min epsilon or 5% by about half a million steps.

One thing I haven't mentioned so far is the target model. DQN tends to be very unstable unless you use two copies of the same model and update one of them every batch and the other one very rarely. How often that update happens becomes another hyper parameter. I set it to 5000 epochs (which equals 50,000 steps in the environment).

It's generally useful to log as many changing variables for DQN as possible. For now I'm choosing only to log epsilon, the loss and the reward observed in occasional test runs. Note that as we train we are also simulating the environment using epsilon greedy. The observation and other relevant data are stored in a fixed size buffer (100,000 in this case) and it is some times useful to record the average reward from recently finished episodes.

Below we can see replays of test runs at various points during training with the reward for a run noted below the run itself. There's a moment at about 400k steps when the system seems to have reached the max reward of 50 but that performance quickly degrades and then takes a while to recover. This may imply that a lower learning rate or scheduling of learning rate might help.

It's interesting to note that other than "the loss isn't very large", we learn very little about the model's performance through the loss plot. This is very common in reinforcement learning.

Parameter Value
Learning Rate 0.0001
Optimizer Adam
Batch Size 100
Minimum Epsilon 0.05
Epsilon Decay Factor 0.99999
Steps before Training 10
Epochs before Target Model Update 5000

Hyperparameter Sweep

Hyperparameters are a particular pain point in reinforcement learning, much more so than in deep learning since it can take a long time before any signs of progress show up. Through using a hyperparameter sweep in W&B we can test how well we picked the parameters in the above example.

The epsilon decay factor seems to be the most important parameter here. Increasing learning rate also appears to negatively impact the test score. Min epsilon appears to be more forgiving, but it's important to note that this can be very environment specific. In a game where a single wrong move can mean the end a large minimum epsilon can effectively control the upper bound on performance.


Reinforcement learning is a very interesting idea, and in the past few years it has become even more powerful through the use of deep learning and modern hardware. DQN makes for a relatively pain free starting point for beginners who can focus on the simpler environments in OpenAI's gym.