Lunar lander

02 Jul 2017

Project Overview

The general field of reinforcement learning concerns agents that learn what optimal actions to take in an environment that maximizes their cumulative, long-term reward. Many reinforcement learning algorithms are widely applicable and generalizable to a multitude of environments and conditions, from self-driving cars to video games. For example, Minh et al. apply Deep Q Network (DQN) learning to a variety of Atari games with human-like (or better) outcomes.

This project involved the development and analysis of a reinforcement learning agent for OpenAI Gym’s LunarLander-v2 environment. It utilizes the same algorithm, DQN, to control a simulated lunar lander module, where the goal is to safely land on a landing pad. The successful completion of the challenge—as defined by OpenAI—emphasizes the robustness and generalizability of reinforcement learning methods and algorithms to novel problems. Lastly, the analysis of the agent performance discusses the training and configuration parameters and their effect on outcomes.

Problem Statement

The problem is well-defined from OpenAI’s environment description:

Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.

A successful solution is expected to consistently land safely on the target area with both legs touching the ground, with an average cumulative reward of 200 over 100 trials. Although the challenge did not stipulate a limit on episodes to success, this project had a self-imposed limit of 2,000 episodes. The environment is an 8-dimensional continuous state space and a 4-dimensional discrete action space, where the goal is to safely land a lunar lander module on a landing pad. Put simply, the agent must learn how to control 4 “levers” to safely land the module. It’s episodic; the agent is able to have multiple tries and restarts, although the starting conditions are randomly initialized within certain bounds.

While the output (actions) is discrete, the input (state space) is infinitely enumerable across 8 continuous dimensions. The high dimensionality and continuous state space pose especially challenging examples of the lemmas of delayed reward, credit assignment, and exploration vs. exploitation. Standard and tabular Q-learning cannot be applied successfully without some amount of discretization. The large, complex state space as well as the environment similarities to video games made DQN a natural choice for an reinforcement learning algorithm. For example, video games have complex, continuous state space (visual pixels) and discrete input space (a video game controller). The theory behind DQN is discussed briefly in following sections.

Metrics

Due to the nature of the reinforcement learning problems, metrics are readily available in the form of cumulative reward. OpenAI’s prescribed solution suggests that a successful learner achieves an average of 200 points across 100 episodes. In addition to reward, the number of episodes needed to reach the success state is also of interest. Put simply, it describes how fast a reinforcement learning agent converges to a near-optimal state. Unfortunately, OpenAI has since deprecated their web application leaderboard, but a wiki-driven leaderboard is available for a smaller amount of comparative benchmarks.

The scope of the study also includes optimization and examination of hyperparameters, including learning rates and discount factors. These variables are compared against both reward and episodes to success.

Analysis

Data Exploration

Given that the defined problem is a reinforcement learning environment, no data is “given” like supervised or unsupervised machine learning problems. Rather, the environment provides an interface to test and probe. Manual exploration of the environment was conducted with human control; across 50 episodes, all 50 crashed or landed away from the target area. Exploration elucidated that the problem could also have been a good candidate for control theory applications, e.g. PID control. Generally speaking, the best course of action for the learning agent to succeed is to learn how to control the lander in a stable manner, mimicking or approximating PID control, and then improving on that to produce safe landings near the target area.

In addition to the insights drawn about the input space in the previous section, manual exploration and source code analysis showed that the input dimensions mapped to the horizontal position (x-axis), vertical position (y-axis), orientation (theta), linear velocity (v), angular velocity (w), state of each landing leg (left and right), and a “crash” state. The output space, or actions, were synchronous per time tick (the agent could only pick one action), and fired the main thruster, fired left or right side thrusters, or idled. Given that the state space is continuous, it’s considered infinitely large. While discretization of the state-action space is possible, approximation of the Q function with a deep neural network, as will be discussed in the next section. Because of this approximation, there is no concept of traditional state-action pairs in tabular Q-learning. As mentioned in the previous section, the Deep Q Network algorithm was chosen specifically for the state size and complexity.

Using a randomized agent that does not have any learning, 1000 episodes were conducted to see how it would perform. This extracts high-level information about the environment and input data; chiefly, that failures are very common, and that randomized exploration will likely not produce any successful results. This also gives upper and lower bounds to evaluate whether agents are actually improving.

Rewards Histogram 1,000 episodes conducted by a randomized agent. All episodes resulted in failure, either from a crash landing (most common) to time expiration.

Theory

The main algorithm used was a model-free, Deep Q Network (DQN) learning agent. DQN builds off of Q-learning algorithms by using a Deep Neural Network (DNN) for approximating the state-action value function, Q(s, a). The DNN has input nodes mapping to the state space, and yields output nodes mapping to a vector of action values. In the process of experience replay, sequential observations are recorded in a memory bank and are sampled to train the online DNN’s weights. Experience replay was a key innovation in DQN and allows the neural networks to be effectively trained on previous experiences, without an overwhelming focus on recency. A parametrically identical target network lags behind the online network and is used to predict values in a consistent manner, as the online network is constantly updating. The target network is intermittently and systematically updated to match the online network weights. The target value predicted from the target DNN with network weights θt— is:

Equation

… and is equivalent to the reward (R) at a terminal step. Gamma (γ) is the discount factor, and the max value is taken from the target DNN action values. As an approximation of the Bellman equations, Yt represents the estimated Q-value. The implementation details of the neural network are discussed in a later section, but the learning rates (α) are considered as an artifact of DNN training. Using a DNN for Q function approximation is of particular relevance to the Lunar Lander problem, due to the extremely large state space. The high fan-out from inputs and high fan-in into outputs makes it a good candidate to capture complexity, and the ability for neural networks to effectively approximate continuous mappings is well established. (Funahashi)

This algorithm was implemented using Python 2.7 and Keras with both experience replay and a target network, as prescribed by Minh et al.

Benchmark

As mentioned in previous sections, performance is well-measured by episodes to success and cumulative reward. As of the writing of this report, the leaderboard has only one successful entry that took 499474 episodes to succeed.

Methodology

Data Preprocessing

Because the states were produced from a noise-free, deterministic environment simulation, little preprocessing of the input/space parameters were needed. In tuning the neural networks, there was no need for regularization or normalization; however, the inputs were reshaped and flattened to accommodate compatibility with the neural networks. Additionally, some reshaping and serialization was done to represent experiences in the replay memory deque.

Implementation

The DQN algorithm was implemented in Python 2.7, using Keras as the main machine learning library. For readers unfamiliar with neural networks and Q-learning, it is recommended to read the post in the footnote. The pseudocode is shown below:

initialize replay memory D
initialize action-value function Q with random weights
observe initial state S
repeat
  select an action a
    with probability ε select a random action
    otherwise select a = argmaxa'Q(s,a’)
  carry out action a
  observe reward r and new state s’
  store experience <s,a,r,s’> in replay memory D
  sample random transitions <ss,aa,rr,ss’> from replay memory D
  calculate target for each minibatch transition
    if ss’ is terminal state then tt = rr
    otherwise tt = rr + γmaxs’Q(ss’,aa’)
  train the Q network using (tt - Q(ss,aa))2 as loss
  s = s’
until terminated
*decay ε

The Deep Q Network (DQN) learning algorithm. (Matiisen 2015)

For the most part, the algorithm followed the base DQN algorithm, but some minor changes were made. Rather than a traditional ring buffer, a simple collection deque was used for the experience replay buffer. This affected performance only, and yielded slightly slower performance at the cost of simpler code implementation. With regards to exploration of the action-state space, epsilon decay was implemented to exploit more optimal policies in later episodes. This was carried out per-episode, and is the only algorithmic deviation from the original DQN model (shown with an asterisk above). The neural networks followed the prescriptions from the Minh et. al paper, implementing a minibatch size of 32. Given a lower dimensional space (vs. vide pixels), the neural networks were created with 2 dense, hidden layers (128 x 64) with ReLU activation functions. After successful completion of the challenge, this topology proved to be sufficient approximators of the Q function. Given the large fan-out to hidden layers and high fan-in to output nodes, it’s clear that the nodes are able to capture and approximate a complex Q function.

Neural Network The Deep Q neural network topology, consisting of 8 input nodes, 128-node hidden layer, a 64-node hidden layer, and 4 output nodes.

Refinement

The critical hyperparameters included:

  • discount (γ)
  • learning rate (α)
  • epsilon decay (ε-decay)
  • training frequency, e.g. when to update the target network

An initial grid search was conducted with epsilon decay (ε-decay) and training frequency; however, the ε-decay was found to be better manually controlled to balance the exploration-exploitation dilemma. Given the large state space and sparse rewards only at terminal states, ensuring that the agent adequately explored state-action combinations on Q values was paramount. Training frequency to update the target network, while it decreased the amount of time needed for episodic training, proved too detrimental to overall learning performance. Additional configuration hyperparameters were held constant, to include batch size, memory bank size, and perhaps most importantly, the DNN structure, activation and loss functions. The DNN structure was determined through Generate and Test methods, and qualitatively chosen from wall clock execution times and performance on the learning problem. The two key hyperparameters that were analyzed in depth were the discount (γ) and learning rates (α).

Grid Search Hyperparameter tuning. A 3x3 grid search was conducted on α and γ.

Learning Performance Episoding learning curves are also presented for learners at fixed α = 0.0001 (value for the optimal agent).

|                  | γ = 0.9           | γ = 0.99          | γ = 0.999         |
|------------------|-------------------|-------------------|-------------------|
| α = 0.01         | DNF, -375         | DNF, -1.8k        | DNF, -2.4k        |
| α = 0.001        | DNF, -131         | 1119, 202         | DNF, -263         |
| α = 0.0001       | DNF, -101         | DNF, 156          | 1298, 200         |

Grid search on hyperparameters, with episodes taken to reach a 100-episode running average reward of 200 (DNF = did not finish). ϵ-decay was held constant for all experimental trials. Training was bounded to 1,600 episodes and the resulting DQN learners were ran in 100-episode trials to calculate the mean reward._

Both α and γ had dramatic effects on the number of episodes, successful or failed completion, and the mean reward of trained agents. Ranges for γ were selected on a logarithmic scale to elucidate the differences between myopic (10-steps ahead, from the 1/[1-γ] rule) and far-ahead learners. Notably, small values of γ produced agents that spent too long hovering, as the reward from successful landings were not realized due to high discount. Much larger values of γ were not explored due to the finite time horizon of 1,000 steps; however, it is the author’s conjecture that a moderate size of γ promotes the agent to favor faster landings. The learning rate, α, had a profound effect on both immediate divergence and long-term stability. Lower values kept the DNN from seeking local optima below the 200-reward benchmark, while larger values resulted in unstable divergence.

Results

Model Evaluation and Validation

Learning Performance Learning performance for the DQN agent, as produced from OpenAI. The DQN agent solved the LunarLander-v2 problem in 1,020 episodes. The DNN contained two hidden layers (128, 64) of RELU nodes, with linear activation on the output. The target network weights were updated only after episodes (as opposed to time steps), and were trained in minibatches of 32 sampled experiences, using an Adam optimizer and a learning rate of 0.001. The RL problem was solved halted at 1,120 episodes with a best 100-episode average of 202.7 +/- 6.97.

Learning Performance Learning performance on an expanded trial of 2,000 episodes. The red line demarcates the episode where the agent first lands successfully. The y-axis is smoothed over 5 episodes.

Learning Performance Trained performance for a tuned agent. After training, the agent was ran for 100 consecutive trials with no learning (left), and observed at least 2 catastrophic crashes. The histogram (right) shows the reward distribution among 1,000 trial presentations.

Justification

The final model definitively solves the reinforcement learning problem. As expected, the DQN algorithm generalized very well to the LunarLander-v2 problem; however, it remains to be seen if the same parameters (e.g. DNN topology) would generalize to the original Atari games. By virtue of the randomized restarts, the agent has been shown to be very robust to changing initial conditions. A fully trained agent running on-policy produced a very high success rate and a consistently high score.

The agent successfully completes the challenge posed by OpenAI, and improves on the benchmark in the wiki by 2 orders of magnitude (episodes to success).

Conclusion

Comparison Comparison of agents with different discount factors. The agent on the left (γ=0.99) never completes the reinforcement learning problem, while the agent on the right (γ=0.999) learns how to land safely. This visualization emphasizes the difference in learning that happens with varying discount, and is discussed in detail in the following section.

Histogram A histogram of rewards over training trials. The three distinct peaks show three “milestones” of learning. The leftmost corresponds to the agent learning how to hover. The middle peak represents the agent learning how to land. And the final peak shows when the agent is able to land on target.

Reflection

The project successfully completes the original scope, having produced a working reinforcement learning agent and providing a study and comparison of hyperparameters. DQN was applied effectively on this specific problem, and produced successful results. The success strengthens the use and generality of this algorithm on other problems.

The influence of discount factor, γ, produces very interesting differences between agents, and highlights the importance and challenge of future credit assignment. An agent with too high of a discount was unable to credit actions to success, far enough in the future. Simply put, it was too myopic. As a result, agents learned how to hover, but never learned how to land. It contrasts well with the successful agent on the right, which was able to assign credit well into the future. With γ=0.999, the agent assigned credit up to 1,000 steps into the future, beyond the simulation’s step limit. This agent was able to learn how to land. The comparative analysis of these two agent types illustrates the importance of discount factor as a hyperparameter. Unsurprisingly, the selection of an appropriate discount factor was the most challenging aspect of the project. An exhaustive grid search for an optimal discount was difficult to conduct due to the long training times; however, an order-of-magnitude and “good enough” value was found, γ=0.999.

Improvement

While DQN successfully completed the OpenAI Gym challenge, there are still many areas of improvement, as both training time and episodes-to-solve were very high (>1,000). Given more time and resources, the agent could have been tuned via a more exhaustive grid search. As mentioned previously, design and tuning of the DNN was particularly limited in scope. Additionally, it’s possible that exposing the network to a larger batch size could have prevented overfitting on outliers. Likewise, the agent could have implemented clipping to bound temporal difference errors between two values to also limit the effect of outliers. DQN could also have been enhanced with prioritized replay or a priority queue, instead of a deque implementation of the memory bank and experience replay sampling. Lastly, implementation of the more advanced Double Q-learning (DDQN) learner would undoubtedly have increased the accuracy of the reinforcement learning agent, as DQN (and Q-learning in general) tend to overestimate optimistic values due to the max operator. Despite these areas of improvement, DQN proved both effective and efficient in tackling a challenging problem with an enormous state-action space and sparse rewards.

Due to proprietary, privacy, or academic concerns, the source code is not publicly available, but can be happily provided upon request.

Me

I'm a software engineer, proud veteran, and even prouder husband and father. I live and work in Silicon Valley, and love to learn about learning (EdTech), ML/AI/RL, and cybersecurity.