📚 Techniques to Improve the Performance of a DQN Agent
💡 Newskategorie: AI Nachrichten
🔗 Quelle: towardsdatascience.com
Reinforcement learning challenges and how to solve them
Deep reinforcement learning is not just about replacing a Q-table with a neural network. There are more techniques you need to implement to improve the performance of the agent. Without these, it can be difficult or even impossible to create a well-performing RL agent.
If you aren’t familiar with deep Q networks (DQN), I can recommend this post. The image below summarizes the process: a Q-table is replaced by a neural network to approximate the Q-value of every state action pair. The reason to use a neural network instead of a Q-table, is because the Q-table doesn’t scale well. Another reason is that with a Q-table, it’s not possible to have continuous states or actions.
Besides the successes like Go, StarCraft and Dota, there are major challenges in reinforcement learning. Here’s a good blog post that describes them in detail. To summarize:
- A reinforcement learning agent needs many samples. This isn’t a problem in gaming, where the agent can play the game again and again. But when dealing with real life scenarios, it’s a big issue.
- There can be easier methods to accomplish good performance, like with Monte Carlo Tree Search (games) or trajectory optimization (robotics).
- Rewards can be shaped or delayed. This affects the behavior of the agent. For example, when the agent only receives a reward at the end of the game, it is hard to determine the specific actions that caused the reward. And when creating a reward function with artificial rewards, the agent can start behaving unpredictable.
- An agent can have generalization issues. Every Atari agent can only play the game it was trained on. And even within the same game, there are generalization issues: if you train an agent against a perfect player, it’s not guaranteed that it will play good against a mediocre player.
- Last but not least: the behavior of the agent can be unstable and hard to reproduce. Because of a large number of hyperparameters, and no ground truth, even the random seed can make the difference between a good- and a bad-performing agent. A failure rate of 30% is accepted, which is pretty high!
In the upcoming section, I will describe six techniques that can improve the performance of a deep Q agent. Not all the problems explained above can be solved though.
“Supervised learning wants to work, reinforcement learning must be forced to work.” — Andrej Karpathy
Prioritized Experience Replay
This first technique, experience replay, is easy to implement. The idea is simple, and I haven’t encountered a deep reinforcement system that doesn’t make use of it. It works as follows: instead of updating the weights of the neural network directly after one experience, with experience replay you sample a batch from the past experiences randomly and update the weights using this batch. The experience replay buffer is the memory where you store the most recent transitions (a transition consists of state, action, reward, next state). Usually the replay buffer has a fixed size.
An improved version of experience replay is prioritized experience replay, where you replay important transitions more frequently. The importance of a transition is measured by the magnitude of their TD error.
The problem it solves
Autocorrelation is a problem in deep reinforcement learning. This means that when you train an agent on consecutive samples, the samples are correlated, because there is a relationship between consecutive Q-values. By taking a random sample from the experience replay buffer, this problem is solved.
Another advantage of experience replay is that previous experiences are used efficiently. The agent learns multiple times from the same experience and this speeds up the learning process.
Literature
- Revisiting Fundamentals of Experience Replay
This paper explains experience replay thoroughly with examples and analysis. - Prioritized Experience Replay
Instead of sampling randomly from the replay buffer, you can choose to replay important transitions more frequently. This technique makes better use of the previous experiences and learns more efficiently.
Double Deep Q Network
The target of a DQN is:
The highest Q-value of the next state is selected, and the same network is used for action selection and evaluation. This can lead to an overestimation of action values (more about this in the next paragraph). Instead of using one neural network for action selection and evaluation, it’s possible to use two neural networks. One network is called the online network, and the other one the target network. The weights from the online network are slowly copied to the target network, which stabilizes learning. For a Double DQN, the target is defined as:
In this case, you select the action with the highest Q-value using the online network (weights θ), and the target network (weights θ-) is used to estimate its value. Because the target network slowly updates the Q-values, these values aren’t overestimated like when just using the online network.
The problem it solves
A problem in reinforcement learning is overestimation of the action values. This can cause learning to fail. In tabular Q-learning, the Q-values will converge to their true values. The downside of a Q-table is that it does not scale. For more complex problems, we need to approximate the Q-values, for example with a DQN. This approximation is a problem, because this causes noise on the output production, due to the generalization. The noise can cause systematic overestimation of the Q-values. In some situations, this will lead to a suboptimal policy. Double DQN solves this by separation of action selection and action evaluation. This leads to more stable learning and an improvement of results.
The results are interesting, and for some games a necessity:
The top row shows the value estimates. You can see the peeks in the graphs of the DQN, and performance drops when overestimating starts. The Double DQN solves this and performs pretty good and stable.
Literature
- Double Q-learning
The original paper on Double Q-learning (no neural nets involved). - Human-level control through deep reinforcement learning
This paper explains benefits of using two neural networks in a DQN. It doesn’t mention Double Q-learning, but makes use of the online and target network. - Deep Reinforcement Learning with Double Q-learning
Applying double Q-learning to DQNs. This paper uses the two networks from the paper above and applies the new target as described in this section. - Clipped Double Q-learning
An improvement of Double Q-learning is Clipped Double Q-learning. It uses the minimum value of both networks to avoid overestimation.
Dueling Network Architectures
The next technique which can improve a DQN is called dueling architecture. The first part of a normal DQN, learning the features with a (convolutional) neural network, is the same as before. But now, instead of calculating the Q-values right away, the dueling network has two separate streams of fully connected layers. One stream estimates the state-value, and the other stream the advantages for each action.
The advantage of an action is equal to the Q-value minus the state value:
So the higher the advantage, the better to choose the related action in that state.
The final step is to combine the two streams and output the Q-values for each action. The combining step is a forward mapping step to calculate the Q-values:
You probably expected that the Q-values could be combined by adding the state value and the advantage. This doesn’t work, and the paper explains in depth why in chapter 3.
In the next image, on the bottom you can see the dueling architecture, where the two streams are created and combined:
The problem it solves
Because the dueling architecture learns the values of states, it can determine which states are valuable. This is an advantage because there is no need to learn the effect of each action for each state.
Sometimes there are states where it doesn’t matter what action you take. All the actions have similar values in those states, so the action you take doesn’t matter. In such problems, the dueling architecture can more quickly identify the correct action during policy evaluation.
The dueling architecture combined with double DQN and prioritized replay gave new state-of-the-art results on the Atari 2600 testbed.
Literature
- Dueling Network Architectures for Deep Reinforcement Learning
The original paper where dueling architecture is introduced.
Actor-Critic
So far, the methods discussed calculate the values of state-action pairs, and use these action values directly to determine the best policy. The best policy is the action with the highest action value for every state. Another way to approach RL problems is by representing the policy directly. In this case, a neural network outputs a probability for each action. The different methods are called value based methods and policy based methods, respectively.
Actor-critic uses a combination of value and policy based methods. The policy is represented independent of the value function. In actor-critic, there are two neural networks:
- A policy network, the actor. This network selects the actions.
- And a deep Q network for the action values, the critic. The deep Q network is trained normally, and learns from the agent’s experiences.
The policy network relies on the action values estimated by the deep Q network. It changes the policy based on these action values. Learning happens on-policy.
Actor-critic methods aren’t new, they exist for over 40 years. But improvements are made, and in 2016 a paper was published that introduced a new and interesting algorithm. Asynchronous methods introduce parallel actor-learners to stabilize learning. The best performing method from the paper was the one that combined asynchronous with actor-critic. The algorithm is called Asynchronous Advantage Actor-Critic (A3C) and outperformed many other Atari agents while trained on a single multi-core CPU. You might wonder what the ‘advantage’ part of the algorithm does: the key idea is that the advantage is used for the critic instead of the action values.
Fun fact: a couple of years later, researchers discovered that Advantage Actor-Critic (A2C) performed better than A3C. A2C is the synchronous version of A3C. The advantage part seems to be more important than asynchronous learning!
The problem it solves
Normal policy based methods like REINFORCE algorithms, are slow. Why? Because they have to estimate the value of each action by going through multiple episodes, sum the future discounted rewards for each action, and normalizing them. In actor-critic, the critic gives directions to the actor, which means that the actor can learn the policy faster. Besides, convergence is only guaranteed in very limited settings for policy based methods.
Other advantages of actor-critic are: less computation is needed because the policy is explicitly stored, and actor-critic methods can learn the optimal probabilities of selecting various actions.
Literature
- Actor-Critic Algorithms
Paper from 1999 that analyzes a class of actor-critic algorithms. - Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Advantage Actor-Critic (A3C) is introduced in this paper. - Advantage Actor-Critic (A2C)
In this post is explained why A2C is favored over A3C. - Soft Actor-Critic
Another actor-critic variant. It tries to be as unpredictable as possible while still getting as many rewards as possible. This encourages exploration. Soft Actor-Critic has proven to be really efficient!
Noisy Networks
NoisyNet adds perturbations to the network weights to drive exploration. It’s a general method for exploration in deep reinforcement learning.
In NoisyNet, there is noise added to the parameters of the neural networks. The noise causes uncertainty and this introduces more variability in the decisions made by the policy. More variability in the decisions means potentially more exploratory actions.
The image above represents a noisy linear layer. Mu and sigma are the learnables of the network. Epsilon (w and b) are noise variables, and are added to the weights vector w and the bias b.
The problem it solves
Most methods for exploration, like epsilon greedy, rely on randomness. In large state-action spaces and with function approximations like in neural networks, there are no convergence guarantees. There are ways to explore the environment more efficiently, like rewarding the agent when it discovers parts of the environment it hasn’t been before. These methods also have their setbacks, because the rewards for exploration can trigger unpredictable behavior and are usually not data efficient.
NoisyNet is a simple approach in which the weights of the network are used to drive exploration. It’s an easy addition that you can combine with other techniques from this post. NoisyNet improved scores on multiple Atari games, when used in combination with A3C, DQN and Dueling agents.
Literature
Distributional RL
The final technique I will discuss is distributional reinforcement learning. Instead of using the expectation of the return, it’s possible to use the full distribution of the random return received by a reinforcement learning agent.
How does it work? A distributional reinforcement learning agent tries to learn the complete value distributions, instead of a single value. The distributional agent wants to minimize the difference between the predicted distribution and the true distribution. The metric used for calculating the difference between distributions is called Kullback-Leibler divergence (other metrics are possible). This metric is implemented in the loss function.
The problem it solves
The common approach to reinforcement learning is to model the expectation of the return: the value. The value is one single number for every state action combination. For some problems, this isn’t the right approach. If a policy is non stationary, approximating the full distribution mitigates the effects of learning. Using distributions stabilizes learning, because multimodality in value distributions is preserved.
This makes sense, in laymen’s terms: the agent receives more knowledge from the environment (distributions instead of single values). A downside could be that it takes longer to learn a distribution than a value.
The results of distributional reinforcement learning were remarkable. DQN, Double DQN, Dueling architecture and Prioritized Replay were outperformed by a large margin in a number of games.
Literature
- A Distributional Perspective on Reinforcement Learning
This paper explains why value distributions are important.
Conclusion
When you have a reinforcement learning use case and the performance isn’t as high as expected, you should definitely try some techniques from this post. The most techniques are easy to implement, especially when you already defined your environment (states, actions and rewards) and have a working DQN agent. All techniques improved the performance on multiple Atari games, although some techniques work better in combination with others, like noisy networks.
There is a paper that combines all the improvements from this blog post. It’s called RainbowNet and the results are interesting, as you can see in the image below:
Related
- Solving Multi-Armed Bandit Problems
- Why you should add Reinforcement Learning to your Data Science Toolbox
- Snake Played by a Deep Reinforcement Learning Agent
Techniques to Improve the Performance of a DQN Agent was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
...