Cookie Consent by Free Privacy Policy Generator 📌 Techniques to Improve the Performance of a DQN Agent

🏠 Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeiträge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden Überblick über die wichtigsten Aspekte der IT-Sicherheit in einer sich ständig verändernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch übersetzen, erst Englisch auswählen dann wieder Deutsch!

Google Android Playstore Download Button für Team IT Security



📚 Techniques to Improve the Performance of a DQN Agent


💡 Newskategorie: AI Nachrichten
🔗 Quelle: towardsdatascience.com

A robot playing games. Image by Dall-E 2.

Reinforcement learning challenges and how to solve them

Deep reinforcement learning is not just about replacing a Q-table with a neural network. There are more techniques you need to implement to improve the performance of the agent. Without these, it can be difficult or even impossible to create a well-performing RL agent.

If you aren’t familiar with deep Q networks (DQN), I can recommend this post. The image below summarizes the process: a Q-table is replaced by a neural network to approximate the Q-value of every state action pair. The reason to use a neural network instead of a Q-table, is because the Q-table doesn’t scale well. Another reason is that with a Q-table, it’s not possible to have continuous states or actions.

Relation between Q-learning and deep Q-learning: the table is replaced by a neural network, where the input layer contains information about the state, and the outputs are Q-values for every action. Image by author.

Besides the successes like Go, StarCraft and Dota, there are major challenges in reinforcement learning. Here’s a good blog post that describes them in detail. To summarize:

  • A reinforcement learning agent needs many samples. This isn’t a problem in gaming, where the agent can play the game again and again. But when dealing with real life scenarios, it’s a big issue.
  • There can be easier methods to accomplish good performance, like with Monte Carlo Tree Search (games) or trajectory optimization (robotics).
  • Rewards can be shaped or delayed. This affects the behavior of the agent. For example, when the agent only receives a reward at the end of the game, it is hard to determine the specific actions that caused the reward. And when creating a reward function with artificial rewards, the agent can start behaving unpredictable.
  • An agent can have generalization issues. Every Atari agent can only play the game it was trained on. And even within the same game, there are generalization issues: if you train an agent against a perfect player, it’s not guaranteed that it will play good against a mediocre player.
  • Last but not least: the behavior of the agent can be unstable and hard to reproduce. Because of a large number of hyperparameters, and no ground truth, even the random seed can make the difference between a good- and a bad-performing agent. A failure rate of 30% is accepted, which is pretty high!

In the upcoming section, I will describe six techniques that can improve the performance of a deep Q agent. Not all the problems explained above can be solved though.

“Supervised learning wants to work, reinforcement learning must be forced to work.” — Andrej Karpathy

Prioritized Experience Replay

This first technique, experience replay, is easy to implement. The idea is simple, and I haven’t encountered a deep reinforcement system that doesn’t make use of it. It works as follows: instead of updating the weights of the neural network directly after one experience, with experience replay you sample a batch from the past experiences randomly and update the weights using this batch. The experience replay buffer is the memory where you store the most recent transitions (a transition consists of state, action, reward, next state). Usually the replay buffer has a fixed size.

Instead of directly learning from one experience, the agent adds a sample from the experience replay buffer to the experience and learns from this batch. Image by author.

An improved version of experience replay is prioritized experience replay, where you replay important transitions more frequently. The importance of a transition is measured by the magnitude of their TD error.

The problem it solves

Autocorrelation is a problem in deep reinforcement learning. This means that when you train an agent on consecutive samples, the samples are correlated, because there is a relationship between consecutive Q-values. By taking a random sample from the experience replay buffer, this problem is solved.

Another advantage of experience replay is that previous experiences are used efficiently. The agent learns multiple times from the same experience and this speeds up the learning process.

Literature

Double Deep Q Network

The target of a DQN is:

The highest Q-value of the next state is selected, and the same network is used for action selection and evaluation. This can lead to an overestimation of action values (more about this in the next paragraph). Instead of using one neural network for action selection and evaluation, it’s possible to use two neural networks. One network is called the online network, and the other one the target network. The weights from the online network are slowly copied to the target network, which stabilizes learning. For a Double DQN, the target is defined as:

In this case, you select the action with the highest Q-value using the online network (weights θ), and the target network (weights θ-) is used to estimate its value. Because the target network slowly updates the Q-values, these values aren’t overestimated like when just using the online network.

Schematic view of double Q-learning. Two networks are used, the online network for action selection, and the target network for evaluation. The weights of the online network are slowly copied to the target network once in a while. Image by author.

The problem it solves

A problem in reinforcement learning is overestimation of the action values. This can cause learning to fail. In tabular Q-learning, the Q-values will converge to their true values. The downside of a Q-table is that it does not scale. For more complex problems, we need to approximate the Q-values, for example with a DQN. This approximation is a problem, because this causes noise on the output production, due to the generalization. The noise can cause systematic overestimation of the Q-values. In some situations, this will lead to a suboptimal policy. Double DQN solves this by separation of action selection and action evaluation. This leads to more stable learning and an improvement of results.

The results are interesting, and for some games a necessity:

Source: Deep Reinforcement Learning with Double Q-learning

The top row shows the value estimates. You can see the peeks in the graphs of the DQN, and performance drops when overestimating starts. The Double DQN solves this and performs pretty good and stable.

Literature

Dueling Network Architectures

The next technique which can improve a DQN is called dueling architecture. The first part of a normal DQN, learning the features with a (convolutional) neural network, is the same as before. But now, instead of calculating the Q-values right away, the dueling network has two separate streams of fully connected layers. One stream estimates the state-value, and the other stream the advantages for each action.

The advantage of an action is equal to the Q-value minus the state value:

So the higher the advantage, the better to choose the related action in that state.

The final step is to combine the two streams and output the Q-values for each action. The combining step is a forward mapping step to calculate the Q-values:

You probably expected that the Q-values could be combined by adding the state value and the advantage. This doesn’t work, and the paper explains in depth why in chapter 3.

In the next image, on the bottom you can see the dueling architecture, where the two streams are created and combined:

Source: Dueling Network Architectures for Deep Reinforcement Learning

The problem it solves

Because the dueling architecture learns the values of states, it can determine which states are valuable. This is an advantage because there is no need to learn the effect of each action for each state.

Sometimes there are states where it doesn’t matter what action you take. All the actions have similar values in those states, so the action you take doesn’t matter. In such problems, the dueling architecture can more quickly identify the correct action during policy evaluation.

The dueling architecture combined with double DQN and prioritized replay gave new state-of-the-art results on the Atari 2600 testbed.

Literature

Actor-Critic

So far, the methods discussed calculate the values of state-action pairs, and use these action values directly to determine the best policy. The best policy is the action with the highest action value for every state. Another way to approach RL problems is by representing the policy directly. In this case, a neural network outputs a probability for each action. The different methods are called value based methods and policy based methods, respectively.

Actor-critic uses a combination of value and policy based methods. The policy is represented independent of the value function. In actor-critic, there are two neural networks:

  1. A policy network, the actor. This network selects the actions.
  2. And a deep Q network for the action values, the critic. The deep Q network is trained normally, and learns from the agent’s experiences.

The policy network relies on the action values estimated by the deep Q network. It changes the policy based on these action values. Learning happens on-policy.

The architecture of actor-critic. Source: Reinforcement Learning: An Introduction

Actor-critic methods aren’t new, they exist for over 40 years. But improvements are made, and in 2016 a paper was published that introduced a new and interesting algorithm. Asynchronous methods introduce parallel actor-learners to stabilize learning. The best performing method from the paper was the one that combined asynchronous with actor-critic. The algorithm is called Asynchronous Advantage Actor-Critic (A3C) and outperformed many other Atari agents while trained on a single multi-core CPU. You might wonder what the ‘advantage’ part of the algorithm does: the key idea is that the advantage is used for the critic instead of the action values.

Fun fact: a couple of years later, researchers discovered that Advantage Actor-Critic (A2C) performed better than A3C. A2C is the synchronous version of A3C. The advantage part seems to be more important than asynchronous learning!

The problem it solves

Normal policy based methods like REINFORCE algorithms, are slow. Why? Because they have to estimate the value of each action by going through multiple episodes, sum the future discounted rewards for each action, and normalizing them. In actor-critic, the critic gives directions to the actor, which means that the actor can learn the policy faster. Besides, convergence is only guaranteed in very limited settings for policy based methods.

Other advantages of actor-critic are: less computation is needed because the policy is explicitly stored, and actor-critic methods can learn the optimal probabilities of selecting various actions.

The actor and the critic playing a game together, they want to beat the o player. Image by author.

Literature

Noisy Networks

NoisyNet adds perturbations to the network weights to drive exploration. It’s a general method for exploration in deep reinforcement learning.

In NoisyNet, there is noise added to the parameters of the neural networks. The noise causes uncertainty and this introduces more variability in the decisions made by the policy. More variability in the decisions means potentially more exploratory actions.

Source: Noisy Networks for Exploration

The image above represents a noisy linear layer. Mu and sigma are the learnables of the network. Epsilon (w and b) are noise variables, and are added to the weights vector w and the bias b.

The problem it solves

Most methods for exploration, like epsilon greedy, rely on randomness. In large state-action spaces and with function approximations like in neural networks, there are no convergence guarantees. There are ways to explore the environment more efficiently, like rewarding the agent when it discovers parts of the environment it hasn’t been before. These methods also have their setbacks, because the rewards for exploration can trigger unpredictable behavior and are usually not data efficient.

NoisyNet is a simple approach in which the weights of the network are used to drive exploration. It’s an easy addition that you can combine with other techniques from this post. NoisyNet improved scores on multiple Atari games, when used in combination with A3C, DQN and Dueling agents.

Literature

Distributional RL

The final technique I will discuss is distributional reinforcement learning. Instead of using the expectation of the return, it’s possible to use the full distribution of the random return received by a reinforcement learning agent.

How does it work? A distributional reinforcement learning agent tries to learn the complete value distributions, instead of a single value. The distributional agent wants to minimize the difference between the predicted distribution and the true distribution. The metric used for calculating the difference between distributions is called Kullback-Leibler divergence (other metrics are possible). This metric is implemented in the loss function.

The problem it solves

The common approach to reinforcement learning is to model the expectation of the return: the value. The value is one single number for every state action combination. For some problems, this isn’t the right approach. If a policy is non stationary, approximating the full distribution mitigates the effects of learning. Using distributions stabilizes learning, because multimodality in value distributions is preserved.

This makes sense, in laymen’s terms: the agent receives more knowledge from the environment (distributions instead of single values). A downside could be that it takes longer to learn a distribution than a value.

The results of distributional reinforcement learning were remarkable. DQN, Double DQN, Dueling architecture and Prioritized Replay were outperformed by a large margin in a number of games.

Literature

Conclusion

When you have a reinforcement learning use case and the performance isn’t as high as expected, you should definitely try some techniques from this post. The most techniques are easy to implement, especially when you already defined your environment (states, actions and rewards) and have a working DQN agent. All techniques improved the performance on multiple Atari games, although some techniques work better in combination with others, like noisy networks.

There is a paper that combines all the improvements from this blog post. It’s called RainbowNet and the results are interesting, as you can see in the image below:

Source: Rainbow: Combining Improvements in Deep Reinforcement Learning

Related


Techniques to Improve the Performance of a DQN Agent was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

...



📌 Techniques to Improve the Performance of a DQN Agent


📈 70.37 Punkte

📌 Rainbow DQN — The Best Reinforcement Learning Has to Offer?


📈 31.73 Punkte

📌 Applied Reinforcement Learning III: Deep Q-Networks (DQN)


📈 31.73 Punkte

📌 Applied Reinforcement Learning IV: Implementation of DQN


📈 31.73 Punkte

📌 Techniques to improve your Jetpack Compose app performance


📈 28.98 Punkte

📌 Observe.AI Agent Performance and Coaching features helps global brands improve CX


📈 26.36 Punkte

📌 Oracle Performance Tuning: How to Improve Database Performance


📈 24.02 Punkte

📌 Agent Tesla Malware Spotted Using New Delivery & Evasion Techniques


📈 21.93 Punkte

📌 5 Code Refactoring Techniques to Improve Your Code


📈 21.67 Punkte

📌 3 Advanced Document Retrieval Techniques To Improve RAG Systems


📈 21.67 Punkte

📌 Boffins promise protection and perfect performance with new ZeRØ, No-FAT memory safety techniques


📈 19.6 Punkte

📌 Boost Machine Learning Model Performance through Effective Feature Engineering Techniques


📈 19.6 Punkte

📌 Optimizing MERN Stack Performance: Techniques for High-Traffic Applications


📈 19.6 Punkte

📌 ASP.NET Core Series: Performance Testing Techniques | On .NET


📈 19.6 Punkte

📌 A Survey of Techniques for Maximizing LLM Performance


📈 19.6 Punkte

📌 Essential Techniques for Performance Tuning in Snowflake


📈 19.6 Punkte

📌 7 Techniques to Optimize React Applications Performance


📈 19.6 Punkte

📌 Techniques for optimizing JavaScript performance and reducing load times


📈 19.6 Punkte

📌 Advanced Techniques for Improving Performance with React Virtualization


📈 19.6 Punkte

📌 Advanced Techniques for Optimizing JavaScript Performance


📈 19.6 Punkte

📌 Jenkins bis 2.43 Agent-to-Agent Security Subsystem Blacklist erweiterte Rechte


📈 19.31 Punkte

📌 Cisco Secure Services Client/Trust Agent/Security Agent GUI unknown vulnerability


📈 19.31 Punkte

📌 Qualys erweitert Cloud-Agent-Plattform um den innovativen neuen Dienst Cloud Agent Gateway (CAG)


📈 19.31 Punkte

📌 Puppet Agent up to 1.6.0 pxp-agent Environment Variable privilege escalation


📈 19.31 Punkte

📌 Jenkins 2.32.1/2.43 Agent-to-Agent Security Subsystem Blacklist privilege escalation


📈 19.31 Punkte

📌 To Agent or Not to Agent: That Is the Vulnerability Management Question


📈 19.31 Punkte

📌 CVE-2022-34877 | VICIdial up to 2.14b0.5 AST Agent Time Sheet Interface AST_agent_time_sheet.php agent sql injection


📈 19.31 Punkte

📌 CVE-2022-29550 | Qualys Cloud Agent up to 4.8.0-49 on Linux qualys-cloud-agent-scan.log log file (ID 168367)


📈 19.31 Punkte

📌 CVE-2022-45306 | Chocolatey Azure-Pipelines-Agent Package up to 2.211.1 C:\agent permission


📈 19.31 Punkte











matomo