# DRL Algorithms
------
## Deep Q Network (DQN)
I studied `Deep Q Network` with the famous paper [Human-level control through deep reinforcement learning](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html) from Deep mind and `deep q network code` from [github of asrivat1](https://github.com/asrivat1/DeepLearningVideoGames).
After, I studied DQN and made my own code.
------
## Double Deep Q Network (DDQN)
I studied `Double Deep Q Network` with the paper [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461)
> The main idea of this algorithm is from `Double Q Learning` (van Hasselt, 2010).
> This algorithm uses two sets of weights, θ and θ'.
>
For each update, one set of weights (θ) is used to determine the greedy policy and the other (θ') is to determine its value.
>
The paper said, we can decouple the `selection` from the `evaluation` with this method.
>
This makes less likely to select overestimated values.
As a result, the difference between DQN and DDQN at each update is as follows.
However, greedy TD-error prioritization has several issues.
- Transitions that have a low TD-error on first visit may not be replayed for a long time
- It is sensitive to noise spikes, which can be exacerbated by bootstrapping
- Greedy prioritization focuses on a small subset of the experience: error shrinks slowly.
To overcome these issues, stochastic sampling method that interpolates between `pure greedy prioritization` and `uniform random sampling`.
For guaranteeing a non-zero probability even for the lowest-priority transition, it defines the `probability of sampling transition` i as
- p_i > 0 is the priority of transition i.
- The exponential alpha determines how much prioritization is used, with alpha = 0 corresponding to the uniform case.
To determine p_i, there are 2 ways.
1. Proportional Prioritization
- epsilon is a small positive constant that prevents the edge-case of the transitions not being revisited once their error is zero.
2. Rank-based Prioritization
- rank(i) is the rank of the transition i when the replay memory is sorted according to delta_i
The algorithm of the prioritized experience replay is as follows.
The image at the top is single stream DQN and image at the bottom is dueling DQN. Dueling network has 2 streams to separately estimate (scalar) state-value and the advantages for each action. After that combine them to get final Q-values as output.
The equation of Q-values is as follows.
The V (s; θ, β) is provides an estimate of the value function. Also, A(s, a; θ, α) is result of advantage stream. The advantage function subtracts the average value of the advantage function to obtain a relative measure of the importance of each action.
The estimates V (s; θ, β) and A(s, a; θ, α) are computed automatically without any extra supervision or algorithmic modifications. Therefore, it is not difficult to implement this algorithm! 😄
DRQN convolves three times over a single-channel image of the game screen. The resulting activations are
processed through time by an LSTM layer. The last two timesteps are shown here. LSTM outputs become Q-Values after passing through a fully-connected layer.