For example, we need to select the main page of a pizzeria website to attract the clients
Probability of winning $\theta = 0.05$
$\theta_1 = 0.02$
$\theta_2 = 0.01 (min)$
$\theta_3 = 0.05$
$\theta_4 = 0.10 (max)$
We don't know the true probabilities, but we want to come up with a strategy that maximizes the payoff (reward)
Given possible actions $x_1, \dots, x_n$
At the next iteration $t$, for each action $x^t_i$ performed, we get the answer
$$ y^t_i \sim q(y|x^t_i, \Theta),$$which brings us a reward
$$r_i^t = r(y^t_i)$$There is an optimal action $x_{i^*}$ (sometimes $x_{i^*_t}$)
$$\forall i:\ \mathbb{E}(r_{i^*_t}^t) \geq \mathbb{E}(r^t_i) $$Question: How to evaluate different strategies?
The quality measure of the multi-armed bandit algorithm $a$ is usually regret
$$ R(a) = \sum\limits_{t=1}^T \left(\mathbb{E}(r_{i^*_t}^t) - \mathbb{E}(r_{i^a_t}^t)\right)$$Under synthetic conditions (when we know the probabilities), we can consider
$$ E\left( R \right) = \int\limits_{\Theta} R(a)d\Theta $$Do not take into account delayed effects. For example, the effect of clickbait in advertising
SHOCK! Cats want to enslave us
Trajectory — $[s_0, a_0, s_1, a_1, s_2, \dots, a_{T-1}, s_T]$
As a strategy model, we simply take a matrix $\pi$ of dimension $|S| \times |A|$
$$\pi(a | s) = \pi_{s,a}$$after selecting the best trajectories, we obtain a set of pairs
$$\text{Elite} = [(s_0, a_0), (s_1, a_1), (s_2, a_2), \dots, (s_H, a_H)]$$and maximize the likelihood
$$ \pi_{s,a} = \frac{\sum\limits_{s_t, a_t \in \text{Elite}} [s_t = s][a_t = a]}{\sum\limits_{s_t, a_t \in \text{Elite}} [s_t = s]}$$That there is a lot of states:
Possible solutions:
Average absolute win: $$ \mathbb{E}_{s_0 \sim P(s_0)\,} \mathbb{E}_{a_0 \sim \pi(a|s_0)\,} \mathbb{E}_{s_1, r_0 \sim P(s^\prime,r|s, a)} \dots \left[r_0 + r_1 + \dots + r_T \right]$$
On-Policy Value Function: $$ V_{\pi} (s) = \mathbb{E}_\pi [R_t|s_t = s] = \mathbb{E}_\pi \left[\sum\limits_{k=0}^\infty \gamma^k r_{k+t+1} | s_t = s\right]$$
On-Policy Action-Value Function: $$ Q_{\pi} (s, a) = \mathbb{E}_\pi [R_t|s_t = s, a_t = a] = \mathbb{E}_\pi \left[\sum\limits_{k=0}^\infty \gamma^k r_{k+t+1} | s_t = s, a_t = a \right]$$
where $\pi$ is the strategy followed by the agent, $\gamma \in [0, 1]$
Question: What is the meaning of the difference $Q_{\pi} (s, a) - V_{\pi} (s)$?
Often in RL, we don't need to describe how good an action is in an absolute sense, but only how much better it is than others on average. We make this concept precise with the advantage function: $$ A_{\pi} (s, a) = Q_{\pi} (s, a) - V_{\pi} (s) $$
For the Action-Value function $Q$
$$ Q_{\pi} (s,a) = \mathbb{E}_\pi \left[r_t + \gamma \sum\limits_{a^\prime} \pi(a^\prime|s^\prime) Q_{\pi}(s^\prime,a^\prime) \mid s_t = s, a_t = a \right]$$That is, with probabilistic transitions
$$V(s) = \max\limits_a [r(s, a) + \gamma \mathbb{E}_{s^\prime \sim P(s^\prime | s, a)} V(s^ \prime)]$$Iterative State Utility Recalculation Formula:
$$ \begin{align*} \forall s \ V_0(s) &= 0 \\ V_{i+1}(s) &= \max\limits_a [r(s, a) + \gamma \mathbb{E}_{s^\prime \sim P(s^\prime | s, a)} V_{i}(s^\prime)] \end{align*} $$Note: To use this formula in practice, we need to know the transition probabilities $P(s^\prime| s, a)$
The strategy of the game is defined as follows
$$\pi(s) : \arg\max\limits_a Q(s, a)$$Again due to stochasticity
$$Q(s, a) = \mathbb{E}_{s^\prime} [r(s, a) + \gamma V(s^\prime)]$$It is possible to estimate the expected value without an explicit distribution, using the Monte Carlo method and averaging over the outcomes:
$$Q(s_t, a_t) \leftarrow \alpha \left(r_t+\gamma \max\limits_a Q(s_{t+1}, a) \right) + (1-\alpha) Q(s_{t}, a_{t})$$In theory, we can solve SLAE of Bellman Equations, get $Q^*(s,a)$ and that's it.
In practice, we have lots of problems:
Let's consider optimal value function:
$$Q^*(s,a) = \max\limits_\pi \mathbb{E}_\pi [R_t|s_t = s, a_t = a] $$We have similar Bellman Equation for $Q^*(s,a)$ and if we solved it,
our strategy would be obvious:
$$\pi^*(s) = \arg\max\limits_a Q^*(s,a)$$
$\quad$ The greedy strategy $\pi$ with respect to a solution of Bellman Equations $Q^*(s,a)$ (SLAE) "choose the action that maximizes $Q^*$" is optimal.
We arbitrarily initialise the function $V_{\pi}(s)$ and the strategy $\pi$. Then we repeat:
State-Action-Reward-State-Action computes the Q-value based on the current state of the policy and is considered as an on-policy learning algorithm. Named by the quintuple (s,a,r,s',a'), the formula is as follows:
$$ Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma Q(s',a') - Q(s,a)\right] $$Q-Learning is an off-policy learning algorithm that computes the Q-value based on the maximum reward that is attainable in the next state. The formula is as follows:
$$ Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma \max_{a}\left[Q(s',a')\right] - Q(s,a)\right] $$Environment is Atari game emulator, and each frame is $210\times 160\text{ pix}, \ 128\text{ colors}$
Pong
Breakout
Space Invaders
Seaquest
Beam Rider
Source: V.Mnih et al. (DeepMind). Playing Atari with deep reinforcement learning. 2013
Dueling Network is a reinforcement learning architecture. The idea behind Dueling Networks is that in many settings, it could be beneficial to separate the representation of state values and the advantages of each action in a given state.
The architecture of Dueling Networks consists of two streams. One stream is the value function stream, which estimates the value function $V(s)$; the other is the advantage function stream, which estimates advantage function $A(s, a)$. These two streams are combined to produce the estimate of $Q(s, a)$.