Up to this point, we either restored the function from the training set $(X, Y)$ (supervised learning)
or searched for the structure in the set of objects $X$ (unsupervised learning)
But how does learning work in the real world?
Usually we do some action and get the result, gradually learning
For example, we need to select the main page of a pizzeria website to attract the clients
Two fundamental difficulties in RL
Exploration vs. Exploitation
Credit assignment
What approaches are there?
Multi-armed bandits — a special case of reinforcement learning with only one state,
i.e., without credit assignment difficulty
Also, there is a statistics approach (i.e. not RL):
A/B testing — many users see potentially bad options;
another problem is that in reality the group size should depend on
the true probability values (e.g., conversions) which you don't know
Bernoulli's one-armed bandit
Probability of winning $\theta = 0.05$
Multi-Armed Bandits
$\theta_1 = 0.02$
$\theta_2 = 0.01 (min)$
$\theta_3 = 0.05$
$\theta_4 = 0.1 (max)$
We do not know the true probabilities, but we want to come up with a strategy that maximizes the payoff (reward)
Mathematical statement of the problem
Given possible actions $x_1, \dots, x_n$
At the next iteration $t$, for each action $x^t_i$ performed,
we get the answer
$$ y^t_i \sim q(y|x^t_i, \Theta),$$
which brings us a reward
$$r_i^t = r(y^t_i)$$
There is an optimal action $x_{i^*}$ (sometimes $x_{i^*_t}$)
Split the state space into sections and treat them as states
Get probabilities from $\pi_\theta (a | s)$ machine learning model:
linear model, neural network, random forest
(often these probabilities need to be further specified later)
Example
As a strategy model, we take just a neural network $\pi_\theta$
Initialize with random weights
At each iteration, after selecting the best trajectories,
we obtain a set of pairs $$\text{Elite} = [(s_0, a_0), (s_1, a_1), (s_2, a_2), \dots, (s_H, a_H )]$$
where $\pi$ is the strategy followed by the agent, $\gamma \in [0, 1]$
Question: What is the meaning of the difference $Q_{\pi} (s, a) - V_{\pi} (s)$?
Often in RL, we don't need to describe how good an action is in an absolute sense,
but only how much better it is than others on average.
We make this concept precise with the advantage function:
$$ A_{\pi} (s, a) = Q_{\pi} (s, a) - V_{\pi} (s) $$
We have similar Bellman Equation for $Q^*(s,a)$ and if we solved it,
our strategy would be obvious:
$$\pi^*(s) = \arg\max\limits_a Q^*(s,a)$$
$\quad$ The greedy strategy $\pi$ with respect to a solution of Bellman Equations $Q^*(s,a)$ (SLAE)
"choose the action that maximizes $Q^*$" is optimal.
We arbitrarily initialise the function $V_{\pi}(s)$ and the strategy $\pi$. Then we repeat:
Initialise $s$
For each agent step:
Select $a$ by strategy $\pi$
Do action $a$, get result $r$ and next state $s^\prime$
Update function $V(s)$ by formula
$V(s) = V(s) + \alpha \left[r + \gamma V(s^\prime) - V(s)\right]$
Go to the next step by assigning $s := s^\prime$
SARSA and Q-Learning in Temporal Difference (TD) Training
SARSA
State-Action-Reward-State-Action computes the Q-value based
on the current state of the policy and is considered as an on-policy learning algorithm.
Named by the quintuple (s,a,r,s',a'), the formula is as follows:
Q-Learning is an off-policy learning algorithm that computes
the Q-value based on the maximum reward that is attainable in the next state.
The formula is as follows:
Environment is Atari game emulator, and each frame is
$210\times 160\text{ pix}, \ 128\text{ colors}$
Pong
Breakout
Space Invaders
Seaquest
Beam Rider
States s: 4 consecutive frames compressed to $84 \times 84$
Actions a: from 4 to 18, depending on the game
Rewards r: current in-game SCORE
Value function $Q(s, a; w)$: ConvNN with input $s$ and $|A|$ outputs
Source: V.Mnih et al. (DeepMind). Playing Atari with deep reinforcement learning. 2013
DQN Method
Saves trajectories $(s_t, a_t, r_t)_{t=1}^T$ in replay memory
for repeated experience replay. It is needed to prevent overfitting
Gets approximation of the optimal value function $Q(s_t, a_t)$
for fixed current network parameters ${\color{orange} w_t}$
Evaluates loss function for training the neural network model $Q(s, a; w)$:
$$ \mathscr{L} (w) = (Q(s_t, a_t; w) - y_t)^2 $$
with stochastic gradient SGD (by mini-batches of length 32):
$$ w_{t+1} = w_{t} - \eta \left(Q(s_t, a_t; w_t) - y_t\right) \nabla_w Q(s_t, a_t; w_t) $$
Also it is better to use two different models for max evaluation and updating
(Double Q-learnig by Hado van Hasselt, 2010)
Network architecture for the value function
Dueling Networks
Dueling Network is a reinforcement learning architecture.
The idea behind Dueling Networks is that in many settings,
it could be beneficial to separate the representation of state values and
the advantages of each action in a given state.
Architecture
The architecture of Dueling Networks consists of two streams.
One stream is the value function stream, which estimates the value function $V(s)$;
the other is the advantage function stream, which estimates advantage function $A(s, a)$.
These two streams are combined to produce the estimate of $Q(s, a)$.
The problem of goal alignment
Summary
We got acquainted with the reinforcement learning problem statement
Touched a little multi-armed bandits model
Got acquainted with the methods of cross-entropy, TD-learning and Q-learning