We are trying to train just policy
$$\pi(a|s, \theta) = Pr[a_t = a| s_t = s]$$Probabilities $Pr_\pi[s]$ are very hard to calculate, but we can just play using policy $\pi$, and this gives us correct samples from this distribution.
The REINFORCE algorithm is a policy-based method used in reinforcement learning. Unlike value-based methods, which aim to learn a value function by interaction with the environment, REINFORCE learns a stochastic policy directly.
$${\nabla_\theta J(\theta) \propto \sum\limits_s Pr_\pi[s] \sum\limits_a Q_\pi(s,a) \nabla_\theta \pi(a|s)} = \\ \quad = \mathbb{E}_{\pi} \left[ \sum\limits_a Q_\pi(S_t,a) \nabla_\theta \pi(a|S_t, \theta)\right]$$So the SGD updates are
$$\theta_{t+1} = \theta_{t} + \alpha \sum\limits_a \hat Q_\pi(S_t,a,\theta) \nabla_\theta \pi(a|S_t, \theta),$$where $\hat Q_\pi(S_t,a,\theta)$ is some learned approximation of the action-value function $Q$.
TD-Gammon is a computer backgammon program that uses artificial intelligence and machine learning techniques to play the game at a highly competitive level. Developed by Gerald Tesauro from IBM.
The main part of MCTS is the formula for selection of the next node. The UCB1 formula looks like this (the same as in multi-armed bandits):
$$ UCB1 = \frac{w_i}{n_i} + c \sqrt{\frac{\ln t}{n_i}} $$where:
The algorithm prefers actions with higher estimated rewards and higher uncertainties.
AlphaGo vs. Lee Sedol
9-15 March 2016Forecast was 1 : 4
Result was 4 : 1
The development of AlphaGo was a significant achievement in the field of Artificial Intelligence.
AlphaZero is a more generalized and simple variant of the AlphaGo algorithm, and is capable of playing Chess, Shogi and Go.
We minimize
$$ Loss = (z-v)^2 - \pi^T \cdot \log p + c \|\theta\|^2 $$where $p$ is vector with moves probabilities from NN, $\pi$ — improved by MCTS vector of probabilities, $z \in \mathbb{R}$ and $v \in \mathbb{R}$ are value function estimations from MCTS and NN, $\theta$ are weights of NN, and $c$ is regularization coefficient.