Vanilla Neural Networks | Image Captioning image → (sequence of words) |
Sentiment Classification (sequence of words) → sentiment |
Machine Translation (seq of words) → (seq of words) |
(on frame level)
J. Ba, V. Mnih, K. Kavukcuoglu. Multiple Object Recognition with Visual Attention
We process the sequence of vectors $x$ with one and the same function with parameters:
$$ h_t = f_W(h_{t-1}, x_t)$$$f_W$ is a function parameterized by $W$
$x_t$ — next input vector
$h_t$ — hidden state
Question: What function can we take as $f_W$?
As a function $f_W$ we set a linear transformation with a non-linear component-wise "sigmoid":
$$ \begin{align*} h_t &= \tanh ({\color{orange}W_{hh}} h_{t-1} + {\color{orange}W_{xh}} x_t) \\ y_t &= {\color{orange}W_{hy}} h_t \end{align*} $$The entire four-letter dictionary: [h, e, l, o] and word "hello" as train:
Softmax is also applied to the values of the out-layer to get the loss
numpy implementation by Karpathy
Let's get to grips with the code in the Jupyter notebook!
$\odot$ — component-wise product
The network should remember the context for a long time. Which context? The network learns itself. To do this, the vector \(c_t\) is introduced, which is the state vector of the network at the moment \(t\).
\(c_t^\prime = \tanh({\color{orange}W_{xc}}x_t + {\color{orange}W_{hc}}h_{t-1} + {\color{orange}b_{c^\prime}})\) | candidate cell state |
\(i_t = \sigma({\color{orange}W_{xi}}x_t + {\color{orange}W_{hi}}h_{t-1} + {\color{orange}b_{i}})\) | input gate |
\(f_t = \sigma({\color{orange}W_{xf}}x_t + {\color{orange}W_{hf}}h_{t-1} + {\color{orange}b_{f}})\) | forget gate |
\(o_t = \sigma({\color{orange}W_{xo}}x_t + {\color{orange}W_{ho}}h_{t-1} + {\color{orange}b_{o}})\) | output gate |
\(c_t = f_t \odot c_{t-1} + i_t \odot c_t^\prime\) | cell state |
\(h_t = o_t \odot \tanh(c_t)\) | block output |
Only \( h_t \) is used, vector \( c_t \) is not introduced. Update-gate instead of input and forget. The reset gate determines how much memory to move forward from the previous step.
As a function \( f_W \) we set a linear transformation with a non-linear component-wise "sigmoid":
$$ \begin{align*} h_t &= \tanh ({\color{orange}W_{hh}} h_{t-1} + {\color{orange}W_{xh}} x_t) \\ y_t &= {\color{orange}W_{hy}} h_t \end{align*} $$\(X = (x_1, \dots, x_n)\) — input sequence
\(Y = (y_1, \dots, y_m)\) — output sequence
\(\color{green}c \equiv h_n\) encodes all information about \(X\) to synthesize \(Y\)
What else can you look at?