LSTM overview
Formualae
First of all, some notations below do not match those in the figure above.
Three gates
input $\boldsymbol{i}_t$, forget $\boldsymbol{f}_t$, output $\boldsymbol{o}_t $
Basic form of gate signals are;
\sigma(\boldsymbol{W}_{*}\boldsymbol{h}_{t - 1} + b_{*})
representing activation func(recur weight * past output + weight * current input + bias)
\boldsymbol{i}_t = \sigma(\boldsymbol{W}_i\boldsymbol{h}_{t - 1} + \boldsymbol{U}_{i}x_t + \boldsymbol{b}_i) \\
\boldsymbol{f}_t = \sigma(\boldsymbol{W}_f\boldsymbol{h}_{t - 1} + \boldsymbol{U}_{f}x_t + \boldsymbol{b}_i) \\
\boldsymbol{o}_t = \sigma(\boldsymbol{W}_o\boldsymbol{h}_{t - 1} + \boldsymbol{U}_{o}x_t + \boldsymbol{b}_o) \\
Activation function explained (briefly)
As $\sigma \in [0, 1]$, it controls how much of a received information is passed to the next step of learning.
\sigma = 1 \Rightarrow \text{info is fully "preserved"} \\
\sigma = 0 \Rightarrow \text{info is completely "discarded"}
On the other hand, $\tanh$ is used because it satisfies $\tanh \in [-1, +1]$. This means that it regulates a received signal scale into $[-1, +1]$.
Cell state
- candidate cell state $\tilde{\boldsymbol{c}}_{t}$
\tilde{{\boldsymbol{c}}_{t}} = \tanh(\boldsymbol{W}_c\boldsymbol{h}_{t - 1} + \boldsymbol{U}_{c}x_t + \boldsymbol{b}_c) $$
- cell state $\boldsymbol{c}_t$
\boldsymbol{c}_t = \boldsymbol{i}_t \odot \tilde{\boldsymbol{c}}_{t} + \boldsymbol{f}_t \odot \boldsymbol{c}_{t - 1}
representing "how much input is taken in" + "how much past output in forgotten/inherited"
Output
The final outcome $\boldsymbol{h}_t$ ($z_{t, j}$ in the figure) is used both as the next "past output"(recur) and "new input"(to the next node).
$$ \boldsymbol{h}_t = \boldsymbol{o}_t \odot \tanh(\boldsymbol{c}_t)$$