# [Review] Sequence to Sequence Learning with Neural Networks

## Abstract

DNN(Deep Neural Network) works well on large-scale datasets. But it is not good at mapping sequences to sequences.
In this paper, they present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure.
And conducted the experiments on the task of English-French translation.
Finally, they found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

## Content

1. Introduction
2. The model
3. Experiments
1. Dataset details
2. Decoding and Rescoring
3. Reversing the Source Sentences
4. Training details
5. Parallelizations
6. Experimental Results
4. Related Work
5. Conclusion

## 1. Introduction

By using LSTM, they created the model for machine translation.
First reads the source sentence using an encoder to build a "thought" vector, a sequence of numbers that represents the sentence meaning; a decoder, then, processes the sentence vector to emit a translation.
This is often referred to as the encoder-decoder architecture.

## Thought Vectors

This vector has been called, by various people, an "embedding", a "representational vector" or a "latent vector". But Geoff Hinton, in a stroke of marketing genius, gave it the name "thought vector".

## 2. The model

They used the model based on the one introduced by Alex Graves.
1. 2 different LSTM (Encoder, Decoder)
2. 4 layers of LSTM (Deep RNN)
3. Feeding Reversed Sentence
4. Beam Search at output layer

So to deeply understand their architecture, let's have a look at his research paper.
Title: Generating Sequences With Recurrent Neural Networks
Author: Alex Graves
Date: 5/6/2014

## Abstract

shows how Long Short-term Memory recurrent neural networks can be used to generate complex sequences with long-range structure, simply by predicting one data point at a time. The approach has demonstrated for the text and online handwriting.

## RNN Prediction Architecture

Image

Input: $X = (x_1, x_2 , ... , x_T)$
Output: $Y = (y_1, y_2 , ... , y_T)$
Hidden: $h^n = (h^n_1, h^n_2, ... , h^n_T)$
Paramater: $P(x_{t+1}|y_t)$
* At first step(T=1), normally this should be $x_1 = 0$, hence this returns $y = x_2$

## Skip Connections (from Input to all hidden layers and all hidden layers to outputs)

Input to Hidden skip connections

h^n_t = H(W^{xh}x_t + W^{h^{n-1}h^n}h^{n-1}_{t} + W^{h^n h^n}h^{n}_{t-1} + b^n_t)


Hidden to Output skip connections
Cost Function: negative logarithm cost function
Output Layer($y^k_t$): Softmax

\hat y_t = b_y + \sum ^N_{n=1} W^{h^ny}h^n_t\\
y_t = Y(\hat y_t)\\
Pr(x) = \prod^T_{t=1} Pr(x_{t+1}|y_t)\\
L(x) = - \sum ^T_{t=1} \log Pr(x_{t+1}|y_t)\\


## Prediction Network

Pr(x_{t+1} = k|y_t) = y^k_t = \frac{\exp(\hat y^k_t)}{\sum^K_{k'=1} \hat y^{k'}_t}\\
Hence\\
L(x) = - \sum ^T_{t=1} \log y^k_t \Leftrightarrow \frac{\delta L(x)}{\delta y^k_t} = (y^k_t - \delta_{k, x_{t+1}})


The partial derivatives of the loss with respect to the network weights can be efficiently calculated with backpropagation through time, and the network can then be trained with stochastic gradient descent.

## 3.1 Dataset Details

They used the WMT’14 English to French MT tasks.

We chose this translation task and this specific training set subset because of the public availability of a tokenized training and test set together with 1000-best lists from the baseline SMT.

• SMT: Statistical Machine Translation

## 3.2 Decoding and Rescoring

Training mainly involves to maximisation of the log probability of a correct translation $T$ given the source sentence $S$.

\frac{1}{|S|}\sum_{(T,s) \in S} \log p(T|S)\\


Once the training finishes, we do translation by finding the most likely translated words.

\hat T = argmax \space p(T|S)


Using beam search, we can reach to the most likely words.
Size of Beam Search was tested as well and turned out that size of 2 significantly outperform compared to size of 1.
* What is Beam Search

## 3.4 Training Details

Layer: 4
Cells per layer: 1,000
Input: 160,000 vocabulary
Output: 80,000 vocabulary => softmax over 80,000 words.....
Number of Parameters in the nets: 384M...
Weight Matrix Initialisation: Real number of (-0.08, 0.08)
Batch Size: 128

s = ||\frac{\nabla gradient}{128}||_2\\
if \space s > 5, \space then \space g = \frac{5g}{s}


## 3.5 Parallelisation

A C++ implementation of deep LSTM with the configuration from the previous section on a single
GPU processes a speed of approximately 1,700 words per second. This was too slow for our
purposes, so we parallelized our model using an 8-GPU machine. Each layer of the LSTM was
executed on a different GPU and communicated its activations to the next GPU / layer as soon as
they were computed. Our models have 4 layers of LSTMs, each of which resides on a separate
GPU. The remaining 4 GPUs were used to parallelize the softmax, so each GPU was responsible
for multiplying by a 1000 × 20000 matrix. The resulting implementation achieved a speed of 6,300
(both English and French) words per second with a minibatch size of 128. Training took about a ten
days with this implementation.

See sec 3.3.

See sec 3.3.

## 4. Related Work

RNNLM(recurrent neural network language model)
T. Mikolov, M. Karafi´at, L. Burget, J. Cernock`y, and S. Khudanpur. Recurrent neural network based
language model. In INTERSPEECH, pages 1045–1048, 2010.

NNLM
M. Auli, M. Galley, C. Quirk, and G. Zweig. Joint language and translation modeling with recurrent
neural networks. In EMNLP, 2013.
J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, and J. Makhoul. Fast and robust neural network
joint models for statistical machine translation. In ACL, 2014.

End-to-End Training
K. M. Hermann and P. Blunsom. Multilingual distributed representations without word alignment. In
ICLR, 2014.

## 5. Conclusion

They concludes that this work is quite potential and can be followed more later.