Sequence to Sequence Learning with Neural Networks
This document discusses sequence to sequence learning with neural networks. It summarizes a seminal paper that introduced a simple approach using LSTM neural networks to map sequences to sequences. The approach uses two LSTMs - an encoder LSTM to map the input sequence to a fixed-dimensional vector, and a decoder LSTM to map the vector back to the target sequence. The paper achieved state-of-the-art results on English to French machine translation, showing the potential of simple neural models for sequence learning tasks.
Introduction to Sequence to Sequence Learning with Neural Networks, presented by Quang Nguyen.
ML algorithms typically assume independent data. Sequential data like Machine Translation are correlated, challenging conventional modeling.
Introduces using LSTMs for sequence to sequence mapping and highlights achievements in translation tasks.
RNNs utilize feedback loops for time steps, but struggle with long-term dependencies, which LSTMs address effectively. LSTMs benefit from flexibility in mapping sequences, improving long-term dependency issues in RNNs. Discusses prior research on LSTM, including encoder-decoder structures that improve translation and sequence generation.
Describes a method for training LSTMs using a simple architecture featuring an encoder and decoder.
Details a comprehensive experiment with significant datasets, parameters, and configuration that outperformed previous systems.
Explains methods for parallelizing the computational process to improve model training efficiency. Highlights the importance of fixed-dimensional vectors in managing sequence input and output effectively.
Showcases BLEU scores across models and discusses improvements in performance on translation tasks.
Reinforces the effectiveness of LSTM in machine translation tasks, summarizing findings and acknowledging future research requirements.
Sequence to Sequence Learning with Neural Networks
1.
Sequence to SequenceLearning
with Neural Networks
2017.04.23
Presented by Quang Nguyen
Vietnam Development Center (VDC)
Ilya Sutskever, Oriol Vinyals, Quoc V. Le - Google
2.
Communicating Knowledge
Vietnam DevelopmentCenter
Most machine learning algorithms are designed
for indepentdent, identically distributed (i.i.d.)
data
But many interesting data types are not i.i.d.
In particular, the sucessive points in sequential
data are strongly correlated
Sequence (To Sequence) Learning
2
Communicating Knowledge
Vietnam DevelopmentCenter
Sequence Learning is the study of ML
algorithms designed for sequential data
Limitation of DNN
Can only map vectors to vectors
Sequential data’s expression has unknown length
DNN Issue with Sequence Learning
4
?
5.
Communicating Knowledge
Vietnam DevelopmentCenter
Paper Abstract
5
WHAT IS PROBLEM
WHY IMPORTANT?
HOW TO SOLVE
Can do on many other sequence learning problems
Apply simple DNN approach to map sequences to sequences
Use two main steps:
- Deep LSTMs to map the input sequence vector of fixed
dimensionality
- Deep LSTMs to decode from vector target sequence
ACHIEVEMENTS
- Close to Winning Achievement of EN -> FR SMT
(BLUE: 34.8 vs. 37)
- It can beat with improvement.
6.
Communicating Knowledge
Vietnam DevelopmentCenter
Just the Neural Network with a Feedback loop
Previous time step’s hidden layer and final
output are feedback to the network
Recurrent Neural Network
6
out
hid
inp
out
hid
inp
out
hid
inp
out
hid
inp
out
hid
inp
out
hid
inp
time
<start>
<end>
h e l l o
h e l l o
ISSUES:
- One-to-one input-
output
- Trouble for “long term
dependencies”
7.
Communicating Knowledge
Vietnam DevelopmentCenter
Same modelling power
Overcome the issues of RNN for “long term
depencies”
RNNs overwrite the hidden state
LSTMs add to the hidden state
Long Short-Term Memory (LSTM)
7
8.
Communicating Knowledge
Vietnam DevelopmentCenter
RNN have one-to-one mapping between the
inputs and the outputs
LSTM can map one to many (one image to
many words in a caption), many to many
(translation), or many to one (classifying a voice)
Long Short-Term Memory (LSTM)
8
9.
Communicating Knowledge
Vietnam DevelopmentCenter
Kalchbrenner and Blunsom first map the entire input
sentence to vector for Translation using CSM
(Convolutional Sentence Model)
Related Work
9
source -> target translation Vector: generalization
and generation
N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013
10.
Communicating Knowledge
Vietnam DevelopmentCenter
Proposed Approach
Related Work
10
N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013
11.
Communicating Knowledge
Vietnam DevelopmentCenter
Source Sentence Model using CSM
Example of CSM
Related Work
11
N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013
Do not preserve ordering
12.
Communicating Knowledge
Vietnam DevelopmentCenter
K. Cho et al. propose a novel neural network model called
RNN Encoder– Decoder that consists of two recurrent
neural networks (RNN).
D.Bahdanau et al. then improve K. Cho’s work
Encoder:
Input: A variable-length sequence x
Output: A fixed-length vector representation c
Decoder:
Input: A given fixed-length vector representation c
Output: A variable-length sequence y
Related Work
12
K.Cho, B.Merrienboer, C.Gulcehre, F.Bougares, H.Schwenk,and Y.Bengio. Learning phrase representations using RNN
encoder-decoder for statistical machine translation. In Arxiv preprint arXiv:1406.1078, 2014.
D.Bahdanau, K.Cho, and Y.Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473, 2014.
13.
Communicating Knowledge
Vietnam DevelopmentCenter
Trained jointly to maximize conditional log-
likelihood
Usage:
Generate an output sequence given an input sequence
Score a given pair of input and output sequences
Related Work
13
K.Cho, B.Merrienboer, C.Gulcehre, F.Bougares, H.Schwenk,and Y.Bengio. Learning phrase representations using RNN
encoder-decoder for statistical machine translation. In Arxiv preprint arXiv:1406.1078, 2014.
D.Bahdanau, K.Cho, and Y.Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473, 2014.
14.
Communicating Knowledge
Vietnam DevelopmentCenter
A. Graves instroduce a novel approach: use
LSTMs to generate complex sequences with
long-range structure, simply by predicting one
data point at a time.
Deep recurrent LSTM net with skip connections
Inputs arrive one at a time, outputs determine predictive
distribution over next input
Train by minimising log-loss
Related Work
14
A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850, 2013.
Communicating Knowledge
Vietnam DevelopmentCenter
Predict hand-writing
Related Work
16
A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850, 2013.
17.
Communicating Knowledge
Vietnam DevelopmentCenter
Kalchbrenner and Blunsom first map the entire
input sentence to vector for Translation using
CSM (Convolutional Sentence Model)
K. Cho et al. propose RNN Encoder– Decoder
for rescoring pairs of source/target sentences.
A. Graves use LSTMs to generate complex
sequences simply by predicting one data point
at a time.
Related Work
17
A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850, 2013.
18.
Communicating Knowledge
Vietnam DevelopmentCenter
LSTM first read the input sequence
And then, produce the output sequence
Paper’s main ideas
18
hid
inp
hid
inp
hid
inp
hid
inp
out
hid
inp
out
hid
inp
time
A
Y
B C <eos> W X
X
out
hid
inp
out
hid
inp
<eos>
Y Z
Z
out
W
19.
Communicating Knowledge
Vietnam DevelopmentCenter
Simple and uncreative model, with Larger and
deeper neural networks, can achieve good
results
Two different LSTMs: Encoder & Decoder
04 layer LSTMs
Reversal of input sentences
First serious attempt to directly produce
translation
Opposite to rescoring
Proof that the naive approach works
Paper's Contribution
19
20.
Communicating Knowledge
Vietnam DevelopmentCenter
WMT’14 English to French
340M French words
303M English words
Large Experiment
20
160K input words
80 K output words
4 hidend layers x 1000 LSTM cells (deep LSTMs
outperform shalow LSTMs)
348M parameters
21.
Communicating Knowledge
Vietnam DevelopmentCenter
Large Experiment
21
80k vocab
in output language
160k vocab in
input languageinp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
out
W
4layers
X
1000 cells
384 M paramters
hid hid hid
hid hidhid
hid
hid
22.
Communicating Knowledge
Vietnam DevelopmentCenter
Large Experiment
22
hid hid hid hid
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
out
W
LSTMs for Encoder LSTMs for Decoder
80k vocab
in output language
160k vocab in
input language
4layers
X
1000 cells
384 M paramters
23.
Communicating Knowledge
Vietnam DevelopmentCenter
Large Experiment
23
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid
24.
Communicating Knowledge
Vietnam DevelopmentCenter
Parallelization
24
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid
25.
Communicating Knowledge
Vietnam DevelopmentCenter
Parallelization
25
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid
26.
Communicating Knowledge
Vietnam DevelopmentCenter
Parallelization
26
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid
27.
Communicating Knowledge
Vietnam DevelopmentCenter
Parallelization
27
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid
28.
Communicating Knowledge
Vietnam DevelopmentCenter
Fixed-Dimensionality Vectors
28
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid
29.
Communicating Knowledge
Vietnam DevelopmentCenter
Parallelization
29
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid
30.
Communicating Knowledge
Vietnam DevelopmentCenter
Parallelization
30
hid hid hid hid
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
out
W
31.
Communicating Knowledge
Vietnam DevelopmentCenter
Fixed-Dimensionality Vectors
31
80k vocab
in output language
160k vocab in
input language
hid hid hid hid
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
out
W
4layers
X
1000 cells
384 M paramters
Communicating Knowledge
Vietnam DevelopmentCenter
Corpus: WMT’14 English -> Frend
303M English Words, 384M French Words
70K Test Words
MNT first time surpasses Phrase-based SMT
baseline
BLEU Evaluation
34
Method BLEU
Bahdanau et al. 28.45
Phrase-based SMT Baseline 33.3
Implemented LSTMs 34.8
Winner of WMT’14 37
Communicating Knowledge
Vietnam DevelopmentCenter
Luong et al. improved Neural Machine
Translation with rare-word technique
First to surpass the best result achieved on a
WMT’14 contest task.
Follow-up for Rare Words Issue
37
M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba. 2015. Addressing
the rare word problem in neural machine translation. In ACL.
Method BLEU
Bahdanau et al. 28.45
Phrase-based SMT Baseline 33.3
Implemented LSTMs 34.8
Winner of WMT’14 37
NMT + rare-word technique 37.5
Communicating Knowledge
Vietnam DevelopmentCenter
General, simple LSTM-based approach to
sequence learning problems
Work well with Machine Translation,
Out-perform the standard SMT-based system
Good with long sentences
Strong Evidence for DNN
Large Dataset + Very Big Neural Network = Success
Conclusions
39
40.
Communicating Knowledge
Vietnam DevelopmentCenter
I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence
to sequence learning with neural networks. In NIPS.
N. Kalchbrenner and P. Blunsom. Recurrent
continuous translation models. In EMNLP, 2013
A. Graves. Generating sequences with recurrent
neural networks. In Arxiv preprint arXiv:1308.0850,
2013.
K.Cho, B.Merrienboer, C.Gulcehre, F.Bougares,
H.Schwenk,and Y.Bengio. Learning phrase
representations using RNN encoder-decoder for
statistical machine translation. In Arxiv preprint
arXiv:1406.1078, 2014.
D.Bahdanau, K.Cho, and Y.Bengio. Neural machine
translation by jointly learning to align and translate.
arXiv preprint arXiv:1409.0473, 2014.
References
40
41.
Communicating Knowledge
Vietnam DevelopmentCenter
M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W.
Zaremba. 2015. Addressing the rare word problem in
neural machine translation. In ACL.
https://www.microsoft.com/en-us/research/video/nips-
oral-session-4-ilya-
sutskever/?from=http%3A%2F%2Fresearch.microsoft.co
m%2Fapps%2Fvideo%2F%3Fid%3D239083
https://deeplearning4j.org/lstm.html
References
41