Sequence to Sequence Learning with Neural Networks

Sequence to Sequence Learning
with Neural Networks
2017.04.23
Presented by Quang Nguyen
Vietnam Development Center (VDC)
Ilya Sutskever, Oriol Vinyals, Quoc V. Le - Google

Communicating Knowledge
Vietnam Development Center
 Most machine learning algorithms are designed
for indepentdent, identically distributed (i.i.d.)
data
 But many interesting data types are not i.i.d.
 In particular, the sucessive points in sequential
data are strongly correlated
Sequence (To Sequence) Learning
2

 Application of Sequence Learning
 Machine Translation
 Question Answering (http://time.com/4624067/amazon-echo-
alexa-ces-2017)
 Image Caption Generation
 Speed Recognition
 Hand Writing Synthesis
 Etc.
Sequence (To Sequence) Learning
3

 Sequence Learning is the study of ML
algorithms designed for sequential data
 Limitation of DNN
 Can only map vectors to vectors
 Sequential data’s expression has unknown length
DNN Issue with Sequence Learning
4
?

Paper Abstract
5
WHAT IS PROBLEM
WHY IMPORTANT?
HOW TO SOLVE
Can do on many other sequence learning problems
Apply simple DNN approach to map sequences to sequences
Use two main steps:
- Deep LSTMs to map the input sequence  vector of fixed
dimensionality
- Deep LSTMs to decode from vector  target sequence
ACHIEVEMENTS
- Close to Winning Achievement of EN -> FR SMT
(BLUE: 34.8 vs. 37)
- It can beat with improvement.

 Just the Neural Network with a Feedback loop
 Previous time step’s hidden layer and final
output are feedback to the network
Recurrent Neural Network
6
out
hid
inp
out
hid
inp
out
hid
inp
out
hid
inp
out
hid
inp
out
hid
inp
time
<start>
<end>
h e l l o
h e l l o
ISSUES:
- One-to-one input-
output
- Trouble for “long term
dependencies”

 Same modelling power
 Overcome the issues of RNN for “long term
depencies”
 RNNs overwrite the hidden state
 LSTMs add to the hidden state
Long Short-Term Memory (LSTM)
7

 RNN have one-to-one mapping between the
inputs and the outputs
 LSTM can map one to many (one image to
many words in a caption), many to many
(translation), or many to one (classifying a voice)
Long Short-Term Memory (LSTM)
8

 Kalchbrenner and Blunsom first map the entire input
sentence to vector for Translation using CSM
(Convolutional Sentence Model)
Related Work
9
source -> target translation Vector: generalization
and generation
N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013

 Proposed Approach
Related Work
10

 Source Sentence Model using CSM
 Example of CSM
Related Work
11
Do not preserve ordering

K. Cho et al. propose a novel neural network model called
RNN Encoder– Decoder that consists of two recurrent
neural networks (RNN).
D.Bahdanau et al. then improve K. Cho’s work
 Encoder:
 Input: A variable-length sequence x
 Output: A fixed-length vector representation c
 Decoder:
 Input: A given fixed-length vector representation c
 Output: A variable-length sequence y
Related Work
12
K.Cho, B.Merrienboer, C.Gulcehre, F.Bougares, H.Schwenk,and Y.Bengio. Learning phrase representations using RNN
encoder-decoder for statistical machine translation. In Arxiv preprint arXiv:1406.1078, 2014.
D.Bahdanau, K.Cho, and Y.Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473, 2014.

 Trained jointly to maximize conditional log-
likelihood
 Usage:
 Generate an output sequence given an input sequence
 Score a given pair of input and output sequences
Related Work
13
K.Cho, B.Merrienboer, C.Gulcehre, F.Bougares, H.Schwenk,and Y.Bengio. Learning phrase representations using RNN
encoder-decoder for statistical machine translation. In Arxiv preprint arXiv:1406.1078, 2014.
D.Bahdanau, K.Cho, and Y.Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473, 2014.

 A. Graves instroduce a novel approach: use
LSTMs to generate complex sequences with
long-range structure, simply by predicting one
data point at a time.
 Deep recurrent LSTM net with skip connections
 Inputs arrive one at a time, outputs determine predictive
distribution over next input
 Train by minimising log-loss
Related Work
14
A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850, 2013.

Vietnam Development Center 15

 Predict hand-writing
Related Work
16

 Kalchbrenner and Blunsom first map the entire
input sentence to vector for Translation using
CSM (Convolutional Sentence Model)
 K. Cho et al. propose RNN Encoder– Decoder
for rescoring pairs of source/target sentences.
 A. Graves use LSTMs to generate complex
sequences simply by predicting one data point
at a time.
Related Work
17

 LSTM first read the input sequence
 And then, produce the output sequence
Paper’s main ideas
18
hid
inp
hid
inp
hid
inp
hid
inp
out
hid
inp
out
hid
inp
time
A
Y
B C <eos> W X
X
out
hid
inp
out
hid
inp
<eos>
Y Z
Z
out
W

 Simple and uncreative model, with Larger and
deeper neural networks, can achieve good
results
 Two different LSTMs: Encoder & Decoder
 04 layer LSTMs
 Reversal of input sentences
 First serious attempt to directly produce
translation
 Opposite to rescoring
 Proof that the naive approach works
Paper's Contribution
19

 WMT’14 English to French
 340M French words
 303M English words
Large Experiment
20
 160K input words
 80 K output words
 4 hidend layers x 1000 LSTM cells (deep LSTMs
outperform shalow LSTMs)
 348M parameters

Large Experiment
21
80k vocab
in output language
160k vocab in
input languageinp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
out
W
4layers
X
1000 cells
384 M paramters
hid hid hid
hid hidhid
hid
hid

Large Experiment
22
hid hid hid hid
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
out
W
LSTMs for Encoder LSTMs for Decoder
80k vocab
in output language
160k vocab in
input language
4layers
X
1000 cells
384 M paramters

Large Experiment
23
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid

Parallelization
24
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid

Parallelization
25
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid

Parallelization
26
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid

Parallelization
27
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid

Fixed-Dimensionality Vectors
28
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid

Parallelization
29
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid

Parallelization
30
hid hid hid hid
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
out
W

31
80k vocab
in output language
160k vocab in
input language
hid hid hid hid
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
out
W
4layers
X
1000 cells
384 M paramters

 Sensitive to sequences orders
 Insensitive to passive/active voice
32

 Sensitive to sequences orders
 Insensitive to passive/active voice
Fixed Dimensionality Vectors
33

 Corpus: WMT’14 English -> Frend
 303M English Words, 384M French Words
 70K Test Words
 MNT first time surpasses Phrase-based SMT
baseline
BLEU Evaluation
34
Method BLEU
Bahdanau et al. 28.45
Phrase-based SMT Baseline 33.3
Implemented LSTMs 34.8
Winner of WMT’14 37

Pretty Good Performance on Long Sentences
35

Bad Performance on Rare Words
36

 Luong et al. improved Neural Machine
Translation with rare-word technique
 First to surpass the best result achieved on a
WMT’14 contest task.
Follow-up for Rare Words Issue
37
M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba. 2015. Addressing
the rare word problem in neural machine translation. In ACL.
Method BLEU
Bahdanau et al. 28.45
Phrase-based SMT Baseline 33.3
Implemented LSTMs 34.8
Winner of WMT’14 37
NMT + rare-word technique 37.5

Bonus Slide
38

 General, simple LSTM-based approach to
sequence learning problems
 Work well with Machine Translation,
 Out-perform the standard SMT-based system
 Good with long sentences
 Strong Evidence for DNN
Large Dataset + Very Big Neural Network = Success
Conclusions
39

 I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence
to sequence learning with neural networks. In NIPS.
 N. Kalchbrenner and P. Blunsom. Recurrent
continuous translation models. In EMNLP, 2013
 A. Graves. Generating sequences with recurrent
neural networks. In Arxiv preprint arXiv:1308.0850,
2013.
 K.Cho, B.Merrienboer, C.Gulcehre, F.Bougares,
H.Schwenk,and Y.Bengio. Learning phrase
representations using RNN encoder-decoder for
statistical machine translation. In Arxiv preprint
arXiv:1406.1078, 2014.
 D.Bahdanau, K.Cho, and Y.Bengio. Neural machine
translation by jointly learning to align and translate.
arXiv preprint arXiv:1409.0473, 2014.
References
40

 M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W.
Zaremba. 2015. Addressing the rare word problem in
neural machine translation. In ACL.
 https://www.microsoft.com/en-us/research/video/nips-
oral-session-4-ilya-
sutskever/?from=http%3A%2F%2Fresearch.microsoft.co
m%2Fapps%2Fvideo%2F%3Fid%3D239083
 https://deeplearning4j.org/lstm.html
References
41

Thank you!
42

Sequence to Sequence Learning with Neural Networks

In this document

More Related Content

What's hot

Similar to Sequence to Sequence Learning with Neural Networks

More from Nguyen Quang

Recently uploaded

Sequence to Sequence Learning with Neural Networks