Sequence to Sequence Learning
with Neural Networks
2017.04.23
Presented by Quang Nguyen
Vietnam Development Center (VDC)
Ilya Sutskever, Oriol Vinyals, Quoc V. Le - Google
Communicating Knowledge
Vietnam Development Center
 Most machine learning algorithms are designed
for indepentdent, identically distributed (i.i.d.)
data
 But many interesting data types are not i.i.d.
 In particular, the sucessive points in sequential
data are strongly correlated
Sequence (To Sequence) Learning
2
Communicating Knowledge
Vietnam Development Center
 Application of Sequence Learning
 Machine Translation
 Question Answering (http://time.com/4624067/amazon-echo-
alexa-ces-2017)
 Image Caption Generation
 Speed Recognition
 Hand Writing Synthesis
 Etc.
Sequence (To Sequence) Learning
3
Communicating Knowledge
Vietnam Development Center
 Sequence Learning is the study of ML
algorithms designed for sequential data
 Limitation of DNN
 Can only map vectors to vectors
 Sequential data’s expression has unknown length
DNN Issue with Sequence Learning
4
?
Communicating Knowledge
Vietnam Development Center
Paper Abstract
5
WHAT IS PROBLEM
WHY IMPORTANT?
HOW TO SOLVE
Can do on many other sequence learning problems
Apply simple DNN approach to map sequences to sequences
Use two main steps:
- Deep LSTMs to map the input sequence  vector of fixed
dimensionality
- Deep LSTMs to decode from vector  target sequence
ACHIEVEMENTS
- Close to Winning Achievement of EN -> FR SMT
(BLUE: 34.8 vs. 37)
- It can beat with improvement.
Communicating Knowledge
Vietnam Development Center
 Just the Neural Network with a Feedback loop
 Previous time step’s hidden layer and final
output are feedback to the network
Recurrent Neural Network
6
out
hid
inp
out
hid
inp
out
hid
inp
out
hid
inp
out
hid
inp
out
hid
inp
time
<start>
<end>
h e l l o
h e l l o
ISSUES:
- One-to-one input-
output
- Trouble for “long term
dependencies”
Communicating Knowledge
Vietnam Development Center
 Same modelling power
 Overcome the issues of RNN for “long term
depencies”
 RNNs overwrite the hidden state
 LSTMs add to the hidden state
Long Short-Term Memory (LSTM)
7
Communicating Knowledge
Vietnam Development Center
 RNN have one-to-one mapping between the
inputs and the outputs
 LSTM can map one to many (one image to
many words in a caption), many to many
(translation), or many to one (classifying a voice)
Long Short-Term Memory (LSTM)
8
Communicating Knowledge
Vietnam Development Center
 Kalchbrenner and Blunsom first map the entire input
sentence to vector for Translation using CSM
(Convolutional Sentence Model)
Related Work
9
source -> target translation Vector: generalization
and generation
N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013
Communicating Knowledge
Vietnam Development Center
 Proposed Approach
Related Work
10
N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013
Communicating Knowledge
Vietnam Development Center
 Source Sentence Model using CSM
 Example of CSM
Related Work
11
N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013
Do not preserve ordering
Communicating Knowledge
Vietnam Development Center
K. Cho et al. propose a novel neural network model called
RNN Encoder– Decoder that consists of two recurrent
neural networks (RNN).
D.Bahdanau et al. then improve K. Cho’s work
 Encoder:
 Input: A variable-length sequence x
 Output: A fixed-length vector representation c
 Decoder:
 Input: A given fixed-length vector representation c
 Output: A variable-length sequence y
Related Work
12
K.Cho, B.Merrienboer, C.Gulcehre, F.Bougares, H.Schwenk,and Y.Bengio. Learning phrase representations using RNN
encoder-decoder for statistical machine translation. In Arxiv preprint arXiv:1406.1078, 2014.
D.Bahdanau, K.Cho, and Y.Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473, 2014.
Communicating Knowledge
Vietnam Development Center
 Trained jointly to maximize conditional log-
likelihood
 Usage:
 Generate an output sequence given an input sequence
 Score a given pair of input and output sequences
Related Work
13
K.Cho, B.Merrienboer, C.Gulcehre, F.Bougares, H.Schwenk,and Y.Bengio. Learning phrase representations using RNN
encoder-decoder for statistical machine translation. In Arxiv preprint arXiv:1406.1078, 2014.
D.Bahdanau, K.Cho, and Y.Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473, 2014.
Communicating Knowledge
Vietnam Development Center
 A. Graves instroduce a novel approach: use
LSTMs to generate complex sequences with
long-range structure, simply by predicting one
data point at a time.
 Deep recurrent LSTM net with skip connections
 Inputs arrive one at a time, outputs determine predictive
distribution over next input
 Train by minimising log-loss
Related Work
14
A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850, 2013.
Communicating Knowledge
Vietnam Development Center 15
A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850, 2013.
Communicating Knowledge
Vietnam Development Center
 Predict hand-writing
Related Work
16
A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850, 2013.
Communicating Knowledge
Vietnam Development Center
 Kalchbrenner and Blunsom first map the entire
input sentence to vector for Translation using
CSM (Convolutional Sentence Model)
 K. Cho et al. propose RNN Encoder– Decoder
for rescoring pairs of source/target sentences.
 A. Graves use LSTMs to generate complex
sequences simply by predicting one data point
at a time.
Related Work
17
A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850, 2013.
Communicating Knowledge
Vietnam Development Center
 LSTM first read the input sequence
 And then, produce the output sequence
Paper’s main ideas
18
hid
inp
hid
inp
hid
inp
hid
inp
out
hid
inp
out
hid
inp
time
A
Y
B C <eos> W X
X
out
hid
inp
out
hid
inp
<eos>
Y Z
Z
out
W
Communicating Knowledge
Vietnam Development Center
 Simple and uncreative model, with Larger and
deeper neural networks, can achieve good
results
 Two different LSTMs: Encoder & Decoder
 04 layer LSTMs
 Reversal of input sentences
 First serious attempt to directly produce
translation
 Opposite to rescoring
 Proof that the naive approach works
Paper's Contribution
19
Communicating Knowledge
Vietnam Development Center
 WMT’14 English to French
 340M French words
 303M English words
Large Experiment
20
 160K input words
 80 K output words
 4 hidend layers x 1000 LSTM cells (deep LSTMs
outperform shalow LSTMs)
 348M parameters
Communicating Knowledge
Vietnam Development Center
Large Experiment
21
80k vocab
in output language
160k vocab in
input languageinp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
out
W
4layers
X
1000 cells
384 M paramters
hid hid hid
hid hidhid
hid
hid
Communicating Knowledge
Vietnam Development Center
Large Experiment
22
hid hid hid hid
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
out
W
LSTMs for Encoder LSTMs for Decoder
80k vocab
in output language
160k vocab in
input language
4layers
X
1000 cells
384 M paramters
Communicating Knowledge
Vietnam Development Center
Large Experiment
23
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid
Communicating Knowledge
Vietnam Development Center
Parallelization
24
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid
Communicating Knowledge
Vietnam Development Center
Parallelization
25
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid
Communicating Knowledge
Vietnam Development Center
Parallelization
26
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid
Communicating Knowledge
Vietnam Development Center
Parallelization
27
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid
Communicating Knowledge
Vietnam Development Center
Fixed-Dimensionality Vectors
28
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid
Communicating Knowledge
Vietnam Development Center
Parallelization
29
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid
hid hid hid hid
hid hid hid
hid hid hid
hid hid hid hid
hid hid hid hid
out
W
04 GPUs
01 GPU
01 GPU
01 GPU
01 GPU
hidhid
hid
hid hid hid hid
hid hid
hid
hid
Communicating Knowledge
Vietnam Development Center
Parallelization
30
hid hid hid hid
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
out
W
Communicating Knowledge
Vietnam Development Center
Fixed-Dimensionality Vectors
31
80k vocab
in output language
160k vocab in
input language
hid hid hid hid
inp inp inp inp
out
inp
out
inp
time
A
Y
B C <eos> W X
X
out
inp
out
inp
<eos>
Y Z
Z
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
hid hid hid hid
out
W
4layers
X
1000 cells
384 M paramters
Communicating Knowledge
Vietnam Development Center
 Sensitive to sequences orders
 Insensitive to passive/active voice
Fixed-Dimensionality Vectors
32
Communicating Knowledge
Vietnam Development Center
 Sensitive to sequences orders
 Insensitive to passive/active voice
Fixed Dimensionality Vectors
33
Communicating Knowledge
Vietnam Development Center
 Corpus: WMT’14 English -> Frend
 303M English Words, 384M French Words
 70K Test Words
 MNT first time surpasses Phrase-based SMT
baseline
BLEU Evaluation
34
Method BLEU
Bahdanau et al. 28.45
Phrase-based SMT Baseline 33.3
Implemented LSTMs 34.8
Winner of WMT’14 37
Communicating Knowledge
Vietnam Development Center
Pretty Good Performance on Long Sentences
35
Communicating Knowledge
Vietnam Development Center
Bad Performance on Rare Words
36
Communicating Knowledge
Vietnam Development Center
 Luong et al. improved Neural Machine
Translation with rare-word technique
 First to surpass the best result achieved on a
WMT’14 contest task.
Follow-up for Rare Words Issue
37
M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba. 2015. Addressing
the rare word problem in neural machine translation. In ACL.
Method BLEU
Bahdanau et al. 28.45
Phrase-based SMT Baseline 33.3
Implemented LSTMs 34.8
Winner of WMT’14 37
NMT + rare-word technique 37.5
Communicating Knowledge
Vietnam Development Center
Bonus Slide
38
Communicating Knowledge
Vietnam Development Center
 General, simple LSTM-based approach to
sequence learning problems
 Work well with Machine Translation,
 Out-perform the standard SMT-based system
 Good with long sentences
 Strong Evidence for DNN
Large Dataset + Very Big Neural Network = Success
Conclusions
39
Communicating Knowledge
Vietnam Development Center
 I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence
to sequence learning with neural networks. In NIPS.
 N. Kalchbrenner and P. Blunsom. Recurrent
continuous translation models. In EMNLP, 2013
 A. Graves. Generating sequences with recurrent
neural networks. In Arxiv preprint arXiv:1308.0850,
2013.
 K.Cho, B.Merrienboer, C.Gulcehre, F.Bougares,
H.Schwenk,and Y.Bengio. Learning phrase
representations using RNN encoder-decoder for
statistical machine translation. In Arxiv preprint
arXiv:1406.1078, 2014.
 D.Bahdanau, K.Cho, and Y.Bengio. Neural machine
translation by jointly learning to align and translate.
arXiv preprint arXiv:1409.0473, 2014.
References
40
Communicating Knowledge
Vietnam Development Center
 M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W.
Zaremba. 2015. Addressing the rare word problem in
neural machine translation. In ACL.
 https://www.microsoft.com/en-us/research/video/nips-
oral-session-4-ilya-
sutskever/?from=http%3A%2F%2Fresearch.microsoft.co
m%2Fapps%2Fvideo%2F%3Fid%3D239083
 https://deeplearning4j.org/lstm.html
References
41
Communicating Knowledge
Vietnam Development Center
Thank you!
42

Sequence to Sequence Learning with Neural Networks

  • 1.
    Sequence to SequenceLearning with Neural Networks 2017.04.23 Presented by Quang Nguyen Vietnam Development Center (VDC) Ilya Sutskever, Oriol Vinyals, Quoc V. Le - Google
  • 2.
    Communicating Knowledge Vietnam DevelopmentCenter  Most machine learning algorithms are designed for indepentdent, identically distributed (i.i.d.) data  But many interesting data types are not i.i.d.  In particular, the sucessive points in sequential data are strongly correlated Sequence (To Sequence) Learning 2
  • 3.
    Communicating Knowledge Vietnam DevelopmentCenter  Application of Sequence Learning  Machine Translation  Question Answering (http://time.com/4624067/amazon-echo- alexa-ces-2017)  Image Caption Generation  Speed Recognition  Hand Writing Synthesis  Etc. Sequence (To Sequence) Learning 3
  • 4.
    Communicating Knowledge Vietnam DevelopmentCenter  Sequence Learning is the study of ML algorithms designed for sequential data  Limitation of DNN  Can only map vectors to vectors  Sequential data’s expression has unknown length DNN Issue with Sequence Learning 4 ?
  • 5.
    Communicating Knowledge Vietnam DevelopmentCenter Paper Abstract 5 WHAT IS PROBLEM WHY IMPORTANT? HOW TO SOLVE Can do on many other sequence learning problems Apply simple DNN approach to map sequences to sequences Use two main steps: - Deep LSTMs to map the input sequence  vector of fixed dimensionality - Deep LSTMs to decode from vector  target sequence ACHIEVEMENTS - Close to Winning Achievement of EN -> FR SMT (BLUE: 34.8 vs. 37) - It can beat with improvement.
  • 6.
    Communicating Knowledge Vietnam DevelopmentCenter  Just the Neural Network with a Feedback loop  Previous time step’s hidden layer and final output are feedback to the network Recurrent Neural Network 6 out hid inp out hid inp out hid inp out hid inp out hid inp out hid inp time <start> <end> h e l l o h e l l o ISSUES: - One-to-one input- output - Trouble for “long term dependencies”
  • 7.
    Communicating Knowledge Vietnam DevelopmentCenter  Same modelling power  Overcome the issues of RNN for “long term depencies”  RNNs overwrite the hidden state  LSTMs add to the hidden state Long Short-Term Memory (LSTM) 7
  • 8.
    Communicating Knowledge Vietnam DevelopmentCenter  RNN have one-to-one mapping between the inputs and the outputs  LSTM can map one to many (one image to many words in a caption), many to many (translation), or many to one (classifying a voice) Long Short-Term Memory (LSTM) 8
  • 9.
    Communicating Knowledge Vietnam DevelopmentCenter  Kalchbrenner and Blunsom first map the entire input sentence to vector for Translation using CSM (Convolutional Sentence Model) Related Work 9 source -> target translation Vector: generalization and generation N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013
  • 10.
    Communicating Knowledge Vietnam DevelopmentCenter  Proposed Approach Related Work 10 N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013
  • 11.
    Communicating Knowledge Vietnam DevelopmentCenter  Source Sentence Model using CSM  Example of CSM Related Work 11 N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013 Do not preserve ordering
  • 12.
    Communicating Knowledge Vietnam DevelopmentCenter K. Cho et al. propose a novel neural network model called RNN Encoder– Decoder that consists of two recurrent neural networks (RNN). D.Bahdanau et al. then improve K. Cho’s work  Encoder:  Input: A variable-length sequence x  Output: A fixed-length vector representation c  Decoder:  Input: A given fixed-length vector representation c  Output: A variable-length sequence y Related Work 12 K.Cho, B.Merrienboer, C.Gulcehre, F.Bougares, H.Schwenk,and Y.Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Arxiv preprint arXiv:1406.1078, 2014. D.Bahdanau, K.Cho, and Y.Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • 13.
    Communicating Knowledge Vietnam DevelopmentCenter  Trained jointly to maximize conditional log- likelihood  Usage:  Generate an output sequence given an input sequence  Score a given pair of input and output sequences Related Work 13 K.Cho, B.Merrienboer, C.Gulcehre, F.Bougares, H.Schwenk,and Y.Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Arxiv preprint arXiv:1406.1078, 2014. D.Bahdanau, K.Cho, and Y.Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • 14.
    Communicating Knowledge Vietnam DevelopmentCenter  A. Graves instroduce a novel approach: use LSTMs to generate complex sequences with long-range structure, simply by predicting one data point at a time.  Deep recurrent LSTM net with skip connections  Inputs arrive one at a time, outputs determine predictive distribution over next input  Train by minimising log-loss Related Work 14 A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850, 2013.
  • 15.
    Communicating Knowledge Vietnam DevelopmentCenter 15 A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850, 2013.
  • 16.
    Communicating Knowledge Vietnam DevelopmentCenter  Predict hand-writing Related Work 16 A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850, 2013.
  • 17.
    Communicating Knowledge Vietnam DevelopmentCenter  Kalchbrenner and Blunsom first map the entire input sentence to vector for Translation using CSM (Convolutional Sentence Model)  K. Cho et al. propose RNN Encoder– Decoder for rescoring pairs of source/target sentences.  A. Graves use LSTMs to generate complex sequences simply by predicting one data point at a time. Related Work 17 A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850, 2013.
  • 18.
    Communicating Knowledge Vietnam DevelopmentCenter  LSTM first read the input sequence  And then, produce the output sequence Paper’s main ideas 18 hid inp hid inp hid inp hid inp out hid inp out hid inp time A Y B C <eos> W X X out hid inp out hid inp <eos> Y Z Z out W
  • 19.
    Communicating Knowledge Vietnam DevelopmentCenter  Simple and uncreative model, with Larger and deeper neural networks, can achieve good results  Two different LSTMs: Encoder & Decoder  04 layer LSTMs  Reversal of input sentences  First serious attempt to directly produce translation  Opposite to rescoring  Proof that the naive approach works Paper's Contribution 19
  • 20.
    Communicating Knowledge Vietnam DevelopmentCenter  WMT’14 English to French  340M French words  303M English words Large Experiment 20  160K input words  80 K output words  4 hidend layers x 1000 LSTM cells (deep LSTMs outperform shalow LSTMs)  348M parameters
  • 21.
    Communicating Knowledge Vietnam DevelopmentCenter Large Experiment 21 80k vocab in output language 160k vocab in input languageinp inp inp inp out inp out inp time A Y B C <eos> W X X out inp out inp <eos> Y Z Z hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid out W 4layers X 1000 cells 384 M paramters hid hid hid hid hidhid hid hid
  • 22.
    Communicating Knowledge Vietnam DevelopmentCenter Large Experiment 22 hid hid hid hid inp inp inp inp out inp out inp time A Y B C <eos> W X X out inp out inp <eos> Y Z Z hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid out W LSTMs for Encoder LSTMs for Decoder 80k vocab in output language 160k vocab in input language 4layers X 1000 cells 384 M paramters
  • 23.
    Communicating Knowledge Vietnam DevelopmentCenter Large Experiment 23 inp inp inp inp out inp out inp time A Y B C <eos> W X X out inp out inp <eos> Y Z Z hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid out W 04 GPUs 01 GPU 01 GPU 01 GPU 01 GPU hidhid hid hid hid hid hid hid hid hid hid
  • 24.
    Communicating Knowledge Vietnam DevelopmentCenter Parallelization 24 inp inp inp inp out inp out inp time A Y B C <eos> W X X out inp out inp <eos> Y Z Z hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid out W 04 GPUs 01 GPU 01 GPU 01 GPU 01 GPU hidhid hid hid hid hid hid hid hid hid hid
  • 25.
    Communicating Knowledge Vietnam DevelopmentCenter Parallelization 25 inp inp inp inp out inp out inp time A Y B C <eos> W X X out inp out inp <eos> Y Z Z hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid out W 04 GPUs 01 GPU 01 GPU 01 GPU 01 GPU hidhid hid hid hid hid hid hid hid hid hid
  • 26.
    Communicating Knowledge Vietnam DevelopmentCenter Parallelization 26 inp inp inp inp out inp out inp time A Y B C <eos> W X X out inp out inp <eos> Y Z Z hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid out W 04 GPUs 01 GPU 01 GPU 01 GPU 01 GPU hidhid hid hid hid hid hid hid hid hid hid
  • 27.
    Communicating Knowledge Vietnam DevelopmentCenter Parallelization 27 inp inp inp inp out inp out inp time A Y B C <eos> W X X out inp out inp <eos> Y Z Z hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid out W 04 GPUs 01 GPU 01 GPU 01 GPU 01 GPU hidhid hid hid hid hid hid hid hid hid hid
  • 28.
    Communicating Knowledge Vietnam DevelopmentCenter Fixed-Dimensionality Vectors 28 inp inp inp inp out inp out inp time A Y B C <eos> W X X out inp out inp <eos> Y Z Z hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid out W 04 GPUs 01 GPU 01 GPU 01 GPU 01 GPU hidhid hid hid hid hid hid hid hid hid hid
  • 29.
    Communicating Knowledge Vietnam DevelopmentCenter Parallelization 29 inp inp inp inp out inp out inp time A Y B C <eos> W X X out inp out inp <eos> Y Z Z hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid out W 04 GPUs 01 GPU 01 GPU 01 GPU 01 GPU hidhid hid hid hid hid hid hid hid hid hid
  • 30.
    Communicating Knowledge Vietnam DevelopmentCenter Parallelization 30 hid hid hid hid inp inp inp inp out inp out inp time A Y B C <eos> W X X out inp out inp <eos> Y Z Z hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid out W
  • 31.
    Communicating Knowledge Vietnam DevelopmentCenter Fixed-Dimensionality Vectors 31 80k vocab in output language 160k vocab in input language hid hid hid hid inp inp inp inp out inp out inp time A Y B C <eos> W X X out inp out inp <eos> Y Z Z hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid hid out W 4layers X 1000 cells 384 M paramters
  • 32.
    Communicating Knowledge Vietnam DevelopmentCenter  Sensitive to sequences orders  Insensitive to passive/active voice Fixed-Dimensionality Vectors 32
  • 33.
    Communicating Knowledge Vietnam DevelopmentCenter  Sensitive to sequences orders  Insensitive to passive/active voice Fixed Dimensionality Vectors 33
  • 34.
    Communicating Knowledge Vietnam DevelopmentCenter  Corpus: WMT’14 English -> Frend  303M English Words, 384M French Words  70K Test Words  MNT first time surpasses Phrase-based SMT baseline BLEU Evaluation 34 Method BLEU Bahdanau et al. 28.45 Phrase-based SMT Baseline 33.3 Implemented LSTMs 34.8 Winner of WMT’14 37
  • 35.
    Communicating Knowledge Vietnam DevelopmentCenter Pretty Good Performance on Long Sentences 35
  • 36.
    Communicating Knowledge Vietnam DevelopmentCenter Bad Performance on Rare Words 36
  • 37.
    Communicating Knowledge Vietnam DevelopmentCenter  Luong et al. improved Neural Machine Translation with rare-word technique  First to surpass the best result achieved on a WMT’14 contest task. Follow-up for Rare Words Issue 37 M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba. 2015. Addressing the rare word problem in neural machine translation. In ACL. Method BLEU Bahdanau et al. 28.45 Phrase-based SMT Baseline 33.3 Implemented LSTMs 34.8 Winner of WMT’14 37 NMT + rare-word technique 37.5
  • 38.
  • 39.
    Communicating Knowledge Vietnam DevelopmentCenter  General, simple LSTM-based approach to sequence learning problems  Work well with Machine Translation,  Out-perform the standard SMT-based system  Good with long sentences  Strong Evidence for DNN Large Dataset + Very Big Neural Network = Success Conclusions 39
  • 40.
    Communicating Knowledge Vietnam DevelopmentCenter  I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS.  N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013  A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850, 2013.  K.Cho, B.Merrienboer, C.Gulcehre, F.Bougares, H.Schwenk,and Y.Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Arxiv preprint arXiv:1406.1078, 2014.  D.Bahdanau, K.Cho, and Y.Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. References 40
  • 41.
    Communicating Knowledge Vietnam DevelopmentCenter  M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba. 2015. Addressing the rare word problem in neural machine translation. In ACL.  https://www.microsoft.com/en-us/research/video/nips- oral-session-4-ilya- sutskever/?from=http%3A%2F%2Fresearch.microsoft.co m%2Fapps%2Fvideo%2F%3Fid%3D239083  https://deeplearning4j.org/lstm.html References 41
  • 42.