Encoder-Decoder Models
Jindřich Liboviký, Jindřich Helcl
March 03, 2022
NPFL116 Compendium of Neural Machine Translation
Charles University
Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics
unless otherwise stated
Model Concept
Conceptual Scheme of the Model
I am the walrus.
↓
Encoder
↓
intermediate representation
↓
Decoder
↓
Ich bin der Walros.
Neural model with a sequence of discrete
symbols as an input that generates another
sequence of discrete symbols as an output.
• pre-process source sentence
(tokenize, split into smaller units)
• convert input into vocabulary indices
• run the encoder to get an intermediate
representation (vector/matrix)
• run the decoder
• postprocess the output (detokenize)
Encoder-Decoder Models 1/ 38
Language Models and Decoders
What is a Language Model
LM = an estimator of a sentence probability given a language
• From now on: sentence = sequence of words 𝑤1, … , 𝑤𝑛
• Factorize the probability by word
i.e., no grammar, no hierarchical structure
Pr (𝑤1, … , 𝑤𝑛) = Pr(𝑤1) ⋅ Pr(𝑤2|𝑤1) ⋅ Pr(𝑤3|𝑤2, 𝑤1) ⋅ ⋯
=
𝑛
∏
𝑖
Pr (𝑤𝑖|𝑤𝑖−1, … , 𝑤1)
Encoder-Decoder Models 2/ 38
What is it good for?
• Substitute for grammar: tells what is a good sentence in a language
• Used in ASR, and statistical MT to select more probable outputs
• Being able to predict next word = proxy for knowing the language
• language modeling is training objective for word2vec
• BERT is a masked language model
• Neural decoder is a conditional language model.
Encoder-Decoder Models 3/ 38
𝑛-gram vs. Neural LMs
𝑛-gram
cool from 1990 to 2013
• Limited history = Markov assumption
• Transparent: estimated from 𝑛-gram counts in a corpus
P(𝑤𝑖|𝑤𝑖−1, 𝑤𝑖−2, … , 𝑤𝑖−𝑛) ≈
𝑛
∑
𝑗=0
𝜆𝑗
𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗)
𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗+1)
Neural
cool since 2013
• Conditioned on RNN state which gather potentially
unlimited history
• Trained by back-propagation to maximize probability of the
training data
• Opaque, but works better (as usual with deep learning)
Encoder-Decoder Models 4/ 38
Reminder: Recurrent Neural Networks
RNN = pipeline for information
In every step some information goes in
and some information goes out.
Technically: A “for” loop applying the
same function 𝐴 on input vectors 𝑥𝑖
At training time unrolled in time:
technically just a very deep network
Image on the right: Chris Olah. Understanding LSTM Networks. A blog post: http://colah.github.io/posts/2015-08-Understanding-LSTMs
Encoder-Decoder Models 5/ 38
Sequence Labeling
• Assign a label to each word in a sentence.
• Tasks formulated as sequence labeling:
• Part-of-Speech Tagging
• Named Entity Recognition
• Filling missing punctuation
MLP = Multilayer perceptron
𝑛× layer: 𝜎 (𝑊𝑥 + 𝑏)
Softmax for 𝐾 classes with logits
z = (𝑧1, … , 𝑧𝐾):
𝑒𝑧𝑖
∑
𝐾
𝑗=1
𝑒𝑧𝑗
𝑤𝑖
↓
lookup index in the vocabulary
↓
Embedding Lookup
↓
ℎ𝑖−1 → RNN → ℎ𝑖
↓
MLP
↓
Softmax
Encoder-Decoder Models 6/ 38
Detour: Why is softmax a good choice
Output layer with softmax (with parameters 𝑊, 𝑏) — gets categorical distribution:
𝑃𝑦 = softmax(x) = Pr(𝑦 ∣ x) =
exp{x⊤
𝑊} + 𝑏
∑ exp{x⊤𝑊} + 𝑏
Network error = cross-entropy between estimated distribution and one-hot ground-truth
distribution 𝑇 = 1(𝑦∗
) = (0, 0, … , 1, 0, … , 0):
𝐿(𝑃𝑦, 𝑦∗
) = 𝐻(𝑃, 𝑇) = −𝔼𝑖∼𝑇 log 𝑃(𝑖)
= − ∑
𝑖
𝑇(𝑖) log 𝑃(𝑖)
= − log 𝑃(𝑦∗
)
Encoder-Decoder Models 7/ 38
Derivative of Cross-Entropy
Let 𝑙 = x⊤
𝑊 + 𝑏, 𝑙𝑦∗ corresponds to the correct one.
∂𝐿(𝑃𝑦, 𝑦∗
)
∂𝑙
= −
∂
∂𝑙
log
exp 𝑙𝑦∗
∑𝑗
exp 𝑙𝑗
= −
∂
∂𝑙
(𝑙𝑦∗ − log ∑ exp 𝑙)
= 1𝑦∗ +
∂
∂𝑙
− log ∑ exp 𝑙 = 1𝑦∗ −
∑ 1𝑦∗ exp 𝑙
∑ exp 𝑙
=
= 1𝑦∗ − 𝑃𝑦(𝑦∗
)
0
1
0
1
0
1
Interpretation: Reinforce the correct logit, suppress the rest.
Encoder-Decoder Models 8/ 38
Language Model as Sequence Labeling
input symbol
one-hot vectors
embedding lookup
RNN cell
(more layers)
classifier
normalization
distribution for
the next symbol
<s>
embed
RNN
MLP
softmax
𝑃(𝑤1|<s>)
𝑤1
embed
RNN
MLP
softmax
𝑃(𝑤1| …)
𝑤2
embed
RNN
MLP
softmax
𝑃(𝑤2| …)
⋯
Encoder-Decoder Models 9/ 38
Sampling from a Language Model
embed
RNN
MLP
softmax
Pr(𝑤1|<s>)
sample
embed
RNN
MLP
softmax
Pr(𝑤1| …)
sample
embed
RNN
MLP
softmax
Pr(𝑤2| …)
sample
embed
RNN
MLP
softmax
Pr(𝑤3| …)
sample
<s>
⋯
Encoder-Decoder Models 10/ 38
Sampling from a Language Model: Pseudocode
last_w = "<s>"
state = initial_state
while last_w != "</s>":
last_w_embeding = target_embeddings[last_w]
state = rnn(state, last_w_embeding)
logits = output_projection(state)
last_w = vocabulary[np.random.multimial(1, logits)]
yield last_w
Encoder-Decoder Models 11/ 38
Training
Training objective: negative-log likelihood:
NLL = −
𝑛
∑
𝑖
log Pr (𝑤𝑖|𝑤𝑖−1, … , 𝑤1)
I.e., maximize probability of the correct word.
• Cross-entropy between the predicted distribution and one-hot “true” distribution
• Error from word is backpropagated into the rest of network unrolled in time
• Prone to exposure bias: during trainining only well-behaved sequences, it can break
when we sample something weird at inference time
Encoder-Decoder Models 12/ 38
Generating from a Language Model
(Example from GPT-2, a Tranformer-based English language model, screenshot from
https://transformer.huggingface.co/doc/gpt2-large)
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.
OpenAI Blog., 2019
Cool, but where is the source language?
Encoder-Decoder Models 13/ 38
Conditioning the Language Model &
Attention
Conditional Language Model
Formally it is simple, condition distribution of
• target sequence y = (𝑦1, … , 𝑦𝑇𝑦
) on
• source sequence x = (𝑥1, … , 𝑥𝑇𝑥
)
Pr (𝑦1, … , 𝑦𝑛|x) =
𝑛
∏
𝑖
Pr (𝑦𝑖|𝑦𝑖−1, … , 𝑦1, x)
We need an encoder to get a representation of x!
What about just continuing an RNN…
Encoder-Decoder Models 14/ 38
Sequence-to-Sequence Model
𝑥1
embed
RNN
𝑥2
embed
RNN
𝑥3
embed
RNN
embed
RNN
MLP
softmax
Pr(𝑤1|<s>)
sample
embed
RNN
MLP
softmax
Pr(𝑤1| …)
sample
embed
RNN
MLP
softmax
Pr(𝑤2| …)
sample
embed
RNN
MLP
softmax
Pr(𝑤3| …)
sample
<s>
⋯
• The interface between encoder and decoder is a single vector
regardless the sentence length.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.
In Advances in Neural Information Processing Systems 27, pages 3104–3112, Montreal, Canada, December 2014
Encoder-Decoder Models 15/ 38
Seq2Seq: Pseudocode
state = np.zeros(rnn_size)
for w in input_words:
input_embedding = source_embeddings[w]
state = enc_cell(encoder_state, input_embedding)
last_w = "<s>"
while last_w != "</s>":
last_w_embeding = target_embeddings[last_w]
state = dec_cell(state, last_w_embeding)
logits = output_projection(state)
last_w = vocabulary[np.argmax(logits)]
yield last_w
Encoder-Decoder Models 16/ 38
Vanila Seq2Seq: Information Bottleneck
Ich habe den Walros gesehen <s> I saw the walrus
I saw the walrus </s>
⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩
Bottleneck all information needs to run through.
A single vector must represent the entire source sentence.
Main weakness and the reason for introducing the attention.
Encoder-Decoder Models 17/ 38
The Attention Model
• Motivation: It would be nice to have variable length input representation
• RNN returns one state per word …
• …what if we were able to get only information from words we need to generate a word.
Attention = probabilistic retrieval of encoder states for
estimating probability of target words.
Query = hidden states of the decoder
Values = encoder hidden states
Encoder-Decoder Models 18/ 38
Sequence-to-Sequence Model With Attention
𝑥1
embed
RNN
RNN
ℎ1
𝑥2
embed
RNN
RNN
ℎ2
𝑥3
embed
RNN
RNN
ℎ3
<s>
embed
RNN
𝑠0
context
=
∑
⋅𝛼0,1
⋅𝛼0,2
⋅𝛼0,3
MLP
Softmax
Pr(𝑤1|<s>)
sample
• Encoder = bidirectional RNN
states ℎ𝑖 ≈ retrieved
values
• Decoder step starts as usual
state 𝑠0 ≈ retrieval query
• Decoder state 𝑠0 used to
compute distribution the
over encoder states
• Weighted average of encoder
states = context vector
• Decoder state & context
concatenated
MLP + Softmax predicts
next word
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate.
In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings, 2015
Encoder-Decoder Models 19/ 38
Attention Model in Equations (1)
Inputs:
decoder state 𝑠𝑖
encoder states ℎ𝑗 = [⃗⃗⃗⃗⃗⃗⃗
ℎ𝑗; ⃖⃖⃖⃖⃖⃖⃖
ℎ𝑗] ∀𝑖 = 1 … 𝑇𝑥
Attention energies:
𝑒𝑖𝑗 = 𝑣⊤
𝑎 tanh (𝑊𝑎𝑠𝑖−1 + 𝑈𝑎ℎ𝑗 + 𝑏𝑎)
Attention distribution:
𝛼𝑖𝑗 =
exp (𝑒𝑖𝑗)
∑
𝑇𝑥
𝑘=1
exp (𝑒𝑖𝑘)
Context vector:
𝑐𝑖 =
𝑇𝑥
∑
𝑗=1
𝛼𝑖𝑗ℎ𝑗
Encoder-Decoder Models 20/ 38
Attention Model in Equations (2)
Output projection:
𝑡𝑖 = MLP (𝑠𝑖−1 ⊕ 𝑣𝑦𝑖−1
⊕ 𝑐𝑖)
…attention is mixed with the hidden state
(different in differnt models)
Output distribution:
𝑝 (𝑦𝑖 = 𝑘|𝑠𝑖, 𝑦𝑖−1, 𝑐𝑖) ∝ exp (𝑊𝑜𝑡𝑖 + 𝑏𝑘)𝑘
(usual trick: use transposed embeddings as 𝑊𝑜)
• Different version of attentive decoders exist
• Alternative: keep the context vector as input for the next step
• Multilayer RNNs: attention between/after layers
Encoder-Decoder Models 21/ 38
Workings of the Attentive Seq2Seq model
Ich habe den Walros gesehen <s> I saw the walrus
𝑠0 𝑠1 𝑠2 𝑠3 𝑠4
I saw the walrus </s>
⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩
⟨⟨⟨ RNN ⟨⟨⟨ RNN ⟨⟨⟨ RNN ⟨⟨⟨
⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩
ℎ1 ℎ2 ℎ3 ℎ4 ℎ5
ℎ1 ℎ2 ℎ3 ℎ4 ℎ5
ℎ1 ℎ2 ℎ3 ℎ4 ℎ5
ℎ1 ℎ2 ℎ3 ℎ4 ℎ5
ℎ1 ℎ2 ℎ3 ℎ4 ℎ5
Encoder-Decoder Models 22/ 38
Seq2Seq with attention: Pseudocode (1)
state = np.zeros(emb_size)
fw_states = []
for w in input_words:
input_embedding = source_embeddings[w]
state, _ = fw_enc_cell(encoder_state, input_embedding)
fw_states.append(state)
bw_states = []
state = np.zeros(emb_size)
for w in reversed(input_words):
input_embedding = source_embeddings[w]
state, _ = bw_enc_cell(encoder_state, input_embedding)
bw_states.append(state)
enc_states = [np.concatenate(fw, bw) for fw, bw in zip(fw_states,
reversed(bw_states))]
Encoder-Decoder Models 23/ 38
Seq2Seq with attention: Pseudocode (2)
last_w = "<s>"
while last_w != "</s>":
last_w_embeding = target_embeddings[last_w]
state = dec_cell(state, last_w_embeding)
alphas = attention(state, enc_states)
context = sum(a * state for a, state in zip(alphas, enc_states))
logits = output_projection(np.concatenate(state, context, last_w_embeding))
last_w = np.argmax(logits)
yield last_w
Encoder-Decoder Models 24/ 38
Attention Visualization (1)
The
agreement
on
the
European
Economic
Area
was
signed
in
August
1992
.
<end>
L'
accord
sur
la
zone
économique
européenne
a
été
signé
en
août
1992
.
<end>
It
should
be
noted
that
the
marine
environment
is
the
least
known
of
environments
.
<end>
Il
convient
de
noter
que
l'
environnement
marin
est
le
moins
connu
de
l'
environnement
.
<end>
(a) (b)
struction
e
uipment
eans
at
ria
n
ger
oduce
w
emical
apons
nd>
ge
re
ily
d>
Image source: Bahdanau et al. (2015), Fig. 3
Encoder-Decoder Models 25/ 38
Attention Visualization (2)
Image source: Koehn and Knowles (2017), Fig. 8
Encoder-Decoder Models 26/ 38
Attention vs. Alignment
Differences between attention model and word alignment used for phrase table generation:
attention (NMT) alignment (SMT)
probabilistic discrete
declarative imperative
LM generates LM discriminates
Encoder-Decoder Models 27/ 38
Training Seq2Seq Model
Optimize negative log-likelihood of parallel data, backpropagation does
the rest.
If you choose a right optimizer, learning rate, model hyper-parameters, prepare data, do
back-translation, monolingual pre-training …
Confusion: decoder inputs vs. output
inputs y[:-1] <s> 𝑦1 𝑦2 𝑦3 𝑦4
↓ ↓ ↓ ↓ ↓
Decoder
↓ ↓ ↓ ↓ ↓
outputs y[1:] 𝑦1 𝑦2 𝑦3 𝑦4 </s>
Encoder-Decoder Models 28/ 38
Inference
Getting output
• Encoder-decoder is a conditional language model
• For a pair x and y, we can compute:
Pr (y|x) =
𝑇𝑦
∏
𝑖=1
Pr (𝑦𝑖|y∶𝑖, x)
• When decoding we want to get
y∗
= argmax
y′
Pr (y′
|𝑥)
☠ Enumerating all y′
s is computationally intractable ☠
Encoder-Decoder Models 29/ 38
Greedy Decoding
In each step, take the maximum probable word.
𝑦∗
𝑖 = argmax
𝑦𝑖
Pr (𝑦𝑖|𝑦∗
𝑖−1, … , <s>)
last_w = "<s>"
state = initial_state
while last_w != "</s>":
last_w_embeding = target_embeddings[last_w]
state = dec_cell(state, last_w_embeding)
logits = output_projection(state)
last_w = vocabulary[np.argmax(logits)]
yield last_w
Encoder-Decoder Models 30/ 38
What if…
This is a
platypus
25%
rather
24%
random end . </s>
30% each
good sentence . </s>
60% each
⚠ Greedy decoding can easily miss the best option. ⚠
Encoder-Decoder Models 31/ 38
Beam Search
Keep a small 𝑘 of hypothesis (typically 4–20).
1. Begin with a single empty hypothesis in the beam.
2. In each time step:
2.1 Extend all hypotheses in the beam by all (or the most probable) from the output
distribution (we call these candidate hypotheses)
2.2 Score the candidate hypotheses
2.3 Keep only 𝑘 best of them.
3. Finish if all 𝑘-best hypotheses end with </s>
4. Sort the hypotheses by their score and output the best one.
Encoder-Decoder Models 32/ 38
Beam Search: Example
...
...
...
...
Hey
world
World
<s>
there
Hi
...
...
hello
world
Hello
!
Encoder-Decoder Models 33/ 38
Beam Search: Pseudocode
beam = [(["<s>"], initial_state, 1.0)]
while any(hyp[-1] != "</s>" for hyp, _, _ in beam):
candidates = []
for hyp, state, score in beam:
distribution, new_state = decoder_step(hyp[-1], state, encoder_states)
for i, prob in enumerate(distribution):
candidates.append(hyp + [vocabulary[i]], new_state, score * prob)
beam = take_best(k, candidates)
Encoder-Decoder Models 34/ 38
Implementation issues
• Multiplying of too many small numbers → float underflow
need to compute in log domain and add logarithms
• Sentences can have different lengths
This is a good long sentence . </s>
0.7 × 0.6 × 0.9 × 0.1 × 0.4 × 0.4 × 0.8 × 0.9 = 0.004
This </s>
0.7 × 0.01 = 0.007
⇒ use the geometric mean instead of probabilities directly
• Sorting candidates is expensive, assomptotically |𝑉 | log |𝑉 |:
𝑘-best can be found in linear time, |𝑉 | ∼ 104
− 105
Encoder-Decoder Models 35/ 38
Final Remarks
Brief history of the architectures
• 2013 First encoder-decoder model (Kalchbrenner and Blunsom, 2013)
• 2014 First really usable encoder-decoder model (Sutskever et al., 2014)
• 2014/2015 Added attention (crucial innovation in NLP) (Bahdanau et al., 2015)
• 2016/2017 WMT winners used RNN-based neural systems (Sennrich et al., 2016)
• 2017 Transformers invented (outperformed RNN) (Vaswani et al., 2017)
The development of achitectures still goes on...
Document context, non-autoregressive models, multilingual models, …
Encoder-Decoder Models 36/ 38
Encoder-Decoder Models
Summary
• Encoder-decoder architecture = major paradigm in MT
• Encoder-decoder architecture = conditional language model
• Attention = way of conditioning the decoder on the encoder
• Attention = probabilistic vector retrieval
• We model probability, but need heuristics to get a good sentence
from the model
http://ufal.mff.cuni.cz/courses/npfl116
References I
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann
LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,
2015.
Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural
Language Processing, pages 1700–1709, Seattle, Washington, USA, October 2013. Association for Computational Linguistics.
Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation,
pages 28–39, Vancouver, Canada, August 2017. Association for Computational Linguistics.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog., 2019.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neural machine translation systems for WMT 16. In Proceedings of the First Conference on
Machine Translation: Volume 2, Shared Task Papers, pages 371–376, Berlin, Germany, August 2016. Association for Computational Linguistics. URL
https://www.aclweb.org/anthology/W16-2323.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27,
pages 3104–3112, Montreal, Canada, December 2014.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In
Advances in Neural Information Processing Systems 30, pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc.
Encoder-Decoder Models 38/ 38

encoder and decoder for language modelss

  • 1.
    Encoder-Decoder Models Jindřich Liboviký,Jindřich Helcl March 03, 2022 NPFL116 Compendium of Neural Machine Translation Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
  • 2.
  • 3.
    Conceptual Scheme ofthe Model I am the walrus. ↓ Encoder ↓ intermediate representation ↓ Decoder ↓ Ich bin der Walros. Neural model with a sequence of discrete symbols as an input that generates another sequence of discrete symbols as an output. • pre-process source sentence (tokenize, split into smaller units) • convert input into vocabulary indices • run the encoder to get an intermediate representation (vector/matrix) • run the decoder • postprocess the output (detokenize) Encoder-Decoder Models 1/ 38
  • 4.
  • 5.
    What is aLanguage Model LM = an estimator of a sentence probability given a language • From now on: sentence = sequence of words 𝑤1, … , 𝑤𝑛 • Factorize the probability by word i.e., no grammar, no hierarchical structure Pr (𝑤1, … , 𝑤𝑛) = Pr(𝑤1) ⋅ Pr(𝑤2|𝑤1) ⋅ Pr(𝑤3|𝑤2, 𝑤1) ⋅ ⋯ = 𝑛 ∏ 𝑖 Pr (𝑤𝑖|𝑤𝑖−1, … , 𝑤1) Encoder-Decoder Models 2/ 38
  • 6.
    What is itgood for? • Substitute for grammar: tells what is a good sentence in a language • Used in ASR, and statistical MT to select more probable outputs • Being able to predict next word = proxy for knowing the language • language modeling is training objective for word2vec • BERT is a masked language model • Neural decoder is a conditional language model. Encoder-Decoder Models 3/ 38
  • 7.
    𝑛-gram vs. NeuralLMs 𝑛-gram cool from 1990 to 2013 • Limited history = Markov assumption • Transparent: estimated from 𝑛-gram counts in a corpus P(𝑤𝑖|𝑤𝑖−1, 𝑤𝑖−2, … , 𝑤𝑖−𝑛) ≈ 𝑛 ∑ 𝑗=0 𝜆𝑗 𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗) 𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗+1) Neural cool since 2013 • Conditioned on RNN state which gather potentially unlimited history • Trained by back-propagation to maximize probability of the training data • Opaque, but works better (as usual with deep learning) Encoder-Decoder Models 4/ 38
  • 8.
    Reminder: Recurrent NeuralNetworks RNN = pipeline for information In every step some information goes in and some information goes out. Technically: A “for” loop applying the same function 𝐴 on input vectors 𝑥𝑖 At training time unrolled in time: technically just a very deep network Image on the right: Chris Olah. Understanding LSTM Networks. A blog post: http://colah.github.io/posts/2015-08-Understanding-LSTMs Encoder-Decoder Models 5/ 38
  • 9.
    Sequence Labeling • Assigna label to each word in a sentence. • Tasks formulated as sequence labeling: • Part-of-Speech Tagging • Named Entity Recognition • Filling missing punctuation MLP = Multilayer perceptron 𝑛× layer: 𝜎 (𝑊𝑥 + 𝑏) Softmax for 𝐾 classes with logits z = (𝑧1, … , 𝑧𝐾): 𝑒𝑧𝑖 ∑ 𝐾 𝑗=1 𝑒𝑧𝑗 𝑤𝑖 ↓ lookup index in the vocabulary ↓ Embedding Lookup ↓ ℎ𝑖−1 → RNN → ℎ𝑖 ↓ MLP ↓ Softmax Encoder-Decoder Models 6/ 38
  • 10.
    Detour: Why issoftmax a good choice Output layer with softmax (with parameters 𝑊, 𝑏) — gets categorical distribution: 𝑃𝑦 = softmax(x) = Pr(𝑦 ∣ x) = exp{x⊤ 𝑊} + 𝑏 ∑ exp{x⊤𝑊} + 𝑏 Network error = cross-entropy between estimated distribution and one-hot ground-truth distribution 𝑇 = 1(𝑦∗ ) = (0, 0, … , 1, 0, … , 0): 𝐿(𝑃𝑦, 𝑦∗ ) = 𝐻(𝑃, 𝑇) = −𝔼𝑖∼𝑇 log 𝑃(𝑖) = − ∑ 𝑖 𝑇(𝑖) log 𝑃(𝑖) = − log 𝑃(𝑦∗ ) Encoder-Decoder Models 7/ 38
  • 11.
    Derivative of Cross-Entropy Let𝑙 = x⊤ 𝑊 + 𝑏, 𝑙𝑦∗ corresponds to the correct one. ∂𝐿(𝑃𝑦, 𝑦∗ ) ∂𝑙 = − ∂ ∂𝑙 log exp 𝑙𝑦∗ ∑𝑗 exp 𝑙𝑗 = − ∂ ∂𝑙 (𝑙𝑦∗ − log ∑ exp 𝑙) = 1𝑦∗ + ∂ ∂𝑙 − log ∑ exp 𝑙 = 1𝑦∗ − ∑ 1𝑦∗ exp 𝑙 ∑ exp 𝑙 = = 1𝑦∗ − 𝑃𝑦(𝑦∗ ) 0 1 0 1 0 1 Interpretation: Reinforce the correct logit, suppress the rest. Encoder-Decoder Models 8/ 38
  • 12.
    Language Model asSequence Labeling input symbol one-hot vectors embedding lookup RNN cell (more layers) classifier normalization distribution for the next symbol <s> embed RNN MLP softmax 𝑃(𝑤1|<s>) 𝑤1 embed RNN MLP softmax 𝑃(𝑤1| …) 𝑤2 embed RNN MLP softmax 𝑃(𝑤2| …) ⋯ Encoder-Decoder Models 9/ 38
  • 13.
    Sampling from aLanguage Model embed RNN MLP softmax Pr(𝑤1|<s>) sample embed RNN MLP softmax Pr(𝑤1| …) sample embed RNN MLP softmax Pr(𝑤2| …) sample embed RNN MLP softmax Pr(𝑤3| …) sample <s> ⋯ Encoder-Decoder Models 10/ 38
  • 14.
    Sampling from aLanguage Model: Pseudocode last_w = "<s>" state = initial_state while last_w != "</s>": last_w_embeding = target_embeddings[last_w] state = rnn(state, last_w_embeding) logits = output_projection(state) last_w = vocabulary[np.random.multimial(1, logits)] yield last_w Encoder-Decoder Models 11/ 38
  • 15.
    Training Training objective: negative-loglikelihood: NLL = − 𝑛 ∑ 𝑖 log Pr (𝑤𝑖|𝑤𝑖−1, … , 𝑤1) I.e., maximize probability of the correct word. • Cross-entropy between the predicted distribution and one-hot “true” distribution • Error from word is backpropagated into the rest of network unrolled in time • Prone to exposure bias: during trainining only well-behaved sequences, it can break when we sample something weird at inference time Encoder-Decoder Models 12/ 38
  • 16.
    Generating from aLanguage Model (Example from GPT-2, a Tranformer-based English language model, screenshot from https://transformer.huggingface.co/doc/gpt2-large) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog., 2019 Cool, but where is the source language? Encoder-Decoder Models 13/ 38
  • 17.
    Conditioning the LanguageModel & Attention
  • 18.
    Conditional Language Model Formallyit is simple, condition distribution of • target sequence y = (𝑦1, … , 𝑦𝑇𝑦 ) on • source sequence x = (𝑥1, … , 𝑥𝑇𝑥 ) Pr (𝑦1, … , 𝑦𝑛|x) = 𝑛 ∏ 𝑖 Pr (𝑦𝑖|𝑦𝑖−1, … , 𝑦1, x) We need an encoder to get a representation of x! What about just continuing an RNN… Encoder-Decoder Models 14/ 38
  • 19.
    Sequence-to-Sequence Model 𝑥1 embed RNN 𝑥2 embed RNN 𝑥3 embed RNN embed RNN MLP softmax Pr(𝑤1|<s>) sample embed RNN MLP softmax Pr(𝑤1| …) sample embed RNN MLP softmax Pr(𝑤2|…) sample embed RNN MLP softmax Pr(𝑤3| …) sample <s> ⋯ • The interface between encoder and decoder is a single vector regardless the sentence length. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, pages 3104–3112, Montreal, Canada, December 2014 Encoder-Decoder Models 15/ 38
  • 20.
    Seq2Seq: Pseudocode state =np.zeros(rnn_size) for w in input_words: input_embedding = source_embeddings[w] state = enc_cell(encoder_state, input_embedding) last_w = "<s>" while last_w != "</s>": last_w_embeding = target_embeddings[last_w] state = dec_cell(state, last_w_embeding) logits = output_projection(state) last_w = vocabulary[np.argmax(logits)] yield last_w Encoder-Decoder Models 16/ 38
  • 21.
    Vanila Seq2Seq: InformationBottleneck Ich habe den Walros gesehen <s> I saw the walrus I saw the walrus </s> ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ Bottleneck all information needs to run through. A single vector must represent the entire source sentence. Main weakness and the reason for introducing the attention. Encoder-Decoder Models 17/ 38
  • 22.
    The Attention Model •Motivation: It would be nice to have variable length input representation • RNN returns one state per word … • …what if we were able to get only information from words we need to generate a word. Attention = probabilistic retrieval of encoder states for estimating probability of target words. Query = hidden states of the decoder Values = encoder hidden states Encoder-Decoder Models 18/ 38
  • 23.
    Sequence-to-Sequence Model WithAttention 𝑥1 embed RNN RNN ℎ1 𝑥2 embed RNN RNN ℎ2 𝑥3 embed RNN RNN ℎ3 <s> embed RNN 𝑠0 context = ∑ ⋅𝛼0,1 ⋅𝛼0,2 ⋅𝛼0,3 MLP Softmax Pr(𝑤1|<s>) sample • Encoder = bidirectional RNN states ℎ𝑖 ≈ retrieved values • Decoder step starts as usual state 𝑠0 ≈ retrieval query • Decoder state 𝑠0 used to compute distribution the over encoder states • Weighted average of encoder states = context vector • Decoder state & context concatenated MLP + Softmax predicts next word Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015 Encoder-Decoder Models 19/ 38
  • 24.
    Attention Model inEquations (1) Inputs: decoder state 𝑠𝑖 encoder states ℎ𝑗 = [⃗⃗⃗⃗⃗⃗⃗ ℎ𝑗; ⃖⃖⃖⃖⃖⃖⃖ ℎ𝑗] ∀𝑖 = 1 … 𝑇𝑥 Attention energies: 𝑒𝑖𝑗 = 𝑣⊤ 𝑎 tanh (𝑊𝑎𝑠𝑖−1 + 𝑈𝑎ℎ𝑗 + 𝑏𝑎) Attention distribution: 𝛼𝑖𝑗 = exp (𝑒𝑖𝑗) ∑ 𝑇𝑥 𝑘=1 exp (𝑒𝑖𝑘) Context vector: 𝑐𝑖 = 𝑇𝑥 ∑ 𝑗=1 𝛼𝑖𝑗ℎ𝑗 Encoder-Decoder Models 20/ 38
  • 25.
    Attention Model inEquations (2) Output projection: 𝑡𝑖 = MLP (𝑠𝑖−1 ⊕ 𝑣𝑦𝑖−1 ⊕ 𝑐𝑖) …attention is mixed with the hidden state (different in differnt models) Output distribution: 𝑝 (𝑦𝑖 = 𝑘|𝑠𝑖, 𝑦𝑖−1, 𝑐𝑖) ∝ exp (𝑊𝑜𝑡𝑖 + 𝑏𝑘)𝑘 (usual trick: use transposed embeddings as 𝑊𝑜) • Different version of attentive decoders exist • Alternative: keep the context vector as input for the next step • Multilayer RNNs: attention between/after layers Encoder-Decoder Models 21/ 38
  • 26.
    Workings of theAttentive Seq2Seq model Ich habe den Walros gesehen <s> I saw the walrus 𝑠0 𝑠1 𝑠2 𝑠3 𝑠4 I saw the walrus </s> ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ ⟨⟨⟨ RNN ⟨⟨⟨ RNN ⟨⟨⟨ RNN ⟨⟨⟨ ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ ℎ1 ℎ2 ℎ3 ℎ4 ℎ5 ℎ1 ℎ2 ℎ3 ℎ4 ℎ5 ℎ1 ℎ2 ℎ3 ℎ4 ℎ5 ℎ1 ℎ2 ℎ3 ℎ4 ℎ5 ℎ1 ℎ2 ℎ3 ℎ4 ℎ5 Encoder-Decoder Models 22/ 38
  • 27.
    Seq2Seq with attention:Pseudocode (1) state = np.zeros(emb_size) fw_states = [] for w in input_words: input_embedding = source_embeddings[w] state, _ = fw_enc_cell(encoder_state, input_embedding) fw_states.append(state) bw_states = [] state = np.zeros(emb_size) for w in reversed(input_words): input_embedding = source_embeddings[w] state, _ = bw_enc_cell(encoder_state, input_embedding) bw_states.append(state) enc_states = [np.concatenate(fw, bw) for fw, bw in zip(fw_states, reversed(bw_states))] Encoder-Decoder Models 23/ 38
  • 28.
    Seq2Seq with attention:Pseudocode (2) last_w = "<s>" while last_w != "</s>": last_w_embeding = target_embeddings[last_w] state = dec_cell(state, last_w_embeding) alphas = attention(state, enc_states) context = sum(a * state for a, state in zip(alphas, enc_states)) logits = output_projection(np.concatenate(state, context, last_w_embeding)) last_w = np.argmax(logits) yield last_w Encoder-Decoder Models 24/ 38
  • 29.
  • 30.
    Attention Visualization (2) Imagesource: Koehn and Knowles (2017), Fig. 8 Encoder-Decoder Models 26/ 38
  • 31.
    Attention vs. Alignment Differencesbetween attention model and word alignment used for phrase table generation: attention (NMT) alignment (SMT) probabilistic discrete declarative imperative LM generates LM discriminates Encoder-Decoder Models 27/ 38
  • 32.
    Training Seq2Seq Model Optimizenegative log-likelihood of parallel data, backpropagation does the rest. If you choose a right optimizer, learning rate, model hyper-parameters, prepare data, do back-translation, monolingual pre-training … Confusion: decoder inputs vs. output inputs y[:-1] <s> 𝑦1 𝑦2 𝑦3 𝑦4 ↓ ↓ ↓ ↓ ↓ Decoder ↓ ↓ ↓ ↓ ↓ outputs y[1:] 𝑦1 𝑦2 𝑦3 𝑦4 </s> Encoder-Decoder Models 28/ 38
  • 33.
  • 34.
    Getting output • Encoder-decoderis a conditional language model • For a pair x and y, we can compute: Pr (y|x) = 𝑇𝑦 ∏ 𝑖=1 Pr (𝑦𝑖|y∶𝑖, x) • When decoding we want to get y∗ = argmax y′ Pr (y′ |𝑥) ☠ Enumerating all y′ s is computationally intractable ☠ Encoder-Decoder Models 29/ 38
  • 35.
    Greedy Decoding In eachstep, take the maximum probable word. 𝑦∗ 𝑖 = argmax 𝑦𝑖 Pr (𝑦𝑖|𝑦∗ 𝑖−1, … , <s>) last_w = "<s>" state = initial_state while last_w != "</s>": last_w_embeding = target_embeddings[last_w] state = dec_cell(state, last_w_embeding) logits = output_projection(state) last_w = vocabulary[np.argmax(logits)] yield last_w Encoder-Decoder Models 30/ 38
  • 36.
    What if… This isa platypus 25% rather 24% random end . </s> 30% each good sentence . </s> 60% each ⚠ Greedy decoding can easily miss the best option. ⚠ Encoder-Decoder Models 31/ 38
  • 37.
    Beam Search Keep asmall 𝑘 of hypothesis (typically 4–20). 1. Begin with a single empty hypothesis in the beam. 2. In each time step: 2.1 Extend all hypotheses in the beam by all (or the most probable) from the output distribution (we call these candidate hypotheses) 2.2 Score the candidate hypotheses 2.3 Keep only 𝑘 best of them. 3. Finish if all 𝑘-best hypotheses end with </s> 4. Sort the hypotheses by their score and output the best one. Encoder-Decoder Models 32/ 38
  • 38.
  • 39.
    Beam Search: Pseudocode beam= [(["<s>"], initial_state, 1.0)] while any(hyp[-1] != "</s>" for hyp, _, _ in beam): candidates = [] for hyp, state, score in beam: distribution, new_state = decoder_step(hyp[-1], state, encoder_states) for i, prob in enumerate(distribution): candidates.append(hyp + [vocabulary[i]], new_state, score * prob) beam = take_best(k, candidates) Encoder-Decoder Models 34/ 38
  • 40.
    Implementation issues • Multiplyingof too many small numbers → float underflow need to compute in log domain and add logarithms • Sentences can have different lengths This is a good long sentence . </s> 0.7 × 0.6 × 0.9 × 0.1 × 0.4 × 0.4 × 0.8 × 0.9 = 0.004 This </s> 0.7 × 0.01 = 0.007 ⇒ use the geometric mean instead of probabilities directly • Sorting candidates is expensive, assomptotically |𝑉 | log |𝑉 |: 𝑘-best can be found in linear time, |𝑉 | ∼ 104 − 105 Encoder-Decoder Models 35/ 38
  • 41.
  • 42.
    Brief history ofthe architectures • 2013 First encoder-decoder model (Kalchbrenner and Blunsom, 2013) • 2014 First really usable encoder-decoder model (Sutskever et al., 2014) • 2014/2015 Added attention (crucial innovation in NLP) (Bahdanau et al., 2015) • 2016/2017 WMT winners used RNN-based neural systems (Sennrich et al., 2016) • 2017 Transformers invented (outperformed RNN) (Vaswani et al., 2017) The development of achitectures still goes on... Document context, non-autoregressive models, multilingual models, … Encoder-Decoder Models 36/ 38
  • 43.
    Encoder-Decoder Models Summary • Encoder-decoderarchitecture = major paradigm in MT • Encoder-decoder architecture = conditional language model • Attention = way of conditioning the decoder on the encoder • Attention = probabilistic vector retrieval • We model probability, but need heuristics to get a good sentence from the model http://ufal.mff.cuni.cz/courses/npfl116
  • 44.
    References I Dzmitry Bahdanau,Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–1709, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver, Canada, August 2017. Association for Computational Linguistics. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog., 2019. Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neural machine translation systems for WMT 16. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 371–376, Berlin, Germany, August 2016. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W16-2323. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, pages 3104–3112, Montreal, Canada, December 2014. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc. Encoder-Decoder Models 38/ 38