encoder and decoder for language modelss

Encoder-Decoder Models
Jindřich Liboviký, Jindřich Helcl
March 03, 2022
NPFL116 Compendium of Neural Machine Translation
Charles University
Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics
unless otherwise stated

Conceptual Scheme of the Model
I am the walrus.
↓
Encoder
↓
intermediate representation
↓
Decoder
↓
Ich bin der Walros.
Neural model with a sequence of discrete
symbols as an input that generates another
sequence of discrete symbols as an output.
• pre-process source sentence
(tokenize, split into smaller units)
• convert input into vocabulary indices
• run the encoder to get an intermediate
representation (vector/matrix)
• run the decoder
• postprocess the output (detokenize)
Encoder-Decoder Models 1/ 38

What is a Language Model
LM = an estimator of a sentence probability given a language
• From now on: sentence = sequence of words 𝑤1, … , 𝑤𝑛
• Factorize the probability by word
i.e., no grammar, no hierarchical structure
Pr (𝑤1, … , 𝑤𝑛) = Pr(𝑤1) ⋅ Pr(𝑤2|𝑤1) ⋅ Pr(𝑤3|𝑤2, 𝑤1) ⋅ ⋯
=
𝑛
∏
𝑖
Pr (𝑤𝑖|𝑤𝑖−1, … , 𝑤1)

What is it good for?
• Substitute for grammar: tells what is a good sentence in a language
• Used in ASR, and statistical MT to select more probable outputs
• Being able to predict next word = proxy for knowing the language
• language modeling is training objective for word2vec
• BERT is a masked language model
• Neural decoder is a conditional language model.

𝑛-gram vs. Neural LMs
𝑛-gram
cool from 1990 to 2013
• Limited history = Markov assumption
• Transparent: estimated from 𝑛-gram counts in a corpus
P(𝑤𝑖|𝑤𝑖−1, 𝑤𝑖−2, … , 𝑤𝑖−𝑛) ≈
𝑛
∑
𝑗=0
𝜆𝑗
𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗)
𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗+1)
Neural
cool since 2013
• Conditioned on RNN state which gather potentially
unlimited history
• Trained by back-propagation to maximize probability of the
training data
• Opaque, but works better (as usual with deep learning)

Reminder: Recurrent Neural Networks
RNN = pipeline for information
In every step some information goes in
and some information goes out.
Technically: A “for” loop applying the
same function 𝐴 on input vectors 𝑥𝑖
At training time unrolled in time:
technically just a very deep network
Image on the right: Chris Olah. Understanding LSTM Networks. A blog post: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Sequence Labeling
• Assign a label to each word in a sentence.
• Tasks formulated as sequence labeling:
• Part-of-Speech Tagging
• Named Entity Recognition
• Filling missing punctuation
MLP = Multilayer perceptron
𝑛× layer: 𝜎 (𝑊𝑥 + 𝑏)
Softmax for 𝐾 classes with logits
z = (𝑧1, … , 𝑧𝐾):
𝑒𝑧𝑖
∑
𝐾
𝑗=1
𝑒𝑧𝑗
𝑤𝑖
↓
lookup index in the vocabulary
↓
Embedding Lookup
↓
ℎ𝑖−1 → RNN → ℎ𝑖
↓
MLP
↓
Softmax

Detour: Why is softmax a good choice
Output layer with softmax (with parameters 𝑊, 𝑏) — gets categorical distribution:
𝑃𝑦 = softmax(x) = Pr(𝑦 ∣ x) =
exp{x⊤
𝑊} + 𝑏
∑ exp{x⊤𝑊} + 𝑏
Network error = cross-entropy between estimated distribution and one-hot ground-truth
distribution 𝑇 = 1(𝑦∗
) = (0, 0, … , 1, 0, … , 0):
𝐿(𝑃𝑦, 𝑦∗
) = 𝐻(𝑃, 𝑇) = −𝔼𝑖∼𝑇 log 𝑃(𝑖)
= − ∑
𝑖
𝑇(𝑖) log 𝑃(𝑖)
= − log 𝑃(𝑦∗
)

Derivative of Cross-Entropy
Let 𝑙 = x⊤
𝑊 + 𝑏, 𝑙𝑦∗ corresponds to the correct one.
∂𝐿(𝑃𝑦, 𝑦∗
)
∂𝑙
= −
∂
∂𝑙
log
exp 𝑙𝑦∗
∑𝑗
exp 𝑙𝑗
= −
∂
∂𝑙
(𝑙𝑦∗ − log ∑ exp 𝑙)
= 1𝑦∗ +
∂
∂𝑙
− log ∑ exp 𝑙 = 1𝑦∗ −
∑ 1𝑦∗ exp 𝑙
∑ exp 𝑙
=
= 1𝑦∗ − 𝑃𝑦(𝑦∗
)
0
1
0
1
0
1
Interpretation: Reinforce the correct logit, suppress the rest.

Language Model as Sequence Labeling
input symbol
one-hot vectors
embedding lookup
RNN cell
(more layers)
classifier
normalization
distribution for
the next symbol
<s>
embed
RNN
MLP
softmax
𝑃(𝑤1|<s>)
𝑤1
embed
RNN
MLP
softmax
𝑃(𝑤1| …)
𝑤2
embed
RNN
MLP
softmax
𝑃(𝑤2| …)
⋯

Sampling from a Language Model
embed
RNN
MLP
softmax
Pr(𝑤1|<s>)
sample
embed
RNN
MLP
softmax
Pr(𝑤1| …)
sample
embed
RNN
MLP
softmax
Pr(𝑤2| …)
sample
embed
RNN
MLP
softmax
Pr(𝑤3| …)
sample
<s>
⋯

Sampling from a Language Model: Pseudocode
last_w = "<s>"
state = initial_state
while last_w != "</s>":
last_w_embeding = target_embeddings[last_w]
state = rnn(state, last_w_embeding)
logits = output_projection(state)
last_w = vocabulary[np.random.multimial(1, logits)]
yield last_w

Training
Training objective: negative-log likelihood:
NLL = −
𝑛
∑
𝑖
log Pr (𝑤𝑖|𝑤𝑖−1, … , 𝑤1)
I.e., maximize probability of the correct word.
• Cross-entropy between the predicted distribution and one-hot “true” distribution
• Error from word is backpropagated into the rest of network unrolled in time
• Prone to exposure bias: during trainining only well-behaved sequences, it can break
when we sample something weird at inference time

Generating from a Language Model
(Example from GPT-2, a Tranformer-based English language model, screenshot from
https://transformer.huggingface.co/doc/gpt2-large)
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.
OpenAI Blog., 2019
Cool, but where is the source language?

Conditioning the Language Model &
Attention

Conditional Language Model
Formally it is simple, condition distribution of
• target sequence y = (𝑦1, … , 𝑦𝑇𝑦
) on
• source sequence x = (𝑥1, … , 𝑥𝑇𝑥
)
Pr (𝑦1, … , 𝑦𝑛|x) =
𝑛
∏
𝑖
Pr (𝑦𝑖|𝑦𝑖−1, … , 𝑦1, x)
We need an encoder to get a representation of x!
What about just continuing an RNN…

Sequence-to-Sequence Model
𝑥1
embed
RNN
𝑥2
embed
RNN
𝑥3
embed
RNN
embed
RNN
MLP
softmax
Pr(𝑤1|<s>)
sample
embed
RNN
MLP
softmax
Pr(𝑤1| …)
sample
embed
RNN
MLP
softmax
Pr(𝑤2| …)
sample
embed
RNN
MLP
softmax
Pr(𝑤3| …)
sample
<s>
⋯
• The interface between encoder and decoder is a single vector
regardless the sentence length.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.
In Advances in Neural Information Processing Systems 27, pages 3104–3112, Montreal, Canada, December 2014

Seq2Seq: Pseudocode
state = np.zeros(rnn_size)
for w in input_words:
input_embedding = source_embeddings[w]
state = enc_cell(encoder_state, input_embedding)
last_w = "<s>"
state = dec_cell(state, last_w_embeding)
last_w = vocabulary[np.argmax(logits)]
yield last_w

Vanila Seq2Seq: Information Bottleneck
Ich habe den Walros gesehen <s> I saw the walrus
I saw the walrus </s>
⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩
Bottleneck all information needs to run through.
A single vector must represent the entire source sentence.
Main weakness and the reason for introducing the attention.

The Attention Model
• Motivation: It would be nice to have variable length input representation
• RNN returns one state per word …
• …what if we were able to get only information from words we need to generate a word.
Attention = probabilistic retrieval of encoder states for
estimating probability of target words.
Query = hidden states of the decoder
Values = encoder hidden states

Sequence-to-Sequence Model With Attention
𝑥1
embed
RNN
RNN
ℎ1
𝑥2
embed
RNN
RNN
ℎ2
𝑥3
embed
RNN
RNN
ℎ3
<s>
embed
RNN
𝑠0
context
=
∑
⋅𝛼0,1
⋅𝛼0,2
⋅𝛼0,3
MLP
Softmax
Pr(𝑤1|<s>)
sample
• Encoder = bidirectional RNN
states ℎ𝑖 ≈ retrieved
values
• Decoder step starts as usual
state 𝑠0 ≈ retrieval query
• Decoder state 𝑠0 used to
compute distribution the
over encoder states
• Weighted average of encoder
states = context vector
• Decoder state & context
concatenated
MLP + Softmax predicts
next word
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate.
In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings, 2015

Attention Model in Equations (1)
Inputs:
decoder state 𝑠𝑖
encoder states ℎ𝑗 = [⃗⃗⃗⃗⃗⃗⃗
ℎ𝑗; ⃖⃖⃖⃖⃖⃖⃖
ℎ𝑗] ∀𝑖 = 1 … 𝑇𝑥
Attention energies:
𝑒𝑖𝑗 = 𝑣⊤
𝑎 tanh (𝑊𝑎𝑠𝑖−1 + 𝑈𝑎ℎ𝑗 + 𝑏𝑎)
Attention distribution:
𝛼𝑖𝑗 =
exp (𝑒𝑖𝑗)
∑
𝑇𝑥
𝑘=1
exp (𝑒𝑖𝑘)
Context vector:
𝑐𝑖 =
𝑇𝑥
∑
𝑗=1
𝛼𝑖𝑗ℎ𝑗

Attention Model in Equations (2)
Output projection:
𝑡𝑖 = MLP (𝑠𝑖−1 ⊕ 𝑣𝑦𝑖−1
⊕ 𝑐𝑖)
…attention is mixed with the hidden state
(different in differnt models)
Output distribution:
𝑝 (𝑦𝑖 = 𝑘|𝑠𝑖, 𝑦𝑖−1, 𝑐𝑖) ∝ exp (𝑊𝑜𝑡𝑖 + 𝑏𝑘)𝑘
(usual trick: use transposed embeddings as 𝑊𝑜)
• Different version of attentive decoders exist
• Alternative: keep the context vector as input for the next step
• Multilayer RNNs: attention between/after layers

Workings of the Attentive Seq2Seq model
Ich habe den Walros gesehen <s> I saw the walrus
𝑠0 𝑠1 𝑠2 𝑠3 𝑠4
I saw the walrus </s>
⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩
⟨⟨⟨ RNN ⟨⟨⟨ RNN ⟨⟨⟨ RNN ⟨⟨⟨
⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩ RNN ⟩⟩⟩
ℎ1 ℎ2 ℎ3 ℎ4 ℎ5
ℎ1 ℎ2 ℎ3 ℎ4 ℎ5
ℎ1 ℎ2 ℎ3 ℎ4 ℎ5
ℎ1 ℎ2 ℎ3 ℎ4 ℎ5
ℎ1 ℎ2 ℎ3 ℎ4 ℎ5

Seq2Seq with attention: Pseudocode (1)
state = np.zeros(emb_size)
fw_states = []
for w in input_words:
state, _ = fw_enc_cell(encoder_state, input_embedding)
fw_states.append(state)
bw_states = []
state = np.zeros(emb_size)
for w in reversed(input_words):
state, _ = bw_enc_cell(encoder_state, input_embedding)
bw_states.append(state)
enc_states = [np.concatenate(fw, bw) for fw, bw in zip(fw_states,
reversed(bw_states))]

Seq2Seq with attention: Pseudocode (2)
last_w = "<s>"
alphas = attention(state, enc_states)
context = sum(a * state for a, state in zip(alphas, enc_states))
logits = output_projection(np.concatenate(state, context, last_w_embeding))
last_w = np.argmax(logits)
yield last_w

Attention Visualization (1)
The
agreement
on
the
European
Economic
Area
was
signed
in
August
1992
.
<end>
L'
accord
sur
la
zone
économique
européenne
a
été
signé
en
août
1992
.
<end>
It
should
be
noted
that
the
marine
environment
is
the
least
known
of
environments
.
<end>
Il
convient
de
noter
que
l'
environnement
marin
est
le
moins
connu
de
l'
environnement
.
<end>
(a) (b)
struction
e
uipment
eans
at
ria
n
ger
oduce
w
emical
apons
nd>
ge
re
ily
d>
Image source: Bahdanau et al. (2015), Fig. 3

Attention Visualization (2)
Image source: Koehn and Knowles (2017), Fig. 8

Attention vs. Alignment
Differences between attention model and word alignment used for phrase table generation:
attention (NMT) alignment (SMT)
probabilistic discrete
declarative imperative
LM generates LM discriminates

Training Seq2Seq Model
Optimize negative log-likelihood of parallel data, backpropagation does
the rest.
If you choose a right optimizer, learning rate, model hyper-parameters, prepare data, do
back-translation, monolingual pre-training …
Confusion: decoder inputs vs. output
inputs y[:-1] <s> 𝑦1 𝑦2 𝑦3 𝑦4
↓ ↓ ↓ ↓ ↓
Decoder
↓ ↓ ↓ ↓ ↓
outputs y[1:] 𝑦1 𝑦2 𝑦3 𝑦4 </s>

Getting output
• Encoder-decoder is a conditional language model
• For a pair x and y, we can compute:
Pr (y|x) =
𝑇𝑦
∏
𝑖=1
Pr (𝑦𝑖|y∶𝑖, x)
• When decoding we want to get
y∗
= argmax
y′
Pr (y′
|𝑥)
☠ Enumerating all y′
s is computationally intractable ☠

Greedy Decoding
In each step, take the maximum probable word.
𝑦∗
𝑖 = argmax
𝑦𝑖
Pr (𝑦𝑖|𝑦∗
𝑖−1, … , <s>)
last_w = "<s>"
state = initial_state
last_w = vocabulary[np.argmax(logits)]
yield last_w

What if…
This is a
platypus
25%
rather
24%
random end . </s>
30% each
good sentence . </s>
60% each
⚠ Greedy decoding can easily miss the best option. ⚠

Beam Search
Keep a small 𝑘 of hypothesis (typically 4–20).
1. Begin with a single empty hypothesis in the beam.
2. In each time step:
2.1 Extend all hypotheses in the beam by all (or the most probable) from the output
distribution (we call these candidate hypotheses)
2.2 Score the candidate hypotheses
2.3 Keep only 𝑘 best of them.
3. Finish if all 𝑘-best hypotheses end with </s>
4. Sort the hypotheses by their score and output the best one.

Beam Search: Example
...
...
...
...
Hey
world
World
<s>
there
Hi
...
...
hello
world
Hello
!

Beam Search: Pseudocode
beam = [(["<s>"], initial_state, 1.0)]
while any(hyp[-1] != "</s>" for hyp, _, _ in beam):
candidates = []
for hyp, state, score in beam:
distribution, new_state = decoder_step(hyp[-1], state, encoder_states)
for i, prob in enumerate(distribution):
candidates.append(hyp + [vocabulary[i]], new_state, score * prob)
beam = take_best(k, candidates)

Implementation issues
• Multiplying of too many small numbers → float underflow
need to compute in log domain and add logarithms
• Sentences can have different lengths
This is a good long sentence . </s>
0.7 × 0.6 × 0.9 × 0.1 × 0.4 × 0.4 × 0.8 × 0.9 = 0.004
This </s>
0.7 × 0.01 = 0.007
⇒ use the geometric mean instead of probabilities directly
• Sorting candidates is expensive, assomptotically |𝑉 | log |𝑉 |:
𝑘-best can be found in linear time, |𝑉 | ∼ 104
− 105

Brief history of the architectures
• 2013 First encoder-decoder model (Kalchbrenner and Blunsom, 2013)
• 2014 First really usable encoder-decoder model (Sutskever et al., 2014)
• 2014/2015 Added attention (crucial innovation in NLP) (Bahdanau et al., 2015)
• 2016/2017 WMT winners used RNN-based neural systems (Sennrich et al., 2016)
• 2017 Transformers invented (outperformed RNN) (Vaswani et al., 2017)
The development of achitectures still goes on...
Document context, non-autoregressive models, multilingual models, …

Encoder-Decoder Models
Summary
• Encoder-decoder architecture = major paradigm in MT
• Encoder-decoder architecture = conditional language model
• Attention = way of conditioning the decoder on the encoder
• Attention = probabilistic vector retrieval
• We model probability, but need heuristics to get a good sentence
from the model
http://ufal.mff.cuni.cz/courses/npfl116

References I
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann
LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,
2015.
Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural
Language Processing, pages 1700–1709, Seattle, Washington, USA, October 2013. Association for Computational Linguistics.
Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation,
pages 28–39, Vancouver, Canada, August 2017. Association for Computational Linguistics.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog., 2019.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neural machine translation systems for WMT 16. In Proceedings of the First Conference on
Machine Translation: Volume 2, Shared Task Papers, pages 371–376, Berlin, Germany, August 2016. Association for Computational Linguistics. URL
https://www.aclweb.org/anthology/W16-2323.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27,
pages 3104–3112, Montreal, Canada, December 2014.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In
Advances in Neural Information Processing Systems 30, pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc.

encoder and decoder for language modelss

More Related Content

Similar to encoder and decoder for language modelss

Recently uploaded

encoder and decoder for language modelss