RNNs and
LSTMs
Simple Recurrent Networks
(RNNs or Elman Nets)
Modeling Time in Neural Networks
Language is inherently temporal
Yet the simple NLP classifiers we've seen (for example
for sentiment analysis) mostly ignore time
• (Feedforward neural LMs (and the transformers we'll
see later) use a "moving window" approach to time.)
Here we introduce a deep learning architecture with a
different way of representing time
• RNNs and their variants like LSTMs
Recurrent Neural Networks (RNNs)
Any network that contains a cycle within its network
connections.
The value of some unit is directly, or indirectly,
dependent on its own earlier outputs as an input.
Simple Recurrent Nets (Elman nets)
xt
yt
ht
The hidden layer has a recurrence as part of its input
The activation value ht depends on xt but also ht-1!
Forward inference in simple RNNs
Very similar to the feedforward networks we've seen!
Simple recurrent neural network illustrated as a
feedforward network
Inference has to be incremental
Computing h at time t requires that we first computed h at the
previous time step!
Training in simple RNNs
Just like feedforward training:
• training set,
• a loss function,
• backpropagation
Weights that need to be updated:
• W, the weights from the input layer to the hidden layer,
• U, the weights from the previous hidden layer to the current hidden layer,
• V, the weights from the hidden layer to the output layer.
Training in simple RNNs: unrolling in time
Unlike feedforward networks:
1. To compute loss function for the output
at time t we need the hidden layer from
time t − 1.
2. hidden layer at time t influences the
output at time t and hidden layer at time
t+1 (and hence the output and loss at t+1).
So: to measure error accruing to ht,
• need to know its influence on both the
current output as well as the ones that
follow.
Unrolling in time (2)
We unroll a recurrent network into a feedforward
computational graph eliminating recurrence
1. Given an input sequence,
2. Generate an unrolled feedforward network specific to input
3. Use graph to train weights directly via ordinary backprop (or
can do forward inference)
RNNs and
LSTMs
Simple Recurrent Networks
(RNNs or Elman Nets)
RNNs and
LSTMs
RNNs as Language Models
Reminder: Language Modeling
The size of the conditioning context for different LMs
The n-gram LM:
Context size is the n − 1 prior words we condition on.
The feedforward LM:
Context is the window size.
The RNN LM:
No fixed context size; ht-1 represents entire history
FFN LMs vs RNN LMs
FFN RNN
…
Forward inference in the RNN LM
Given input X of of N tokens represented as one-hot vectors
Use embedding matrix to get the embedding for current token xt
Combine …
Shapes
d x 1
d x d
d x d
d x 1
d x 1
|V| x d
|V| x 1
Computing the probability that the next word is word k
Training RNN LM
• Self-supervision
• take a corpus of text as training material
• at each time step t
• ask the model to predict the next word.
• Why called self-supervised: we don't need human labels;
the text is its own supervision signal
• We train the model to
• minimize the error
• in predicting the true next word in the training sequence,
• using cross-entropy as the loss function.
Cross-entropy loss
The difference between:
• a predicted probability distribution
• the correct distribution.
CE loss for LMs is simpler!!!
• the correct distribution yt is a one-hot vector over the vocabulary
• where the entry for the actual next word is 1, and all the other entries are 0.
• So the CE loss for LMs is only determined by the probability of next word.
• So at time t, CE loss is:
Teacher forcing
We always give the model the correct history to predict the next word (rather
than feeding the model the possible buggy guess from the prior time step).
This is called teacher forcing (in training we force the context to be correct based
on the gold words)
What teacher forcing looks like:
• At word position t
• the model takes as input the correct word wt together with ht−1, computes a
probability distribution over possible next words
• That gives loss for the next token wt+1
• Then we move on to next word, ignore what the model predicted for the next
word and instead use the correct word wt+1 along with the prior history
encoded to estimate the probability of token wt+2.
Weight tying
The input embedding matrix E and the final layer matrix V, are
similar
• The columns of E represent the word embeddings for each
word in vocab. E is [d x |V|]
• The final layer matrix V helps give a score (logit) for each
word in vocab . V is [|V| x d ]
Instead of having separate E and V, we just tie them together,
using ET
instead of V:
RNNs and
LSTMs
RNNs as Language Models
RNNs and
LSTMs
RNNs for Sequences
RNNs for sequence labeling
Assign a label to each element of a sequence
Part-of-speech tagging
RNNs for sequence classification
Text classification
Instead of taking the last state, could use some pooling function of all
the output states, like mean pooling
Autoregressive generation
Stacked RNNs
Bidirectional RNNs
Bidirectional RNNs for classification
RNNs and
LSTMs
RNNs for Sequences
RNNs and
LSTMs
The LSTM
Motivating the LSTM: dealing with distance
• It's hard to assign probabilities accurately when context is very far away:
• The flights the airline was canceling were full.
• Hidden layers are being forced to do two things:
• Provide information useful for the current decision,
• Update and carry forward information required for future decisions.
• Another problem: During backprop, we have to repeatedly multiply
gradients through time and many h's
• The "vanishing gradient" problem
The LSTM: Long short-term memory network
LSTMs divide the context management problem into two
subproblems:
• removing information no longer needed from the context,
• adding information likely to be needed for later decision making
• LSTMs add:
• explicit context layer
• Neural circuits with gates to control information flow
Forget gate
Deletes information from the context that is no longer needed.
Regular passing of information
Add gate
Selecting information to add to current context
Add this to the modified context vector to get our new context vector.
Output gate
Decide what information is required for the current hidden state (as opposed to what
information needs to be preserved for future decisions).
The LSTM
Units
FFN SRN LSTM
RNNs and
LSTMs
The LSTM
RNNs and
LSTMs
The LSTM Encoder-Decoder
Architecture
Four architectures for NLP tasks with RNNs
3 components of an encoder-decoder
1. An encoder that accepts an input sequence, x1:n, and
generates a corresponding sequence of contextualized
representations, h1:n.
2. A context vector, c, which is a function of h1:n, and
conveys the essence of the input to the decoder.
3. A decoder, which accepts c as input and generates an
arbitrary length sequence of hidden states h1:m, from which
a corresponding sequence of output states y1:m, can be
obtained
Encoder-decoder
Encoder-decoder for translation
Regular language modeling
Encoder-decoder for translation
Let x be the source text plus a separate token <s> and
y the target
Let x = The green witch arrive <s>
Let y = llego la bruja verde
́
Encoder-decoder simplified
Encoder-decoder showing context
Encoder-decoder equations
g is a stand-in for some flavor of RNN
yˆt−1 is the embedding for the output sampled from the softmax at the previous step
ˆyt is a vector of probabilities over the vocabulary, representing the probability of each
word occurring at time t. To generate text, we sample from this distribution ˆyt .
Training the encoder-decoder with teacher forcing
RNNs and
LSTMs
The LSTM Encoder-Decoder
Architecture
RNNs and
LSTMs
LSTM Attention
Problem with passing context c only from end
Requiring the context c to be only the encoder’s final hidden state
forces all the information from the entire source sentence to pass
through this representational bottleneck.
Solution: attention
instead of being taken from the last hidden state, the context it’s a
weighted average of all the hidden states of the decoder.
this weighted average is also informed by part of the decoder state as
well, the state of the decoder right before the current token i.
Attention
How to compute c?
We'll create a score that tells us how much to focus on each encoder
state, how relevant each encoder state is to the decoder state:
We’ll normalize them with a softmax to create weights αi j , that tell us
the relevance of encoder hidden state j to hidden decoder state, hd
i-1
And then use this to help create a weighted average:
Encoder-decoder with attention, focusing on the
computation of c
RNNs and
LSTMs
LSTM Attention

rnnjan nlp nlp Advanced_nlp25 nlp nlp.pptx

  • 1.
    RNNs and LSTMs Simple RecurrentNetworks (RNNs or Elman Nets)
  • 2.
    Modeling Time inNeural Networks Language is inherently temporal Yet the simple NLP classifiers we've seen (for example for sentiment analysis) mostly ignore time • (Feedforward neural LMs (and the transformers we'll see later) use a "moving window" approach to time.) Here we introduce a deep learning architecture with a different way of representing time • RNNs and their variants like LSTMs
  • 3.
    Recurrent Neural Networks(RNNs) Any network that contains a cycle within its network connections. The value of some unit is directly, or indirectly, dependent on its own earlier outputs as an input.
  • 4.
    Simple Recurrent Nets(Elman nets) xt yt ht The hidden layer has a recurrence as part of its input The activation value ht depends on xt but also ht-1!
  • 5.
    Forward inference insimple RNNs Very similar to the feedforward networks we've seen!
  • 6.
    Simple recurrent neuralnetwork illustrated as a feedforward network
  • 7.
    Inference has tobe incremental Computing h at time t requires that we first computed h at the previous time step!
  • 8.
    Training in simpleRNNs Just like feedforward training: • training set, • a loss function, • backpropagation Weights that need to be updated: • W, the weights from the input layer to the hidden layer, • U, the weights from the previous hidden layer to the current hidden layer, • V, the weights from the hidden layer to the output layer.
  • 9.
    Training in simpleRNNs: unrolling in time Unlike feedforward networks: 1. To compute loss function for the output at time t we need the hidden layer from time t − 1. 2. hidden layer at time t influences the output at time t and hidden layer at time t+1 (and hence the output and loss at t+1). So: to measure error accruing to ht, • need to know its influence on both the current output as well as the ones that follow.
  • 10.
    Unrolling in time(2) We unroll a recurrent network into a feedforward computational graph eliminating recurrence 1. Given an input sequence, 2. Generate an unrolled feedforward network specific to input 3. Use graph to train weights directly via ordinary backprop (or can do forward inference)
  • 11.
    RNNs and LSTMs Simple RecurrentNetworks (RNNs or Elman Nets)
  • 12.
    RNNs and LSTMs RNNs asLanguage Models
  • 13.
  • 14.
    The size ofthe conditioning context for different LMs The n-gram LM: Context size is the n − 1 prior words we condition on. The feedforward LM: Context is the window size. The RNN LM: No fixed context size; ht-1 represents entire history
  • 15.
    FFN LMs vsRNN LMs FFN RNN …
  • 16.
    Forward inference inthe RNN LM Given input X of of N tokens represented as one-hot vectors Use embedding matrix to get the embedding for current token xt Combine …
  • 17.
    Shapes d x 1 dx d d x d d x 1 d x 1 |V| x d |V| x 1
  • 18.
    Computing the probabilitythat the next word is word k
  • 19.
    Training RNN LM •Self-supervision • take a corpus of text as training material • at each time step t • ask the model to predict the next word. • Why called self-supervised: we don't need human labels; the text is its own supervision signal • We train the model to • minimize the error • in predicting the true next word in the training sequence, • using cross-entropy as the loss function.
  • 20.
    Cross-entropy loss The differencebetween: • a predicted probability distribution • the correct distribution. CE loss for LMs is simpler!!! • the correct distribution yt is a one-hot vector over the vocabulary • where the entry for the actual next word is 1, and all the other entries are 0. • So the CE loss for LMs is only determined by the probability of next word. • So at time t, CE loss is:
  • 21.
    Teacher forcing We alwaysgive the model the correct history to predict the next word (rather than feeding the model the possible buggy guess from the prior time step). This is called teacher forcing (in training we force the context to be correct based on the gold words) What teacher forcing looks like: • At word position t • the model takes as input the correct word wt together with ht−1, computes a probability distribution over possible next words • That gives loss for the next token wt+1 • Then we move on to next word, ignore what the model predicted for the next word and instead use the correct word wt+1 along with the prior history encoded to estimate the probability of token wt+2.
  • 22.
    Weight tying The inputembedding matrix E and the final layer matrix V, are similar • The columns of E represent the word embeddings for each word in vocab. E is [d x |V|] • The final layer matrix V helps give a score (logit) for each word in vocab . V is [|V| x d ] Instead of having separate E and V, we just tie them together, using ET instead of V:
  • 23.
    RNNs and LSTMs RNNs asLanguage Models
  • 24.
  • 25.
    RNNs for sequencelabeling Assign a label to each element of a sequence Part-of-speech tagging
  • 26.
    RNNs for sequenceclassification Text classification Instead of taking the last state, could use some pooling function of all the output states, like mean pooling
  • 27.
  • 28.
  • 29.
  • 30.
    Bidirectional RNNs forclassification
  • 31.
  • 32.
  • 33.
    Motivating the LSTM:dealing with distance • It's hard to assign probabilities accurately when context is very far away: • The flights the airline was canceling were full. • Hidden layers are being forced to do two things: • Provide information useful for the current decision, • Update and carry forward information required for future decisions. • Another problem: During backprop, we have to repeatedly multiply gradients through time and many h's • The "vanishing gradient" problem
  • 34.
    The LSTM: Longshort-term memory network LSTMs divide the context management problem into two subproblems: • removing information no longer needed from the context, • adding information likely to be needed for later decision making • LSTMs add: • explicit context layer • Neural circuits with gates to control information flow
  • 35.
    Forget gate Deletes informationfrom the context that is no longer needed.
  • 36.
    Regular passing ofinformation
  • 37.
    Add gate Selecting informationto add to current context Add this to the modified context vector to get our new context vector.
  • 38.
    Output gate Decide whatinformation is required for the current hidden state (as opposed to what information needs to be preserved for future decisions).
  • 39.
  • 40.
  • 41.
  • 42.
    RNNs and LSTMs The LSTMEncoder-Decoder Architecture
  • 43.
    Four architectures forNLP tasks with RNNs
  • 44.
    3 components ofan encoder-decoder 1. An encoder that accepts an input sequence, x1:n, and generates a corresponding sequence of contextualized representations, h1:n. 2. A context vector, c, which is a function of h1:n, and conveys the essence of the input to the decoder. 3. A decoder, which accepts c as input and generates an arbitrary length sequence of hidden states h1:m, from which a corresponding sequence of output states y1:m, can be obtained
  • 45.
  • 46.
  • 47.
    Encoder-decoder for translation Letx be the source text plus a separate token <s> and y the target Let x = The green witch arrive <s> Let y = llego la bruja verde ́
  • 48.
  • 49.
  • 50.
    Encoder-decoder equations g isa stand-in for some flavor of RNN yˆt−1 is the embedding for the output sampled from the softmax at the previous step ˆyt is a vector of probabilities over the vocabulary, representing the probability of each word occurring at time t. To generate text, we sample from this distribution ˆyt .
  • 51.
    Training the encoder-decoderwith teacher forcing
  • 52.
    RNNs and LSTMs The LSTMEncoder-Decoder Architecture
  • 53.
  • 54.
    Problem with passingcontext c only from end Requiring the context c to be only the encoder’s final hidden state forces all the information from the entire source sentence to pass through this representational bottleneck.
  • 55.
    Solution: attention instead ofbeing taken from the last hidden state, the context it’s a weighted average of all the hidden states of the decoder. this weighted average is also informed by part of the decoder state as well, the state of the decoder right before the current token i.
  • 56.
  • 57.
    How to computec? We'll create a score that tells us how much to focus on each encoder state, how relevant each encoder state is to the decoder state: We’ll normalize them with a softmax to create weights αi j , that tell us the relevance of encoder hidden state j to hidden decoder state, hd i-1 And then use this to help create a weighted average:
  • 58.
    Encoder-decoder with attention,focusing on the computation of c
  • 59.

Editor's Notes

  • #2 Recall that we applied feedforward networks to lan- guage modeling by having them look only at a fixed-size window of words, and then sliding this window over the input, making independent predictions along the way.
  • #6 ht−1 from prior time step is multiplied by U and then added to the current time step's feedforward Wxt
  • #15 RNNs don’t have the limited context problem that n-gram models have, or the fixed context that feedforward language models have, since the hidden state can in principle represent information about all of the preceding words all the way back to the beginning of the sequence. Here's an FFN language model and an RNN language model, showing that the RNN language model uses ht−1, the hidden state from the previous time step, as a representation of the past context.
  • #20 . So at time t the CE loss is the negative log probability the model assigns to the next word in the training sequence.
  • #33 Assigning a high probability to was following airline is straightforward since airline provides a strong local context for the singular agreement. However, assigning an appropriate probability to were is quite difficult, not only because the plural flights is quite distant, but also because the singular noun airline is closer in the intervening context. Ideally, a network should be able to retain the distant information about plural flights until it is needed, while still processing the intermediate parts of the sequence correctly.
  • #35 computes a weighted sum of the previous state’s hidden layer and the current in- put and passes that through a sigmoid. This mask is then multiplied element-wise by the context vector to remove the information from context that is no longer re- quired.