Generative
Models
Introduction
Spectrum of Low-Labeled Learning
Supervised
Learning
⬣ Train Input: 𝑋, 𝑌
⬣ Learning output:
𝑓 ∶ 𝑋 → 𝑌, 𝑃(𝑦|𝑥)
⬣ e.g. classification
Sheep
Dog
Cat
Lion
Giraffe
Unsupervised
Learning
⬣ Input: 𝑋
⬣ Learning output: 𝑃 𝑥
⬣ Example: Clustering,
density estimation, etc.
Less Labels
Unsupervised Learning
Density
Estimation
Classification
Regression
Clustering
Dimensionality
Reduction
x y
x y
Discrete
Continuous
x c Discrete
x z Continuous
Supervised Learning
Unsupervised Learning
x p(x) On simplex
What to Learn?
Traditional unsupervised learning methods:
Similar in deep learning, but from neural network/learning perspective
Modeling 𝑷 𝒙 Comparing/
Grouping
Representation
Learning
Principal
Component
Analysis
Clustering
Density
estimation
Almost all deep learning!
Metric learning & clustering
Deep Generative Models
Discriminative models model 𝑷 𝒚 𝒙
⬣ Example: Model this via neural network, SVM, etc.
Generative models model 𝑷(𝒙)
Generative Models
Discriminative vs. Generative Models
Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks
Discriminative models model 𝑷 𝒚 𝒙
⬣ Example: Model this via neural network, SVM, etc.
Generative models model 𝑷(𝒙)
⬣ We can parameterize our model as 𝑷(𝒙, 𝜽) and use maximum likelihood to optimize the
parameters given an unlabeled dataset:
⬣ They are called generative because they can often generate samples
⬣ Example: Multivariate Gaussian with estimated parameters 𝝁, 𝝈
Generative Models
Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks
Discriminative vs. Generative Models
Generative Models
Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks
PixelRNN &
PixelCNN
Generative Models
Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks
Factorizing P(x)
We can use chain rule to decompose the joint distribution
⬣ Factorizes joint distribution into a product of conditional distributions
⬣ Similar to Bayesian Network (factorizing a joint distribution)
⬣ Similar to language models!
⬣ Requires some ordering of variables (edges in a probabilistic graphical
model)
⬣ We can estimate this conditional distribution as a neural network
Oord et al., Pixel Recurrent Neural Networks
𝒑 𝒙 = $
𝒊$𝟏
𝒏𝟐
𝒑 𝒙𝒊 𝒙𝟏, … , 𝒙𝒊'𝟏)
Modeling Language as a Sequence
W(Z
Z
Z) = W(^, ^, . . . , ^U)
= Ñ(ðċ) Ñ(ðČ | ðċ) Ñ(ðč | ðċ, ðČ) · · · Ñ(ðÀ | ðÀ ċ, . . . , ðċ)
=
Y
¨
Ñ(ð¨ | ð¨ ċ, . . . , ðċ)
next
word
history
Language Models as an RNN
_ _ _U
O O
OU
Mʿ Mʿ Mʿ
. . .
⬣ Language modeling involves estimating a probability distribution over
sequences of words.
W(Z
Z
Z) = W(^, ^, . . . , ^U) =
Y
¨
Ñ(ð¨ | ð¨ ċ, . . . , ðċ)
next
word
history
⬣ RNNs are a family of neural architectures for modeling sequences.
Factorized Models for Images
𝒑 𝒙 = 𝒑 𝒙𝟏 $
𝒊,𝟐
𝒏𝟐
𝒑 𝒙𝒊 𝒙𝟏, … , 𝒙𝒊/𝟏)
Oord et al., Pixel Recurrent Neural Networks
𝒑 𝒙 = $
𝒊,𝟏
𝒏𝟐
𝒑 𝒙𝒊 𝒙𝟏, … , 𝒙𝒊/𝟏)
Factorized Models for Images
𝒑 𝒙 = 𝒑 𝒙𝟏 𝒑 𝒙𝟐 𝒙𝟏 𝒑 𝒙𝟑 𝒙𝟏 $
𝒊%𝟒
𝒏𝟐
𝒑 𝒙𝒊 𝒙𝟏, … , 𝒙𝒊(𝟏)
Training:
⬣ We can train similar to language models:
Teacher/student forcing
⬣ Maximum likelihood approach
Downsides:
⬣ Slow sequential generation process
⬣ Only considers few context pixels
Oord et al., Pixel Recurrent Neural Networks
Pixel CNN
Oord et al., Conditional Image Generation with PixelCNN Decoders
⬣ Idea: Represent conditional
distribution as a convolution
layer!
⬣ Considers larger context
(receptive field)
⬣ Practically can be implemented
by applying a mask, zeroing
out “future” pixels
⬣ Faster training but still slow
generation
⬣ Limited to smaller images
Example Results: Image Completion (PixelRNN)
Oord et al., Conditional Image Generation with PixelCNN Decoders
Example Images (PixelCNN)
Oord et al., Conditional Image Generation with PixelCNN Decoders
Generative
Adversarial
Networks
(GANs)
Generative Models
Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks
Implicit Models
Implicit generative models do not actually learn an explicit model for 𝒑 𝒙
Instead, learn to generate samples from 𝒑 𝒙
⬣ Learn good feature representations
⬣ Perform data augmentation
⬣ Learn world models (a simulator!) for reinforcement learning
How?
⬣ Learn to sample from a neural network output
⬣ Adversarial training that uses one network’s predictions to train
the other (dynamic loss function!)
⬣ Lots of tricks to make the optimization more stable
Learning to Sample
We would like to sample from 𝒑 𝒙 using a neural network
Idea:
⬣ Sample from a simple distribution (Gaussian)
⬣ Transform the sample to 𝒑 𝒙
𝑵 𝝁, 𝝈 Neural Network
Samples Samples
𝒑 𝒙
Generating Images
⬣ Input can be a vector with (independent) Gaussian random numbers
⬣ We can use a CNN to generate images!
𝑵 𝝁, 𝝈
Neural Network 𝒑 𝒙
Vector of
Random
Numbers
Generator
Adversarial Networks
⬣ Goal: We would like to generate realistic images. How can we drive the
network to learn how to do this?
⬣ Idea: Have another network try to distinguish a real image from a generated
(fake) image
⬣ Why? Signal can be used to determine how well it’s doing at generation
Discriminator
Real or
Fake?
𝑵 𝝁, 𝝈
Neural Network 𝒑 𝒙
Vector of
Random
Numbers
Generator
Generative Adversarial Networks (GANs)
Vector of
Random
Numbers
Generator Discriminator
Cross-entropy
(Real or Fake?)
We know the
answer (self-
supervised)
Mini-batch of
real  fake data
Question: What loss functions can we use (for each network)?
⬣ Generator: Update weights to improve
realism of generated images
⬣ Discriminator: Update weights to better
discriminate
⬣ Since we have two networks competing, this is a mini-max two player game
⬣ Ties to game theory
⬣ Not clear what (even local) Nash equilibria are for this game
Mini-max Two Player Game
Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks
⬣ Since we have two networks competing, this is a mini-max two player game
⬣ Ties to game theory
⬣ Not clear what (even local) Nash equilibria are for this game
⬣ The full mini-max objective is:
⬣ where 𝐷 𝑥 is the discriminator outputs probability ([0,1]) of real image
⬣ 𝑥 is a real image and 𝐺 𝑧 is a generated image
Mini-max Two Player Game
Generator minimizes How well discriminator
does (0 for fake)
Sample from fake
Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks
⬣ Since we have two networks competing, this is a mini-max two player game
⬣ Ties to game theory
⬣ Not clear what (even local) Nash equilibria are for this game
⬣ The full mini-max objective is:
⬣ where 𝐷 𝑥 is the discriminator outputs probability ([0,1]) of real image
⬣ 𝑥 is a real image and 𝐺 𝑧 is a generated image
Mini-max Two Player Game
Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks
How well discriminator
does (1 for real)
Discriminator maximizes How well discriminator
does (0 for fake)
Sample from fake
Sample from real
Discriminator Perspective
⬣ where 𝐷 𝑥 is the discriminator outputs probability ([0,1]) of real
image
⬣ 𝑥 is a real image and 𝐺 𝑧 is a generated image
⬣ The discriminator wants to maximize this:
⬣ 𝐷 𝑥 is pushed up (to 1) because 𝑥 is a real image
⬣ 1 − 𝐷 𝐺 𝑧 is also pushed up to 1 (so that D G z is pushed down to
0)
⬣ In other words, discriminator wants to classify real images as real (1)
and fake images as fake (0)
Generator Perspective
⬣ where 𝐷 𝑥 is the discriminator outputs probability ([0,1]) of real
image
⬣ 𝑥 is a real image and 𝐺 𝑧 is a generated image
⬣ The generator wants to minimize this:
⬣ 1 − 𝐷 𝐺 𝑧 is pushed down to 0 (so that D G z is pushed up to 1)
⬣ This means that the generator is fooling the discriminator, i.e.
succeeding at generating images that the discriminator can’t
discriminate from real
Generative Adversarial Networks (GANs)
Generator Loss Discriminator Loss
Vector of
Random
Numbers
Generator Discriminator
Cross-entropy
(Real or Fake?)
We know the
answer (self-
supervised)
Mini-batch of
real  fake data
Converting to Max-Max Game
The generator part of the objective does not have good gradient properties
⬣ High gradient when 𝐷 𝐺 𝑧 is high (that is, discriminator is wrong)
⬣ We want it to improve when samples are bad (discriminator is right)
Alternative objective, maximize:
Plot from CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung
Final Algorithm Goodfellow, NeurIPS 2016 Generative Adversarial Nets
Generative Adversarial Networks (GANs)
At the end, we have:
⬣ An implicit generative model!
⬣ Features from discriminator
Vector of
Random
Numbers
Generator Discriminator
Cross-entropy
(Real or Fake?)
We know the
answer (self-
supervised)
Mini-batch of
real  fake data
Early Results
⬣ Low-resolution
images but look
decent!
⬣ Last column are
nearest
neighbor
matches in
dataset
Goodfellow, NeurIPS 2016 Generative Adversarial Nets
Difficulty in Training
Goodfellow, NeurIPS 2016 Generative Adversarial Nets
GANs are very difficult to train
due to the mini-max objective
Advancements include:
⬣ More stable architectures
⬣ Regularization methods to
improve optimization
⬣ Progressive growing/training
and scaling
DCGAN
Radford et al., Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Radford et al., Unsupervised Representation Learning with Deep Convolutional
Generative Adversarial Networks
Regularization
Kodali et al., On Convergence and Stability of GANs (also known as How to Train your DRAGAN)
Training GANs is difficult due to:
⬣ Minimax objective – For example, what if generator learns to
memorize training data (no variety) or only generates part of the
distribution?
⬣ Mode collapse – Capturing only some modes of distribution
Several theoretically-motivated regularization methods
⬣ Simple example: Add noise to real samples!
Example Generated Images - BigGAN
Brock et al., Large Scale GAN Training for High Fidelity Natural Image Synthesis
Failure Examples - BigGAN
Brock et al., Large Scale GAN Training for High Fidelity Natural Image Synthesis
Video Generation
https://www.youtube.com/wa
tch?v=PCBTZh41Ris
Summary
Generative Adversarial Networks (GANs)
can produce amazing images!
Several drawbacks
⬣ High-fidelity generation heavy to train
⬣ Training can be unstable
⬣ No explicit model for distribution
Larger number of extensions:
⬣ GANs conditioned on labels or other
information
⬣ Adversarial losses for other
applications
Variational
Autoencoders
(VAEs)
Generative Models
Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks
Reminder: Autoencoders
Encoder Decoder
Low dimensional embedding
Minimize the difference (with MSE)
Linear layers with reduced
dimension or Conv-2d
layers with stride
Linear layers with increasing
dimension or Conv-2d layers
with bilinear upsampling
Formalizing the Generative Model
What is this?
Hidden/Latent variables
Factors of variation that
produce an image:
(digit, orientation, scale, etc.)
𝑃 𝑋 = * 𝑃 𝑋 𝑍; 𝜃 𝑃 𝑍 𝑑𝑍
⬣ We cannot maximize this likelihood due to the integral
⬣ Instead we maximize a variational lower bound (VLB) that we can
compute
Kingma  Welling, Auto-Encoding Variational Bayes
𝑍
Variational Autoencoder: Decoder
⬣ We can combine the probabilistic view, sampling, autoencoders, and approximate
optimization
⬣ Just as before, sample 𝑍 from simpler distribution
⬣ We can also output parameters of a probability
distribution!
⬣ Example: 𝜇, 𝜎 of Gaussian distribution
⬣ For multi-dimensional version output diagonal
covariance
⬣ How can we maximize
𝑃 𝑋 = ∫ 𝑃 𝑋 𝑍; 𝜃 𝑃 𝑍 𝑑𝑍
𝒁
𝝁𝒙 𝝈𝒙
Decoder
𝑷 𝑿|𝒁; 𝜽
Variational Autoencoder: Encoder
⬣ We can combine the probabilistic view, sampling, autoencoders, and approximate
optimization
⬣ Given an image, estimate 𝑍
⬣ Again, output parameters of a
distribution
𝜇* 𝜎*
X
Encoder
Q 𝑍|𝑋; 𝜙
Putting Them Together
We can tie the encoder and decoder together into a probabilistic autoencoder
⬣ Given data (X), estimate 𝜇, 𝜎 and sample from 𝑁(𝜇, 𝜎)
⬣ Given 𝑍, estimate 𝜇#, 𝜎# and sample from 𝑁(𝜇#, 𝜎#)
Encoder
Q 𝑍|𝑋; 𝜙
𝜇* 𝜎*
X
Decoder
𝑃 𝑋|𝑍; 𝜃
𝑍
𝜇+ 𝜎+
Maximizing Likelihood
⬣ How can we optimize the parameters of the two networks?
Now equipped with our encoder and decoder networks, let’s work
out the (log) data likelihood:
From CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung
Maximizing Likelihood
From CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung
KL-Divergence
Aside: KL Divergence (distance measure for distributions), always = 0
𝐾𝐿(𝑝| 𝑞 = 𝐻, 𝑝, 𝑞 − 𝐻(𝑝) = ∑ 𝑝 𝑥 log 𝑝 𝑥 − ∑ 𝑝 𝑥 log 𝑞 𝑥
Definition of Expectation
Maximizing Likelihood
From CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung
The expectation wrt. z (using encoder
network) let us write nice KL terms
Maximizing Likelihood
From CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung
Decoder network gives 𝒑𝜽 𝒙 𝒛 ,
can compute estimate of this term
through sampling. (Sampling
differentiable through reparam.
trick. see paper.)
This KL term (between
Gaussians for encoder
and z prior) has nice
closed-form solution!
𝒑𝜽 𝒛 𝒙 intractable (saw
earlier), can’t compute this
KL term L But we know KL
divergence always =0.
Maximizing Likelihood
From CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung
Forward and Backward Passes
Encoder
Q 𝑍|𝑋; 𝜙
𝜇* 𝜎*
X
Putting it all together: maximizing the likelihood lower bound
Make approximate
posterior distribution
close to prior
From CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung
Forward and Backward Passes
Encoder
Q 𝑍|𝑋; 𝜙
𝜇* 𝜎*
X
Decoder
𝑃 𝑋|𝑍; 𝜃
𝑍
𝜇+ 𝜎+
Putting it all together: maximizing the likelihood lower bound
Sample from 𝑸(𝒁|𝑿)~𝑵(𝝁𝒛, 𝝈𝒛)
From CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung
Forward and Backward Passes
Encoder
Q 𝑍|𝑋; 𝜙
𝜇* 𝜎*
X
Decoder
𝑃 𝑋|𝑍; 𝜃
𝑍
𝜇+ 𝜎+
Putting it all together: maximizing the likelihood lower bound
F
𝑋
Sample from 𝑷(𝑿|𝒁; 𝜽)~𝑵(𝝁𝒙, 𝝈𝒙)
Maximize likelihood of
original input being
reconstructed
From CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung
Reparameterization Trick: Problem
Tutorial on Variational Autoencoders
https://arxiv.org/abs/1606.05908
http://gokererdogan.github.io/2016/07/01/reparameterization-trick/
⬣ Problem with respect to the VLB:
updating 𝜙
⬣ 𝑍~𝑄(𝑍|𝑋; 𝜙) : need to differentiate
through the sampling process w.r.t 𝜙
(encoder is probabilistic)
Reparameterization Trick: Solution
⬣ Solution: make the randomness
independent of encoder output, making
the encoder deterministic
⬣ Gaussian distribution example:
⬣ Previously: encoder output =
random variable 𝑧~𝑁(𝜇, 𝜎)
⬣ Now encoder output = distribution
parameter [𝜇, 𝜎]
⬣ 𝑧 = 𝜇 + 𝜖 ∗ 𝜎, 𝜖~𝑁(0,1)
Tutorial on Variational Autoencoders
https://arxiv.org/abs/1606.05908
http://gokererdogan.github.io/2016/07/01/reparameterization-trick/
Interpretability of Latent Vector
Kingma  Welling, Auto-Encoding Variational Bayes
𝒛𝟏
𝒛𝟐
Summary
Variational Autoencoders (VAEs) provide a
principled way to perform approximate
maximum likelihood optimization
⬣ Requires some assumptions (e.g.
Gaussian distributions)
Samples are often not as competitive as
GANs
Latent features (learned in an unsupervised
way!) often good for downstream tasks:
⬣ Example: World models for reinforcement
learning (Ha et al., 2018)
Ha  Schmidhuber, World Models, 2018

M4L19 Generative Models - Slides v 3.pdf

  • 1.
  • 2.
    Spectrum of Low-LabeledLearning Supervised Learning ⬣ Train Input: 𝑋, 𝑌 ⬣ Learning output: 𝑓 ∶ 𝑋 → 𝑌, 𝑃(𝑦|𝑥) ⬣ e.g. classification Sheep Dog Cat Lion Giraffe Unsupervised Learning ⬣ Input: 𝑋 ⬣ Learning output: 𝑃 𝑥 ⬣ Example: Clustering, density estimation, etc. Less Labels
  • 3.
    Unsupervised Learning Density Estimation Classification Regression Clustering Dimensionality Reduction x y xy Discrete Continuous x c Discrete x z Continuous Supervised Learning Unsupervised Learning x p(x) On simplex
  • 4.
    What to Learn? Traditionalunsupervised learning methods: Similar in deep learning, but from neural network/learning perspective Modeling 𝑷 𝒙 Comparing/ Grouping Representation Learning Principal Component Analysis Clustering Density estimation Almost all deep learning! Metric learning & clustering Deep Generative Models
  • 5.
    Discriminative models model𝑷 𝒚 𝒙 ⬣ Example: Model this via neural network, SVM, etc. Generative models model 𝑷(𝒙) Generative Models Discriminative vs. Generative Models Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks
  • 6.
    Discriminative models model𝑷 𝒚 𝒙 ⬣ Example: Model this via neural network, SVM, etc. Generative models model 𝑷(𝒙) ⬣ We can parameterize our model as 𝑷(𝒙, 𝜽) and use maximum likelihood to optimize the parameters given an unlabeled dataset: ⬣ They are called generative because they can often generate samples ⬣ Example: Multivariate Gaussian with estimated parameters 𝝁, 𝝈 Generative Models Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks Discriminative vs. Generative Models
  • 7.
    Generative Models Goodfellow, NeurIPS2016 Tutorial: Generative Adversarial Networks
  • 8.
  • 9.
    Generative Models Goodfellow, NeurIPS2016 Tutorial: Generative Adversarial Networks
  • 10.
    Factorizing P(x) We canuse chain rule to decompose the joint distribution ⬣ Factorizes joint distribution into a product of conditional distributions ⬣ Similar to Bayesian Network (factorizing a joint distribution) ⬣ Similar to language models! ⬣ Requires some ordering of variables (edges in a probabilistic graphical model) ⬣ We can estimate this conditional distribution as a neural network Oord et al., Pixel Recurrent Neural Networks 𝒑 𝒙 = $ 𝒊$𝟏 𝒏𝟐 𝒑 𝒙𝒊 𝒙𝟏, … , 𝒙𝒊'𝟏)
  • 11.
    Modeling Language asa Sequence W(Z Z Z) = W(^, ^, . . . , ^U) = Ñ(ðċ) Ñ(ðČ | ðċ) Ñ(ðč | ðċ, ðČ) · · · Ñ(ðÀ | ðÀ ċ, . . . , ðċ) = Y ¨ Ñ(ð¨ | ð¨ ċ, . . . , ðċ) next word history
  • 12.
    Language Models asan RNN _ _ _U O O OU Mʿ Mʿ Mʿ . . . ⬣ Language modeling involves estimating a probability distribution over sequences of words. W(Z Z Z) = W(^, ^, . . . , ^U) = Y ¨ Ñ(ð¨ | ð¨ ċ, . . . , ðċ) next word history ⬣ RNNs are a family of neural architectures for modeling sequences.
  • 13.
    Factorized Models forImages 𝒑 𝒙 = 𝒑 𝒙𝟏 $ 𝒊,𝟐 𝒏𝟐 𝒑 𝒙𝒊 𝒙𝟏, … , 𝒙𝒊/𝟏) Oord et al., Pixel Recurrent Neural Networks 𝒑 𝒙 = $ 𝒊,𝟏 𝒏𝟐 𝒑 𝒙𝒊 𝒙𝟏, … , 𝒙𝒊/𝟏)
  • 14.
    Factorized Models forImages 𝒑 𝒙 = 𝒑 𝒙𝟏 𝒑 𝒙𝟐 𝒙𝟏 𝒑 𝒙𝟑 𝒙𝟏 $ 𝒊%𝟒 𝒏𝟐 𝒑 𝒙𝒊 𝒙𝟏, … , 𝒙𝒊(𝟏) Training: ⬣ We can train similar to language models: Teacher/student forcing ⬣ Maximum likelihood approach Downsides: ⬣ Slow sequential generation process ⬣ Only considers few context pixels Oord et al., Pixel Recurrent Neural Networks
  • 15.
    Pixel CNN Oord etal., Conditional Image Generation with PixelCNN Decoders ⬣ Idea: Represent conditional distribution as a convolution layer! ⬣ Considers larger context (receptive field) ⬣ Practically can be implemented by applying a mask, zeroing out “future” pixels ⬣ Faster training but still slow generation ⬣ Limited to smaller images
  • 16.
    Example Results: ImageCompletion (PixelRNN) Oord et al., Conditional Image Generation with PixelCNN Decoders
  • 17.
    Example Images (PixelCNN) Oordet al., Conditional Image Generation with PixelCNN Decoders
  • 18.
  • 19.
    Generative Models Goodfellow, NeurIPS2016 Tutorial: Generative Adversarial Networks
  • 20.
    Implicit Models Implicit generativemodels do not actually learn an explicit model for 𝒑 𝒙 Instead, learn to generate samples from 𝒑 𝒙 ⬣ Learn good feature representations ⬣ Perform data augmentation ⬣ Learn world models (a simulator!) for reinforcement learning How? ⬣ Learn to sample from a neural network output ⬣ Adversarial training that uses one network’s predictions to train the other (dynamic loss function!) ⬣ Lots of tricks to make the optimization more stable
  • 21.
    Learning to Sample Wewould like to sample from 𝒑 𝒙 using a neural network Idea: ⬣ Sample from a simple distribution (Gaussian) ⬣ Transform the sample to 𝒑 𝒙 𝑵 𝝁, 𝝈 Neural Network Samples Samples 𝒑 𝒙
  • 22.
    Generating Images ⬣ Inputcan be a vector with (independent) Gaussian random numbers ⬣ We can use a CNN to generate images! 𝑵 𝝁, 𝝈 Neural Network 𝒑 𝒙 Vector of Random Numbers Generator
  • 23.
    Adversarial Networks ⬣ Goal:We would like to generate realistic images. How can we drive the network to learn how to do this? ⬣ Idea: Have another network try to distinguish a real image from a generated (fake) image ⬣ Why? Signal can be used to determine how well it’s doing at generation Discriminator Real or Fake? 𝑵 𝝁, 𝝈 Neural Network 𝒑 𝒙 Vector of Random Numbers Generator
  • 24.
    Generative Adversarial Networks(GANs) Vector of Random Numbers Generator Discriminator Cross-entropy (Real or Fake?) We know the answer (self- supervised) Mini-batch of real fake data Question: What loss functions can we use (for each network)? ⬣ Generator: Update weights to improve realism of generated images ⬣ Discriminator: Update weights to better discriminate
  • 25.
    ⬣ Since wehave two networks competing, this is a mini-max two player game ⬣ Ties to game theory ⬣ Not clear what (even local) Nash equilibria are for this game Mini-max Two Player Game Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks
  • 26.
    ⬣ Since wehave two networks competing, this is a mini-max two player game ⬣ Ties to game theory ⬣ Not clear what (even local) Nash equilibria are for this game ⬣ The full mini-max objective is: ⬣ where 𝐷 𝑥 is the discriminator outputs probability ([0,1]) of real image ⬣ 𝑥 is a real image and 𝐺 𝑧 is a generated image Mini-max Two Player Game Generator minimizes How well discriminator does (0 for fake) Sample from fake Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks
  • 27.
    ⬣ Since wehave two networks competing, this is a mini-max two player game ⬣ Ties to game theory ⬣ Not clear what (even local) Nash equilibria are for this game ⬣ The full mini-max objective is: ⬣ where 𝐷 𝑥 is the discriminator outputs probability ([0,1]) of real image ⬣ 𝑥 is a real image and 𝐺 𝑧 is a generated image Mini-max Two Player Game Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks How well discriminator does (1 for real) Discriminator maximizes How well discriminator does (0 for fake) Sample from fake Sample from real
  • 28.
    Discriminator Perspective ⬣ where𝐷 𝑥 is the discriminator outputs probability ([0,1]) of real image ⬣ 𝑥 is a real image and 𝐺 𝑧 is a generated image ⬣ The discriminator wants to maximize this: ⬣ 𝐷 𝑥 is pushed up (to 1) because 𝑥 is a real image ⬣ 1 − 𝐷 𝐺 𝑧 is also pushed up to 1 (so that D G z is pushed down to 0) ⬣ In other words, discriminator wants to classify real images as real (1) and fake images as fake (0)
  • 29.
    Generator Perspective ⬣ where𝐷 𝑥 is the discriminator outputs probability ([0,1]) of real image ⬣ 𝑥 is a real image and 𝐺 𝑧 is a generated image ⬣ The generator wants to minimize this: ⬣ 1 − 𝐷 𝐺 𝑧 is pushed down to 0 (so that D G z is pushed up to 1) ⬣ This means that the generator is fooling the discriminator, i.e. succeeding at generating images that the discriminator can’t discriminate from real
  • 30.
    Generative Adversarial Networks(GANs) Generator Loss Discriminator Loss Vector of Random Numbers Generator Discriminator Cross-entropy (Real or Fake?) We know the answer (self- supervised) Mini-batch of real fake data
  • 31.
    Converting to Max-MaxGame The generator part of the objective does not have good gradient properties ⬣ High gradient when 𝐷 𝐺 𝑧 is high (that is, discriminator is wrong) ⬣ We want it to improve when samples are bad (discriminator is right) Alternative objective, maximize: Plot from CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung
  • 32.
    Final Algorithm Goodfellow,NeurIPS 2016 Generative Adversarial Nets
  • 33.
    Generative Adversarial Networks(GANs) At the end, we have: ⬣ An implicit generative model! ⬣ Features from discriminator Vector of Random Numbers Generator Discriminator Cross-entropy (Real or Fake?) We know the answer (self- supervised) Mini-batch of real fake data
  • 34.
    Early Results ⬣ Low-resolution imagesbut look decent! ⬣ Last column are nearest neighbor matches in dataset Goodfellow, NeurIPS 2016 Generative Adversarial Nets
  • 35.
    Difficulty in Training Goodfellow,NeurIPS 2016 Generative Adversarial Nets GANs are very difficult to train due to the mini-max objective Advancements include: ⬣ More stable architectures ⬣ Regularization methods to improve optimization ⬣ Progressive growing/training and scaling
  • 36.
    DCGAN Radford et al.,Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks Radford et al., Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
  • 37.
    Regularization Kodali et al.,On Convergence and Stability of GANs (also known as How to Train your DRAGAN) Training GANs is difficult due to: ⬣ Minimax objective – For example, what if generator learns to memorize training data (no variety) or only generates part of the distribution? ⬣ Mode collapse – Capturing only some modes of distribution Several theoretically-motivated regularization methods ⬣ Simple example: Add noise to real samples!
  • 38.
    Example Generated Images- BigGAN Brock et al., Large Scale GAN Training for High Fidelity Natural Image Synthesis
  • 39.
    Failure Examples -BigGAN Brock et al., Large Scale GAN Training for High Fidelity Natural Image Synthesis
  • 40.
  • 41.
    Summary Generative Adversarial Networks(GANs) can produce amazing images! Several drawbacks ⬣ High-fidelity generation heavy to train ⬣ Training can be unstable ⬣ No explicit model for distribution Larger number of extensions: ⬣ GANs conditioned on labels or other information ⬣ Adversarial losses for other applications
  • 42.
  • 43.
    Generative Models Goodfellow, NeurIPS2016 Tutorial: Generative Adversarial Networks
  • 44.
    Reminder: Autoencoders Encoder Decoder Lowdimensional embedding Minimize the difference (with MSE) Linear layers with reduced dimension or Conv-2d layers with stride Linear layers with increasing dimension or Conv-2d layers with bilinear upsampling
  • 45.
    Formalizing the GenerativeModel What is this? Hidden/Latent variables Factors of variation that produce an image: (digit, orientation, scale, etc.) 𝑃 𝑋 = * 𝑃 𝑋 𝑍; 𝜃 𝑃 𝑍 𝑑𝑍 ⬣ We cannot maximize this likelihood due to the integral ⬣ Instead we maximize a variational lower bound (VLB) that we can compute Kingma Welling, Auto-Encoding Variational Bayes 𝑍
  • 46.
    Variational Autoencoder: Decoder ⬣We can combine the probabilistic view, sampling, autoencoders, and approximate optimization ⬣ Just as before, sample 𝑍 from simpler distribution ⬣ We can also output parameters of a probability distribution! ⬣ Example: 𝜇, 𝜎 of Gaussian distribution ⬣ For multi-dimensional version output diagonal covariance ⬣ How can we maximize 𝑃 𝑋 = ∫ 𝑃 𝑋 𝑍; 𝜃 𝑃 𝑍 𝑑𝑍 𝒁 𝝁𝒙 𝝈𝒙 Decoder 𝑷 𝑿|𝒁; 𝜽
  • 47.
    Variational Autoencoder: Encoder ⬣We can combine the probabilistic view, sampling, autoencoders, and approximate optimization ⬣ Given an image, estimate 𝑍 ⬣ Again, output parameters of a distribution 𝜇* 𝜎* X Encoder Q 𝑍|𝑋; 𝜙
  • 48.
    Putting Them Together Wecan tie the encoder and decoder together into a probabilistic autoencoder ⬣ Given data (X), estimate 𝜇, 𝜎 and sample from 𝑁(𝜇, 𝜎) ⬣ Given 𝑍, estimate 𝜇#, 𝜎# and sample from 𝑁(𝜇#, 𝜎#) Encoder Q 𝑍|𝑋; 𝜙 𝜇* 𝜎* X Decoder 𝑃 𝑋|𝑍; 𝜃 𝑍 𝜇+ 𝜎+
  • 49.
    Maximizing Likelihood ⬣ Howcan we optimize the parameters of the two networks? Now equipped with our encoder and decoder networks, let’s work out the (log) data likelihood: From CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung
  • 50.
    Maximizing Likelihood From CS231n,Fei-Fei Li, Justin Johnson, Serena Yeung
  • 51.
    KL-Divergence Aside: KL Divergence(distance measure for distributions), always = 0 𝐾𝐿(𝑝| 𝑞 = 𝐻, 𝑝, 𝑞 − 𝐻(𝑝) = ∑ 𝑝 𝑥 log 𝑝 𝑥 − ∑ 𝑝 𝑥 log 𝑞 𝑥 Definition of Expectation
  • 52.
    Maximizing Likelihood From CS231n,Fei-Fei Li, Justin Johnson, Serena Yeung The expectation wrt. z (using encoder network) let us write nice KL terms
  • 53.
    Maximizing Likelihood From CS231n,Fei-Fei Li, Justin Johnson, Serena Yeung Decoder network gives 𝒑𝜽 𝒙 𝒛 , can compute estimate of this term through sampling. (Sampling differentiable through reparam. trick. see paper.) This KL term (between Gaussians for encoder and z prior) has nice closed-form solution! 𝒑𝜽 𝒛 𝒙 intractable (saw earlier), can’t compute this KL term L But we know KL divergence always =0.
  • 54.
    Maximizing Likelihood From CS231n,Fei-Fei Li, Justin Johnson, Serena Yeung
  • 55.
    Forward and BackwardPasses Encoder Q 𝑍|𝑋; 𝜙 𝜇* 𝜎* X Putting it all together: maximizing the likelihood lower bound Make approximate posterior distribution close to prior From CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung
  • 56.
    Forward and BackwardPasses Encoder Q 𝑍|𝑋; 𝜙 𝜇* 𝜎* X Decoder 𝑃 𝑋|𝑍; 𝜃 𝑍 𝜇+ 𝜎+ Putting it all together: maximizing the likelihood lower bound Sample from 𝑸(𝒁|𝑿)~𝑵(𝝁𝒛, 𝝈𝒛) From CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung
  • 57.
    Forward and BackwardPasses Encoder Q 𝑍|𝑋; 𝜙 𝜇* 𝜎* X Decoder 𝑃 𝑋|𝑍; 𝜃 𝑍 𝜇+ 𝜎+ Putting it all together: maximizing the likelihood lower bound F 𝑋 Sample from 𝑷(𝑿|𝒁; 𝜽)~𝑵(𝝁𝒙, 𝝈𝒙) Maximize likelihood of original input being reconstructed From CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung
  • 58.
    Reparameterization Trick: Problem Tutorialon Variational Autoencoders https://arxiv.org/abs/1606.05908 http://gokererdogan.github.io/2016/07/01/reparameterization-trick/ ⬣ Problem with respect to the VLB: updating 𝜙 ⬣ 𝑍~𝑄(𝑍|𝑋; 𝜙) : need to differentiate through the sampling process w.r.t 𝜙 (encoder is probabilistic)
  • 59.
    Reparameterization Trick: Solution ⬣Solution: make the randomness independent of encoder output, making the encoder deterministic ⬣ Gaussian distribution example: ⬣ Previously: encoder output = random variable 𝑧~𝑁(𝜇, 𝜎) ⬣ Now encoder output = distribution parameter [𝜇, 𝜎] ⬣ 𝑧 = 𝜇 + 𝜖 ∗ 𝜎, 𝜖~𝑁(0,1) Tutorial on Variational Autoencoders https://arxiv.org/abs/1606.05908 http://gokererdogan.github.io/2016/07/01/reparameterization-trick/
  • 60.
    Interpretability of LatentVector Kingma Welling, Auto-Encoding Variational Bayes 𝒛𝟏 𝒛𝟐
  • 61.
    Summary Variational Autoencoders (VAEs)provide a principled way to perform approximate maximum likelihood optimization ⬣ Requires some assumptions (e.g. Gaussian distributions) Samples are often not as competitive as GANs Latent features (learned in an unsupervised way!) often good for downstream tasks: ⬣ Example: World models for reinforcement learning (Ha et al., 2018) Ha Schmidhuber, World Models, 2018