M4L19 Generative Models - Slides v 3.pdf

Generative
Models
Introduction

Spectrum of Low-Labeled Learning
Supervised
Learning
⬣ Train Input: 𝑋, 𝑌
⬣ Learning output:
𝑓 ∶ 𝑋 → 𝑌, 𝑃(𝑦|𝑥)
⬣ e.g. classification
Sheep
Dog
Cat
Lion
Giraffe
Unsupervised
Learning
⬣ Input: 𝑋
⬣ Learning output: 𝑃 𝑥
⬣ Example: Clustering,
density estimation, etc.
Less Labels

Unsupervised Learning
Density
Estimation
Classification
Regression
Clustering
Dimensionality
Reduction
x y
x y
Discrete
Continuous
x c Discrete
x z Continuous
Supervised Learning
Unsupervised Learning
x p(x) On simplex

What to Learn?
Traditional unsupervised learning methods:
Similar in deep learning, but from neural network/learning perspective
Modeling 𝑷 𝒙 Comparing/
Grouping
Representation
Learning
Principal
Component
Analysis
Clustering
Density
estimation
Almost all deep learning!
Metric learning & clustering
Deep Generative Models

Discriminative models model 𝑷 𝒚 𝒙
⬣ Example: Model this via neural network, SVM, etc.
Generative models model 𝑷(𝒙)
Generative Models
Discriminative vs. Generative Models
Goodfellow, NeurIPS 2016 Tutorial: Generative Adversarial Networks

Discriminative models model 𝑷 𝒚 𝒙
⬣ Example: Model this via neural network, SVM, etc.
Generative models model 𝑷(𝒙)
⬣ We can parameterize our model as 𝑷(𝒙, 𝜽) and use maximum likelihood to optimize the
parameters given an unlabeled dataset:
⬣ They are called generative because they can often generate samples
⬣ Example: Multivariate Gaussian with estimated parameters 𝝁, 𝝈
Generative Models
Discriminative vs. Generative Models

Generative Models

Factorizing P(x)
We can use chain rule to decompose the joint distribution
⬣ Factorizes joint distribution into a product of conditional distributions
⬣ Similar to Bayesian Network (factorizing a joint distribution)
⬣ Similar to language models!
⬣ Requires some ordering of variables (edges in a probabilistic graphical
model)
⬣ We can estimate this conditional distribution as a neural network
Oord et al., Pixel Recurrent Neural Networks
𝒑 𝒙 = $
𝒊$𝟏
𝒏𝟐
𝒑 𝒙𝒊 𝒙𝟏, … , 𝒙𝒊'𝟏)

Modeling Language as a Sequence
W(Z
Z
Z) = W(^, ^, . . . , ^U)
= Ñ(ðċ) Ñ(ðČ | ðċ) Ñ(ðč | ðċ, ðČ) · · · Ñ(ðÀ | ðÀ ċ, . . . , ðċ)
=
Y
¨
Ñ(ð¨ | ð¨ ċ, . . . , ðċ)
next
word
history

Language Models as an RNN
_ _ _U
O O
OU
Mʿ Mʿ Mʿ
. . .
⬣ Language modeling involves estimating a probability distribution over
sequences of words.
W(Z
Z
Z) = W(^, ^, . . . , ^U) =
Y
¨
Ñ(ð¨ | ð¨ ċ, . . . , ðċ)
next
word
history
⬣ RNNs are a family of neural architectures for modeling sequences.

Factorized Models for Images
𝒑 𝒙 = 𝒑 𝒙𝟏 $
𝒊,𝟐
𝒏𝟐
𝒑 𝒙𝒊 𝒙𝟏, … , 𝒙𝒊/𝟏)
𝒑 𝒙 = $
𝒊,𝟏
𝒏𝟐
𝒑 𝒙𝒊 𝒙𝟏, … , 𝒙𝒊/𝟏)

Factorized Models for Images
𝒑 𝒙 = 𝒑 𝒙𝟏 𝒑 𝒙𝟐 𝒙𝟏 𝒑 𝒙𝟑 𝒙𝟏 $
𝒊%𝟒
𝒏𝟐
𝒑 𝒙𝒊 𝒙𝟏, … , 𝒙𝒊(𝟏)
Training:
⬣ We can train similar to language models:
Teacher/student forcing
⬣ Maximum likelihood approach
Downsides:
⬣ Slow sequential generation process
⬣ Only considers few context pixels

Pixel CNN
Oord et al., Conditional Image Generation with PixelCNN Decoders
⬣ Idea: Represent conditional
distribution as a convolution
layer!
⬣ Considers larger context
(receptive field)
⬣ Practically can be implemented
by applying a mask, zeroing
out “future” pixels
⬣ Faster training but still slow
generation
⬣ Limited to smaller images

Example Results: Image Completion (PixelRNN)

Example Images (PixelCNN)

Generative
Adversarial
Networks
(GANs)

Implicit Models
Implicit generative models do not actually learn an explicit model for 𝒑 𝒙
Instead, learn to generate samples from 𝒑 𝒙
⬣ Learn good feature representations
⬣ Perform data augmentation
⬣ Learn world models (a simulator!) for reinforcement learning
How?
⬣ Learn to sample from a neural network output
⬣ Adversarial training that uses one network’s predictions to train
the other (dynamic loss function!)
⬣ Lots of tricks to make the optimization more stable

Learning to Sample
We would like to sample from 𝒑 𝒙 using a neural network
Idea:
⬣ Sample from a simple distribution (Gaussian)
⬣ Transform the sample to 𝒑 𝒙
𝑵 𝝁, 𝝈 Neural Network
Samples Samples
𝒑 𝒙

Generating Images
⬣ Input can be a vector with (independent) Gaussian random numbers
⬣ We can use a CNN to generate images!
𝑵 𝝁, 𝝈
Neural Network 𝒑 𝒙
Vector of
Random
Numbers
Generator

Adversarial Networks
⬣ Goal: We would like to generate realistic images. How can we drive the
network to learn how to do this?
⬣ Idea: Have another network try to distinguish a real image from a generated
(fake) image
⬣ Why? Signal can be used to determine how well it’s doing at generation
Discriminator
Real or
Fake?
𝑵 𝝁, 𝝈
Neural Network 𝒑 𝒙
Vector of
Random
Numbers
Generator

Generative Adversarial Networks (GANs)
Vector of
Random
Numbers
Generator Discriminator
Cross-entropy
(Real or Fake?)
We know the
answer (self-
supervised)
Mini-batch of
real fake data
Question: What loss functions can we use (for each network)?
⬣ Generator: Update weights to improve
realism of generated images
⬣ Discriminator: Update weights to better
discriminate

⬣ Since we have two networks competing, this is a mini-max two player game
⬣ Ties to game theory
⬣ Not clear what (even local) Nash equilibria are for this game
Mini-max Two Player Game

⬣ The full mini-max objective is:
⬣ where 𝐷 𝑥 is the discriminator outputs probability ([0,1]) of real image
⬣ 𝑥 is a real image and 𝐺 𝑧 is a generated image
Generator minimizes How well discriminator
does (0 for fake)
Sample from fake

⬣ The full mini-max objective is:
⬣ where 𝐷 𝑥 is the discriminator outputs probability ([0,1]) of real image
How well discriminator
does (1 for real)
Discriminator maximizes How well discriminator
does (0 for fake)
Sample from fake
Sample from real

Discriminator Perspective
⬣ where 𝐷 𝑥 is the discriminator outputs probability ([0,1]) of real
image
⬣ The discriminator wants to maximize this:
⬣ 𝐷 𝑥 is pushed up (to 1) because 𝑥 is a real image
⬣ 1 − 𝐷 𝐺 𝑧 is also pushed up to 1 (so that D G z is pushed down to
0)
⬣ In other words, discriminator wants to classify real images as real (1)
and fake images as fake (0)

Generator Perspective
⬣ where 𝐷 𝑥 is the discriminator outputs probability ([0,1]) of real
image
⬣ The generator wants to minimize this:
⬣ 1 − 𝐷 𝐺 𝑧 is pushed down to 0 (so that D G z is pushed up to 1)
⬣ This means that the generator is fooling the discriminator, i.e.
succeeding at generating images that the discriminator can’t
discriminate from real

Generator Loss Discriminator Loss
Vector of
Random
Numbers
Cross-entropy
(Real or Fake?)
We know the
answer (self-
supervised)
Mini-batch of
real fake data

Converting to Max-Max Game
The generator part of the objective does not have good gradient properties
⬣ High gradient when 𝐷 𝐺 𝑧 is high (that is, discriminator is wrong)
⬣ We want it to improve when samples are bad (discriminator is right)
Alternative objective, maximize:
Plot from CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung

Final Algorithm Goodfellow, NeurIPS 2016 Generative Adversarial Nets

At the end, we have:
⬣ An implicit generative model!
⬣ Features from discriminator
Vector of
Random
Numbers
Cross-entropy
(Real or Fake?)
We know the
answer (self-
supervised)
Mini-batch of
real fake data

Early Results
⬣ Low-resolution
images but look
decent!
⬣ Last column are
nearest
neighbor
matches in
dataset
Goodfellow, NeurIPS 2016 Generative Adversarial Nets

Difficulty in Training
Goodfellow, NeurIPS 2016 Generative Adversarial Nets
GANs are very difficult to train
due to the mini-max objective
Advancements include:
⬣ More stable architectures
⬣ Regularization methods to
improve optimization
⬣ Progressive growing/training
and scaling

DCGAN
Radford et al., Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Radford et al., Unsupervised Representation Learning with Deep Convolutional
Generative Adversarial Networks

Regularization
Kodali et al., On Convergence and Stability of GANs (also known as How to Train your DRAGAN)
Training GANs is difficult due to:
⬣ Minimax objective – For example, what if generator learns to
memorize training data (no variety) or only generates part of the
distribution?
⬣ Mode collapse – Capturing only some modes of distribution
Several theoretically-motivated regularization methods
⬣ Simple example: Add noise to real samples!

Example Generated Images - BigGAN
Brock et al., Large Scale GAN Training for High Fidelity Natural Image Synthesis

Failure Examples - BigGAN
Brock et al., Large Scale GAN Training for High Fidelity Natural Image Synthesis

Video Generation
https://www.youtube.com/wa
tch?v=PCBTZh41Ris

Summary
can produce amazing images!
Several drawbacks
⬣ High-fidelity generation heavy to train
⬣ Training can be unstable
⬣ No explicit model for distribution
Larger number of extensions:
⬣ GANs conditioned on labels or other
information
⬣ Adversarial losses for other
applications

Variational
Autoencoders
(VAEs)

Reminder: Autoencoders
Encoder Decoder
Low dimensional embedding
Minimize the difference (with MSE)
Linear layers with reduced
dimension or Conv-2d
layers with stride
Linear layers with increasing
dimension or Conv-2d layers
with bilinear upsampling

Formalizing the Generative Model
What is this?
Hidden/Latent variables
Factors of variation that
produce an image:
(digit, orientation, scale, etc.)
𝑃 𝑋 = * 𝑃 𝑋 𝑍; 𝜃 𝑃 𝑍 𝑑𝑍
⬣ We cannot maximize this likelihood due to the integral
⬣ Instead we maximize a variational lower bound (VLB) that we can
compute
Kingma Welling, Auto-Encoding Variational Bayes
𝑍

Variational Autoencoder: Decoder
⬣ We can combine the probabilistic view, sampling, autoencoders, and approximate
optimization
⬣ Just as before, sample 𝑍 from simpler distribution
⬣ We can also output parameters of a probability
distribution!
⬣ Example: 𝜇, 𝜎 of Gaussian distribution
⬣ For multi-dimensional version output diagonal
covariance
⬣ How can we maximize
𝑃 𝑋 = ∫ 𝑃 𝑋 𝑍; 𝜃 𝑃 𝑍 𝑑𝑍
𝒁
𝝁𝒙 𝝈𝒙
Decoder
𝑷 𝑿|𝒁; 𝜽

Variational Autoencoder: Encoder
⬣ We can combine the probabilistic view, sampling, autoencoders, and approximate
optimization
⬣ Given an image, estimate 𝑍
⬣ Again, output parameters of a
distribution
𝜇* 𝜎*
X
Encoder
Q 𝑍|𝑋; 𝜙

Putting Them Together
We can tie the encoder and decoder together into a probabilistic autoencoder
⬣ Given data (X), estimate 𝜇, 𝜎 and sample from 𝑁(𝜇, 𝜎)
⬣ Given 𝑍, estimate 𝜇#, 𝜎# and sample from 𝑁(𝜇#, 𝜎#)
Encoder
Q 𝑍|𝑋; 𝜙
𝜇* 𝜎*
X
Decoder
𝑃 𝑋|𝑍; 𝜃
𝑍
𝜇+ 𝜎+

Maximizing Likelihood
⬣ How can we optimize the parameters of the two networks?
Now equipped with our encoder and decoder networks, let’s work
out the (log) data likelihood:
From CS231n, Fei-Fei Li, Justin Johnson, Serena Yeung

KL-Divergence
Aside: KL Divergence (distance measure for distributions), always = 0
𝐾𝐿(𝑝| 𝑞 = 𝐻, 𝑝, 𝑞 − 𝐻(𝑝) = ∑ 𝑝 𝑥 log 𝑝 𝑥 − ∑ 𝑝 𝑥 log 𝑞 𝑥
Definition of Expectation

The expectation wrt. z (using encoder
network) let us write nice KL terms

Decoder network gives 𝒑𝜽 𝒙 𝒛 ,
can compute estimate of this term
through sampling. (Sampling
differentiable through reparam.
trick. see paper.)
This KL term (between
Gaussians for encoder
and z prior) has nice
closed-form solution!
𝒑𝜽 𝒛 𝒙 intractable (saw
earlier), can’t compute this
KL term L But we know KL
divergence always =0.

Forward and Backward Passes
Encoder
Q 𝑍|𝑋; 𝜙
𝜇* 𝜎*
X
Putting it all together: maximizing the likelihood lower bound
Make approximate
posterior distribution
close to prior

Encoder
Q 𝑍|𝑋; 𝜙
𝜇* 𝜎*
X
Decoder
𝑍
𝜇+ 𝜎+
Sample from 𝑸(𝒁|𝑿)~𝑵(𝝁𝒛, 𝝈𝒛)

Encoder
Q 𝑍|𝑋; 𝜙
𝜇* 𝜎*
X
Decoder
𝑍
𝜇+ 𝜎+
F
𝑋
Sample from 𝑷(𝑿|𝒁; 𝜽)~𝑵(𝝁𝒙, 𝝈𝒙)
Maximize likelihood of
original input being
reconstructed

Reparameterization Trick: Problem
Tutorial on Variational Autoencoders
https://arxiv.org/abs/1606.05908
http://gokererdogan.github.io/2016/07/01/reparameterization-trick/
⬣ Problem with respect to the VLB:
updating 𝜙
⬣ 𝑍~𝑄(𝑍|𝑋; 𝜙) : need to differentiate
through the sampling process w.r.t 𝜙
(encoder is probabilistic)

Reparameterization Trick: Solution
⬣ Solution: make the randomness
independent of encoder output, making
the encoder deterministic
⬣ Gaussian distribution example:
⬣ Previously: encoder output =
random variable 𝑧~𝑁(𝜇, 𝜎)
⬣ Now encoder output = distribution
parameter [𝜇, 𝜎]
⬣ 𝑧 = 𝜇 + 𝜖 ∗ 𝜎, 𝜖~𝑁(0,1)
Tutorial on Variational Autoencoders
https://arxiv.org/abs/1606.05908
http://gokererdogan.github.io/2016/07/01/reparameterization-trick/

Interpretability of Latent Vector
Kingma Welling, Auto-Encoding Variational Bayes
𝒛𝟏
𝒛𝟐

Summary
Variational Autoencoders (VAEs) provide a
principled way to perform approximate
maximum likelihood optimization
⬣ Requires some assumptions (e.g.
Gaussian distributions)
Samples are often not as competitive as
GANs
Latent features (learned in an unsupervised
way!) often good for downstream tasks:
⬣ Example: World models for reinforcement
learning (Ha et al., 2018)
Ha Schmidhuber, World Models, 2018

M4L19 Generative Models - Slides v 3.pdf

More Related Content

Similar to M4L19 Generative Models - Slides v 3.pdf

Recently uploaded

M4L19 Generative Models - Slides v 3.pdf