Introduction to Generative Adversarial Network

Generative Adversarial Networks
Aaron Mishkin
UBC MLRG 2018W2
1

Generative Adversial Networks
“Two imaginary celebrities that were dreamed up by a random
number generator.”
https://research.nvidia.com/publication/2017-10 Progressive-Growing-of
2

Why care about GANs?
Why to spend your limited time learning about GANs:
• GANs are achieving state-of-the-art results in a large variety
of image generation tasks.
• There’s been a veritable explosion in GAN publications over
the last few years – many people are very excited!
• GANs are stimulating new theoretical interest in min-max
optimization problems and “smooth games”.
3

Why care about GANs: Hyper-realistic Image Generation
StyleGAN: image generatation with hierarchical style transfer [3].
https://arxiv.org/abs/1812.04948 4

Why care about GANs: Conditionally Generative Models
Conditional GANs: high-resolution image synthesis via semantic
labeling [8].
Input: Segmentation Output: Synthesized Image
https://research.nvidia.com/publication/2017-12 High-Resolution-Image-Synthesis
5

Why care about GANs: Image Super Resolution
SRGAN: Photo-realistic super-resolution [4].
Bicubic Interp. SRGAN Original Image
https://arxiv.org/abs/1609.04802
6

Why care about GANs: Publications
Approximately 500 papers GAN papers as of September 2018!
See https://github.com/hindupuravinash/the-gan-zoo for the exhaustive list of papers. 7

Generative Modeling
Generative Models estimate the probabilistic process that
generated a set of observations D.
• D =

xi , yi
n
i=1
: supervised generative models learn the
joint distribution p(xi , yi ), often to compute p(yi | xi ).
• D =

xi n
i=1
: unsupervised generative models learn the
distribution of D for clustering, sampling, etc. We can:
• directly estimate p(xi
),
• introducing latents yi
and estimate p(xi
, yi
).
8

Generative Modeling: Unsupervised Parametric Approaches
• Direct Estimation: Choose a parameterized family p(x | θ)
and learn θ by maximizing the log-likelihood
θ∗
= arg max θ
n
X
i=1
log p(xi
| θ).
• Latent Variable Models: Define a joint distribution
p(x, y | θ) and learn θ by maximizing the log-marginal
likelihood
θ∗
= arg max θ
n
X
i=1
log
Z
zi
p(xi
, zi
| θ)dz.
Both approaches require that p(x | θ) is easy to evaluate.
9

Generative Modeling: Models for (Very) Complex Data
How can we learn such models for very complex data?
https://www.researchgate.net/figure/Heterogeneousness-and-diversity-of-the-CIFAR-10-entries-in-their-10-
10

Generative Modeling: Normalizing Flows and VAEs
Design parameterized densities with huge capacity!
• Normalizing flows: sequence of non-linear transformations to
a simple distribution pz(z)
p(x | θ0:k) = pz(z) where z = f −1
θk
◦ · · · ◦ f −1
θ1
◦ f −1
θ0
(x) .
f −1
θj
must be invertible with tractable log-det. Jacobians.
• VAEs: latent-variable models where inference networks
specify parameters
p(x, y | θ) = p(x | fθ(y))py(y).
The marginal likelihood is maximized via the ELBO.
11

GANs: Density-Free Models
Generative Adversial Networks (GANs) instead use an
unrestricted generator Gθg (z) such that
p(x | θg ) = pz({z}) where {z} = G−1
θg
(x).
• Problem: the inverse image of Gθg (z) may be huge!
• Problem: it’s likely intractable to preserve volume through
G(z; θg ).
So, we can’t evaluate p(x | θg ) and we can’t learn θg by maximum
likelihood.
12

GANs: Discriminators
GANs learn by comparing model samples with examples from D.
• Sampling from the generator is easy:
x̂ = Gθg (ẑ), where ẑ ∼ pz(z).
• Given a sample x̂, a discriminator tries to distinguish it from
true examples:
D(x) = Pr (x ∼ pdata) .
• The discriminator “supervises” the generator network.
13

GANs: Generator + Descriminator
https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial-
training-upc-2016
14

GANs: Goodfellow et al. (2014)
• Let z ∈ Rm and pz(z) be a simple base distribution.
• The generator Gθg (z) : Rm → D̃ is a deep neural network.
• D̃ is the manifold of generated examples.
• The discriminator Dθd
(x) : D ∪ D̃ → (0, 1) is also a deep
neural network.
15

GANs: Saddle-Point Optimization
Saddle-Point Optimization: learn Gθg (z) and Dθd
(x) jointly via
the objective V (θd , θg ):
min
θg
max
θd
Epdata
[log Dθd
(x)]
| {z }
likelihood of true data
+ Epz(z)

log 1 − Dθd
(Gθg (z))

| {z }
likelihood of generated data
16

GANs: Optimal Discriminators
Claim: Given Gθg defining an implicit distribution pg = p(x | θg ),
the optimal descriminator is
D∗
(x) =
pdata(x)
pdata(x) + pg(x)
.
Proof Sketch:
V (θd , θg ) =
Z
D
pdata(x) log D(x)dx +
Z
D̃
p(z) log(1 − D(Gθg (z)))dz
=
Z
D∪D̃
pdata(x) log D(x) + pg (x) log(1 − D(x))dx
Maximizing the integrand for all x is sufficient and gives the result
(see bonus slides).
Previous Slide: https://commons.wikimedia.org/wiki/File:Saddle point.svg
17

GANs: Jensen-Shannon Divergence and Optimal Generators
Given an optimal discriminator D∗(x), the generator objective is
C(θg ) = Epdata

log D∗
θd
(x)

+ Epg (x)

log 1 − D∗
θd
(x)

= Epdata

log
pdata(x)
pdata(x) + pg(x)

+ Epg (x)

log
pg (x)
pdata(x) + pg(x)

∝
1
2
KL

pdata
(pdata + pg )
2

+
1
2
KL

pg
(pdata + pg )
2

| {z }
Jensen-Shannon Divergence
C(θg ) achives its global minimum at pg = pdata given an optimal
discriminator!
18

GANs: Learning Generators and Discriminators
Putting these results to use in practice:
• High-capacity discriminators Dθd
approximate the
Jensen-Shannon divergence when close to global maximum.
• Dθd
is a “differentiable program”.
• We can use Dθd
to learn Gθg with our favourite gradient
descent method.
19

GANs: Training Procedure
for i = 1 . . . N do
for k = 1 . . . K do
• Sample noise samples {z1, . . . , zm} ∼ pz(z)
• Sample examples {x1, . . . , xm} from pdata(x).
• Update the discriminator Dθd
:
θd = θd −αd ∇θd
1
m
m
X
i=1

log D xi

+ log 1 − D G zi

.
end for
• Sample noise samples {z1, . . . , zm} ∼ pz(z).
• Update the generator Gθg :
θg = θg − αg ∇θg
1
m
m
X
i=1
log 1 − D G zi

.
end for 20

Problems with GANs
• Vanishing gradients: the discriminator becomes ”too good”
and the generator gradient vanishes.
• Non-Convergence: the generator and discriminator oscillate
without reaching an equilibrium.
• Mode Collapse: the generator distribution collapses to a
small set of examples.
• Mode Dropping: the generator distribution doesn’t fully
cover the data distribution.
21

Problems: Vanishing Gradients
• The minimax objective saturates when Dθd
is close to perfect:
V (θd , θg ) = Epdata
[log Dθd
(x)]+Epz(z)

log 1 − Dθd
(Gθg (z))

.
• A non-saturating heuristic objective for the generator is
J(Gθg ) = −Epz(z)

log Dθd
(Gθg (z))

.
https://arxiv.org/abs/1701.00160 22

Problems: Addressing Vanishing Gradients
Solutions:
• Change Objectives: use the non-saturating heuristic
objective, maximum-likelihood cost, etc.
• Limit Discriminator: restrict the capacity of the
discriminator.
• Schedule Learning: try to balance training Dθd
and Gθg .
23

Problems: Non-Convergence
Simultaneous gradient descent is not guaranteed to converge for
minimax objectives.
• Goodfellow et al. only showed convergence when updates are
made in the function space [2].
• The parameterization of Dθd
and Gθg results in highly
non-convex objective.
• In practice, training tends to oscillate – updates “undo” each
other.
24

Problems: Addressing Non-Convergence
Solutions: Lots and lots of hacks!
https://github.com/soumith/ganhacks
25

Problems: Mode Collapse and Mode Dropping
One Explanation: SGD may optimize the max-min objective
max
θd
min
θg
Epdata
[log Dθd
(x)] + Epz(z)

log 1 − Dθd
(Gθg (z))

Intuition: the generator maps all z values to the x̂ that is mostly
likely to fool the discriminator.
26

A Possible Solution: Alternative Divergences
There are a large variety of divergence measures for distributions:
• f-Divergences: (e.g. Jensen-Shannon, Kullback-Leibler)
Df (P ||Q) =
Z
χ
q(x)f (
p(x)
q(x)
)dx
• GANs [2], f-GANs [7], and more.
• Integral Probability Metrics: (e.g. Earth Movers Distance,
Maximum Mean Discrepancy)
γF (P ||Q) = sup
f ∈F
Z
fdP −
Z
fdQ
• Wasserstein GANs [1], Fisher GANs [6], Sobolev GANs [5] and
more.
27

A Possible Solution: Wasserstein GANs
Wasserstein GANs: Strong theory and excellent empirical results.
• “In no experiment did we see evidence of mode collapse for
the WGAN algorithm.” [1]
28

Summary
Recap:
• GANs are a class of density-free generative models with
(mostly) unrestricted generator functions.
• Introducing adversial discriminator networks allows GANs to
learn by minimizing the Jensen-Shannon divergence.
• Concurrently learning the generator and discriminator is
challenging due to
• Vanishing Gradients,
• Non-convergence due to oscilliation
• Mode collapse and mode dropping.
• A variety of alternative objective functions are being proposed.
29

Agknowledgements and References
There are lots of excellent references on GANs:
• Sebastian Nowozin’s presentation at MLSS 2018.
• NIPS 2016 tutorial on GANs by Ian Goodfellow.
• A nice explanation of Wasserstein GANs by Alex Irpan.
30

Bonus: Optimal Discriminators Cont.
The integrand
h(D(x)) = pdata(x) log D(x) + pg (x) log(1 − D(x))
is concave for D(x) ∈ (0, 1). We take the derivative and compute
a stationary point in the domain:
∂h(D(x))
∂D(x)
=
pdata(x)
D(x)
−
pg (x)
1 − D(x)
= 0
⇒ D(x) =
pdata(x)
pdata(x) + pg(x)
.
This minimizes the integrand over the domain of the discriminator,
completing the proof.
31

References i
Martin Arjovsky, Soumith Chintala, and Léon Bottou.
Wasserstein gan.
arXiv preprint arXiv:1701.07875, 2017.
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.
Generative adversarial networks. arxiv e-prints.
Tero Karras, Samuli Laine, and Timo Aila.
A style-based generator architecture for generative adversarial
networks.
32

References ii
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew
Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes
Totz, Zehan Wang, et al.
Photo-realistic single image super-resolution using a generative
adversarial network.
In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 4681–4690, 2017.
Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng.
Sobolev gan.
Youssef Mroueh and Tom Sercu.
Fisher gan.
In Advances in Neural Information Processing Systems, pages 2513–2523,
2017.
33

References iii
Sebastian Nowozin, Botond Cseke, and Ryota Tomioka.
f-gan: Training generative neural samplers using variational
divergence minimization.
In Advances in neural information processing systems, pages 271–279,
2016.
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz,
and Bryan Catanzaro.
High-resolution image synthesis and semantic manipulation with
conditional gans.
In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 8798–8807, 2018.
34

Introduction to Generative Adversarial Network

More Related Content

Similar to Introduction to Generative Adversarial Network

Recently uploaded

Introduction to Generative Adversarial Network