InfoGAN
10/11/17
Zak Jost
Outline
• GANs
– Basics
– Architecture
– Training algorithm
– Results
– Problems
• InfoGAN
– Disentangled representations
– Basics
– Implementation
– Results
• Potential fraud applications
What’s a Generative Adversarial
Network?
Let’s break down the terms…
Generative Adversarial Network
• Most machine learning is focused on learning
Discriminative Models:
– Find the decision boundary line (i.e. on the left of line = dog, right
of line = cat)
– Given new data, what label am I?
– Are conditional probability distributions: P(y|X). The probability
of a label given the data.
• Generative Models:
– Finds the distribution for the classes (i.e. probability of dog/cat for
all images)
– Give me a new image of a cat/dog or tell me probability of cat/dog
for this X
– Are joint probability distributions: P(X,y). The joint probability of
all possible data points for each possible label.
Generative Adversarial Network
• Two player zero-sum game: the more I win, the more you
lose
• Minimax game:
– Each player choses action that minimizes their loss assuming they
will lose
– This is opposed to maximizing your gain assuming you will win
– AKA: Maximize the minimum gain
• GANs create minimax game between two models:
generator and discriminator
Generative Adversarial Network
• Each model is a neural network
• Allows efficient update of parameters via
backpropagation algorithm
Cliché “Counterfeiter” Example
Generator
Model Generated (Fake) Data
Real (Training) Data
Discriminator
Real or Fake?
How to Build a GAN
Noise, z
Generator
Neural
Network
Fake Data
G(z)
Real Data
x
Discriminator
Neural
Network
Probability input is
real:
D(x) or D(G(z))
Training Deep Dive
• Optimization Function
• Training algorithm
• Convergence Guarantees
Optimization Function
• D(x) is probability x is real
• G wants this small, D wants this large (i.e. Adversarial)
• Use stochastic gradient descent and back-propagate
real fake
Training Algorithm
Convergence Guarantees
• Original paper proves that this algorithm theoretically
converges to pg = pdata—i.e. your generative model is a
perfect representation of your true data distribution
• In reality, we optimize parameters of a neural network
instead of the distributions themselves, so proof doesn’t
directly apply
GAN Results
Yellow boxes are real data samples that are nearest matches to last column of fake
images. This shows the generator didn’t merely memorize training examples
Results
* From DCGAN paper:
https://arxiv.org/abs/1511.06434
Meaningful
Representations!
Reflection
• It’s worth stressing that this is an entirely unsupervised
technique—you don’t use the hand-written digit label to
learn how to write
– Labeled training data is often hard to come by, but unlabeled data
is plentiful
• This is useful because you could train a GAN in an
unsupervised fashion using a lot of images, and then
build a classifier with labeled data using the internal
layers of the GAN discriminator as input.
– You will need much less labeled training data if useful
representations have already been learned with unlabeled data
Problems
• Hard to train
– “Mode collapse”: Generator gives the same thing over
and over
– Lack of convergence: Discriminator and Generator
never learn
• Tangled representations
– Not clear how to change the input to the generator to
get meaningful changes in the fake examples
Solutions!
• Hard to train
– DCGAN paper gives architectural guidelines that stabilize model
training
– Convolutional layers; batch normalization of weights; replace pooling
layers with strided convolutions to learn spatial up/down sampling;
remove fully connected layers; use ReLU/Leaky ReLU activation
functions
– Many other papers address this problem
• Entangled Representations
– InfoGAN
Entangled vs Disentangled
• To get a fake output, our only knob to turn is the generator input
noise
• What if you want a particular output, like the number 7 or a face
with sunglasses?
• It’s not clear how to modify the Generator input z to get the desired
results because the representation is entangled
Entangled Disentangled
InfoGAN
• Learns disentangled representations in an unsupervised
manner
• This is accomplished by splitting the generator input into
two parts: the noise, z and the latent code, c
• The codes are made meaningful by maximizing the
Mutual Information between the code and the generator
output
– See appendix for Mutual Info 101
Latent Code Intuition
• You know handwritten digits will be one of ten numbers, so why not
try to encode this structure?
• You could re-assign one of the z’s to be a 10-state discrete variable
• The hope is that all of the digit information is represented by this
one variable
• Maximizing the mutual information between this structured input
and the output will ensure it has meaning
How to Implement?
• Adds a regularizer to the GAN minimax game to maximize Mutual Information:
• Uses variational arguments to get lower bound using auxiliary distribution,
Q(c|x), which is a parameterized neural network to approximate P(c|x):
• Final form:
Prior = Easy!
Posterior = Hard
Constant, not important
InfoGAN Architecture
• Q(c|x) tries to recover the code from the input. In reality, it’s just a fully
connected layer attached to the end of the discriminator D.
Noise, z Generator
Neural
Network
Fake Data
G(z, c)
Discriminator
Neural
Network
Estimation of c
Code, c
Q Neural
Network
Real Data
x
Probability input is
real:
D(x) or D(G(z))
Q Updates
• Pick a c; Pick a z; Calculate G(c,z); Calculate Q(c|x=G(c,z));
– If using a discrete code, Q network outputs softmax (i.e. prob(0) = 0.1, prob(1) =
0.05, prob(2) = 0.8, …, prob(9) = .01)
– If using a continuous code, Q network outputs sufficient statistics, like mean and
standard deviation of a normal distribution—your choice how to model it.
• Once you know the probability (or parameters of the distribution from which
you can calculate the probability), you can calculate log-likelihood:
log Q(c|X)
Results
Results
Results
Appendix: Mutual Information 101
• I(X;Y) = H(X) – H(X|Y)
– How much is the uncertainty in X reduced if you know Y? AKA: How much
information about X is in Y?
– If X and Y are independent, H(X|Y) = H(X) => I(X;Y) = 0.
• Example: whether its raining outside and if it’s dark when you wake up
– If it rains 28% of time and does not rain 72% of time:
– H(0.28, 0.72) = 0.86 bits
– Let p(x,y) =
– H(X|Y) = 0.37 bits
– Mutual Information = 0.48 bits
Dark
Rain No Yes
No 0.70 0.02
Yes 0.08 0.20

InfoGAN and Generative Adversarial Networks

  • 1.
  • 2.
    Outline • GANs – Basics –Architecture – Training algorithm – Results – Problems • InfoGAN – Disentangled representations – Basics – Implementation – Results • Potential fraud applications
  • 3.
    What’s a GenerativeAdversarial Network? Let’s break down the terms…
  • 4.
    Generative Adversarial Network •Most machine learning is focused on learning Discriminative Models: – Find the decision boundary line (i.e. on the left of line = dog, right of line = cat) – Given new data, what label am I? – Are conditional probability distributions: P(y|X). The probability of a label given the data. • Generative Models: – Finds the distribution for the classes (i.e. probability of dog/cat for all images) – Give me a new image of a cat/dog or tell me probability of cat/dog for this X – Are joint probability distributions: P(X,y). The joint probability of all possible data points for each possible label.
  • 5.
    Generative Adversarial Network •Two player zero-sum game: the more I win, the more you lose • Minimax game: – Each player choses action that minimizes their loss assuming they will lose – This is opposed to maximizing your gain assuming you will win – AKA: Maximize the minimum gain • GANs create minimax game between two models: generator and discriminator
  • 6.
    Generative Adversarial Network •Each model is a neural network • Allows efficient update of parameters via backpropagation algorithm
  • 7.
    Cliché “Counterfeiter” Example Generator ModelGenerated (Fake) Data Real (Training) Data Discriminator Real or Fake?
  • 8.
    How to Builda GAN Noise, z Generator Neural Network Fake Data G(z) Real Data x Discriminator Neural Network Probability input is real: D(x) or D(G(z))
  • 9.
    Training Deep Dive •Optimization Function • Training algorithm • Convergence Guarantees
  • 10.
    Optimization Function • D(x)is probability x is real • G wants this small, D wants this large (i.e. Adversarial) • Use stochastic gradient descent and back-propagate real fake
  • 11.
  • 12.
    Convergence Guarantees • Originalpaper proves that this algorithm theoretically converges to pg = pdata—i.e. your generative model is a perfect representation of your true data distribution • In reality, we optimize parameters of a neural network instead of the distributions themselves, so proof doesn’t directly apply
  • 13.
    GAN Results Yellow boxesare real data samples that are nearest matches to last column of fake images. This shows the generator didn’t merely memorize training examples
  • 14.
    Results * From DCGANpaper: https://arxiv.org/abs/1511.06434 Meaningful Representations!
  • 15.
    Reflection • It’s worthstressing that this is an entirely unsupervised technique—you don’t use the hand-written digit label to learn how to write – Labeled training data is often hard to come by, but unlabeled data is plentiful • This is useful because you could train a GAN in an unsupervised fashion using a lot of images, and then build a classifier with labeled data using the internal layers of the GAN discriminator as input. – You will need much less labeled training data if useful representations have already been learned with unlabeled data
  • 16.
    Problems • Hard totrain – “Mode collapse”: Generator gives the same thing over and over – Lack of convergence: Discriminator and Generator never learn • Tangled representations – Not clear how to change the input to the generator to get meaningful changes in the fake examples
  • 17.
    Solutions! • Hard totrain – DCGAN paper gives architectural guidelines that stabilize model training – Convolutional layers; batch normalization of weights; replace pooling layers with strided convolutions to learn spatial up/down sampling; remove fully connected layers; use ReLU/Leaky ReLU activation functions – Many other papers address this problem • Entangled Representations – InfoGAN
  • 18.
    Entangled vs Disentangled •To get a fake output, our only knob to turn is the generator input noise • What if you want a particular output, like the number 7 or a face with sunglasses? • It’s not clear how to modify the Generator input z to get the desired results because the representation is entangled Entangled Disentangled
  • 19.
    InfoGAN • Learns disentangledrepresentations in an unsupervised manner • This is accomplished by splitting the generator input into two parts: the noise, z and the latent code, c • The codes are made meaningful by maximizing the Mutual Information between the code and the generator output – See appendix for Mutual Info 101
  • 20.
    Latent Code Intuition •You know handwritten digits will be one of ten numbers, so why not try to encode this structure? • You could re-assign one of the z’s to be a 10-state discrete variable • The hope is that all of the digit information is represented by this one variable • Maximizing the mutual information between this structured input and the output will ensure it has meaning
  • 21.
    How to Implement? •Adds a regularizer to the GAN minimax game to maximize Mutual Information: • Uses variational arguments to get lower bound using auxiliary distribution, Q(c|x), which is a parameterized neural network to approximate P(c|x): • Final form: Prior = Easy! Posterior = Hard Constant, not important
  • 22.
    InfoGAN Architecture • Q(c|x)tries to recover the code from the input. In reality, it’s just a fully connected layer attached to the end of the discriminator D. Noise, z Generator Neural Network Fake Data G(z, c) Discriminator Neural Network Estimation of c Code, c Q Neural Network Real Data x Probability input is real: D(x) or D(G(z))
  • 23.
    Q Updates • Picka c; Pick a z; Calculate G(c,z); Calculate Q(c|x=G(c,z)); – If using a discrete code, Q network outputs softmax (i.e. prob(0) = 0.1, prob(1) = 0.05, prob(2) = 0.8, …, prob(9) = .01) – If using a continuous code, Q network outputs sufficient statistics, like mean and standard deviation of a normal distribution—your choice how to model it. • Once you know the probability (or parameters of the distribution from which you can calculate the probability), you can calculate log-likelihood: log Q(c|X)
  • 24.
  • 25.
  • 26.
  • 27.
    Appendix: Mutual Information101 • I(X;Y) = H(X) – H(X|Y) – How much is the uncertainty in X reduced if you know Y? AKA: How much information about X is in Y? – If X and Y are independent, H(X|Y) = H(X) => I(X;Y) = 0. • Example: whether its raining outside and if it’s dark when you wake up – If it rains 28% of time and does not rain 72% of time: – H(0.28, 0.72) = 0.86 bits – Let p(x,y) = – H(X|Y) = 0.37 bits – Mutual Information = 0.48 bits Dark Rain No Yes No 0.70 0.02 Yes 0.08 0.20