From Autoencoder to Variational Autoencoder
1
Hao Dong
Peking University
• Vanilla Autoencoder
• Denoising Autoencoder
• Sparse Autoencoder
• Contractive Autoencoder
• Stacked Autoencoder
• Variational Autoencoder (VAE)
From Autoencoder to Variational Autoencoder
Feature Representation
Distribution Representation
2
• Vanilla Autoencoder
• Denoising Autoencoder
• Sparse Autoencoder
• Contractive Autoencoder
• Stacked Autoencoder
• Variational Autoencoder (VAE)
3
Vanilla Autoencoder
• What is it?
Reconstruct high-dimensional data using a neural network model with a narrow
bottleneck layer.
The bottleneck layer captures the compressed latent coding, so the nice by-product
is dimension reduction.
The low-dimensional representation can be used as the representation of the data
in various applications, e.g., image retrieval, data compression …
�
�
!
�
�
�
�
ℒ
4
Latent code: the compressed low
dimensional representation of the
input data
Vanilla Autoencoder
�
�
!
decoder/generator
Z  X
encoder
X  Z
• How it works?
Input
ℒ
𝑥
𝑧
Reconstructed Input
Ideally the input and reconstruction are identical
The encoder
network is for
dimension
reduction, just
like PCA
5
Vanilla Autoencoder
• Training
𝑥1
𝑥2
𝑥3
𝑎1
𝑎2
𝑎3
𝑥4
𝑥5
𝑥6
𝑎4
𝑥
#
1
𝑥
#
2
𝑥
#
3
𝑥
#
4
𝑥
#
5
𝑥
#
6
input layer hidden layer output layer
• The hidden units are usually less than the number
of inputs
• Dimension reduction --- Representation learning
The distance between two data can be measure by
Mean Squared Error (MSE):
𝑛 𝑖 ' 1
ℒ = 1
∑𝑛
(𝑥𝑖 − 𝐺(𝐸
𝑥𝑖
6
) 2
where 𝑛 is the number of variables
• It is trying to learn an approximation to the
identity function so that the input is “compress”
to the “compressed” features, discovering
interesting structure about the data.
Encoder
Decoder
Vanilla Autoencoder
• Testing/Inferencing
input layer hidden layer
𝑥1
𝑥2
𝑥3
𝑎1
𝑎2
𝑎3
𝑥4
𝑥5
𝑥6
𝑎4
extracted features
• Autoencoder is an unsupervised learning
method if we considered the latent code as the
“output”.
• Autoencoder is also a self-supervised (self-taught)
learning method which is a type of supervised
learning where the training labels are
determined by the input data.
• Word2Vec (from RNN lecture) is another
unsupervised, self-taught learning example.
Autoencoder for MNIST dataset (28×28×1,
784 pixels)
7
�
�
�
�
%
Encode
r
Vanilla Autoencoder
• Example:
• Compress MNIST (28x28x1) to the latent code with only 2 variables
Lossy
8
9
Vanilla Autoencoder
• Power of Latent Representation
• t-SNE visualization on MNIST: PCA vs. Autoencoder
Autoencoder (Winner)
PCA
2006 Science paper by Hinton and
Salakhutdinov
Vanilla Autoencoder
10
• Discussion
• Hidden layer is overcomplete if greater than the input layer
Vanilla Autoencoder
11
• Discussion
• Hidden layer is overcomplete if greater than the input layer
• No compression
• No guarantee that the hidden units extract meaningful feature
• Vanilla Autoencoder
• Denoising Autoencoder
• Sparse Autoencoder
• Contractive Autoencoder
• Stacked Autoencoder
• Variational Autoencoder (VAE)
12
Denoising Autoencoder (DAE)
13
• Why?
• Avoid overfitting
• Learn robust representations
Denoising Autoencoder
• Architecture
𝑥1
𝑥2
𝑥3
𝑎1
𝑎2
𝑎3
𝑥4
𝑥5
𝑥6
𝑎4
𝑥
#
1
𝑥
#
2
𝑥
#
3
𝑥
#
4
𝑥
#
5
𝑥
#
6
hidden layer output layer
input layer
𝑥1
𝑥2
𝑥3
𝑥4
𝑥5
𝑥6
Applying dropout between the input and the first hidden
layer
• Improve the
robustness
Encoder
Decoder
14
Denoising Autoencoder
Visualizing the learned
features
𝑥1
𝑥2
𝑥3
𝑎1
𝑎2
𝑎3
𝑥4
𝑥5
𝑥6
𝑎4
• Feature Visualization
One neuron == One feature
extractor
reshape 
15
Denoising Autoencoder
16
• Denoising Autoencoder & Dropout
Denoising autoencoder was proposed in 2008, 4 years before the dropout paper (Hinton, et al.
2012). Denoising autoencoder can be seem as applying dropout between the input and the first
layer.
Denoising autoencoder can be seem as one type of data augmentation on the input.
• Vanilla Autoencoder
• Denoising Autoencoder
• Sparse Autoencoder
• Contractive Autoencoder
• Stacked Autoencoder
• Variational Autoencoder (VAE)
17
Sparse Autoencoder
• Why?
• Even when the number of hidden units
is large (perhaps even greater than the
number of input pixels), we can still
discover interesting structure, by
imposing other constraints on the network.
• In particular, if we impose a ”‘sparsity”’
constraint on the hidden units,
then the autoencoder will still
discover interesting structure in the data,
even if the number of hidden units is
large.
𝑥1
𝑥2
𝑥3
𝑎1
𝑎2
𝑎3
𝑥4
𝑥5
𝑥6
𝑎4
input layer hidden layer
18
0.02
“inactive”
0.97
“active”
0.01
“inactive”
0.98
“active”
Encoder
Sigmoi
d
Sparse Autoencoder
• Recap: KL Divergence
Smaller == Closer
19
20
Sparse Autoencoder
• Sparsity Regularization
�
�
1
𝑥2
𝑥3
𝑎1
𝑎2
𝑎3
𝑥4
𝑥5
𝑥6
𝑎4
input layer hidden layer
0.02
“inactive”
0.97
“active”
0.01
“inactive”
0.98
“active”
Encoder
Sigmoi
d
�
� �
�
𝜌^ =
1
$ 𝑎 𝑚=
1
�
�
�
�
Given 𝑴 data samples (batch size) and
Sigmoid activation function, the active ratio of a
neuron 𝑎𝑗:
To make the output “sparse”, we would like to
enforce the following constraint, where 𝜌 is
a “sparsity parameter”, such as 0.2 (20% of
the neurons)
𝜌^𝑗 = 𝜌
The penalty term is as follow, where s is the
number of activation outputs.
j=1
ℒ 𝜌 = ∑𝑠
𝐾𝐿(𝜌||
𝜌^𝑗) j=1 𝜌^ j 1–
𝜌^j
= ∑𝑠
(𝜌log 𝜌
+ (1 − 𝜌)log
1 – 𝜌
)
ℒ𝑡𝑜𝑡𝑎
𝑙
𝑀𝑆
𝐸
= ℒ +
𝜆ℒ
�
�
The total loss:
The number of hidden units can be greater than the number of input variables.
Sparse Autoencoder
• Sparsity Regularization Smaller 𝜌 == More
sparse
Autoencoders for MNIST
dataset
21
�
�
�
�
%
Autoencode
r
Sparse
Autoencoder
�
�
%
Inpu
t
Sparse Autoencoder
• Different regularization loss
ℒ 1 on the hidden activation output
Method Hidden
Activatio
n
Reconstructi
on
Activation
Loss Function
Method 1 Sigmoid Sigmoid ℒ 𝑡𝑜𝑡𝑎𝑙 = ℒ 𝑀𝑆𝐸 + ℒ 𝜌
Method 2 ReLU Softplus ℒ 𝑡𝑜𝑡𝑎𝑙 = ℒ 𝑀𝑆𝐸 + 𝒂
22
Sparse Autoencoder
• Sparse Autoencoder vs. Denoising Autoencoder
Feature Extractors of Sparse Autoencoder Feature Extractors of Denoising
Autoencoder
23
Sparse Autoencoder
• Autoencoder vs. Denoising Autoencoder vs. Sparse Autoencoder
Autoencoders for MNIST
dataset
24
�
�
�
�
%
Autoencode
r
Sparse
Autoencoder
�
�
%
�
�
%
Inpu
t
Denoising
Autoencoder
• Vanilla Autoencoder
• Denoising Autoencoder
• Sparse Autoencoder
• Contractive Autoencoder
• Stacked Autoencoder
• Variational Autoencoder (VAE)
25
Contractive Autoencoder
26
• Why?
• Denoising Autoencoder and Sparse Autoencoder overcome the overcomplete
problem via the input and hidden layers.
• Could we add an explicit term in the loss to avoid uninteresting features?
We wish the features that ONLY reflect variations observed in the training set
Contractive Autoencoder
27
• How
• Penalize the representation being too sensitive to the input
• Improve the robustness to small perturbations
• Measure the sensitivity by the Frobenius norm of the Jacobian matrix of the
encoder activations
𝑥 = 𝑓
𝑧
𝑧 =
𝑧1
𝑥 =
𝑥1
𝑧2
𝑥2
𝐽𝑓 =
𝜕𝑥1⁄𝜕𝑧1
𝜕𝑥1⁄𝜕𝑧2
𝜕𝑥2⁄𝜕𝑧1
𝜕𝑥2⁄𝜕𝑧2
𝜕𝑧1⁄𝜕𝑥
1
𝐽𝑓−1 =
𝜕𝑧2⁄𝜕𝑥1
𝜕𝑧1⁄𝜕𝑥
2
𝜕𝑧2⁄𝜕𝑥
2
𝑧 = 𝑓−1
𝑥
input
2𝑧
1
=
𝑓
𝑧
1
𝑧
2
𝐽𝑓 = 1
1
2
0
𝑥1 𝑧1
+ 𝑧2
𝑥2
=
𝑥2/2
𝑥1 −
𝑥2/2
=
𝑓−1
𝑥
1
𝑥
2
𝑧
1
𝑧
2
=
𝐽𝑓−1 =1
0 1/2
−1/
2
output
𝐽𝑓𝐽𝑓−1
= 𝐼
28
Contractive Autoencoder
• Recap: Jocobian Matrix
Contractive Autoencoder
• Jocobian Matrix
29
Contractive Autoencoder
• New Loss
reconstruction
30
new regularization
Contractive Autoencoder
31
• vs. Denoising Autoencoder
• Advantages
• CAE can better model the distribution of raw data
• Disadvantages
• DAE is easier to implement
• CAE needs second-order optimization (conjugate gradient, LBFGS)
• Vanilla Autoencoder
• Denoising Autoencoder
• Sparse Autoencoder
• Contractive Autoencoder
• Stacked Autoencoder
• Variational Autoencoder (VAE)
32
33
Stacked Autoencoder
𝑥2
𝑥3
1
𝑎1
2
𝑎1
3
𝑎1
𝑥4
�
�
5
𝑥6
4
𝑎1
𝑥
#
2
�
�
#
3
𝑥
#
4
�
�
#
5
𝑥
#
6
The feature extractor for the
input data
Red lines indicate the trainable weights
Black lines indicate the fixed/nontrainable
weights
• Start from Autoencoder: Learn Feature From Input
input hidden 1 output
𝑥1
Encoder Decoder
𝑥
#1
Unsupervised
Red color indicates the trainable weights
34
Stacked Autoencoder
𝑥2
𝑥3
1
𝑎1
2
𝑎1
3
𝑎1
𝑥4
�
�
5
𝑥6
4
𝑎1
1
𝑎2
2
𝑎2
3
𝑎2
4
𝑎2
𝑥
#
2
�
�
#
3
𝑥
#
4
�
�
#
5
𝑥
#
6
The feature extractor for the first feature
extractor
Red lines indicate the trainable weights
Black lines indicate the fixed/nontrainable
weights
• 2nd Stage: Learn 2nd Level Feature From 1st Level Feature
input hidden 1 hidden 2 output
𝑥1
Encoder Encoder Decoder
𝑥
#1
Unsupervised
Red color indicates the trainable weights
35
Stacked Autoencoder
𝑥2
𝑥3
1
𝑎1
2
𝑎1
3
𝑎1
𝑥4
�
�
5
𝑥6
4
𝑎1
1
𝑎2
2
𝑎2
3
𝑎2
4
𝑎2
1
𝑎3
2
𝑎3
3
𝑎3
4
𝑎3
𝑥
#
2
�
�
#
3
𝑥
#
4
�
�
#
5
𝑥
#
6
The feature extractor for the second feature
extractor
Red lines indicate the trainable weights
Black lines indicate the fixed/nontrainable
weights
• 3rd Stage: Learn 3rd Level Feature From 2nd Level Feature
input hidden 1 hidden 2 hidden 3 output
𝑥1
Encoder Encoder Encoder Decoder
𝑥
#1
Unsupervised
Red color indicates the trainable weights
Stacked Autoencoder
• 4th Stage: Learn 4th Level Feature From 3rd Level Feature
𝑥1
𝑥2
𝑥3
1
𝑎1
2
𝑎1
3
𝑎1
𝑥4
�
�
5
𝑥6
4
𝑎1
1
𝑎2
2
𝑎2
3
𝑎2
4
𝑎2
1
𝑎3
2
𝑎3
3
𝑎3
4
𝑎3
input hidden 1 hidden 2 hidden 3 hidden 4
output
1
𝑎4
2
𝑎4
3
𝑎4
4
𝑎5
𝑥
#
1
𝑥
#
2
�
�
#
3
𝑥
#
4
�
�
#
5
𝑥
#
6
Red lines indicate the trainable weights
Black lines indicate the fixed/nontrainable
we3
ig6
hts
The feature extractor for the third feature
extracto
Encoder Encoder Encoder Encoder
Decoder
Unsupervise
d
Red color indicates the trainable weights
37
Stacked Autoencoder
• Use the Learned Feature Extractor for Downstream Tasks
𝑥2
𝑥3
1
𝑎1
2
𝑎1
3
𝑎1
𝑥4
�
�
5
𝑥6
4
𝑎1
1
𝑎2
2
𝑎2
3
𝑎2
4
𝑎2
1
𝑎3
2
𝑎3
3
𝑎3
4
𝑎3
output
1
𝑎4
2
𝑎4
3
𝑎4
4
𝑎4
1
𝑎5
input hidden 1 hidden 2 hidden 3 hidden 4
𝑥1
Red lines indicate the trainable weights
Black lines indicate the fixed/nontrainable
weights
Supervised
Learn to classify the input data
by using the labels and high-
level features
Red color indicates the trainable weights
38
Stacked Autoencoder
𝑥2
𝑥3
1
𝑎1
2
𝑎1
3
𝑎1
𝑥4
�
�
5
𝑥6
4
𝑎1
1
𝑎2
2
𝑎2
3
𝑎2
4
𝑎2
1
𝑎3
2
𝑎3
3
𝑎3
4
𝑎3
output
1
𝑎4
2
𝑎4
3
𝑎4
4
𝑎4
1
𝑎5
• Fine-tuning
input hidden 1 hidden 2 hidden 3 hidden 4
𝑥1
Fine-tune the entire model for
classification
Red lines indicate the trainable weights
Black lines indicate the fixed/nontrainable
weights
Supervise
d
Red color indicates the trainable weights
Stacked Autoencoder
39
• Discussion
• Advantages
• …
• Disadvantages
• …
• Vanilla Autoencoder
• Denoising Autoencoder
• Sparse Autoencoder
• Contractive Autoencoder
• Stacked Autoencoder
• Variational Autoencoder (VAE)
• From Neural Network Perspective
• From Probability Model Perspective
40
41
Before we start
• Question?
• Are the previous Autoencoders generative model?
• Recap: We want to learn a probability distribution 𝑝(𝑥) over 𝑥
o Generation (sampling): 𝐱𝑛𝑒r~𝑝(x)
(NO, The compressed latent codes of autoencoders are not prior distributions, autoencoder
cannot learn to represent the data distribution)
o Density Estimation: 𝑝(x) high if 𝐱 looks like a real data
NO
o Unsupervised Representation Learning:
Discovering the underlying structure from the data distribution (e.g., ears, nose, eyes …)
(YES, Autoencoders learn the feature representation)
• Vanilla Autoencoder
• Denoising Autoencoder
• Sparse Autoencoder
• Contractive Autoencoder
• Stacked Autoencoder
• Variational Autoencoder (VAE)
• From Neural Network Perspective
• From Probability Model Perspective
42
43
Variational Autoencoder
• How to perform generation (sampling)?
𝑥1
𝑥2
𝑥3
𝑧1
𝑧2
𝑧3
𝑥4
𝑥5
𝑧4
𝑥
#
1
𝑥
#
2
𝑥
#
3
𝑥
#
4
𝑥
#
5
input layer hidden layer output layer
Can the hidden output be a prior distribution, e.g., Normal
distribution?
𝑧1
𝑧2
𝑧3
𝑧4
𝑥
#
1
𝑥
#
2
𝑥
#
3
𝑥
#
4
𝑥
#
5
𝑥
#
6
𝑁(0,
1)
Decoder(Generator) maps
𝑁(0, 1) to data
space
Encoder
Decoder
Decode
r
𝑥6 𝑥#6
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
𝑝 𝑋 = ∑2 𝑝 𝑋
𝑍 𝑝(𝑍)
Variational Autoencoder
• Quick Overview
ℒkl
�
�
!
�
�
�
�
ℒ𝑀𝑆
𝐸
Data
Space
𝒙
44
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Latent
Space
𝑁(0, 1)
Bidirectional
Mapping
ℒ 𝑡𝑜𝑡𝑎𝑙 = ℒ 𝑀𝑆𝐸 +
ℒ 𝑘𝑙
𝑝(𝑥|𝑧)
generation
(decode)
𝑞(𝑧|𝑥)
Inference
(encoder)
Variational Autoencoder
45
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
• The neural net perspective
• A variational autoencoder consists of an encoder, a decoder, and a loss function
Variational Autoencoder
• Encoder, Decoder
46
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Loss function
regularization
47
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Can be represented by MSE
48
• Which direction of the KL divergence to
use?
• Some applications require an
approximation that usually places
high probability anywhere that the
true distribution places high
probability: left one
• VAE requires an approximation
that
rarely places high probability
anywhere that the true distribution
places low probability: right one
Variational Autoencoder
• Why KL(Q||P) not KL(P||Q)
If:
Variational Autoencoder
• Reparameterization Trick
ℎ1
ℎ2
ℎ3
𝜇1
𝜇2
𝜇3
ℎ4
ℎ5
ℎ6
�
�
4
�
�
#
1
𝑥
#
2
𝑥
#
3
𝑥
#
4
𝑥
#
5
𝑥
#
6
𝛿1
𝛿2
𝛿3
𝛿4
𝑧1
𝑧2
𝑧3
𝑧4
Resampling
𝑧𝑖~𝑁(𝜇𝑖,
𝛿𝑖)
predict means
predict std
�
�
1
𝑥2
𝑥3
𝑥4
𝑥5
𝑥6
1. Encode the input
2. Predict means
3. Predict standard derivations
4. Use the predicted means and standard
derivations to sample new latent
variables individually
5. Reconstruct the input
49
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Reparameterization Trick
• z ~ N(μ, σ) is not differentiable
• To make sampling z differentiable
• z = μ + σ * ϵ ϵ ~ N(0,
1)
50
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Reparameterization Trick
51
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Loss function
52
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Where is ‘variational’?
53
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
• Vanilla Autoencoder
• Denoising Autoencoder
• Sparse Autoencoder
• Contractive Autoencoder
• Stacked Autoencoder
• Variational Autoencoder (VAE)
• From Neural Network Perspective
• From Probability Model Perspective
54
Variational Autoencoder
Z
= 𝑁(0,1) is a prior/known
distribution
• Problem Definition
Goal: Given 𝑋 = {𝑥1, 𝑥2, 𝑥3 … , 𝑥𝑛}, find 𝑝 𝑋 to
represent 𝑋
How: It is difficult to directly model 𝑝 𝑋 , so
alternatively, we can …
𝑝 𝑋 = D 𝑝 𝑋|𝑍 𝑝(𝑍)
where
𝑝 𝑍
55
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
i.e., sample 𝑋
from 𝑍
Variational Autoencoder
• The probability model perspective
• P(X) is hard to model
𝑝
𝑋
= G 𝑝 𝑋|
𝑍 𝑝(𝑍)
4
𝑝
𝑋
= G
𝑝 𝑋, 𝑍
4
• Alternatively, we learn the joint distribution of X and Z
𝑝 𝑋, 𝑍 = 𝑝 𝑍 𝑝(𝑋|𝑍)
56
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Assumption
57
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Assumption
58
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Monte Carlo?
• n might need to be extremely large before we have an accurate estimation of P(X)
59
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Monte Carlo?
• Pixel difference is different from perceptual difference
60
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
61
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
• Monte Carlo?
• VAE alters the sampling procedure
Variational Autoencoder
• Recap: Variational Inference
• VI turns inference into optimization
ideal
approximation
𝑝(𝑥
)
62
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
𝑝 𝑧 𝑥=
𝑝(𝑥, 𝑧)
∝ 𝑝(𝑥,
𝑧)
Variational Autoencoder
• Variational Inference
• VI turns inference into optimization
parameter distribution
63
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Setting up the objective
• Maximize P(X)
• Set Q(z) to be an arbitrary distribution 𝑝 𝑧 𝑋
=
𝑝 𝑋 𝑧
𝑝(𝑧)
64
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
𝑝(𝑋
)
Goal: maximize this logP(x)
Variational Autoencoder
• Setting up the objective
reconstruction/decoder KLD
Goal: maximize this encoder
ideal
difficult to compute
Goal becomes: optimize this
ℒkl
�
�
!
�
�
�
�
ℒ𝑀𝑆
𝐸
ℒ 𝑡𝑜𝑡𝑎𝑙 = ℒ 𝑀𝑆𝐸 +
ℒ 𝑘𝑙
65
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
𝑝(𝑥|𝑧)
generation
𝑞(𝑧|𝑥)
inference
Variational Autoencoder
• Setting up the objective : ELBO
ideal
encoder
-ELBO
𝑝
𝑧 𝑋
=
𝑝(𝑋, 𝑧)
𝑝(𝑋
)
66
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Setting up the objective : ELBO
67
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Recap: The KL Divergence Loss
𝐾𝐿(𝒩(𝜇, 𝜎2)||𝒩
0,1
)
= O 𝒩
𝜇, 𝜎
2
𝒩(𝜇,
𝜎2)
𝑙𝑜𝑔
𝒩
0,1
𝑑
𝑥
=
O
1
2𝜋𝜎
2
�
�
– 𝑥–𝜇
5
2σ5
𝑙𝑜
𝑔
1
2𝜋𝜎
2
�
�
– 𝑥–𝜇
5
2σ5
1
2𝜋
𝑒
–𝑥5
2
𝑑
𝑥
=
O
1
2𝜋𝜎
2
�
�
– 𝑥–𝜇
5
2σ5
log
(
1
𝜎2
�
�
𝑥5 – 𝑥–𝜇
5
2σ5
)𝑑
𝑥
1
2
=
O
1
2𝜋𝜎
2
�
�
– 𝑥–𝜇
5 2σ5
−𝑙𝑜𝑔𝜎2 + 𝑥2
−
𝑥 − 𝜇
2
�
�
2 𝑑
𝑥
68
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Recap: The KL Divergence Loss
𝐾𝐿(𝒩(𝜇, 𝜎2)||𝒩
0,1 )
1
2
=
O
1
2𝜋𝜎
2
– 𝑥–𝜇
5
2σ5
𝑒 −𝑙𝑜𝑔𝜎2 + 𝑥2
−
𝑥 − 𝜇
2
�
�
2
𝑑
𝑥
2
=
1
(−𝑙𝑜𝑔𝜎2 + 𝜇2 + 𝜎2
− 1)
69
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Recap: The KL Divergence Loss
70
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• Optimizing the objective
encoder ideal reconstruction KLD
dataset
dataset
71
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
• VAE is a Generative Model
𝑝 𝑍|𝑋 is not 𝑁(0,1)
Can we input 𝑁(0,1) to the decoder for
sampling? YES: the goal of KL is to make 𝑝 𝑍|𝑋
to be
𝑁(0,1)
72
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
Variational Autoencoder
73
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
• VAE vs. Autoencoder
• VAE : distribution representation, p(z|x) is a distribution
• AE: feature representation, h = E(x) is deterministic
Variational Autoencoder
74
Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
• Challenges
• Low quality images
• …
Summary: Take Home Message
• Autoencoders learn data representation in an unsupervised/ self-supervised way.
• Autoencoders learn data representation but cannot model the data distribution 𝑝 𝑋
.
• Different with vanilla autoencoder, in sparse autoencoder, the number of hidden units
can be greater than the number of input variables.
• VAE
• …
• …
• …
• …
• …
• …
75
Thanks
76

Lecture 7-8 From Autoencoder to VAE.pptx

  • 1.
    From Autoencoder toVariational Autoencoder 1 Hao Dong Peking University
  • 2.
    • Vanilla Autoencoder •Denoising Autoencoder • Sparse Autoencoder • Contractive Autoencoder • Stacked Autoencoder • Variational Autoencoder (VAE) From Autoencoder to Variational Autoencoder Feature Representation Distribution Representation 2
  • 3.
    • Vanilla Autoencoder •Denoising Autoencoder • Sparse Autoencoder • Contractive Autoencoder • Stacked Autoencoder • Variational Autoencoder (VAE) 3
  • 4.
    Vanilla Autoencoder • Whatis it? Reconstruct high-dimensional data using a neural network model with a narrow bottleneck layer. The bottleneck layer captures the compressed latent coding, so the nice by-product is dimension reduction. The low-dimensional representation can be used as the representation of the data in various applications, e.g., image retrieval, data compression … � � ! � � � � ℒ 4
  • 5.
    Latent code: thecompressed low dimensional representation of the input data Vanilla Autoencoder � � ! decoder/generator Z  X encoder X  Z • How it works? Input ℒ 𝑥 𝑧 Reconstructed Input Ideally the input and reconstruction are identical The encoder network is for dimension reduction, just like PCA 5
  • 6.
    Vanilla Autoencoder • Training 𝑥1 𝑥2 𝑥3 𝑎1 𝑎2 𝑎3 𝑥4 𝑥5 𝑥6 𝑎4 𝑥 # 1 𝑥 # 2 𝑥 # 3 𝑥 # 4 𝑥 # 5 𝑥 # 6 inputlayer hidden layer output layer • The hidden units are usually less than the number of inputs • Dimension reduction --- Representation learning The distance between two data can be measure by Mean Squared Error (MSE): 𝑛 𝑖 ' 1 ℒ = 1 ∑𝑛 (𝑥𝑖 − 𝐺(𝐸 𝑥𝑖 6 ) 2 where 𝑛 is the number of variables • It is trying to learn an approximation to the identity function so that the input is “compress” to the “compressed” features, discovering interesting structure about the data. Encoder Decoder
  • 7.
    Vanilla Autoencoder • Testing/Inferencing inputlayer hidden layer 𝑥1 𝑥2 𝑥3 𝑎1 𝑎2 𝑎3 𝑥4 𝑥5 𝑥6 𝑎4 extracted features • Autoencoder is an unsupervised learning method if we considered the latent code as the “output”. • Autoencoder is also a self-supervised (self-taught) learning method which is a type of supervised learning where the training labels are determined by the input data. • Word2Vec (from RNN lecture) is another unsupervised, self-taught learning example. Autoencoder for MNIST dataset (28×28×1, 784 pixels) 7 � � � � % Encode r
  • 8.
    Vanilla Autoencoder • Example: •Compress MNIST (28x28x1) to the latent code with only 2 variables Lossy 8
  • 9.
    9 Vanilla Autoencoder • Powerof Latent Representation • t-SNE visualization on MNIST: PCA vs. Autoencoder Autoencoder (Winner) PCA 2006 Science paper by Hinton and Salakhutdinov
  • 10.
    Vanilla Autoencoder 10 • Discussion •Hidden layer is overcomplete if greater than the input layer
  • 11.
    Vanilla Autoencoder 11 • Discussion •Hidden layer is overcomplete if greater than the input layer • No compression • No guarantee that the hidden units extract meaningful feature
  • 12.
    • Vanilla Autoencoder •Denoising Autoencoder • Sparse Autoencoder • Contractive Autoencoder • Stacked Autoencoder • Variational Autoencoder (VAE) 12
  • 13.
    Denoising Autoencoder (DAE) 13 •Why? • Avoid overfitting • Learn robust representations
  • 14.
    Denoising Autoencoder • Architecture 𝑥1 𝑥2 𝑥3 𝑎1 𝑎2 𝑎3 𝑥4 𝑥5 𝑥6 𝑎4 𝑥 # 1 𝑥 # 2 𝑥 # 3 𝑥 # 4 𝑥 # 5 𝑥 # 6 hiddenlayer output layer input layer 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 Applying dropout between the input and the first hidden layer • Improve the robustness Encoder Decoder 14
  • 15.
    Denoising Autoencoder Visualizing thelearned features 𝑥1 𝑥2 𝑥3 𝑎1 𝑎2 𝑎3 𝑥4 𝑥5 𝑥6 𝑎4 • Feature Visualization One neuron == One feature extractor reshape  15
  • 16.
    Denoising Autoencoder 16 • DenoisingAutoencoder & Dropout Denoising autoencoder was proposed in 2008, 4 years before the dropout paper (Hinton, et al. 2012). Denoising autoencoder can be seem as applying dropout between the input and the first layer. Denoising autoencoder can be seem as one type of data augmentation on the input.
  • 17.
    • Vanilla Autoencoder •Denoising Autoencoder • Sparse Autoencoder • Contractive Autoencoder • Stacked Autoencoder • Variational Autoencoder (VAE) 17
  • 18.
    Sparse Autoencoder • Why? •Even when the number of hidden units is large (perhaps even greater than the number of input pixels), we can still discover interesting structure, by imposing other constraints on the network. • In particular, if we impose a ”‘sparsity”’ constraint on the hidden units, then the autoencoder will still discover interesting structure in the data, even if the number of hidden units is large. 𝑥1 𝑥2 𝑥3 𝑎1 𝑎2 𝑎3 𝑥4 𝑥5 𝑥6 𝑎4 input layer hidden layer 18 0.02 “inactive” 0.97 “active” 0.01 “inactive” 0.98 “active” Encoder Sigmoi d
  • 19.
    Sparse Autoencoder • Recap:KL Divergence Smaller == Closer 19
  • 20.
    20 Sparse Autoencoder • SparsityRegularization � � 1 𝑥2 𝑥3 𝑎1 𝑎2 𝑎3 𝑥4 𝑥5 𝑥6 𝑎4 input layer hidden layer 0.02 “inactive” 0.97 “active” 0.01 “inactive” 0.98 “active” Encoder Sigmoi d � � � � 𝜌^ = 1 $ 𝑎 𝑚= 1 � � � � Given 𝑴 data samples (batch size) and Sigmoid activation function, the active ratio of a neuron 𝑎𝑗: To make the output “sparse”, we would like to enforce the following constraint, where 𝜌 is a “sparsity parameter”, such as 0.2 (20% of the neurons) 𝜌^𝑗 = 𝜌 The penalty term is as follow, where s is the number of activation outputs. j=1 ℒ 𝜌 = ∑𝑠 𝐾𝐿(𝜌|| 𝜌^𝑗) j=1 𝜌^ j 1– 𝜌^j = ∑𝑠 (𝜌log 𝜌 + (1 − 𝜌)log 1 – 𝜌 ) ℒ𝑡𝑜𝑡𝑎 𝑙 𝑀𝑆 𝐸 = ℒ + 𝜆ℒ � � The total loss: The number of hidden units can be greater than the number of input variables.
  • 21.
    Sparse Autoencoder • SparsityRegularization Smaller 𝜌 == More sparse Autoencoders for MNIST dataset 21 � � � � % Autoencode r Sparse Autoencoder � � % Inpu t
  • 22.
    Sparse Autoencoder • Differentregularization loss ℒ 1 on the hidden activation output Method Hidden Activatio n Reconstructi on Activation Loss Function Method 1 Sigmoid Sigmoid ℒ 𝑡𝑜𝑡𝑎𝑙 = ℒ 𝑀𝑆𝐸 + ℒ 𝜌 Method 2 ReLU Softplus ℒ 𝑡𝑜𝑡𝑎𝑙 = ℒ 𝑀𝑆𝐸 + 𝒂 22
  • 23.
    Sparse Autoencoder • SparseAutoencoder vs. Denoising Autoencoder Feature Extractors of Sparse Autoencoder Feature Extractors of Denoising Autoencoder 23
  • 24.
    Sparse Autoencoder • Autoencodervs. Denoising Autoencoder vs. Sparse Autoencoder Autoencoders for MNIST dataset 24 � � � � % Autoencode r Sparse Autoencoder � � % � � % Inpu t Denoising Autoencoder
  • 25.
    • Vanilla Autoencoder •Denoising Autoencoder • Sparse Autoencoder • Contractive Autoencoder • Stacked Autoencoder • Variational Autoencoder (VAE) 25
  • 26.
    Contractive Autoencoder 26 • Why? •Denoising Autoencoder and Sparse Autoencoder overcome the overcomplete problem via the input and hidden layers. • Could we add an explicit term in the loss to avoid uninteresting features? We wish the features that ONLY reflect variations observed in the training set
  • 27.
    Contractive Autoencoder 27 • How •Penalize the representation being too sensitive to the input • Improve the robustness to small perturbations • Measure the sensitivity by the Frobenius norm of the Jacobian matrix of the encoder activations
  • 28.
    𝑥 = 𝑓 𝑧 𝑧= 𝑧1 𝑥 = 𝑥1 𝑧2 𝑥2 𝐽𝑓 = 𝜕𝑥1⁄𝜕𝑧1 𝜕𝑥1⁄𝜕𝑧2 𝜕𝑥2⁄𝜕𝑧1 𝜕𝑥2⁄𝜕𝑧2 𝜕𝑧1⁄𝜕𝑥 1 𝐽𝑓−1 = 𝜕𝑧2⁄𝜕𝑥1 𝜕𝑧1⁄𝜕𝑥 2 𝜕𝑧2⁄𝜕𝑥 2 𝑧 = 𝑓−1 𝑥 input 2𝑧 1 = 𝑓 𝑧 1 𝑧 2 𝐽𝑓 = 1 1 2 0 𝑥1 𝑧1 + 𝑧2 𝑥2 = 𝑥2/2 𝑥1 − 𝑥2/2 = 𝑓−1 𝑥 1 𝑥 2 𝑧 1 𝑧 2 = 𝐽𝑓−1 =1 0 1/2 −1/ 2 output 𝐽𝑓𝐽𝑓−1 = 𝐼 28 Contractive Autoencoder • Recap: Jocobian Matrix
  • 29.
  • 30.
    Contractive Autoencoder • NewLoss reconstruction 30 new regularization
  • 31.
    Contractive Autoencoder 31 • vs.Denoising Autoencoder • Advantages • CAE can better model the distribution of raw data • Disadvantages • DAE is easier to implement • CAE needs second-order optimization (conjugate gradient, LBFGS)
  • 32.
    • Vanilla Autoencoder •Denoising Autoencoder • Sparse Autoencoder • Contractive Autoencoder • Stacked Autoencoder • Variational Autoencoder (VAE) 32
  • 33.
    33 Stacked Autoencoder 𝑥2 𝑥3 1 𝑎1 2 𝑎1 3 𝑎1 𝑥4 � � 5 𝑥6 4 𝑎1 𝑥 # 2 � � # 3 𝑥 # 4 � � # 5 𝑥 # 6 The featureextractor for the input data Red lines indicate the trainable weights Black lines indicate the fixed/nontrainable weights • Start from Autoencoder: Learn Feature From Input input hidden 1 output 𝑥1 Encoder Decoder 𝑥 #1 Unsupervised Red color indicates the trainable weights
  • 34.
    34 Stacked Autoencoder 𝑥2 𝑥3 1 𝑎1 2 𝑎1 3 𝑎1 𝑥4 � � 5 𝑥6 4 𝑎1 1 𝑎2 2 𝑎2 3 𝑎2 4 𝑎2 𝑥 # 2 � � # 3 𝑥 # 4 � � # 5 𝑥 # 6 The featureextractor for the first feature extractor Red lines indicate the trainable weights Black lines indicate the fixed/nontrainable weights • 2nd Stage: Learn 2nd Level Feature From 1st Level Feature input hidden 1 hidden 2 output 𝑥1 Encoder Encoder Decoder 𝑥 #1 Unsupervised Red color indicates the trainable weights
  • 35.
    35 Stacked Autoencoder 𝑥2 𝑥3 1 𝑎1 2 𝑎1 3 𝑎1 𝑥4 � � 5 𝑥6 4 𝑎1 1 𝑎2 2 𝑎2 3 𝑎2 4 𝑎2 1 𝑎3 2 𝑎3 3 𝑎3 4 𝑎3 𝑥 # 2 � � # 3 𝑥 # 4 � � # 5 𝑥 # 6 The featureextractor for the second feature extractor Red lines indicate the trainable weights Black lines indicate the fixed/nontrainable weights • 3rd Stage: Learn 3rd Level Feature From 2nd Level Feature input hidden 1 hidden 2 hidden 3 output 𝑥1 Encoder Encoder Encoder Decoder 𝑥 #1 Unsupervised Red color indicates the trainable weights
  • 36.
    Stacked Autoencoder • 4thStage: Learn 4th Level Feature From 3rd Level Feature 𝑥1 𝑥2 𝑥3 1 𝑎1 2 𝑎1 3 𝑎1 𝑥4 � � 5 𝑥6 4 𝑎1 1 𝑎2 2 𝑎2 3 𝑎2 4 𝑎2 1 𝑎3 2 𝑎3 3 𝑎3 4 𝑎3 input hidden 1 hidden 2 hidden 3 hidden 4 output 1 𝑎4 2 𝑎4 3 𝑎4 4 𝑎5 𝑥 # 1 𝑥 # 2 � � # 3 𝑥 # 4 � � # 5 𝑥 # 6 Red lines indicate the trainable weights Black lines indicate the fixed/nontrainable we3 ig6 hts The feature extractor for the third feature extracto Encoder Encoder Encoder Encoder Decoder Unsupervise d Red color indicates the trainable weights
  • 37.
    37 Stacked Autoencoder • Usethe Learned Feature Extractor for Downstream Tasks 𝑥2 𝑥3 1 𝑎1 2 𝑎1 3 𝑎1 𝑥4 � � 5 𝑥6 4 𝑎1 1 𝑎2 2 𝑎2 3 𝑎2 4 𝑎2 1 𝑎3 2 𝑎3 3 𝑎3 4 𝑎3 output 1 𝑎4 2 𝑎4 3 𝑎4 4 𝑎4 1 𝑎5 input hidden 1 hidden 2 hidden 3 hidden 4 𝑥1 Red lines indicate the trainable weights Black lines indicate the fixed/nontrainable weights Supervised Learn to classify the input data by using the labels and high- level features Red color indicates the trainable weights
  • 38.
    38 Stacked Autoencoder 𝑥2 𝑥3 1 𝑎1 2 𝑎1 3 𝑎1 𝑥4 � � 5 𝑥6 4 𝑎1 1 𝑎2 2 𝑎2 3 𝑎2 4 𝑎2 1 𝑎3 2 𝑎3 3 𝑎3 4 𝑎3 output 1 𝑎4 2 𝑎4 3 𝑎4 4 𝑎4 1 𝑎5 • Fine-tuning inputhidden 1 hidden 2 hidden 3 hidden 4 𝑥1 Fine-tune the entire model for classification Red lines indicate the trainable weights Black lines indicate the fixed/nontrainable weights Supervise d Red color indicates the trainable weights
  • 39.
    Stacked Autoencoder 39 • Discussion •Advantages • … • Disadvantages • …
  • 40.
    • Vanilla Autoencoder •Denoising Autoencoder • Sparse Autoencoder • Contractive Autoencoder • Stacked Autoencoder • Variational Autoencoder (VAE) • From Neural Network Perspective • From Probability Model Perspective 40
  • 41.
    41 Before we start •Question? • Are the previous Autoencoders generative model? • Recap: We want to learn a probability distribution 𝑝(𝑥) over 𝑥 o Generation (sampling): 𝐱𝑛𝑒r~𝑝(x) (NO, The compressed latent codes of autoencoders are not prior distributions, autoencoder cannot learn to represent the data distribution) o Density Estimation: 𝑝(x) high if 𝐱 looks like a real data NO o Unsupervised Representation Learning: Discovering the underlying structure from the data distribution (e.g., ears, nose, eyes …) (YES, Autoencoders learn the feature representation)
  • 42.
    • Vanilla Autoencoder •Denoising Autoencoder • Sparse Autoencoder • Contractive Autoencoder • Stacked Autoencoder • Variational Autoencoder (VAE) • From Neural Network Perspective • From Probability Model Perspective 42
  • 43.
    43 Variational Autoencoder • Howto perform generation (sampling)? 𝑥1 𝑥2 𝑥3 𝑧1 𝑧2 𝑧3 𝑥4 𝑥5 𝑧4 𝑥 # 1 𝑥 # 2 𝑥 # 3 𝑥 # 4 𝑥 # 5 input layer hidden layer output layer Can the hidden output be a prior distribution, e.g., Normal distribution? 𝑧1 𝑧2 𝑧3 𝑧4 𝑥 # 1 𝑥 # 2 𝑥 # 3 𝑥 # 4 𝑥 # 5 𝑥 # 6 𝑁(0, 1) Decoder(Generator) maps 𝑁(0, 1) to data space Encoder Decoder Decode r 𝑥6 𝑥#6 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 𝑝 𝑋 = ∑2 𝑝 𝑋 𝑍 𝑝(𝑍)
  • 44.
    Variational Autoencoder • QuickOverview ℒkl � � ! � � � � ℒ𝑀𝑆 𝐸 Data Space 𝒙 44 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 Latent Space 𝑁(0, 1) Bidirectional Mapping ℒ 𝑡𝑜𝑡𝑎𝑙 = ℒ 𝑀𝑆𝐸 + ℒ 𝑘𝑙 𝑝(𝑥|𝑧) generation (decode) 𝑞(𝑧|𝑥) Inference (encoder)
  • 45.
    Variational Autoencoder 45 Auto-Encoding VariationalBayes. Diederik P. Kingma, Max Welling. ICLR 2013 • The neural net perspective • A variational autoencoder consists of an encoder, a decoder, and a loss function
  • 46.
    Variational Autoencoder • Encoder,Decoder 46 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 47.
    Variational Autoencoder • Lossfunction regularization 47 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 Can be represented by MSE
  • 48.
    48 • Which directionof the KL divergence to use? • Some applications require an approximation that usually places high probability anywhere that the true distribution places high probability: left one • VAE requires an approximation that rarely places high probability anywhere that the true distribution places low probability: right one Variational Autoencoder • Why KL(Q||P) not KL(P||Q) If:
  • 49.
    Variational Autoencoder • ReparameterizationTrick ℎ1 ℎ2 ℎ3 𝜇1 𝜇2 𝜇3 ℎ4 ℎ5 ℎ6 � � 4 � � # 1 𝑥 # 2 𝑥 # 3 𝑥 # 4 𝑥 # 5 𝑥 # 6 𝛿1 𝛿2 𝛿3 𝛿4 𝑧1 𝑧2 𝑧3 𝑧4 Resampling 𝑧𝑖~𝑁(𝜇𝑖, 𝛿𝑖) predict means predict std � � 1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 1. Encode the input 2. Predict means 3. Predict standard derivations 4. Use the predicted means and standard derivations to sample new latent variables individually 5. Reconstruct the input 49 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 50.
    Variational Autoencoder • ReparameterizationTrick • z ~ N(μ, σ) is not differentiable • To make sampling z differentiable • z = μ + σ * ϵ ϵ ~ N(0, 1) 50 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 51.
    Variational Autoencoder • ReparameterizationTrick 51 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 52.
    Variational Autoencoder • Lossfunction 52 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 53.
    Variational Autoencoder • Whereis ‘variational’? 53 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 54.
    • Vanilla Autoencoder •Denoising Autoencoder • Sparse Autoencoder • Contractive Autoencoder • Stacked Autoencoder • Variational Autoencoder (VAE) • From Neural Network Perspective • From Probability Model Perspective 54
  • 55.
    Variational Autoencoder Z = 𝑁(0,1)is a prior/known distribution • Problem Definition Goal: Given 𝑋 = {𝑥1, 𝑥2, 𝑥3 … , 𝑥𝑛}, find 𝑝 𝑋 to represent 𝑋 How: It is difficult to directly model 𝑝 𝑋 , so alternatively, we can … 𝑝 𝑋 = D 𝑝 𝑋|𝑍 𝑝(𝑍) where 𝑝 𝑍 55 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 i.e., sample 𝑋 from 𝑍
  • 56.
    Variational Autoencoder • Theprobability model perspective • P(X) is hard to model 𝑝 𝑋 = G 𝑝 𝑋| 𝑍 𝑝(𝑍) 4 𝑝 𝑋 = G 𝑝 𝑋, 𝑍 4 • Alternatively, we learn the joint distribution of X and Z 𝑝 𝑋, 𝑍 = 𝑝 𝑍 𝑝(𝑋|𝑍) 56 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 57.
    Variational Autoencoder • Assumption 57 Auto-EncodingVariational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 58.
    Variational Autoencoder • Assumption 58 Auto-EncodingVariational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 59.
    Variational Autoencoder • MonteCarlo? • n might need to be extremely large before we have an accurate estimation of P(X) 59 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 60.
    Variational Autoencoder • MonteCarlo? • Pixel difference is different from perceptual difference 60 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 61.
    Variational Autoencoder 61 Auto-Encoding VariationalBayes. Diederik P. Kingma, Max Welling. ICLR 2013 • Monte Carlo? • VAE alters the sampling procedure
  • 62.
    Variational Autoencoder • Recap:Variational Inference • VI turns inference into optimization ideal approximation 𝑝(𝑥 ) 62 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 𝑝 𝑧 𝑥= 𝑝(𝑥, 𝑧) ∝ 𝑝(𝑥, 𝑧)
  • 63.
    Variational Autoencoder • VariationalInference • VI turns inference into optimization parameter distribution 63 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 64.
    Variational Autoencoder • Settingup the objective • Maximize P(X) • Set Q(z) to be an arbitrary distribution 𝑝 𝑧 𝑋 = 𝑝 𝑋 𝑧 𝑝(𝑧) 64 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 𝑝(𝑋 ) Goal: maximize this logP(x)
  • 65.
    Variational Autoencoder • Settingup the objective reconstruction/decoder KLD Goal: maximize this encoder ideal difficult to compute Goal becomes: optimize this ℒkl � � ! � � � � ℒ𝑀𝑆 𝐸 ℒ 𝑡𝑜𝑡𝑎𝑙 = ℒ 𝑀𝑆𝐸 + ℒ 𝑘𝑙 65 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013 𝑝(𝑥|𝑧) generation 𝑞(𝑧|𝑥) inference
  • 66.
    Variational Autoencoder • Settingup the objective : ELBO ideal encoder -ELBO 𝑝 𝑧 𝑋 = 𝑝(𝑋, 𝑧) 𝑝(𝑋 ) 66 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 67.
    Variational Autoencoder • Settingup the objective : ELBO 67 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 68.
    Variational Autoencoder • Recap:The KL Divergence Loss 𝐾𝐿(𝒩(𝜇, 𝜎2)||𝒩 0,1 ) = O 𝒩 𝜇, 𝜎 2 𝒩(𝜇, 𝜎2) 𝑙𝑜𝑔 𝒩 0,1 𝑑 𝑥 = O 1 2𝜋𝜎 2 � � – 𝑥–𝜇 5 2σ5 𝑙𝑜 𝑔 1 2𝜋𝜎 2 � � – 𝑥–𝜇 5 2σ5 1 2𝜋 𝑒 –𝑥5 2 𝑑 𝑥 = O 1 2𝜋𝜎 2 � � – 𝑥–𝜇 5 2σ5 log ( 1 𝜎2 � � 𝑥5 – 𝑥–𝜇 5 2σ5 )𝑑 𝑥 1 2 = O 1 2𝜋𝜎 2 � � – 𝑥–𝜇 5 2σ5 −𝑙𝑜𝑔𝜎2 + 𝑥2 − 𝑥 − 𝜇 2 � � 2 𝑑 𝑥 68 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 69.
    Variational Autoencoder • Recap:The KL Divergence Loss 𝐾𝐿(𝒩(𝜇, 𝜎2)||𝒩 0,1 ) 1 2 = O 1 2𝜋𝜎 2 – 𝑥–𝜇 5 2σ5 𝑒 −𝑙𝑜𝑔𝜎2 + 𝑥2 − 𝑥 − 𝜇 2 � � 2 𝑑 𝑥 2 = 1 (−𝑙𝑜𝑔𝜎2 + 𝜇2 + 𝜎2 − 1) 69 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 70.
    Variational Autoencoder • Recap:The KL Divergence Loss 70 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 71.
    Variational Autoencoder • Optimizingthe objective encoder ideal reconstruction KLD dataset dataset 71 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 72.
    Variational Autoencoder • VAEis a Generative Model 𝑝 𝑍|𝑋 is not 𝑁(0,1) Can we input 𝑁(0,1) to the decoder for sampling? YES: the goal of KL is to make 𝑝 𝑍|𝑋 to be 𝑁(0,1) 72 Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013
  • 73.
    Variational Autoencoder 73 Auto-Encoding VariationalBayes. Diederik P. Kingma, Max Welling. ICLR 2013 • VAE vs. Autoencoder • VAE : distribution representation, p(z|x) is a distribution • AE: feature representation, h = E(x) is deterministic
  • 74.
    Variational Autoencoder 74 Auto-Encoding VariationalBayes. Diederik P. Kingma, Max Welling. ICLR 2013 • Challenges • Low quality images • …
  • 75.
    Summary: Take HomeMessage • Autoencoders learn data representation in an unsupervised/ self-supervised way. • Autoencoders learn data representation but cannot model the data distribution 𝑝 𝑋 . • Different with vanilla autoencoder, in sparse autoencoder, the number of hidden units can be greater than the number of input variables. • VAE • … • … • … • … • … • … 75
  • 76.