DEEP LEARNING WITH
TENSORFLOW
Introduction To Deep Learning – Weeks 2 (Units 2-3)
Jon Lederman
Why Do We Need Artificial Intelligence
• Deep learning (also known as deep structured learning or hierarchical learning)
is part of a broader family of machine learning methods based on learning data
representations, as opposed to task-specific algorithms.
• Some examples of Deep learning architectures include:
• Neural Networks
• Feed-Forward Neural Networks*
• Convolutional Neural Networks*
• Recurrent Neural Networks*
• Deep Belief Networks
• Deep Boltzmann Machines
Limitations Of Some Forms Of Simple
Machine Learning
Algorithms
• Performance of machine learning algorithms depends heavily on the
representation of data provided
• Each element of data in a representation of is known as a feature.
• Cannot influence how features are defined or which ones are important
Enter Deep Learning
What is Deep Learning?
• Deep learning is a particular kind of machine learning technique that represents the
world as a nested hierarchy of concepts with each concept defined in relation to
simpler concepts and more abstract representations defined in terms of less
abstract ones.
• The deep in "deep learning" refers to the number of layers through which the data is
transformed.
• Deep Learning is an AI/machine learning approach with several key features that
distinguishes it from previous AI approaches such as the perceptron:
• Representational learning – Rather than hand coding the relevant features and their
associated representations, the algorithm learns the best representations of the data
• In many cases these learned representations are not intuitive to human beings, which is one
reason deep neural networks have mysterious qualities.
• Hierarchical - More complex representations are expressed in terms of simpler
representations
Transformations of Abstract Representations
• In deep learning, each level learns to transform its input data into a slightly
more abstract and composite representation.
• In an image recognition application:
• the raw input may be a matrix of pixels
• the first representational layer may abstract the pixels and encode edges
• the second layer may compose and encode arrangements of edges
• the third layer may encode a nose and eyes
• the fourth layer may recognize that the image contains a face.
• Importantly, a deep learning process can learn which features to optimally place in
which level on its own. That is, a deep learning process learns its own feature
detectors!
Representational Learning
From Deep Learning – Goodfellow, Bengio and Courville
Input is presented at
the visible layer
(observable features).
Then a series of hidden
layers extracts
increasingly abstract
features from the
images. These layers
are called ”hidden”
because their values
are not given in the
data. Instead the
model must learn
which concepts are
useful for explaining
the relationships in the
observed data.
In deep learning, each level
learns to transform its
input data into a slightly
more abstract and
composite representation.
In an image recognition
application, the raw input
may be a matrix of pixels;
the first representational
layer may abstract the
pixels and encode edges;
the second layer may
compose and encode
arrangements of edges;
the third layer may encode
a nose and eyes; and the
fourth layer may recognize
that the image contains a
face. Importantly, a deep
learning process can learn
which features to optimally
Test 1 2 3
Deep Neural Networks as Feature Detectors
• AlexNet (Sneak preview)
• Convolutional neural network that achieved a top-5 error of 15.3%, more than 10.8
percentage points ahead of the runner up
• Think of convolutional network as:
• Feature detectors – Conv layers that detect features
• Fully connected feedforward layers – compose features detected by conv layers into more complex
representations
• Will discuss convolutional neural networks in depth later
• AlexNet has 8 layers
• 5 Convolutional Layers – Feature Detectors
• 3 Fully Connected Layers – Compose Features
AlexNet
(Layer 1 Conv1 Features)
Edge detectors and color
detectors. Note that edge
detectors are at different
angles.
AlexNet
(Layer 6 Conv2 Features)
First 30 features learned by
Conv2 layer.
AlexNet
(Conv2-Conv5 Features)
Conv3 Layer Features Conv4 Layer Features Conv5 Layer Features
AlexNet
(Fully Connected Layer Features)
Fully Connected Layer (fc6) Fully Connected Layer (fc7)
AlexNet
(Images Resembling Specific Classes Most
In Final Fully Connected Layer)
Classes Selected:
‘hen’
‘Yorkshire terrier’
‘Shetland sheepdog’
‘fountain’
‘theatre curtain’
‘geyser’
Hinton’s Family Tree Example
Examples of Deep learning Architectures
• Some examples of Deep learning architectures include:
• Neural Networks
• Feed-Forward Neural Networks*
• Convolutional Neural Networks*
• Recurrent Neural Networks*
• Deep Belief Networks
• Deep Boltzmann Machines
Biological Neurons
• Neuron structure
• Cell Body
• Axon – Pathway by which neuron sends messages to other neurons
• Dendritic Tree – Pathway by which neuron receives messages from other neurons
• Synapse – Locus where axon of one neuron contacts dendritic tree of another neuron
(locus of communication)
Biological Neurons
• How neurons communicate:
• Synapse contains vesicles of a transmitter chemical
• Different chemicals implement positive and negative weights
• When a spike arrives it causes these vesicles to migrate to the surface and be released into
the synaptic cleft
• Transmitter molecules diffuse across the synaptic cleft and bind to receptor molecules in the post-
synaptic neuron
• This opens holes that allow specific ions to flow in or out and that changes their state of
depolarization
• How neurons adapt (learn):
• The effectiveness of a synapse can be changed
• Vary the number of vesicles of transmitter
• Vary the number of receptor molecules
Biological Neurons
• Each neuron receives input from other neurons
• The effect of each input line on the neuron is controlled by a synaptic weight
• Weight can be positive or negative
• The synaptic weights adapt so that the entire network learns to perform useful
computations
• Human brain has about 10^11 each with about 10^4 weights
• Brain cortex looks the same all over and can become specialized
• Provides for rapid parallel computation
• Similar to FPGA
• In fact, even a single neuron is not explained by neuroscience. In fact it is much
more complex or possibly entirely different than our conception of artificial
neurons. Upshot: Use this analogy loosely.
Biological Neurons as Inspiration For Artificial
Neurons
Linear Neurons
Note that real neurons communicate
with spikes of activity rather than real
values.
1 . . . b
0
Weighted Input
Output
Binary Threshold Neurons
• Conceived by McCulloch-Pitts (1943)
• Compute a weighted sum of inputs
• Output a fixed size spike of activity if weighted sum exceeds threshold
1
0
Threshold
Output
Weighted Input
Rectified Linear (“ReLu”) Neuron
• Linear weight
• Output is non-linear
1
0
Output
Weighted Input
Sigmoid (Logistic) Neuron
• Real-valued output that is smooth and bounded function of total input
• Desirable properties of the derivative
y
zLogistic (sigmoid) function
The sigmoid function in this context is referred to the activation function.
Referred to as the “logit”
Tanh Activation Function
Tanh
Relu
Types of Learning
• Supervised
• Learn to predict output vector given an input vector
• Unsupervised
• Learn good internal representation of input
• Reinforcement
• Select an action to maximize a payoff
Perceptrons
• Investigated in early 1960’s as potentially promising learning mechanisms
• Grand claims re: learning algorithm
• Fell into disfavor as it was shown by Minsky and Papert (1969) that they were limited in
capability
• Many people wrongly interpreted these limitations as applying to all neural network models
• Decision unit in perceptron is binary threshold neuron
• Learning mechanism
• Pick training cases that ensures that every training case will keep getting picket
• If output unit is correct, leave its weights alone
• If output incorrectly outputs a zero , add the input vector to the weight vector
• If the output unit incorrectly outputs a 1, subtract the input vector from the weight vector
• This is guaranteed to find a set of weights that gets the right answer for all training cases if any such set
exists
• It turns out the difficult problem is determining what set of features should be used (remember these
aren’t learned).
• Using right set of features makes learning easy
• Using wrong set of feature makes learning impossible
Standard Paradigm For Statistical Pattern
Recognition
• Convert raw input vector into vector of feature activations
• Use hand-written programs based on common-sense or intuitive notions to define
features
• Learn how to weight each feature of the activations to get a single scalar
quantity
• The weights on the features represent how much evidence the feature gives you in
favor or against the hypothesis that the current input is the kind of pattern that is to
be recognized
• Add up all features and gives you total evidence whether to recognize the current
input
Perceptrons
A perceptron is one example of a statistical pattern
recognition system
. . .
Decision unit
Learned Weights
Feature Units
Inputs
No learning. Hand engineered.
Weights are learned here.
Limitations of Perceptrons
• Neural network research came to a halt in late ‘60s and early ‘70s largely due to
the fact that perceptrons were shown to be limited. In particular:
• Minsky and Papert’s “Group Invariance Theorem” proved that perceptron cannot
learn if there exist transformations of the features that form a group.
• This is very bad news for perceptrons, as pattern recognition requires translation and
rotation invariance, which are both groups
• If you can choose features by hand and use enough features a perceptron is
very powerful
• Thus, for binary input vectors a separate feature unit can be chosen for each vector.
However, this results in an exponential explosion of the number of feature units
required.
What Perceptrons Tell Us About What Is
Required For Deep Learning
• Neural networks are only going to be powerful if they can learn the feature
detectors
• Not enough to learn weights on feature detectors
• The second generation of neural networks (after Perceptrons) was about how to
learn the feature detectors
• Took about 20 years to figure out how to learn the feature detectors
• This is one of the primary reasons neural networks are so powerful
• Networks without hidden units are very limited in the input-output mappings
they can learn to model
• Many layers of linear units does not help. Need to break the linearity.
• Fixed output non-linearities are not enough. Linearity must permeate all hidden
layers
Requirements and Challenges for Deep
Learning
• What is required is multiple layers of adaptive non-linear hidden units?
• But, how can such an adaptive non-linear algorithm be trained?
• Need an efficient way of adapting all the weights – not just the last layer like in a
perceptron
• Learning the weights going into the hidden units is equivalent to learning features
• This is difficult because there is no guidance indicating directly what the hidden units
should do (i.e., what the feature detectors should be).
Learning Algorithm For Linear Neuron
• Multi-layer neural networks do not use the perceptron learning algorithm
• Multi-layer neural networks are not multi-layer perceptrons (“MLPs”). This is a
misnomer.
• Compare with perceptron learning algorithm:
• In perceptron, weights are always getting closer to a “good” set of weights
• This doesn’t work for more complex networks as the average of two good solutions may
not be a good solution.
• In a linear neuron, the outputs are always getting closer to the target outputs
• This is not true for perceptron learning. Target outputs may get further away even if
weights are converging to good sets of weights.
Linear Neurons
1 . . . b
0
Weighted Input
Output
The neuron has a real-
valued output, which is a
weighted sum of its inputs
The aim of learning is to
minimize the error
summed over all training
examples. Error may be
defined in many ways, but
often the squared error
(L2 loss) is used, which is
calculated by computing
the square of the
difference between the
desired output and actual
output.
Why Not Use Analytic Solution?
• We desire a method that real neurons use so we understand what they are doing
• Neurons are most certainly not solving equations
• Desire a method that can be generalized to multi-layer non-linear neural
networks
• Analytic solution only works for linear problems and a squared error function
• Iterative methods are less efficient but much easier to generalize
Deriving Learning Mechanism For Linear
Neurons
Define error as squared residuals summed over all training exam
Differentiate and apply chain rule
How To Update Weights
How to update weights in each iteratioLearning Rate
Behavior of Iterative Learning Procedure
• Does the learning procedure eventually get the right answer?
• There may be no perfect answer. We may give the linear neuron a bunch of training examples with
respective desired answers, but there may not exist a set of weights to give the desired answers.
• However, there may be a set of weights that gives a best approximation over all training examples
and that minimizes that error measure summed over all training examples.
• If we make the learning rate small enough and learn for long enough, we can get as close as we want
to the best answer.
• How quickly to weights converge to best answer?
• It can be quite slow if two input dimensions are highly correlated.
• Learning rate annoyances
• Choosing a learning rate that is too large will cause instabilities in learning (oscillations).
• Choosing a learning rate that is too small will cause learning to take a long time.
Error Surface of Linear Neuron
Error surface (E) is a quadratic bowl
Vertical cross sections are parabolas
Horizontal cross sections are ellipses
***For multi-layer non-linear neural networks, the
error surface is much more complicated. It will be
smooth but may have many local minima.
***From Neural Networks For Machine Learning (Coursera – Hinton)
Error Surface of Linear Neuron
Online vs. Batch Learning
• Online Learning – Weights and biases are updated for each training example.
• Gradient is taken for each single training case.
• Stochastic Gradient Descent (“SGD”) is one example of online learning as we will see
later.
• Batch Learning – Weights and biases are updated after an entire batch of
training examples. T
• The gradient is summed over ALL training examples. This is equivalent to steepest
gradient descent.
• Mini-batch learning splits training set into small batches and update step is
performed for each mini-batch
• Later we will explore the pros/cons of batch learning, mini-batch learning and
Online vs. Batch Learning For Linear Neuron
Batch learning is equivalent to
steepest descent on the error surface.
Trajectory is perpendicular to
contours.
If alternate between two training examples,
will move perpendicular to each respective
constrain line (zig-zag).
*From Neural Networks For Machine Learning (Coursera – Hinton)
Why Learning Can Be Slow
If ellipse is very elongated (will happen if
lines corresponding to two training
examples are almost parallel), steepest
descent can be very slow. This is due to
the fact that with an elongated ellipse,
the gradient is big in the direction in
which we don’t want to move very far
and small in direction where we would
like to move a long way. This condition
will cause the trajectory across the
ravine rather than along the ravine. This
is the opposite of the desired goal.
*From Neural Networks For Machine Learning (Coursera – Hinton)
Learning The Weights Of A Logistic Neuron
y
zLogistic (sigmoid) function
The sigmoid function in this context is referred to the activation function.
Referred to as the “logit”
The Derivatives Of A Logistic Neuron
Can be shown with simple calculus using
quotient rule for derivatives.
How To Learn The Weights For Logistic
Neuron
Due to slope of
sigmoid activation function.
Same as from linear neuron
Logistic Neuron Model
Logistic
Neuron
Model
Logistic Regression Problem Statement
Binary Classification
Problem:
Parameters:
Because 𝑦 is a probability, it should be between 0 and 1.
Our prediction is:
Logistic Regression Loss Function
m Training Examples
Given: want
What is the best loss function? Results in non-convex error surface
Instead use:
Cross-entropy loss function
Logistic Regression Cost Function
Why Cross-Entropy Loss?
Why Cross-Entropy Loss?
Compact expression:
Why Cross-Entropy Loss?
If i.i.d:
Maximum Likelihood Estimation (MLE) – Find parameters that maximize that expression. Because, we
Want to minimize, we need to take the additive inverse (remove negative sign). Thus, by performing
(i.e., steepest gradient descent, we are performing MLE.
Determining The Parameters Using Gradient
Descent
*From DeepLearning.ai (Coursera – Andrew Ng)
Gradient Operator
The gradient vector points in the direction of steepest ascent.
Proof:
must be by properties of the dot product.
Gradient Descent
• Algo:
• Randomly initialize weights and biases
• Calculate gradients
𝜕𝐽
𝜕𝑤 𝑖
and
𝜕𝐽
𝜕𝑏 𝑖
for all weights
and biases
• Update weights and biases using learning rate
and gradients
• 𝑤𝑖 = 𝑤𝑖-𝛼
𝜕𝐽
𝜕𝑤 𝑖
• 𝑏𝑖 = 𝑏𝑖-𝛼
𝜕𝐽
𝜕𝑏 𝑖
• Repeat until stopping condition
Notation:
𝑑𝑤 ≡
𝜕𝐽
𝜕𝑤
𝑑𝑏 ≡
𝜕𝐽
𝜕𝑏
Learning Rate
Computation Graph
• As will be seen, the training of a neural
network involves:
• A forward pass
• A backward pass
• Computation graph organizes these
trajectories
• Example: 𝐽 𝑎, 𝑏, 𝑐 = 3(𝑎 + 𝑏𝑐)
𝑢 = 𝑏𝑐
𝑣 = 𝑎 + 𝑢
𝐽 = 3𝑣
u=bc
v=a+
u
J=3v
a
b
c
Computation Graph
Computation Graph
Forward Pass
u=bc
v=a+
u
J=3v
a
b
c
Computation Graph
5
3
2
33
Computation Graph
Backward Pass
u=bc
v=a+
u
J=3v
a
b
c
Computation Graph
5
3
2
Computation Graph
Logistic Regression
Update Rules For Gradient Descent:
PseudoCode For Logistic Regression
(One Step Of Gradient Descent)
for i=1 to m:
Code Efficiency
• Code is inefficient as it requires two for loops
• 1 loop over m training examples
• 1 loop over features (in this example only 2 features present, but in general there will be
more)
• Solution Vectorization
• Vectorization leverages SIMD instructions to parallelize computations (data-level
parallelism)
• NumPy supports this as does TensorFlow
• Read Web resources on vectorization to learn more
• Broadcasting
• Allows operation on two objects having incompatible dimensions to force compatible
shapes
• Important for TensorFlow – Read Web resources on this topic
Artificial Neuron Model
Artificial
Neuron
Model
Deep Feedforward Networks
Multilayer Neural Networks
Deep Feedforward Networks
• Network comprises a set of layers, each layer comprising a set of artificial neurons
• Each layer may be thought of computing a function such that the combination of layers
generates a composite function:
• 𝑓 𝐿
(𝑓 𝐿−1
(… 𝑓1
(𝒙)))
• Final layer is called the output layer
• First layer is called the input layer
• Number of layers gives depth of the model (hence the term “deep” learning)
Deep Feedforward Networks
• Goal of deep feedforward network is to approximate some function f*(x)
• Information flows through the network through intermediation computations (acyclic
graph)
• A feedforward network defines a mapping 𝒚 = 𝑓(𝒙; 𝜽) that learns the parameters that
result in the best function approximation to f*(x)
• For example, a classifier 𝑦 = 𝑓(𝒙) maps an input vector to a category
• Training
• During training, drive f(x) to match f*(x)
• Training data provides noisy examples of f*(x) evaluated at different training points
• Each example x is accompanied by a label y=f*(x) (Supervised Learning)
Deep Feedforward Networks
(Hidden Layers)
• The training data specifies directly what the output layer must do at each point
x
• Namely, it must predict a value 𝑦 that is close to y
• In contrast, the behavior of the hidden layers is not directly specified by the
training data
• The learning algorithm must decide how to use the other layers to produce the
desired output that is close to y (i.e., to generate an approximation of f*
• Because the training data does not show the desired output for each of these other layers,
they are called hidden layers
Operation of Artificial Neuron
Forward Pass
Vectorizing Across Layers
• We will adopt the following conventions:
• Each layer has 𝑈𝑙 neurons (units), where l is the layer number
• Each neuron has an associated weight (column) vector 𝒘𝑖
[𝑙]
of length 𝑈𝑙−1 where i is
the ith neuron in the lth layer
• Each layer has an associated activation function 𝑓𝑙, for example 𝜎(𝑧)
• The input layer is layer 0
Notation Conventions
Number of units in layer l
Activation function for layer l
Linear combination for layer l
Activations for layer l
Biases for layer l
Weights for layer l
Note:
Forward Pass
Vectorizing Across Layers
Example 2-Layer Neural Network
ith Layer
jth unit
Example 2-Layer Neural Network
(4,1)
(4,1)
(1,1)
(1,1)
(4,3) (3,1) (4,1)
(1,4) (4,1) (1,1)
Example 4-Layer Neural Network
Example 4-Layer Neural Network
(Forward Propagation)
.
.
.
Example 4-Layer Neural Network
(Forward Propagation (vectorizing across training
examples))
Forward Pass
Vectorizing Across Training Examples
• It is now easy to vectorize across training examples
• Introduce the matrix:
Updated Notation Conventions
Number of units in layer l
Activation function for layer l
Weighted sum for layer l for ith training example
Activations for layer l for ith training example
Biases for layer l
Weights for layer l
Note:
We use [] to indicate layer
We use () to indicate training example
Forward Pass
Vectorizing Across Training Examples
A Tour Of Activation Functions
Sigmoid Function
y
z
Logistic (sigmoid) function
Tanh Activation Function
Tanh
Relu
Typically better than sigmoid function because mean is 0 except for output layer where sigmoid makes
That is, at output, generally want a value between 0 and 1.
Centers mean of data around 0.
Rectified Linear (“ReLu”) Neuron
• Output is non-linear
• Faster learning than sigmoid or tanh typically.
• Although derivative is 0 above threshold, usually works well. If this is a concern use
leaky ReLu (next slide).
1
0
Output
Weighted Input
Leaky Rectified Linear (“ReLu”) Neuron
• Addresses issue of zero derivative of ReLu when less than threshold
1
0
Output
Weighted Input
Small positive number
Rules Of Thumb
• If z is very large or very small, the slope is close to 0, which slows down
learning.
• If output is 0 or 1 (binary classification), use sigmoid
• Tanh superior for everything else
• For all other units use ReLu or Leaky ReLu (becoming more popular in hidden
units)
Why Non-Linear Activation Functions?
• Why do we need an activation function at all?
• Composition of linear functions is still linear!
• Exercise: Prove this
• Thus, using linear activation functions limits the complexity of functions that can be
learned
Training vs. Inference
• Neural networks have two phases or operation:
• Training (learning) phase – Parameters (weights and biases) of the network are
learned via a supervised learning algorithm by providing training examples with
know target outputs. Target output is depicted by y.
• Backpropagation algorithm (learning algorithm for neural networks)
• Forward pass – Generates 𝑦, the predicted output from which a loss function is computed based
upon the target output y: 𝐿(𝑦, 𝑦)
• Backward pass – To determine the weights and biases
• Gradient descent - Backpropagation algorithm is used in conjunction with gradient descent to
update the weights and biases iteratively.
• Inference (Runtime)
• Once weights and biases are learned, the neural network is used to make predictions on
inputs it has not seen before.
Backpropagation Algorithm
In order to carry out steepest gradient descent, we are interested in the following expressions:
Define the error term as follows:
The error term represents the rate of change of the cost function with respect to the
weighted input of the jth neuron in the lth layer. We will see that the backpropagation
algorithm amounts to iteratively propagating the error backwards from the last layer (L)
to the first layer and computing: at each iteration.
Backpropagation Algorithm
Step 1: Calculate an expression for the error term in the last layer (L):
Plugging this into the expression for the error term:
Equation 1
(Using chain rule)
(Using chain rule)
Backpropagation Algorithm
Step 2: Calculate the propagation of the error backwards through the network:
Equation 2
(Chain rule)
(Definition of error)
(Definition of weighted sum)
(Carrying out derivative)
(Substitution)
Backpropagation Algorithm
Step 3: Calculate the derivative of the cost function with respect to the weight:
Equation 3
(Chain rule)
(Definition)
(Applying derivative)
(Definition)
(Substitution)
Backpropagation Algorithm
Step 4: Calculate the derivative of the cost function with respect to the
bias:
(Chain rule)
(Definition)
(Definition)
(Applying the derivative)
(Substitution)
Equation 4
The 4 Fundamental Equations Of
Backpropagation And Their Interpretation
(1)
(2)
(3)
(4)
Calculate error of
last layer
Propagate error
backwards preceding
layers
Calculate gradient
of cost function with
respect to weights using
errors
Calculate gradient
of cost function with
respect to biases
using errors
Carrying Out The Backpropagation Algorithm
1. Input: Training Example (x) – Set input layer activation 𝒂0 to training example
(x)
2. Feedforward: – For each l=1,2,3, … L compute 𝒛𝑙= 𝑾𝑙 𝒂𝑙−1 + 𝒃𝑙 and 𝒂𝑙 = 𝜎(𝒛𝑙).
Cache 𝒛𝑙
for backpropagation.
3. Output Error: Calculate 𝜺 𝐿 = 𝜵 𝒂 𝐽⨀𝜎′(𝒛 𝐿)
4. Backpropagate Error: For each i=L-l, L-2 , … 1 compute 𝜺𝑙
=
((𝒘𝑙+1) 𝑇 𝜺𝑙+1)⨀𝜎′(𝒛𝑙)
5. Output: the gradient of the cost function:
𝜕𝐽
𝜕𝑤 𝑗𝑘
𝑙 = 𝑎 𝑘
𝑙−1
𝜀𝑗
𝑙
and
𝜕𝐽
𝜕𝑏 𝑗
𝑙 = 𝜀𝑗
𝑙
Backpropagation Block Diagram
cache
Forward
Propagation
Backpropagation
Layer l
Backpropagation Block Diagram
cache
Forward
PropagationBackpropagation
. . .
. . .
Backpropagation With Gradient Descent
• For each training example x, set the input activation 𝒂[0](𝑥) and perform the
following steps:
• Feedforward: For each l=1, 2, 3, … L compute 𝒛[𝑙](𝑥) = 𝒘[𝑙] 𝒂 𝑙−1 (𝑥) + 𝒃[𝑙] and 𝒂[𝑙](𝑥) =
𝜎(𝒛 𝑙
)
• Output Error: Compute 𝜺[𝐿](𝑥) = 𝜵 𝒂 𝐽⨀𝜎′(𝒛[𝐿](𝑥))
• Backpropagate Error: For each i=L-l, L-2 , … 1 compute 𝜺[𝑙](𝑥) =
((𝒘[𝑙+1]) 𝑇 𝜺[𝑙+1](𝑥))⨀𝜎′(𝒛[𝑙](𝑥))
• Compute One Step Of Gradient Descent: For each l=L, L-1, L-2, … 1, update the
weights according to the rules:
• 𝒘𝑙
= 𝒘𝑙
−
∝
𝑚 𝑥 𝜺 𝑙 𝑥
(𝒂 𝑙−1 𝑥
) 𝑇
• 𝒃𝑙
= 𝒃𝑙
−
𝛼
𝑚 𝑥 𝜺 𝑙 𝑥
Learning Rate
Learning Rate
Learning By Processing The Training
Examples
• As we will discuss in more detail later, training examples may be processed in
different chunk sizes:
• Batch – the entire set of training examples
• Mini-batch – a subset of the training examples
• Stochastic Gradient Descent – a single training example
• In practice there will be two outer loops:
• One outer loop generating mini-batches (if mini-batch is used)
• One outer loop stepping through multiple epochs of training
• Epoch – One pass through entire training set
Matrix Shape
• 𝒘[𝑙] = (𝑛[𝑙], 𝑛[𝑙−1])
• 𝒃[𝑙] = 𝑛 𝑙 , 1
• 𝑑𝒘[𝑙] = (𝑛[𝑙], 𝑛[𝑙−1])
• 𝑑𝒃[𝑙] = 𝑛 𝑙 , 1
• 𝒛[𝑙], 𝒂[𝑙] = (𝑛 𝑙 , 1)
• For vectorization across training examples:
• 𝒁[𝑙]
, 𝑨[𝑙]
= (𝑛 𝑙
, 𝑚)
• 𝑑𝒁[𝑙]
, 𝑑𝑨[𝑙]
= (𝑛 𝑙
, 𝑚)
Exercises
• Draw computation graph for 2-layer network shown earlier
• Understand forward and backward pass
• The 4 fundamental equations for backpropagation derived vectorize over layers
but not over training examples
• Derive the 4 fundamental equations for backpropagation vectorized across training
examples
Initialization of Parameters
• If initialize weights to 0, all hidden units will compute the same value due to
symmetry
• Biases can be initialized to 0
• To solve, weights should be initialized to a small random value
• Small to force learning of (Tanh or Sigmoid) to not be small
Parameters vs. Hyperparameters
• Parameters:
• 𝒘[𝑙], 𝒃[𝑙]
• Hyperparameters – determine the parameters (tuned during cross-validation):
• ∝ − Learning Rate
• # of iterations
• L – Number of hidden layers
• 𝑛[𝑙]
- Number of hidden units
• 𝑔[𝑙]
- Activation functions
• Many others to be seen . . .

Neural Networks and Deep Learning Basics

  • 1.
    DEEP LEARNING WITH TENSORFLOW IntroductionTo Deep Learning – Weeks 2 (Units 2-3) Jon Lederman
  • 2.
    Why Do WeNeed Artificial Intelligence • Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. • Some examples of Deep learning architectures include: • Neural Networks • Feed-Forward Neural Networks* • Convolutional Neural Networks* • Recurrent Neural Networks* • Deep Belief Networks • Deep Boltzmann Machines
  • 3.
    Limitations Of SomeForms Of Simple Machine Learning Algorithms • Performance of machine learning algorithms depends heavily on the representation of data provided • Each element of data in a representation of is known as a feature. • Cannot influence how features are defined or which ones are important
  • 4.
    Enter Deep Learning Whatis Deep Learning? • Deep learning is a particular kind of machine learning technique that represents the world as a nested hierarchy of concepts with each concept defined in relation to simpler concepts and more abstract representations defined in terms of less abstract ones. • The deep in "deep learning" refers to the number of layers through which the data is transformed. • Deep Learning is an AI/machine learning approach with several key features that distinguishes it from previous AI approaches such as the perceptron: • Representational learning – Rather than hand coding the relevant features and their associated representations, the algorithm learns the best representations of the data • In many cases these learned representations are not intuitive to human beings, which is one reason deep neural networks have mysterious qualities. • Hierarchical - More complex representations are expressed in terms of simpler representations
  • 5.
    Transformations of AbstractRepresentations • In deep learning, each level learns to transform its input data into a slightly more abstract and composite representation. • In an image recognition application: • the raw input may be a matrix of pixels • the first representational layer may abstract the pixels and encode edges • the second layer may compose and encode arrangements of edges • the third layer may encode a nose and eyes • the fourth layer may recognize that the image contains a face. • Importantly, a deep learning process can learn which features to optimally place in which level on its own. That is, a deep learning process learns its own feature detectors!
  • 6.
    Representational Learning From DeepLearning – Goodfellow, Bengio and Courville Input is presented at the visible layer (observable features). Then a series of hidden layers extracts increasingly abstract features from the images. These layers are called ”hidden” because their values are not given in the data. Instead the model must learn which concepts are useful for explaining the relationships in the observed data. In deep learning, each level learns to transform its input data into a slightly more abstract and composite representation. In an image recognition application, the raw input may be a matrix of pixels; the first representational layer may abstract the pixels and encode edges; the second layer may compose and encode arrangements of edges; the third layer may encode a nose and eyes; and the fourth layer may recognize that the image contains a face. Importantly, a deep learning process can learn which features to optimally Test 1 2 3
  • 7.
    Deep Neural Networksas Feature Detectors • AlexNet (Sneak preview) • Convolutional neural network that achieved a top-5 error of 15.3%, more than 10.8 percentage points ahead of the runner up • Think of convolutional network as: • Feature detectors – Conv layers that detect features • Fully connected feedforward layers – compose features detected by conv layers into more complex representations • Will discuss convolutional neural networks in depth later • AlexNet has 8 layers • 5 Convolutional Layers – Feature Detectors • 3 Fully Connected Layers – Compose Features
  • 8.
    AlexNet (Layer 1 Conv1Features) Edge detectors and color detectors. Note that edge detectors are at different angles.
  • 9.
    AlexNet (Layer 6 Conv2Features) First 30 features learned by Conv2 layer.
  • 10.
    AlexNet (Conv2-Conv5 Features) Conv3 LayerFeatures Conv4 Layer Features Conv5 Layer Features
  • 11.
    AlexNet (Fully Connected LayerFeatures) Fully Connected Layer (fc6) Fully Connected Layer (fc7)
  • 12.
    AlexNet (Images Resembling SpecificClasses Most In Final Fully Connected Layer) Classes Selected: ‘hen’ ‘Yorkshire terrier’ ‘Shetland sheepdog’ ‘fountain’ ‘theatre curtain’ ‘geyser’
  • 13.
  • 14.
    Examples of Deeplearning Architectures • Some examples of Deep learning architectures include: • Neural Networks • Feed-Forward Neural Networks* • Convolutional Neural Networks* • Recurrent Neural Networks* • Deep Belief Networks • Deep Boltzmann Machines
  • 15.
    Biological Neurons • Neuronstructure • Cell Body • Axon – Pathway by which neuron sends messages to other neurons • Dendritic Tree – Pathway by which neuron receives messages from other neurons • Synapse – Locus where axon of one neuron contacts dendritic tree of another neuron (locus of communication)
  • 16.
    Biological Neurons • Howneurons communicate: • Synapse contains vesicles of a transmitter chemical • Different chemicals implement positive and negative weights • When a spike arrives it causes these vesicles to migrate to the surface and be released into the synaptic cleft • Transmitter molecules diffuse across the synaptic cleft and bind to receptor molecules in the post- synaptic neuron • This opens holes that allow specific ions to flow in or out and that changes their state of depolarization • How neurons adapt (learn): • The effectiveness of a synapse can be changed • Vary the number of vesicles of transmitter • Vary the number of receptor molecules
  • 17.
    Biological Neurons • Eachneuron receives input from other neurons • The effect of each input line on the neuron is controlled by a synaptic weight • Weight can be positive or negative • The synaptic weights adapt so that the entire network learns to perform useful computations • Human brain has about 10^11 each with about 10^4 weights • Brain cortex looks the same all over and can become specialized • Provides for rapid parallel computation • Similar to FPGA • In fact, even a single neuron is not explained by neuroscience. In fact it is much more complex or possibly entirely different than our conception of artificial neurons. Upshot: Use this analogy loosely.
  • 18.
    Biological Neurons asInspiration For Artificial Neurons
  • 19.
    Linear Neurons Note thatreal neurons communicate with spikes of activity rather than real values. 1 . . . b 0 Weighted Input Output
  • 20.
    Binary Threshold Neurons •Conceived by McCulloch-Pitts (1943) • Compute a weighted sum of inputs • Output a fixed size spike of activity if weighted sum exceeds threshold 1 0 Threshold Output Weighted Input
  • 21.
    Rectified Linear (“ReLu”)Neuron • Linear weight • Output is non-linear 1 0 Output Weighted Input
  • 22.
    Sigmoid (Logistic) Neuron •Real-valued output that is smooth and bounded function of total input • Desirable properties of the derivative y zLogistic (sigmoid) function The sigmoid function in this context is referred to the activation function. Referred to as the “logit”
  • 23.
  • 24.
    Types of Learning •Supervised • Learn to predict output vector given an input vector • Unsupervised • Learn good internal representation of input • Reinforcement • Select an action to maximize a payoff
  • 25.
    Perceptrons • Investigated inearly 1960’s as potentially promising learning mechanisms • Grand claims re: learning algorithm • Fell into disfavor as it was shown by Minsky and Papert (1969) that they were limited in capability • Many people wrongly interpreted these limitations as applying to all neural network models • Decision unit in perceptron is binary threshold neuron • Learning mechanism • Pick training cases that ensures that every training case will keep getting picket • If output unit is correct, leave its weights alone • If output incorrectly outputs a zero , add the input vector to the weight vector • If the output unit incorrectly outputs a 1, subtract the input vector from the weight vector • This is guaranteed to find a set of weights that gets the right answer for all training cases if any such set exists • It turns out the difficult problem is determining what set of features should be used (remember these aren’t learned). • Using right set of features makes learning easy • Using wrong set of feature makes learning impossible
  • 26.
    Standard Paradigm ForStatistical Pattern Recognition • Convert raw input vector into vector of feature activations • Use hand-written programs based on common-sense or intuitive notions to define features • Learn how to weight each feature of the activations to get a single scalar quantity • The weights on the features represent how much evidence the feature gives you in favor or against the hypothesis that the current input is the kind of pattern that is to be recognized • Add up all features and gives you total evidence whether to recognize the current input
  • 27.
    Perceptrons A perceptron isone example of a statistical pattern recognition system . . . Decision unit Learned Weights Feature Units Inputs No learning. Hand engineered. Weights are learned here.
  • 28.
    Limitations of Perceptrons •Neural network research came to a halt in late ‘60s and early ‘70s largely due to the fact that perceptrons were shown to be limited. In particular: • Minsky and Papert’s “Group Invariance Theorem” proved that perceptron cannot learn if there exist transformations of the features that form a group. • This is very bad news for perceptrons, as pattern recognition requires translation and rotation invariance, which are both groups • If you can choose features by hand and use enough features a perceptron is very powerful • Thus, for binary input vectors a separate feature unit can be chosen for each vector. However, this results in an exponential explosion of the number of feature units required.
  • 29.
    What Perceptrons TellUs About What Is Required For Deep Learning • Neural networks are only going to be powerful if they can learn the feature detectors • Not enough to learn weights on feature detectors • The second generation of neural networks (after Perceptrons) was about how to learn the feature detectors • Took about 20 years to figure out how to learn the feature detectors • This is one of the primary reasons neural networks are so powerful • Networks without hidden units are very limited in the input-output mappings they can learn to model • Many layers of linear units does not help. Need to break the linearity. • Fixed output non-linearities are not enough. Linearity must permeate all hidden layers
  • 30.
    Requirements and Challengesfor Deep Learning • What is required is multiple layers of adaptive non-linear hidden units? • But, how can such an adaptive non-linear algorithm be trained? • Need an efficient way of adapting all the weights – not just the last layer like in a perceptron • Learning the weights going into the hidden units is equivalent to learning features • This is difficult because there is no guidance indicating directly what the hidden units should do (i.e., what the feature detectors should be).
  • 31.
    Learning Algorithm ForLinear Neuron • Multi-layer neural networks do not use the perceptron learning algorithm • Multi-layer neural networks are not multi-layer perceptrons (“MLPs”). This is a misnomer. • Compare with perceptron learning algorithm: • In perceptron, weights are always getting closer to a “good” set of weights • This doesn’t work for more complex networks as the average of two good solutions may not be a good solution. • In a linear neuron, the outputs are always getting closer to the target outputs • This is not true for perceptron learning. Target outputs may get further away even if weights are converging to good sets of weights.
  • 32.
    Linear Neurons 1 .. . b 0 Weighted Input Output The neuron has a real- valued output, which is a weighted sum of its inputs The aim of learning is to minimize the error summed over all training examples. Error may be defined in many ways, but often the squared error (L2 loss) is used, which is calculated by computing the square of the difference between the desired output and actual output.
  • 33.
    Why Not UseAnalytic Solution? • We desire a method that real neurons use so we understand what they are doing • Neurons are most certainly not solving equations • Desire a method that can be generalized to multi-layer non-linear neural networks • Analytic solution only works for linear problems and a squared error function • Iterative methods are less efficient but much easier to generalize
  • 34.
    Deriving Learning MechanismFor Linear Neurons Define error as squared residuals summed over all training exam Differentiate and apply chain rule
  • 35.
    How To UpdateWeights How to update weights in each iteratioLearning Rate
  • 36.
    Behavior of IterativeLearning Procedure • Does the learning procedure eventually get the right answer? • There may be no perfect answer. We may give the linear neuron a bunch of training examples with respective desired answers, but there may not exist a set of weights to give the desired answers. • However, there may be a set of weights that gives a best approximation over all training examples and that minimizes that error measure summed over all training examples. • If we make the learning rate small enough and learn for long enough, we can get as close as we want to the best answer. • How quickly to weights converge to best answer? • It can be quite slow if two input dimensions are highly correlated. • Learning rate annoyances • Choosing a learning rate that is too large will cause instabilities in learning (oscillations). • Choosing a learning rate that is too small will cause learning to take a long time.
  • 37.
    Error Surface ofLinear Neuron Error surface (E) is a quadratic bowl Vertical cross sections are parabolas Horizontal cross sections are ellipses ***For multi-layer non-linear neural networks, the error surface is much more complicated. It will be smooth but may have many local minima. ***From Neural Networks For Machine Learning (Coursera – Hinton)
  • 38.
    Error Surface ofLinear Neuron
  • 39.
    Online vs. BatchLearning • Online Learning – Weights and biases are updated for each training example. • Gradient is taken for each single training case. • Stochastic Gradient Descent (“SGD”) is one example of online learning as we will see later. • Batch Learning – Weights and biases are updated after an entire batch of training examples. T • The gradient is summed over ALL training examples. This is equivalent to steepest gradient descent. • Mini-batch learning splits training set into small batches and update step is performed for each mini-batch • Later we will explore the pros/cons of batch learning, mini-batch learning and
  • 40.
    Online vs. BatchLearning For Linear Neuron Batch learning is equivalent to steepest descent on the error surface. Trajectory is perpendicular to contours. If alternate between two training examples, will move perpendicular to each respective constrain line (zig-zag). *From Neural Networks For Machine Learning (Coursera – Hinton)
  • 41.
    Why Learning CanBe Slow If ellipse is very elongated (will happen if lines corresponding to two training examples are almost parallel), steepest descent can be very slow. This is due to the fact that with an elongated ellipse, the gradient is big in the direction in which we don’t want to move very far and small in direction where we would like to move a long way. This condition will cause the trajectory across the ravine rather than along the ravine. This is the opposite of the desired goal. *From Neural Networks For Machine Learning (Coursera – Hinton)
  • 42.
    Learning The WeightsOf A Logistic Neuron y zLogistic (sigmoid) function The sigmoid function in this context is referred to the activation function. Referred to as the “logit”
  • 43.
    The Derivatives OfA Logistic Neuron Can be shown with simple calculus using quotient rule for derivatives.
  • 44.
    How To LearnThe Weights For Logistic Neuron Due to slope of sigmoid activation function. Same as from linear neuron
  • 45.
  • 46.
    Logistic Regression ProblemStatement Binary Classification Problem: Parameters: Because 𝑦 is a probability, it should be between 0 and 1. Our prediction is:
  • 47.
    Logistic Regression LossFunction m Training Examples Given: want What is the best loss function? Results in non-convex error surface Instead use: Cross-entropy loss function
  • 48.
  • 49.
  • 50.
  • 51.
    Why Cross-Entropy Loss? Ifi.i.d: Maximum Likelihood Estimation (MLE) – Find parameters that maximize that expression. Because, we Want to minimize, we need to take the additive inverse (remove negative sign). Thus, by performing (i.e., steepest gradient descent, we are performing MLE.
  • 52.
    Determining The ParametersUsing Gradient Descent *From DeepLearning.ai (Coursera – Andrew Ng)
  • 53.
    Gradient Operator The gradientvector points in the direction of steepest ascent. Proof: must be by properties of the dot product.
  • 54.
    Gradient Descent • Algo: •Randomly initialize weights and biases • Calculate gradients 𝜕𝐽 𝜕𝑤 𝑖 and 𝜕𝐽 𝜕𝑏 𝑖 for all weights and biases • Update weights and biases using learning rate and gradients • 𝑤𝑖 = 𝑤𝑖-𝛼 𝜕𝐽 𝜕𝑤 𝑖 • 𝑏𝑖 = 𝑏𝑖-𝛼 𝜕𝐽 𝜕𝑏 𝑖 • Repeat until stopping condition Notation: 𝑑𝑤 ≡ 𝜕𝐽 𝜕𝑤 𝑑𝑏 ≡ 𝜕𝐽 𝜕𝑏 Learning Rate
  • 55.
    Computation Graph • Aswill be seen, the training of a neural network involves: • A forward pass • A backward pass • Computation graph organizes these trajectories • Example: 𝐽 𝑎, 𝑏, 𝑐 = 3(𝑎 + 𝑏𝑐) 𝑢 = 𝑏𝑐 𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣 u=bc v=a+ u J=3v a b c Computation Graph
  • 56.
  • 57.
  • 58.
  • 59.
    PseudoCode For LogisticRegression (One Step Of Gradient Descent) for i=1 to m:
  • 60.
    Code Efficiency • Codeis inefficient as it requires two for loops • 1 loop over m training examples • 1 loop over features (in this example only 2 features present, but in general there will be more) • Solution Vectorization • Vectorization leverages SIMD instructions to parallelize computations (data-level parallelism) • NumPy supports this as does TensorFlow • Read Web resources on vectorization to learn more • Broadcasting • Allows operation on two objects having incompatible dimensions to force compatible shapes • Important for TensorFlow – Read Web resources on this topic
  • 61.
  • 62.
  • 63.
    Deep Feedforward Networks •Network comprises a set of layers, each layer comprising a set of artificial neurons • Each layer may be thought of computing a function such that the combination of layers generates a composite function: • 𝑓 𝐿 (𝑓 𝐿−1 (… 𝑓1 (𝒙))) • Final layer is called the output layer • First layer is called the input layer • Number of layers gives depth of the model (hence the term “deep” learning)
  • 64.
    Deep Feedforward Networks •Goal of deep feedforward network is to approximate some function f*(x) • Information flows through the network through intermediation computations (acyclic graph) • A feedforward network defines a mapping 𝒚 = 𝑓(𝒙; 𝜽) that learns the parameters that result in the best function approximation to f*(x) • For example, a classifier 𝑦 = 𝑓(𝒙) maps an input vector to a category • Training • During training, drive f(x) to match f*(x) • Training data provides noisy examples of f*(x) evaluated at different training points • Each example x is accompanied by a label y=f*(x) (Supervised Learning)
  • 65.
    Deep Feedforward Networks (HiddenLayers) • The training data specifies directly what the output layer must do at each point x • Namely, it must predict a value 𝑦 that is close to y • In contrast, the behavior of the hidden layers is not directly specified by the training data • The learning algorithm must decide how to use the other layers to produce the desired output that is close to y (i.e., to generate an approximation of f* • Because the training data does not show the desired output for each of these other layers, they are called hidden layers
  • 66.
  • 67.
    Forward Pass Vectorizing AcrossLayers • We will adopt the following conventions: • Each layer has 𝑈𝑙 neurons (units), where l is the layer number • Each neuron has an associated weight (column) vector 𝒘𝑖 [𝑙] of length 𝑈𝑙−1 where i is the ith neuron in the lth layer • Each layer has an associated activation function 𝑓𝑙, for example 𝜎(𝑧) • The input layer is layer 0
  • 68.
    Notation Conventions Number ofunits in layer l Activation function for layer l Linear combination for layer l Activations for layer l Biases for layer l Weights for layer l Note:
  • 69.
  • 70.
    Example 2-Layer NeuralNetwork ith Layer jth unit
  • 71.
    Example 2-Layer NeuralNetwork (4,1) (4,1) (1,1) (1,1) (4,3) (3,1) (4,1) (1,4) (4,1) (1,1)
  • 72.
  • 73.
    Example 4-Layer NeuralNetwork (Forward Propagation) . . .
  • 74.
    Example 4-Layer NeuralNetwork (Forward Propagation (vectorizing across training examples))
  • 75.
    Forward Pass Vectorizing AcrossTraining Examples • It is now easy to vectorize across training examples • Introduce the matrix:
  • 76.
    Updated Notation Conventions Numberof units in layer l Activation function for layer l Weighted sum for layer l for ith training example Activations for layer l for ith training example Biases for layer l Weights for layer l Note: We use [] to indicate layer We use () to indicate training example
  • 77.
  • 78.
    A Tour OfActivation Functions Sigmoid Function y z Logistic (sigmoid) function
  • 79.
    Tanh Activation Function Tanh Relu Typicallybetter than sigmoid function because mean is 0 except for output layer where sigmoid makes That is, at output, generally want a value between 0 and 1. Centers mean of data around 0.
  • 80.
    Rectified Linear (“ReLu”)Neuron • Output is non-linear • Faster learning than sigmoid or tanh typically. • Although derivative is 0 above threshold, usually works well. If this is a concern use leaky ReLu (next slide). 1 0 Output Weighted Input
  • 81.
    Leaky Rectified Linear(“ReLu”) Neuron • Addresses issue of zero derivative of ReLu when less than threshold 1 0 Output Weighted Input Small positive number
  • 82.
    Rules Of Thumb •If z is very large or very small, the slope is close to 0, which slows down learning. • If output is 0 or 1 (binary classification), use sigmoid • Tanh superior for everything else • For all other units use ReLu or Leaky ReLu (becoming more popular in hidden units)
  • 83.
    Why Non-Linear ActivationFunctions? • Why do we need an activation function at all? • Composition of linear functions is still linear! • Exercise: Prove this • Thus, using linear activation functions limits the complexity of functions that can be learned
  • 84.
    Training vs. Inference •Neural networks have two phases or operation: • Training (learning) phase – Parameters (weights and biases) of the network are learned via a supervised learning algorithm by providing training examples with know target outputs. Target output is depicted by y. • Backpropagation algorithm (learning algorithm for neural networks) • Forward pass – Generates 𝑦, the predicted output from which a loss function is computed based upon the target output y: 𝐿(𝑦, 𝑦) • Backward pass – To determine the weights and biases • Gradient descent - Backpropagation algorithm is used in conjunction with gradient descent to update the weights and biases iteratively. • Inference (Runtime) • Once weights and biases are learned, the neural network is used to make predictions on inputs it has not seen before.
  • 85.
    Backpropagation Algorithm In orderto carry out steepest gradient descent, we are interested in the following expressions: Define the error term as follows: The error term represents the rate of change of the cost function with respect to the weighted input of the jth neuron in the lth layer. We will see that the backpropagation algorithm amounts to iteratively propagating the error backwards from the last layer (L) to the first layer and computing: at each iteration.
  • 86.
    Backpropagation Algorithm Step 1:Calculate an expression for the error term in the last layer (L): Plugging this into the expression for the error term: Equation 1 (Using chain rule) (Using chain rule)
  • 87.
    Backpropagation Algorithm Step 2:Calculate the propagation of the error backwards through the network: Equation 2 (Chain rule) (Definition of error) (Definition of weighted sum) (Carrying out derivative) (Substitution)
  • 88.
    Backpropagation Algorithm Step 3:Calculate the derivative of the cost function with respect to the weight: Equation 3 (Chain rule) (Definition) (Applying derivative) (Definition) (Substitution)
  • 89.
    Backpropagation Algorithm Step 4:Calculate the derivative of the cost function with respect to the bias: (Chain rule) (Definition) (Definition) (Applying the derivative) (Substitution) Equation 4
  • 90.
    The 4 FundamentalEquations Of Backpropagation And Their Interpretation (1) (2) (3) (4) Calculate error of last layer Propagate error backwards preceding layers Calculate gradient of cost function with respect to weights using errors Calculate gradient of cost function with respect to biases using errors
  • 91.
    Carrying Out TheBackpropagation Algorithm 1. Input: Training Example (x) – Set input layer activation 𝒂0 to training example (x) 2. Feedforward: – For each l=1,2,3, … L compute 𝒛𝑙= 𝑾𝑙 𝒂𝑙−1 + 𝒃𝑙 and 𝒂𝑙 = 𝜎(𝒛𝑙). Cache 𝒛𝑙 for backpropagation. 3. Output Error: Calculate 𝜺 𝐿 = 𝜵 𝒂 𝐽⨀𝜎′(𝒛 𝐿) 4. Backpropagate Error: For each i=L-l, L-2 , … 1 compute 𝜺𝑙 = ((𝒘𝑙+1) 𝑇 𝜺𝑙+1)⨀𝜎′(𝒛𝑙) 5. Output: the gradient of the cost function: 𝜕𝐽 𝜕𝑤 𝑗𝑘 𝑙 = 𝑎 𝑘 𝑙−1 𝜀𝑗 𝑙 and 𝜕𝐽 𝜕𝑏 𝑗 𝑙 = 𝜀𝑗 𝑙
  • 92.
  • 93.
  • 94.
    Backpropagation With GradientDescent • For each training example x, set the input activation 𝒂[0](𝑥) and perform the following steps: • Feedforward: For each l=1, 2, 3, … L compute 𝒛[𝑙](𝑥) = 𝒘[𝑙] 𝒂 𝑙−1 (𝑥) + 𝒃[𝑙] and 𝒂[𝑙](𝑥) = 𝜎(𝒛 𝑙 ) • Output Error: Compute 𝜺[𝐿](𝑥) = 𝜵 𝒂 𝐽⨀𝜎′(𝒛[𝐿](𝑥)) • Backpropagate Error: For each i=L-l, L-2 , … 1 compute 𝜺[𝑙](𝑥) = ((𝒘[𝑙+1]) 𝑇 𝜺[𝑙+1](𝑥))⨀𝜎′(𝒛[𝑙](𝑥)) • Compute One Step Of Gradient Descent: For each l=L, L-1, L-2, … 1, update the weights according to the rules: • 𝒘𝑙 = 𝒘𝑙 − ∝ 𝑚 𝑥 𝜺 𝑙 𝑥 (𝒂 𝑙−1 𝑥 ) 𝑇 • 𝒃𝑙 = 𝒃𝑙 − 𝛼 𝑚 𝑥 𝜺 𝑙 𝑥 Learning Rate Learning Rate
  • 95.
    Learning By ProcessingThe Training Examples • As we will discuss in more detail later, training examples may be processed in different chunk sizes: • Batch – the entire set of training examples • Mini-batch – a subset of the training examples • Stochastic Gradient Descent – a single training example • In practice there will be two outer loops: • One outer loop generating mini-batches (if mini-batch is used) • One outer loop stepping through multiple epochs of training • Epoch – One pass through entire training set
  • 96.
    Matrix Shape • 𝒘[𝑙]= (𝑛[𝑙], 𝑛[𝑙−1]) • 𝒃[𝑙] = 𝑛 𝑙 , 1 • 𝑑𝒘[𝑙] = (𝑛[𝑙], 𝑛[𝑙−1]) • 𝑑𝒃[𝑙] = 𝑛 𝑙 , 1 • 𝒛[𝑙], 𝒂[𝑙] = (𝑛 𝑙 , 1) • For vectorization across training examples: • 𝒁[𝑙] , 𝑨[𝑙] = (𝑛 𝑙 , 𝑚) • 𝑑𝒁[𝑙] , 𝑑𝑨[𝑙] = (𝑛 𝑙 , 𝑚)
  • 97.
    Exercises • Draw computationgraph for 2-layer network shown earlier • Understand forward and backward pass • The 4 fundamental equations for backpropagation derived vectorize over layers but not over training examples • Derive the 4 fundamental equations for backpropagation vectorized across training examples
  • 98.
    Initialization of Parameters •If initialize weights to 0, all hidden units will compute the same value due to symmetry • Biases can be initialized to 0 • To solve, weights should be initialized to a small random value • Small to force learning of (Tanh or Sigmoid) to not be small
  • 99.
    Parameters vs. Hyperparameters •Parameters: • 𝒘[𝑙], 𝒃[𝑙] • Hyperparameters – determine the parameters (tuned during cross-validation): • ∝ − Learning Rate • # of iterations • L – Number of hidden layers • 𝑛[𝑙] - Number of hidden units • 𝑔[𝑙] - Activation functions • Many others to be seen . . .