Deeplearning for Computer Vision PPT with

Introduction
Artificial
Intelligence
Machine
Learning
Deep
Learning
•AI is the study of pattern recognition and
mimicking human behavior.AI powered
computers has started simulating the human
brain work style sensation, actions, interaction,
perception, and cognitive abilities.
•A subset of AI that incorporates math and
statistics in such a way that allows the
application to learn from data.
•A subset of ML that uses neural network to
learn from unstructured or unlabeled data.
•A measurable attribute of
data, determined to be
valuable in the learning
process.
Feature
•A set of algorithms inspired by
neural connections in the
human brain, consisting of
thousands to millions of
connected processing nodes.
Neural
Network
•Identifying to which category a
given data point belongs.
Classificati
on

Machine Learning Vs Deep Learning
Description Machine Learning Deep Learning
Operation ML algorithms get train data set and learn how to
predict similar events in future which is usually as test
set.
DL is mostly based on neural network which is
one of ML algorithm. DL works mostly on feature
selection.
Methods Supervised and Unsupervised Supervised and Unsupervised
Data A few thousands, can train on lesser data More than million, Requires large data.
Accuracy Lesser accuracy High accurcy
Algorithm Linear and Logistic Regression, Support Vector
Machine (SEVI), Naive Bayes(NB), K-Nearest
Neighborhood (KNN), Decision Tree Random Forest,
Neural Network(NN)
Convolutional Neural Network (CNN), Recurrent
Neural Network (RNN), Long Short Term
Memory(LSTM)
Relationship Machine Learning is child of Artificial Intelligence and
parent of Deep Learning
Convolution Neural Network (CNN), Recurrent
Neural Network (RNN), Long Short Term
Memory(LSTM)
What's the difference between the two?
Simply explained, both machine learning and deep learning mimic the way the human brain learns. Its main difference is
therefore the type of algorithms used in each case, although deep learning is more similar to human learning as it works with
neurons. Machine learning usually uses decision trees and deep learning neural networks, which are more evolved. In addition,
both can learn in a supervised or unsupervised way.

Fundamentals
Forward Propagation
Backward Propagation
Gradient Descent
Perceptron
Let us add bias: Each perceptron also has a bias which is thought of as how much flexible the perceptron is. It is similar to constant b of a linear
function y = ax + b. It allows us to move the lineup and down to fit the prediction with the data better. Without b the line will always go through the
origin (0, 0) and you may get a poorer fit. For example, a perceptron may have two inputs, in that case, it requires three weights. One for each input
and one for the bias. Now linear representation of input will look like, w1*x1 + w2*x2 + w3*x3 + 1*b.
By directly combining the input and computing the output based on a threshold value. for eg: Take x1=0, x2=1, x3=1 and setting a threshold =0. So, if
x1+x2+x3>0, the output is 1 otherwise 0. You can see that in this case, the perceptron calculates the output as 1.
Next, let us add weights to the inputs. Weights give importance to an input. For example, you assign w1=2, w2=3, and w3=4 to x1, x2, and x3
respectively. To compute the output, we will multiply input with respective weights and compare with threshold value as w1*x1 + w2*x2 + w3*x3 >
threshold. These weights assign more importance to x3 in comparison to x1 and x2.
Perceptrons used for Linear. A neuron applies non-linear transformations (activation function) to the inputs and biases.

Activation function
Activation Function takes the sum of weighted input (w1*x1 + w2*x2 +
w3*x3 + 1*b) as an argument and returns the output of the neuron. In the
below equation, we have represented 1 as x0 and b as w0
Fundamentals
It is used to make a non-linear transformation that
allows us to fit nonlinear hypotheses or to
estimate the complex functions. Like “Sigmoid”,
“Tanh”, ReLu and many others.
Forward Propagation
Backward Propagation
Gradient Descent
Epoch
Multi-layer perceptron
The different components are:
I. Xi, XN: Inputs to the neuron. These can either be the actual observations from the input layer or an intermediate value from one Of the hidden layers.
2. Xo: Bias unit. This is a constant value added to the input of the activation function. It works similar to an intercept term and typically has +1 value.
3. w0,w1,w2,w3…wN: Weights on each input. Note that even the bias unit has a weight.
f is known as an activation function. This makes a Neural Network extremely flexible and imparts the capability to estimate complex non-linear
relationships in data. It can be a gaussian function, logistic function, hyperbolic function or even a linear function in simple cases.

Hidden Layer
A single hidden layer in green but in practice can contain multiple hidden layers. In addition, another point to remember in case of an MLP
is that all the layers are fully connected i.e every node in a layer(except the input and the output layer) is connected to every node in the
previous layer and the following layer.
Fundamentals
Full Batch Gradient Descent Stochastic Gradient Descent

Fundamentals
Model Parameters :
The properties of training data that will learn on its own during
training by the classifier or other ML model. For example, weights
and biases, or split points in Decision Tree.
Model Hyperparameters:
They are instead properties that govern the entire training process.
Hyperparameters are important since they directly control behavior
of the training algo, having important impact on performance of the
model under training.
The variables which determines the network structure (for example,
Number of Hidden Units)
The variables which determine how the network is trained (for
example, Learning Rate)
Model hyperparameters are set before training (before optimizing the
weights and bias).
• Learning Rate
• Number of Epochs
• Hidden Layers
• Hidden Units
• Activations Functions
PARAMETERS HYPERPARAMETER
They are required for making
predictions
They are required for estimating
the model parameters
They are estimated by
optimization algorithms(Gradient
Descent, Adam, Adagrad)
They are estimated by
hyperparameter tuning
They are not set manually They are set manually
The final parameters found after
training will decide how the
model will perform on unseen
data
The choice of hyperparameters
decide how efficient the training
is. In gradient descent the
learning rate decide how efficient
and accurate the optimization
process is in estimating the
parameters

Underfitting refers to a model that can neither model the training dataset nor generalize to new dataset. An underfit machine learning model is
not a suitable model and will be obvious as it will have poor performance on the training dataset.
Fundamentals
Overfitting is that a machine learning model can’t generalize or fit well on unseen dataset. The model's error on the testing or validation
dataset is much greater than the error on training dataset. The model / function corresponds too closely to a dataset. As a result, overfitting
may fail to fit additional data, and this may affect the accuracy of predicting future observations.
A model learns the detail and noise in the training dataset to the extent that it negatively impacts the performance of the model on a new
dataset.
Methods to prevent Overfitting
Cross-validation:
Initial training data to
generate multiple mini
train-test splits. Use these
splits to tune the model.
Tune hyperparameters
with only original training
dataset. This allows to
keep the test dataset as a
truly unseen dataset.
More training
data:
More data into the
model, it will be
unable to overfit
all the samples
and will be forced
to generalize to
obtain results, also
increases accuracy.
Data augmentation:
It makes a data sample
look slightly different every
time it is processed by the
model. The process makes
each data set appear
unique to the model and
prevents the model from
learning the characteristics
of the data sets.
Reduce Complexity or
Data Simplification:
Reduce overfitting by
decreasing the
complexity of the
model. Reduce the
number of parameters
in a Neural Networks,
and using dropout on a
Neural Networks.
Ensembling:
Machine learning methods for
combining predictions from
multiple separate models.
Boosting attempts to improve
the predictive flexibility of
simple models.
Bagging attempts to reduce the
chance of overfitting complex
models.

Step-by-Step Procedure of
Neural Network Operation Methodology

Visualization of steps for
Neural Network Operation
Let’s look at the step by step building methodology of Neural Network (MLP with one hidden layer, similar
to above-shown architecture). At the output layer, we have only one neuron as we are solving a binary
classification problem (predict 0 or 1). We could also have two neurons for predicting each of both classes.
0.) We take input and output
X as an input matrix
y as an output matrix
1.) Then we initialize weights and biases with
random values (one-time initiation. Next iteration,
use updated weights, and biases). Let us define:
wh as a weight matrix to the hidden layer
bh as bias matrix to the hidden layer
wout as a weight matrix to the output layer
bout as bias matrix to the output layer
2.) Then we take matrix dot product of input and
weights assigned to edges between the input and
hidden layer then add biases of the hidden layer
neurons to respective inputs, this is known as
linear transformation:
hidden_layer_input= matrix_dot_product(X,wh) + bh
Yellow filled cells represent current active cell. Orange cell represents the input used to populate the values of the current cell

3) Perform non-linear transformation using an
activation function (Sigmoid). Sigmoid will return
the output as 1/(1 + exp(-x)).
hiddenlayer_activations = sigmoid(hidden_layer_input)
4.) Then perform a linear transformation on
hidden layer activation (take matrix dot product
with weights and add a bias of the output layer
neuron) then apply an activation function (again
used sigmoid, but can use any activation function
depending upon task) to predict the output
output_layer_input = matrix_dot_product
(hiddenlayer_activations * wout ) + bout
output = sigmoid(output_layer_input)
All the above steps are known as “Forward Propagation“

5.) Compare prediction with actual output and
calculate the gradient of error (Actual – Predicted)
Error is the mean square loss = ((Y-t)^2)/2
E = y – output
6.) Compute the slope/gradient of hidden and
output layer neurons ( To find the slope, calculate
the derivatives of non-linear activations x at each
layer for each neuron). The gradient of sigmoid
can be returned as x * (1 – x).
slope_output_layer = derivatives_sigmoid(output)
slope_hidden_layer = derivatives_sigmoid(hiddenlayer_activations)
7.) Then compute change factor(delta) at the
output layer, dependent on the gradient of error
multiplied by the slope of output layer activation
d_output = E * slope_output_layer
8.) At this step, the error will propagate back into
the network which means error at the hidden
layer. For this, take the dot product of the output
layer delta with the weight parameters of edges
between the hidden and output layer (wout.T).
Error_at_hidden_layer = matrix_dot_product(d_output, wout.Transpose)

9.) Compute change factor(delta) at hidden layer,
multiply the error at hidden layer with slope of
hidden layer activation
d_hiddenlayer = Error_at_hidden_layer * slope_hidden_layer
10.) Then update weights at the output and hidden
layer: The weights in the network can be updated
from the errors calculated for training example(s).
wout = wout + matrix_dot_product(hiddenlayer_activations.Transpose,
d_output)*learning_rate
wh = wh + matrix_dot_product(X.Transpose,d_hiddenlayer)*learning_rate
learning_rate: The amount that weights are updated is controlled
by a configuration parameter called the learning rate)
11.) Finally, update biases at the output and
hidden layer: The biases in the network can be
updated from the aggregated errors at that
neuron.
bias at output_layer =bias at output_layer + sum of delta of
output_layer at row-wise * learning_rate
bias at hidden_layer =bias at hidden_layer + sum of delta of
output_layer at row-wise * learning_rate
bh = bh + sum(d_hiddenlayer, axis=0) * learning_rate
bout = bout + sum(d_output, axis=0)*learning_rate
Steps from 5 to 11 are known as “Backward Propagation “One forward and backward
propagation iteration is considered as one training cycle
Above, you can see that there is still a good error not
close to the actual target value because we have
completed only one training iteration. If we will train
the model multiple times then it will be a very close
actual outcome. I have completed thousands
iteration and my result is close to actual target values
([[ 0.98032096] [ 0.96845624] [ 0.04532167]]).

Convolutional Neural Network
(CNN) Architecture
The ConvNet architecture consists of three types of layers: Convolutional Layer, Pooling Layer, and Fully-Connected Layer.
INPUT layer : hold the input image as a 3-D array of pixel values.
CONV layer : Will compute the dot product between the kernel and sub-
array of an input image same size as a kernel. Then it’ll sum all the values
resulted from the dot product and this will be the single pixel value of an
output image. This process is repeated until the whole input image is
covered and for all the kernels.
RELU layer : will apply an activation function max(0,x) on all the pixel values
of an output image.
POOL layer : Perform down sampling along the width and height of an image
resulting in reducing the dimension of an image.
FC (Fully-Connected) layer : Compute the class score for each of the
classification category.
Advantages of Convolution Neural Network (CNN):
•CNN learns the filters automatically without mentioning it explicitly. These filters help in extracting the right and relevant features from the input data.
•CNN captures the from an image. Spatial features refer to the arrangement of pixels and the relationship between them in an image. They help us in
identifying the object accurately, the location of an object, as well as its relation with other objects in an image.
•CNN also follows the concept of parameter sharing. A single filter is applied across different parts of an input to produce a feature map.

Step-by-Step Process
The Convolution Layer
Consider we have an image of size 6*6.
We define a weight matrix which extracts certain features from the images
We have initialized the weight(Filter) as a 3*3 matrix. This weight shall
now run across the image such that all the pixels are covered at least
once, to give a convolved output. The value 429 above, is obtained by
the adding the values obtained by element wise multiplication of the
weight matrix and the highlighted 3*3 part of the input image.
The 6*6 image is now converted into a 4*4 image. Pixel values are
used again when the weight matrix moves along the image. This
basically enables parameter sharing in a convolutional neural network
weights are learnt to extract features from the original image which
help the network in correct prediction

Stride
The filter or the weight matrix, was moving across the entire image moving n pixel at a time, n is stride.
Stride = 1
Stride = 2
The size of image keeps on reducing as we increase the stride value.
This is defined as hyperparameter, as to how we would want the weight matrix to move across the image. If the weight matrix moves 1
pixel at a time, we call it as a stride of 1.

Padding
Padding the input image with zeros across maintains the output image size from stride. We can also add more than one layer of zeros around
the image in case of higher stride values.
The initial shape of the image is retained after we padded the image with a zero. This is known as same padding since the output image has
the same size as the input, which means that we considered only the valid pixels of the input image. The middle 4*4 pixels would be the
same. Here we have retained more information from the borders and have also preserved the size of the image.

Pooling
•Sometimes when the images are too large, we would need to reduce the number of trainable
parameters.
•It is then desired to periodically introduce pooling layers between subsequent convolution
layers.
•Pooling is done for the sole purpose of reducing the spatial size of the image.
•Pooling is done independently on each depth dimension, therefore the depth of the image
remains unchanged.
•The most common form of pooling layer generally applied is the max pooling.
Here stride as 2, while pooling size also as 2.
The max operation is applied to each depth
dimension of the convolved output.
The 4*4 convolved output has become 2*2 after the
max pooling operation. Convoluted image and
applied max pooling reduce the parameters.

Output Dimensions & Output Layer
Output dimensions :
Filters / Depth: The number of filters The depth of the output
volume will be equal to the number of filter applied. The depth of
the activation map will be equal to the number of filters.
Stride: For the stride of one we move across and down a single
pixel. With higher stride values, we move large number of pixels at
a time and hence produce smaller output volumes.
Zero padding: This helps us to preserve the size of the input
image. If a single zero padding is added, a single stride filter
movement would retain the size of the original image.
Formula to calculate the output dimensions.
The spatial size of the output image = ( [W-F+2P]/S)+1.
W is the input volume size
F is the size of the filter
P is the number of padding applied
S is the number of strides.
Suppose we have an input image of size 32*32*3, we apply 10 filters
of size 3*3*3, with single stride and no zero padding.
Here W=32, F=3, P=0 and S=1. The output depth will be equal to the
number of filters applied i.e. 10. The size of the output volume will be
([32-3+0]/1)+1 = 30. Therefore the output volume will be 30*30*10.
Output layer:
The convolution and pooling layers would only be able to extract features and reduce the number of parameters from the original images.
However, to generate the final output we need to apply a fully connected layer to generate an output equal to the number of classes we
need. The output layer has a loss function like categorical cross-entropy, to compute the error in prediction. Once the forward pass is
complete the back propagation begins to update the weight and biases for error and loss reduction.

Well Known 10 Evaluation Metrics
for Classification Models
Predicted: Outcome of the model on the validation set
Actual: Values seen in the training set
Positive (P): Observation is positive
Negative (N): Observation is not positive
True Positive (TP): Observation is positive, and is predicted correctly
False Negative (FN): Observation is positive, but predicted wrongly
True Negative (TN): Observation is negative, and predicted correctly
False Positive (FP): Observation is negative, but predicted wrongly
2. Accuracy : It defines your total number of true predictions in total dataset. Accuracy is the number of correct
predictions over the output size. Accuracy = TP + TN / TP + TN + FP + FN
3. Detection rate : This metric basically shows the number of correct positive class predictions made as a
proportion of all of the predictions made. Detection Rate = TP / TP + FP + FN + TN
4. Logarithmic loss: log loss, functions by penalizing all
false/incorrect classifications. Assign a specific probability to
each class for all samples. The formula:
5. Sensitivity (true positive rate): The true positive rate, corresponds to the proportion of positive data points
that are correctly considered as positive, with respect to all positive data points. Sensitivity = TP / FN + TP
6. Specificity (false positive rate): Corresponds to the proportion of negative data points that are mistakenly
considered as positive, with respect to all negative data points. Specificity = FP / FP + TN : Please note that both
FPR and TPR have values in the range of 0 to 1.
1. Confusion matrix is a metric
used to quantify the performance
of a machine learning classifier.
Confusion matrices are used to
visualize important predictive
analytics like recall, specificity,
accuracy, and precision.

7. Precision
This metric is the number of correct positive results divided by the number of
positive results predicted by the classifier.
Precision = TP / TP + FP
8. Recall
Recall is the number of correct positive results divided by the number of all samples
that should have been identified as positive.
Recall = TP / TP + FN
9. F1 score : The F1 score is basically the harmonic mean between precision and recall. It is used to measure the accuracy of tests
and is a direct indication of the model’s performance. The range of the F1 score is between 0 to 1, with the goal being to get as
close as possible to 1. It is calculated as per:
10. Receiver operating
characteristic curve (ROC) / area
under curve (AUC) score
The ROC curve is basically a
graph that displays the
classification model’s
performance at all thresholds. As
the name suggests, the AUC is
the entire area below the two-
dimensional area below the ROC
curve. This curve basically
generates two important
metrics: sensitivity and
specificity.
Well Known 10 Evaluation Metrics
for Classification Models
Precision-Recall Curve
(PRC)
As the name suggests,
this curve is a direct
representation of the
precision(y-axis) and the
recall(x-axis).
This is particularly useful
for the situations where
we have an imbalanced
dataset and the number
of negatives is much
larger than the positives.

Interpretation of ROC Curves
(Receiver Operating Characteristic Curve)
ROC Curve:
1. It is the plot between the TPR(y-axis) and FPR(x-axis).
2. Consider the model classifies the patient as having heart disease or not based on the probabilities generated for each class, we can decide the threshold of the
probabilities as well.
3. For example, we want to set a threshold value of 0.4. This means that the model will classify the datapoint/patient as having heart disease if the probability of the
patient having a heart disease is greater than 0.4.
4. This will obviously give a high recall value and reduce the number of False Positives. Similarly, we can visualize how our model performs for different threshold values
using the ROC curve.
5. Let us generate a ROC curve for our model with k = 3.
1. At the lowest point, i.e. at (0, 0)- the threshold is set at 1.0. This means our model classifies all patients
as not having a heart disease.
2. At the highest point i.e. at (1, 1), the threshold is set at 0.0. This means our model classifies all patients
as having a heart disease.
3. The rest of the curve is the values of FPR and TPR for the threshold values between 0 and 1. At some
threshold value, we observe that for FPR close to 0, we are achieving a TPR of close to 1. This is when
the model will predict the patients having heart disease almost perfectly.
4. The area with the curve and the axes as the boundaries is called the Area Under Curve(AUC). It is this
area which is considered as a metric of a good model. With this metric ranging from 0 to 1, we should
aim for a high value of AUC. Models with a high AUC are called as models with good skill. Let us
compute the AUC score of our model and the above plot: 0.868
5. We get a value of 0.868 as the AUC which is a pretty good score! This means that the model will be
able to distinguish the patients with heart disease and those who don’t 87% of the time.
6. The diagonal line is a random model with an AUC of 0.5, a model with no skill, which just the same as
making a random prediction

Interpretation of
Precision-Recall Curve (PRC)
As the name suggests, this curve is a direct representation of the precision(y-axis) and the recall(x-axis).
If you observe our definitions and formulae for the Precision and Recall above, you will notice that at no point are we using the True Negatives(the actual number
of people who don’t have heart disease).
This is particularly useful for the situations where we have an imbalanced dataset and the number of negatives is much larger than the positives(or when the
number of patients having no heart disease is much larger than the patients having it).
In such cases, our higher concern would be detecting the patients with heart disease as correctly as possible and would not need the TNR.
PRC Interpretation:
1. At the lowest point, i.e. at (0, 0)- the threshold is set at 1.0. This means our model
makes no distinctions between the patients who have heart disease and the patients
who don’t.
2. At the highest point i.e. at (1, 1), the threshold is set at 0.0. This means that both our
precision and recall are high and the model makes distinctions perfectly.
3. The rest of the curve is the values of Precision and Recall for the threshold values
between 0 and 1. Our aim is to make the curve as close to (1, 1) as possible- meaning a
good precision and recall.
4. Similar to ROC, the area with the curve and the axes as the boundaries is the Area
Under Curve(AUC). Consider this area as a metric of a good model. The AUC ranges
from 0 to 1. Therefore, we should aim for a high value of AUC. Let us compute the AUC
for our model and the above plot: 0.8957
5. As before, we get a good AUC of around 90%. Also, the model can achieve high
precision with recall as 0 and would achieve a high recall by compromising the precision
of 50%.

Comparative Study on
Activation Functions

Activation Functions: Binary Step
Neural network activation functions
are a crucial component of deep
learning.
1. Activation functions determine
the output of a learning model
2. Determines the model accuracy
3. The computational efficiency of
training a model
4. Activation functions have a major
effect on ability to converge and
the convergence speed
1. Activation functions are mathematical equations that determine the output of a neural
network.
2. The function is attached to each neuron in the network, and determines whether it should be
activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s
prediction.
3. Activation functions also help normalize the output of each neuron to a range between 1 and 0
or between -1 and 1.
4. They must be computationally efficient because they are calculated across thousands or even
millions of neurons for each data sample.
5. Modern neural networks use a technique called backpropagation to train the model, which
places an increased computational strain on the activation function, and its derivative function.
Activation functions
Binary Step Function
A binary step function is a threshold-based activation function. If the
input value is above or below a certain threshold, the neuron is
activated and sends exactly the same signal to the next layer.
Disadvantage: The problem with a step function is that it does not
allow multi-value outputs—for example, it cannot support classifying
the inputs into one of several categories.

Activation Functions: Linear
Linear Activation Function
A linear activation function takes the form: y = mx
It takes the inputs, multiplied by the weights for each neuron, and creates an
output signal proportional to the input. In one sense, a linear function is better
than a step function because it allows multiple outputs
Limitations:
1. Not possible to use backpropagation (gradient descent) to train the model—the
derivative of the function is a constant, and has no relation to the input, X. So it’s
not possible to go back and understand which weights in the input neurons can
provide a better prediction.
2. All layers of the neural network collapse into one—with linear activation
functions, no matter how many layers in the neural network, the last layer will be a
linear function of the first layer (because a linear combination of linear functions is
still a linear function). So a linear activation function turns the neural network into
just one layer. A neural network with a linear activation function is simply a linear
regression model. It has limited power and ability to handle complexity varying
parameters of input data.

Activation Functions: Nonlinear
Sigmoid / Logistic
Non-linear Action Functions:
Non-linear activation functions allow the model to create complex
mappings between the network’s inputs and outputs, which are
essential for learning and modeling complex data, such as images,
video, audio, and data sets which are non-linear or have high
dimensionality.
Non-linear activation functions advantages over linear functions:
1. They allow backpropagation because they have a derivative
function which is related to the inputs.
2. They allow “stacking” of multiple layers of neurons to create a
deep neural network. Multiple hidden layers of neurons are
needed to learn complex data sets with high levels of accuracy.
Advantages
1. Smooth gradient, preventing “jumps” in output values.
2. Output values bound between 0 and 1, normalizing the output of each neuron.
3. Clear predictions—For X above 2 or below -2, tends to bring the Y value (the prediction)
to the edge of the curve, very close to 1 or 0. This enables clear predictions.
Disadvantages
1. Vanishing gradient—for very high or very low values of X, there is almost no change to the
prediction, causing a vanishing gradient problem. This can result in the network refusing to
learn further, or being too slow to reach an accurate prediction.
2. Outputs not zero centered.
3. Computationally expensive

TanH / Hyperbolic Tangent &
ReLU (Rectified Linear Unit) & Leaky ReLU
Advantages
Zero centered—making it easier to
model inputs that have strongly
negative, neutral, and strongly positive
values.
Otherwise like the Sigmoid function.
Disadvantages
Like the Sigmoid function
TanH / Hyperbolic Tangent
Advantages
Computationally efficient—allows the network to
converge very quickly
Non-linear it looks like a linear, ReLU has a
derivative function and allows for
backpropagation
Disadvantages : The Dying ReLU problem—when
inputs approach zero, or are negative, the gradient
of the function becomes zero, the network cannot
Advantages
Prevents dying ReLU problem It has a small
positive slope in the negative area, so it does
enable backpropagation, even for negative
input values : Other charectertics like ReLU
Disadvantages
Results not consistent—leaky ReLU does not
provide consistent predictions for negative
input values.
ReLU (Rectified Linear Unit) Leaky ReLU

Advantages
Able to handle multiple classes only one class in other activation
functions—normalizes the outputs for each class between 0 and 1, and
divides by their sum, giving the probability of the input value being in a
specific class.
Useful for output neurons—typically Softmax is used only for the output
layer, for neural networks that need to classify inputs into multiple
categories
Swish is a new, self-gated activation function
discovered by researchers at Google. According to
their paper, it performs better than ReLU with a
similar level of computational efficiency. In
experiments on ImageNet with identical models
running ReLU and Swish, the new function achieved
top -1 classification accuracy 0.6-0.9% higher.
Swish
Softmax
Softmax and Swish

Derivatives or Gradients
1. The derivative—also known as a
gradient—of an activation
function is extremely important
for training the neural network.
2. Neural networks are trained
using a process called
backpropagation—this is an
algorithm which traces back
from the output of the model,
through the different neurons
which were involved in
generating that output, back to
the original weight applied to
each neuron.
3. Backpropagation suggests an
optimal weight for each neuron
which results in the most
accurate prediction.
Derivatives or Gradients of
Sigmoid TanH ReLU
Derivatives or Gradients

Backpropagation,
Error Calculation with gradient descent

Backpropagation
6 Stages of Neural Network Learning
1. Initialization—initial weights are applied to all the neurons.
2. Forward propagation—the inputs from a training set are passed through the
neural network and an output is computed.
3. Error function as we are working with a training set, the correct output is
known. An error function is defined, which captures the delta between the
correct output and the actual output of the model, given the current model
weights (in other words, “how far off” is the model from the correct result).
4. Backpropagation—the objective of backpropagation is to change the weights
for the neurons, in order to bring the error function to a minimum.
5. Weight update—weights are changed to the optimal values according to the
results of the backpropagation algorithm.
6. Iterate until convergence—because the weights are updated a small delta
step at a time, several iterations are required in order for the network to
learn. After each iteration, the gradient descent force updates the weights
towards less and less global loss function. The amount of iterations needed
to converge depends on the learning rate, the network meta-parameters, and
the optimization method used.
Backpropagation is simply an
algorithm which performs a highly
efficient search for the optimal
weight values, using the gradient
descent technique.

Backpropagation
The image below is a very simple neural network model with two inputs
(i1 and i2), which can be real values between 0 and 1, two hidden neurons
(h1 and h2), and two output neurons (o1 and o2).
Biases in neural networks are extra neurons added to each layer, which
store the value of 1. This allows you to “move” or translate the activation
function so it doesn’t cross the origin, by adding a constant number.

The Forward Pass
each neuron is a very simple component executes the activation
function. There are several commonly used activation functions;
for example, this is the sigmoid function: f(x) = 1 / 1 + exp(-x)
Our simple neural network The forward pass works by:
1. Taking each of the two inputs
2. Multiplying by the first-layer weights—w1,2,3,4
3. Adding bias
4. Applying the activation function for neurons h1 and h2
5. Taking the output of h1 and h2, multiplying by the second
layer weights—w5,6,7,8
6. This is the output.
Assume that first input i1 is 0.1, the weight going into the first neuron, w1,
is 0.27, the second input i2 is 0.2, the weight from the second weight to the
first neuron, w3, is 0.57, and the first layer bias b1 is 0.4.
The input of the first neuron h1 is combined from the two inputs, i1 and i2:
(i1 * w1) + (i2 * w2) + b1 = (0.1 * 0.27) + (0.2 * 0.57) + (0.4 * 1) = 0.541
Feeding this into the activation function of neuron h1:
f(0.541) = 1 / (1 + exp(-0.541)) = 0.632
Now, given some other weights w2 and w4 and the second input i2, you can follow
a similar calculation to get an output for the second neuron in the hidden layer, h2.
The final step is to take the outputs of neurons h1 and h2, multiply them by the
weights w5,6,7,8, and feed them to the same activation function of neurons o1 and
o2 (exactly the same calculation as above). The result is the final output of the
neural network,let’s say the final outputs are 0.735 for o1 and 0.455 for o2. We’ll
also assume that the correct output values are 0.5 for o1 and 0.5 for o2 (Assumed
correct values because in supervised learning, each data point had its truth value).

The backpropagation algorithm calculates how much the final output
values, o1 and o2, are affected by each of the weights. To do this, it
calculates partial derivatives, going back from the error function to the
neuron that carried a specific weight.
The error function
For simplicity, consider Mean Squared Error function. For the
first output, the error is the correct output value minus the
actual output of the neural network: 0.5-0.735 = -0.235
For the second output:0.5-0.455 = 0.045
Calculate the Mean Squared Error:
MSE(o1) = ½ (-0.235)2
= 0.0276
MSE(o2) = ½ (0.045)2
= 0.001
The Total Error is the sum of the two errors:
Total Error = 0.0276 + 0.001 = 0.0286
This is the number we need to minimize with Backpropagation.
Final outputs: 0.735 - O1 & 0.455 –
for O2. Assumed correct output
values are 0.5 for o1 and 0.5 for o2
Backpropagation with gradient descent
For example, weight w6, going from hidden neuron h1 to output neuron
o2, affected our model as follows: neuron h1 with weight w6 → affects
total input of neuron o2 → affects output o2 → affects total errors
Backpropagation goes in the opposite direction:
total errors → affected by output o2 → affected by total input of neuron
o2 → affected by neuron h1 with weight w6
The algorithm calculates three derivatives:
1. The derivative of total errors with respect to output o2
2. The derivative of output o2 with respect to total input of neuron o2
3. Total input of neuron o2 with respect to neuron h1 with weight w6
This gives us complete traceability from the total errors, all the way back to
the weight w6. Using the Leibniz Chain Rule, it is possible to calculate,
based on the above three derivatives, what is the optimal value of w6 that
minimizes the error function. In other words, what is the “best” weight w6
that will make the neural network most accurate? Similarly, the algorithm
calculates an optimal value for each of the 8 weights.
End result of backpropagation:
The backpropagation algorithm results in a set of optimal
weights, like this: Optimal values are: w1 = 0.355 ; w2 = 0.476 ;
w3 = 0.233 ; w4 = 0.674 ; w5 = 0.142 ; w6 = 0.967 ; w7 = 0.319 ;
w8 = 0.658.
Update the weights to these values, and start using the neural
network to make predictions for new inputs.
How Often Are the Weights Updated?
1)Updating after every sample in training set;
2)Updating in batch and 3)Randomized mini-batches
Backpropagation
Step-by-Step Process and Calculation

Deeplearning for Computer Vision PPT with

More Related Content

Similar to Deeplearning for Computer Vision PPT with

Recently uploaded

Deeplearning for Computer Vision PPT with

Editor's Notes