Deep Learning
Introduction
Artificial
Intelligence
Machine
Learning
Deep
Learning
•AI is the study of pattern recognition and
mimicking human behavior.AI powered
computers has started simulating the human
brain work style sensation, actions, interaction,
perception, and cognitive abilities.
•A subset of AI that incorporates math and
statistics in such a way that allows the
application to learn from data.
•A subset of ML that uses neural network to
learn from unstructured or unlabeled data.
•A measurable attribute of
data, determined to be
valuable in the learning
process.
Feature
•A set of algorithms inspired by
neural connections in the
human brain, consisting of
thousands to millions of
connected processing nodes.
Neural
Network
•Identifying to which category a
given data point belongs.
Classificati
on
Machine Learning Vs Deep Learning
Description Machine Learning Deep Learning
Operation ML algorithms get train data set and learn how to
predict similar events in future which is usually as test
set.
DL is mostly based on neural network which is
one of ML algorithm. DL works mostly on feature
selection.
Methods Supervised and Unsupervised Supervised and Unsupervised
Data A few thousands, can train on lesser data More than million, Requires large data.
Accuracy Lesser accuracy High accurcy
Algorithm Linear and Logistic Regression, Support Vector
Machine (SEVI), Naive Bayes(NB), K-Nearest
Neighborhood (KNN), Decision Tree Random Forest,
Neural Network(NN)
Convolutional Neural Network (CNN), Recurrent
Neural Network (RNN), Long Short Term
Memory(LSTM)
Relationship Machine Learning is child of Artificial Intelligence and
parent of Deep Learning
Convolution Neural Network (CNN), Recurrent
Neural Network (RNN), Long Short Term
Memory(LSTM)
What's the difference between the two?
Simply explained, both machine learning and deep learning mimic the way the human brain learns. Its main difference is
therefore the type of algorithms used in each case, although deep learning is more similar to human learning as it works with
neurons. Machine learning usually uses decision trees and deep learning neural networks, which are more evolved. In addition,
both can learn in a supervised or unsupervised way.
Fundamentals
Fundamentals
Forward Propagation
Backward Propagation
Gradient Descent
Perceptron
Let us add bias: Each perceptron also has a bias which is thought of as how much flexible the perceptron is. It is similar to constant b of a linear
function y = ax + b. It allows us to move the lineup and down to fit the prediction with the data better. Without b the line will always go through the
origin (0, 0) and you may get a poorer fit. For example, a perceptron may have two inputs, in that case, it requires three weights. One for each input
and one for the bias. Now linear representation of input will look like, w1*x1 + w2*x2 + w3*x3 + 1*b.
By directly combining the input and computing the output based on a threshold value. for eg: Take x1=0, x2=1, x3=1 and setting a threshold =0. So, if
x1+x2+x3>0, the output is 1 otherwise 0. You can see that in this case, the perceptron calculates the output as 1.
Next, let us add weights to the inputs. Weights give importance to an input. For example, you assign w1=2, w2=3, and w3=4 to x1, x2, and x3
respectively. To compute the output, we will multiply input with respective weights and compare with threshold value as w1*x1 + w2*x2 + w3*x3 >
threshold. These weights assign more importance to x3 in comparison to x1 and x2.
Perceptrons used for Linear. A neuron applies non-linear transformations (activation function) to the inputs and biases.
Activation function
Activation Function takes the sum of weighted input (w1*x1 + w2*x2 +
w3*x3 + 1*b) as an argument and returns the output of the neuron. In the
below equation, we have represented 1 as x0 and b as w0
Fundamentals
It is used to make a non-linear transformation that
allows us to fit nonlinear hypotheses or to
estimate the complex functions. Like “Sigmoid”,
“Tanh”, ReLu and many others.
Forward Propagation
Backward Propagation
Gradient Descent
Epoch
Multi-layer perceptron
The different components are:
I. Xi, XN: Inputs to the neuron. These can either be the actual observations from the input layer or an intermediate value from one Of the hidden layers.
2. Xo: Bias unit. This is a constant value added to the input of the activation function. It works similar to an intercept term and typically has +1 value.
3. w0,w1,w2,w3…wN: Weights on each input. Note that even the bias unit has a weight.
f is known as an activation function. This makes a Neural Network extremely flexible and imparts the capability to estimate complex non-linear
relationships in data. It can be a gaussian function, logistic function, hyperbolic function or even a linear function in simple cases.
Hidden Layer
A single hidden layer in green but in practice can contain multiple hidden layers. In addition, another point to remember in case of an MLP
is that all the layers are fully connected i.e every node in a layer(except the input and the output layer) is connected to every node in the
previous layer and the following layer.
Fundamentals
Full Batch Gradient Descent Stochastic Gradient Descent
Fundamentals
Model Parameters :
The properties of training data that will learn on its own during
training by the classifier or other ML model. For example, weights
and biases, or split points in Decision Tree.
Model Hyperparameters:
They are instead properties that govern the entire training process.
Hyperparameters are important since they directly control behavior
of the training algo, having important impact on performance of the
model under training.
The variables which determines the network structure (for example,
Number of Hidden Units)
The variables which determine how the network is trained (for
example, Learning Rate)
Model hyperparameters are set before training (before optimizing the
weights and bias).
• Learning Rate
• Number of Epochs
• Hidden Layers
• Hidden Units
• Activations Functions
PARAMETERS HYPERPARAMETER
They are required for making
predictions
They are required for estimating
the model parameters
They are estimated by
optimization algorithms(Gradient
Descent, Adam, Adagrad)
They are estimated by
hyperparameter tuning
They are not set manually They are set manually
The final parameters found after
training will decide how the
model will perform on unseen
data
The choice of hyperparameters
decide how efficient the training
is. In gradient descent the
learning rate decide how efficient
and accurate the optimization
process is in estimating the
parameters
Underfitting refers to a model that can neither model the training dataset nor generalize to new dataset. An underfit machine learning model is
not a suitable model and will be obvious as it will have poor performance on the training dataset.
Fundamentals
Overfitting is that a machine learning model can’t generalize or fit well on unseen dataset. The model's error on the testing or validation
dataset is much greater than the error on training dataset. The model / function corresponds too closely to a dataset. As a result, overfitting
may fail to fit additional data, and this may affect the accuracy of predicting future observations.
A model learns the detail and noise in the training dataset to the extent that it negatively impacts the performance of the model on a new
dataset.
Methods to prevent Overfitting
Cross-validation:
Initial training data to
generate multiple mini
train-test splits. Use these
splits to tune the model.
Tune hyperparameters
with only original training
dataset. This allows to
keep the test dataset as a
truly unseen dataset.
More training
data:
More data into the
model, it will be
unable to overfit
all the samples
and will be forced
to generalize to
obtain results, also
increases accuracy.
Data augmentation:
It makes a data sample
look slightly different every
time it is processed by the
model. The process makes
each data set appear
unique to the model and
prevents the model from
learning the characteristics
of the data sets.
Reduce Complexity or
Data Simplification:
Reduce overfitting by
decreasing the
complexity of the
model. Reduce the
number of parameters
in a Neural Networks,
and using dropout on a
Neural Networks.
Ensembling:
Machine learning methods for
combining predictions from
multiple separate models.
Boosting attempts to improve
the predictive flexibility of
simple models.
Bagging attempts to reduce the
chance of overfitting complex
models.
Step-by-Step Procedure of
Neural Network Operation Methodology
Visualization of steps for
Neural Network Operation
Let’s look at the step by step building methodology of Neural Network (MLP with one hidden layer, similar
to above-shown architecture). At the output layer, we have only one neuron as we are solving a binary
classification problem (predict 0 or 1). We could also have two neurons for predicting each of both classes.
0.) We take input and output
X as an input matrix
y as an output matrix
1.) Then we initialize weights and biases with
random values (one-time initiation. Next iteration,
use updated weights, and biases). Let us define:
wh as a weight matrix to the hidden layer
bh as bias matrix to the hidden layer
wout as a weight matrix to the output layer
bout as bias matrix to the output layer
2.) Then we take matrix dot product of input and
weights assigned to edges between the input and
hidden layer then add biases of the hidden layer
neurons to respective inputs, this is known as
linear transformation:
hidden_layer_input= matrix_dot_product(X,wh) + bh
Yellow filled cells represent current active cell. Orange cell represents the input used to populate the values of the current cell
Visualization of steps for
Neural Network Operation
3) Perform non-linear transformation using an
activation function (Sigmoid). Sigmoid will return
the output as 1/(1 + exp(-x)).
hiddenlayer_activations = sigmoid(hidden_layer_input)
4.) Then perform a linear transformation on
hidden layer activation (take matrix dot product
with weights and add a bias of the output layer
neuron) then apply an activation function (again
used sigmoid, but can use any activation function
depending upon task) to predict the output
output_layer_input = matrix_dot_product
(hiddenlayer_activations * wout ) + bout
output = sigmoid(output_layer_input)
All the above steps are known as “Forward Propagation“
Visualization of steps for
Neural Network Operation
5.) Compare prediction with actual output and
calculate the gradient of error (Actual – Predicted)
Error is the mean square loss = ((Y-t)^2)/2
E = y – output
6.) Compute the slope/gradient of hidden and
output layer neurons ( To find the slope, calculate
the derivatives of non-linear activations x at each
layer for each neuron). The gradient of sigmoid
can be returned as x * (1 – x).
slope_output_layer = derivatives_sigmoid(output)
slope_hidden_layer = derivatives_sigmoid(hiddenlayer_activations)
7.) Then compute change factor(delta) at the
output layer, dependent on the gradient of error
multiplied by the slope of output layer activation
d_output = E * slope_output_layer
8.) At this step, the error will propagate back into
the network which means error at the hidden
layer. For this, take the dot product of the output
layer delta with the weight parameters of edges
between the hidden and output layer (wout.T).
Error_at_hidden_layer = matrix_dot_product(d_output, wout.Transpose)
9.) Compute change factor(delta) at hidden layer,
multiply the error at hidden layer with slope of
hidden layer activation
d_hiddenlayer = Error_at_hidden_layer * slope_hidden_layer
10.) Then update weights at the output and hidden
layer: The weights in the network can be updated
from the errors calculated for training example(s).
wout = wout + matrix_dot_product(hiddenlayer_activations.Transpose,
d_output)*learning_rate
wh = wh + matrix_dot_product(X.Transpose,d_hiddenlayer)*learning_rate
learning_rate: The amount that weights are updated is controlled
by a configuration parameter called the learning rate)
11.) Finally, update biases at the output and
hidden layer: The biases in the network can be
updated from the aggregated errors at that
neuron.
bias at output_layer =bias at output_layer + sum of delta of
output_layer at row-wise * learning_rate
bias at hidden_layer =bias at hidden_layer + sum of delta of
output_layer at row-wise * learning_rate
bh = bh + sum(d_hiddenlayer, axis=0) * learning_rate
bout = bout + sum(d_output, axis=0)*learning_rate
Steps from 5 to 11 are known as “Backward Propagation “One forward and backward
propagation iteration is considered as one training cycle
Above, you can see that there is still a good error not
close to the actual target value because we have
completed only one training iteration. If we will train
the model multiple times then it will be a very close
actual outcome. I have completed thousands
iteration and my result is close to actual target values
([[ 0.98032096] [ 0.96845624] [ 0.04532167]]).
Convolutional Neural Network
Convolutional Neural Network
(CNN) Architecture
The ConvNet architecture consists of three types of layers: Convolutional Layer, Pooling Layer, and Fully-Connected Layer.
INPUT layer : hold the input image as a 3-D array of pixel values.
CONV layer : Will compute the dot product between the kernel and sub-
array of an input image same size as a kernel. Then it’ll sum all the values
resulted from the dot product and this will be the single pixel value of an
output image. This process is repeated until the whole input image is
covered and for all the kernels.
RELU layer : will apply an activation function max(0,x) on all the pixel values
of an output image.
POOL layer : Perform down sampling along the width and height of an image
resulting in reducing the dimension of an image.
FC (Fully-Connected) layer : Compute the class score for each of the
classification category.
Advantages of Convolution Neural Network (CNN):
•CNN learns the filters automatically without mentioning it explicitly. These filters help in extracting the right and relevant features from the input data.
•CNN captures the from an image. Spatial features refer to the arrangement of pixels and the relationship between them in an image. They help us in
identifying the object accurately, the location of an object, as well as its relation with other objects in an image.
•CNN also follows the concept of parameter sharing. A single filter is applied across different parts of an input to produce a feature map.
Convolutional Neural Network
Step-by-Step Process
The Convolution Layer
Consider we have an image of size 6*6.
We define a weight matrix which extracts certain features from the images
We have initialized the weight(Filter) as a 3*3 matrix. This weight shall
now run across the image such that all the pixels are covered at least
once, to give a convolved output. The value 429 above, is obtained by
the adding the values obtained by element wise multiplication of the
weight matrix and the highlighted 3*3 part of the input image.
The 6*6 image is now converted into a 4*4 image. Pixel values are
used again when the weight matrix moves along the image. This
basically enables parameter sharing in a convolutional neural network
weights are learnt to extract features from the original image which
help the network in correct prediction
Convolutional Neural Network
Stride
The filter or the weight matrix, was moving across the entire image moving n pixel at a time, n is stride.
Stride = 1
Stride = 2
The size of image keeps on reducing as we increase the stride value.
This is defined as hyperparameter, as to how we would want the weight matrix to move across the image. If the weight matrix moves 1
pixel at a time, we call it as a stride of 1.
Convolutional Neural Network
Padding
Padding the input image with zeros across maintains the output image size from stride. We can also add more than one layer of zeros around
the image in case of higher stride values.
The initial shape of the image is retained after we padded the image with a zero. This is known as same padding since the output image has
the same size as the input, which means that we considered only the valid pixels of the input image. The middle 4*4 pixels would be the
same. Here we have retained more information from the borders and have also preserved the size of the image.
Convolutional Neural Network
Pooling
•Sometimes when the images are too large, we would need to reduce the number of trainable
parameters.
•It is then desired to periodically introduce pooling layers between subsequent convolution
layers.
•Pooling is done for the sole purpose of reducing the spatial size of the image.
•Pooling is done independently on each depth dimension, therefore the depth of the image
remains unchanged.
•The most common form of pooling layer generally applied is the max pooling.
Here stride as 2, while pooling size also as 2.
The max operation is applied to each depth
dimension of the convolved output.
The 4*4 convolved output has become 2*2 after the
max pooling operation. Convoluted image and
applied max pooling reduce the parameters.
Convolutional Neural Network
Output Dimensions & Output Layer
Output dimensions :
Filters / Depth: The number of filters The depth of the output
volume will be equal to the number of filter applied. The depth of
the activation map will be equal to the number of filters.
Stride: For the stride of one we move across and down a single
pixel. With higher stride values, we move large number of pixels at
a time and hence produce smaller output volumes.
Zero padding: This helps us to preserve the size of the input
image. If a single zero padding is added, a single stride filter
movement would retain the size of the original image.
Formula to calculate the output dimensions.
The spatial size of the output image = ( [W-F+2P]/S)+1.
W is the input volume size
F is the size of the filter
P is the number of padding applied
S is the number of strides.
Suppose we have an input image of size 32*32*3, we apply 10 filters
of size 3*3*3, with single stride and no zero padding.
Here W=32, F=3, P=0 and S=1. The output depth will be equal to the
number of filters applied i.e. 10. The size of the output volume will be
([32-3+0]/1)+1 = 30. Therefore the output volume will be 30*30*10.
Output layer:
The convolution and pooling layers would only be able to extract features and reduce the number of parameters from the original images.
However, to generate the final output we need to apply a fully connected layer to generate an output equal to the number of classes we
need. The output layer has a loss function like categorical cross-entropy, to compute the error in prediction. Once the forward pass is
complete the back propagation begins to update the weight and biases for error and loss reduction.
Well Known 10 Evaluation Metrics
for Classification Models
Predicted: Outcome of the model on the validation set
Actual: Values seen in the training set
Positive (P): Observation is positive
Negative (N): Observation is not positive
True Positive (TP): Observation is positive, and is predicted correctly
False Negative (FN): Observation is positive, but predicted wrongly
True Negative (TN): Observation is negative, and predicted correctly
False Positive (FP): Observation is negative, but predicted wrongly
2. Accuracy : It defines your total number of true predictions in total dataset. Accuracy is the number of correct
predictions over the output size. Accuracy = TP + TN / TP + TN + FP + FN
3. Detection rate : This metric basically shows the number of correct positive class predictions made as a
proportion of all of the predictions made. Detection Rate = TP / TP + FP + FN + TN
4. Logarithmic loss: log loss, functions by penalizing all
false/incorrect classifications. Assign a specific probability to
each class for all samples. The formula:
5. Sensitivity (true positive rate): The true positive rate, corresponds to the proportion of positive data points
that are correctly considered as positive, with respect to all positive data points. Sensitivity = TP / FN + TP
6. Specificity (false positive rate): Corresponds to the proportion of negative data points that are mistakenly
considered as positive, with respect to all negative data points. Specificity = FP / FP + TN : Please note that both
FPR and TPR have values in the range of 0 to 1.
1. Confusion matrix is a metric
used to quantify the performance
of a machine learning classifier.
Confusion matrices are used to
visualize important predictive
analytics like recall, specificity,
accuracy, and precision.
7. Precision
This metric is the number of correct positive results divided by the number of
positive results predicted by the classifier.
Precision = TP / TP + FP
8. Recall
Recall is the number of correct positive results divided by the number of all samples
that should have been identified as positive.
Recall = TP / TP + FN
9. F1 score : The F1 score is basically the harmonic mean between precision and recall. It is used to measure the accuracy of tests
and is a direct indication of the model’s performance. The range of the F1 score is between 0 to 1, with the goal being to get as
close as possible to 1. It is calculated as per:
10. Receiver operating
characteristic curve (ROC) / area
under curve (AUC) score
The ROC curve is basically a
graph that displays the
classification model’s
performance at all thresholds. As
the name suggests, the AUC is
the entire area below the two-
dimensional area below the ROC
curve. This curve basically
generates two important
metrics: sensitivity and
specificity.
Well Known 10 Evaluation Metrics
for Classification Models
Precision-Recall Curve
(PRC)
As the name suggests,
this curve is a direct
representation of the
precision(y-axis) and the
recall(x-axis).
This is particularly useful
for the situations where
we have an imbalanced
dataset and the number
of negatives is much
larger than the positives.
Interpretation of ROC Curves
(Receiver Operating Characteristic Curve)
ROC Curve:
1. It is the plot between the TPR(y-axis) and FPR(x-axis).
2. Consider the model classifies the patient as having heart disease or not based on the probabilities generated for each class, we can decide the threshold of the
probabilities as well.
3. For example, we want to set a threshold value of 0.4. This means that the model will classify the datapoint/patient as having heart disease if the probability of the
patient having a heart disease is greater than 0.4.
4. This will obviously give a high recall value and reduce the number of False Positives. Similarly, we can visualize how our model performs for different threshold values
using the ROC curve.
5. Let us generate a ROC curve for our model with k = 3.
1. At the lowest point, i.e. at (0, 0)- the threshold is set at 1.0. This means our model classifies all patients
as not having a heart disease.
2. At the highest point i.e. at (1, 1), the threshold is set at 0.0. This means our model classifies all patients
as having a heart disease.
3. The rest of the curve is the values of FPR and TPR for the threshold values between 0 and 1. At some
threshold value, we observe that for FPR close to 0, we are achieving a TPR of close to 1. This is when
the model will predict the patients having heart disease almost perfectly.
4. The area with the curve and the axes as the boundaries is called the Area Under Curve(AUC). It is this
area which is considered as a metric of a good model. With this metric ranging from 0 to 1, we should
aim for a high value of AUC. Models with a high AUC are called as models with good skill. Let us
compute the AUC score of our model and the above plot: 0.868
5. We get a value of 0.868 as the AUC which is a pretty good score! This means that the model will be
able to distinguish the patients with heart disease and those who don’t 87% of the time.
6. The diagonal line is a random model with an AUC of 0.5, a model with no skill, which just the same as
making a random prediction
Interpretation of
Precision-Recall Curve (PRC)
As the name suggests, this curve is a direct representation of the precision(y-axis) and the recall(x-axis).
If you observe our definitions and formulae for the Precision and Recall above, you will notice that at no point are we using the True Negatives(the actual number
of people who don’t have heart disease).
This is particularly useful for the situations where we have an imbalanced dataset and the number of negatives is much larger than the positives(or when the
number of patients having no heart disease is much larger than the patients having it).
In such cases, our higher concern would be detecting the patients with heart disease as correctly as possible and would not need the TNR.
PRC Interpretation:
1. At the lowest point, i.e. at (0, 0)- the threshold is set at 1.0. This means our model
makes no distinctions between the patients who have heart disease and the patients
who don’t.
2. At the highest point i.e. at (1, 1), the threshold is set at 0.0. This means that both our
precision and recall are high and the model makes distinctions perfectly.
3. The rest of the curve is the values of Precision and Recall for the threshold values
between 0 and 1. Our aim is to make the curve as close to (1, 1) as possible- meaning a
good precision and recall.
4. Similar to ROC, the area with the curve and the axes as the boundaries is the Area
Under Curve(AUC). Consider this area as a metric of a good model. The AUC ranges
from 0 to 1. Therefore, we should aim for a high value of AUC. Let us compute the AUC
for our model and the above plot: 0.8957
5. As before, we get a good AUC of around 90%. Also, the model can achieve high
precision with recall as 0 and would achieve a high recall by compromising the precision
of 50%.
Comparative Study on
Activation Functions
Activation Functions: Binary Step
Neural network activation functions
are a crucial component of deep
learning.
1. Activation functions determine
the output of a learning model
2. Determines the model accuracy
3. The computational efficiency of
training a model
4. Activation functions have a major
effect on ability to converge and
the convergence speed
1. Activation functions are mathematical equations that determine the output of a neural
network.
2. The function is attached to each neuron in the network, and determines whether it should be
activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s
prediction.
3. Activation functions also help normalize the output of each neuron to a range between 1 and 0
or between -1 and 1.
4. They must be computationally efficient because they are calculated across thousands or even
millions of neurons for each data sample.
5. Modern neural networks use a technique called backpropagation to train the model, which
places an increased computational strain on the activation function, and its derivative function.
Activation functions
Binary Step Function
A binary step function is a threshold-based activation function. If the
input value is above or below a certain threshold, the neuron is
activated and sends exactly the same signal to the next layer.
Disadvantage: The problem with a step function is that it does not
allow multi-value outputs—for example, it cannot support classifying
the inputs into one of several categories.
Activation Functions: Linear
Linear Activation Function
A linear activation function takes the form: y = mx
It takes the inputs, multiplied by the weights for each neuron, and creates an
output signal proportional to the input. In one sense, a linear function is better
than a step function because it allows multiple outputs
Limitations:
1. Not possible to use backpropagation (gradient descent) to train the model—the
derivative of the function is a constant, and has no relation to the input, X. So it’s
not possible to go back and understand which weights in the input neurons can
provide a better prediction.
2. All layers of the neural network collapse into one—with linear activation
functions, no matter how many layers in the neural network, the last layer will be a
linear function of the first layer (because a linear combination of linear functions is
still a linear function). So a linear activation function turns the neural network into
just one layer. A neural network with a linear activation function is simply a linear
regression model. It has limited power and ability to handle complexity varying
parameters of input data.
Activation Functions: Nonlinear
Sigmoid / Logistic
Non-linear Action Functions:
Non-linear activation functions allow the model to create complex
mappings between the network’s inputs and outputs, which are
essential for learning and modeling complex data, such as images,
video, audio, and data sets which are non-linear or have high
dimensionality.
Non-linear activation functions advantages over linear functions:
1. They allow backpropagation because they have a derivative
function which is related to the inputs.
2. They allow “stacking” of multiple layers of neurons to create a
deep neural network. Multiple hidden layers of neurons are
needed to learn complex data sets with high levels of accuracy.
Advantages
1. Smooth gradient, preventing “jumps” in output values.
2. Output values bound between 0 and 1, normalizing the output of each neuron.
3. Clear predictions—For X above 2 or below -2, tends to bring the Y value (the prediction)
to the edge of the curve, very close to 1 or 0. This enables clear predictions.
Disadvantages
1. Vanishing gradient—for very high or very low values of X, there is almost no change to the
prediction, causing a vanishing gradient problem. This can result in the network refusing to
learn further, or being too slow to reach an accurate prediction.
2. Outputs not zero centered.
3. Computationally expensive
Activation Functions: Nonlinear
TanH / Hyperbolic Tangent &
ReLU (Rectified Linear Unit) & Leaky ReLU
Advantages
Zero centered—making it easier to
model inputs that have strongly
negative, neutral, and strongly positive
values.
Otherwise like the Sigmoid function.
Disadvantages
Like the Sigmoid function
TanH / Hyperbolic Tangent
Advantages
Computationally efficient—allows the network to
converge very quickly
Non-linear it looks like a linear, ReLU has a
derivative function and allows for
backpropagation
Disadvantages : The Dying ReLU problem—when
inputs approach zero, or are negative, the gradient
of the function becomes zero, the network cannot
Advantages
Prevents dying ReLU problem It has a small
positive slope in the negative area, so it does
enable backpropagation, even for negative
input values : Other charectertics like ReLU
Disadvantages
Results not consistent—leaky ReLU does not
provide consistent predictions for negative
input values.
ReLU (Rectified Linear Unit) Leaky ReLU
Advantages
Able to handle multiple classes only one class in other activation
functions—normalizes the outputs for each class between 0 and 1, and
divides by their sum, giving the probability of the input value being in a
specific class.
Useful for output neurons—typically Softmax is used only for the output
layer, for neural networks that need to classify inputs into multiple
categories
Swish is a new, self-gated activation function
discovered by researchers at Google. According to
their paper, it performs better than ReLU with a
similar level of computational efficiency. In
experiments on ImageNet with identical models
running ReLU and Swish, the new function achieved
top -1 classification accuracy 0.6-0.9% higher.
Swish
Softmax
Activation Functions: Nonlinear
Softmax and Swish
Activation Functions
Derivatives or Gradients
1. The derivative—also known as a
gradient—of an activation
function is extremely important
for training the neural network.
2. Neural networks are trained
using a process called
backpropagation—this is an
algorithm which traces back
from the output of the model,
through the different neurons
which were involved in
generating that output, back to
the original weight applied to
each neuron.
3. Backpropagation suggests an
optimal weight for each neuron
which results in the most
accurate prediction.
Derivatives or Gradients of
Activation Functions
Sigmoid TanH ReLU
Derivatives or Gradients
Backpropagation,
Error Calculation with gradient descent
Backpropagation
6 Stages of Neural Network Learning
1. Initialization—initial weights are applied to all the neurons.
2. Forward propagation—the inputs from a training set are passed through the
neural network and an output is computed.
3. Error function as we are working with a training set, the correct output is
known. An error function is defined, which captures the delta between the
correct output and the actual output of the model, given the current model
weights (in other words, “how far off” is the model from the correct result).
4. Backpropagation—the objective of backpropagation is to change the weights
for the neurons, in order to bring the error function to a minimum.
5. Weight update—weights are changed to the optimal values according to the
results of the backpropagation algorithm.
6. Iterate until convergence—because the weights are updated a small delta
step at a time, several iterations are required in order for the network to
learn. After each iteration, the gradient descent force updates the weights
towards less and less global loss function. The amount of iterations needed
to converge depends on the learning rate, the network meta-parameters, and
the optimization method used.
Backpropagation is simply an
algorithm which performs a highly
efficient search for the optimal
weight values, using the gradient
descent technique.
Backpropagation
Step-by-Step Process
The image below is a very simple neural network model with two inputs
(i1 and i2), which can be real values between 0 and 1, two hidden neurons
(h1 and h2), and two output neurons (o1 and o2).
Biases in neural networks are extra neurons added to each layer, which
store the value of 1. This allows you to “move” or translate the activation
function so it doesn’t cross the origin, by adding a constant number.
The Forward Pass
Step-by-Step Process
each neuron is a very simple component executes the activation
function. There are several commonly used activation functions;
for example, this is the sigmoid function: f(x) = 1 / 1 + exp(-x)
Our simple neural network The forward pass works by:
1. Taking each of the two inputs
2. Multiplying by the first-layer weights—w1,2,3,4
3. Adding bias
4. Applying the activation function for neurons h1 and h2
5. Taking the output of h1 and h2, multiplying by the second
layer weights—w5,6,7,8
6. This is the output.
Assume that first input i1 is 0.1, the weight going into the first neuron, w1,
is 0.27, the second input i2 is 0.2, the weight from the second weight to the
first neuron, w3, is 0.57, and the first layer bias b1 is 0.4.
The input of the first neuron h1 is combined from the two inputs, i1 and i2:
(i1 * w1) + (i2 * w2) + b1 = (0.1 * 0.27) + (0.2 * 0.57) + (0.4 * 1) = 0.541
Feeding this into the activation function of neuron h1:
f(0.541) = 1 / (1 + exp(-0.541)) = 0.632
Now, given some other weights w2 and w4 and the second input i2, you can follow
a similar calculation to get an output for the second neuron in the hidden layer, h2.
The final step is to take the outputs of neurons h1 and h2, multiply them by the
weights w5,6,7,8, and feed them to the same activation function of neurons o1 and
o2 (exactly the same calculation as above). The result is the final output of the
neural network,let’s say the final outputs are 0.735 for o1 and 0.455 for o2. We’ll
also assume that the correct output values are 0.5 for o1 and 0.5 for o2 (Assumed
correct values because in supervised learning, each data point had its truth value).
The backpropagation algorithm calculates how much the final output
values, o1 and o2, are affected by each of the weights. To do this, it
calculates partial derivatives, going back from the error function to the
neuron that carried a specific weight.
The error function
For simplicity, consider Mean Squared Error function. For the
first output, the error is the correct output value minus the
actual output of the neural network: 0.5-0.735 = -0.235
For the second output:0.5-0.455 = 0.045
Calculate the Mean Squared Error:
MSE(o1) = ½ (-0.235)2
= 0.0276
MSE(o2) = ½ (0.045)2
= 0.001
The Total Error is the sum of the two errors:
Total Error = 0.0276 + 0.001 = 0.0286
This is the number we need to minimize with Backpropagation.
Final outputs: 0.735 - O1 & 0.455 –
for O2. Assumed correct output
values are 0.5 for o1 and 0.5 for o2
Backpropagation with gradient descent
For example, weight w6, going from hidden neuron h1 to output neuron
o2, affected our model as follows: neuron h1 with weight w6 → affects
total input of neuron o2 → affects output o2 → affects total errors
Backpropagation goes in the opposite direction:
total errors → affected by output o2 → affected by total input of neuron
o2 → affected by neuron h1 with weight w6
The algorithm calculates three derivatives:
1. The derivative of total errors with respect to output o2
2. The derivative of output o2 with respect to total input of neuron o2
3. Total input of neuron o2 with respect to neuron h1 with weight w6
This gives us complete traceability from the total errors, all the way back to
the weight w6. Using the Leibniz Chain Rule, it is possible to calculate,
based on the above three derivatives, what is the optimal value of w6 that
minimizes the error function. In other words, what is the “best” weight w6
that will make the neural network most accurate? Similarly, the algorithm
calculates an optimal value for each of the 8 weights.
End result of backpropagation:
The backpropagation algorithm results in a set of optimal
weights, like this: Optimal values are: w1 = 0.355 ; w2 = 0.476 ;
w3 = 0.233 ; w4 = 0.674 ; w5 = 0.142 ; w6 = 0.967 ; w7 = 0.319 ;
w8 = 0.658.
Update the weights to these values, and start using the neural
network to make predictions for new inputs.
How Often Are the Weights Updated?
1)Updating after every sample in training set;
2)Updating in batch and 3)Randomized mini-batches
Backpropagation
Step-by-Step Process and Calculation

Deeplearning for Computer Vision PPT with

  • 1.
  • 2.
    Introduction Artificial Intelligence Machine Learning Deep Learning •AI is thestudy of pattern recognition and mimicking human behavior.AI powered computers has started simulating the human brain work style sensation, actions, interaction, perception, and cognitive abilities. •A subset of AI that incorporates math and statistics in such a way that allows the application to learn from data. •A subset of ML that uses neural network to learn from unstructured or unlabeled data. •A measurable attribute of data, determined to be valuable in the learning process. Feature •A set of algorithms inspired by neural connections in the human brain, consisting of thousands to millions of connected processing nodes. Neural Network •Identifying to which category a given data point belongs. Classificati on
  • 3.
    Machine Learning VsDeep Learning Description Machine Learning Deep Learning Operation ML algorithms get train data set and learn how to predict similar events in future which is usually as test set. DL is mostly based on neural network which is one of ML algorithm. DL works mostly on feature selection. Methods Supervised and Unsupervised Supervised and Unsupervised Data A few thousands, can train on lesser data More than million, Requires large data. Accuracy Lesser accuracy High accurcy Algorithm Linear and Logistic Regression, Support Vector Machine (SEVI), Naive Bayes(NB), K-Nearest Neighborhood (KNN), Decision Tree Random Forest, Neural Network(NN) Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Long Short Term Memory(LSTM) Relationship Machine Learning is child of Artificial Intelligence and parent of Deep Learning Convolution Neural Network (CNN), Recurrent Neural Network (RNN), Long Short Term Memory(LSTM) What's the difference between the two? Simply explained, both machine learning and deep learning mimic the way the human brain learns. Its main difference is therefore the type of algorithms used in each case, although deep learning is more similar to human learning as it works with neurons. Machine learning usually uses decision trees and deep learning neural networks, which are more evolved. In addition, both can learn in a supervised or unsupervised way.
  • 4.
  • 5.
    Fundamentals Forward Propagation Backward Propagation GradientDescent Perceptron Let us add bias: Each perceptron also has a bias which is thought of as how much flexible the perceptron is. It is similar to constant b of a linear function y = ax + b. It allows us to move the lineup and down to fit the prediction with the data better. Without b the line will always go through the origin (0, 0) and you may get a poorer fit. For example, a perceptron may have two inputs, in that case, it requires three weights. One for each input and one for the bias. Now linear representation of input will look like, w1*x1 + w2*x2 + w3*x3 + 1*b. By directly combining the input and computing the output based on a threshold value. for eg: Take x1=0, x2=1, x3=1 and setting a threshold =0. So, if x1+x2+x3>0, the output is 1 otherwise 0. You can see that in this case, the perceptron calculates the output as 1. Next, let us add weights to the inputs. Weights give importance to an input. For example, you assign w1=2, w2=3, and w3=4 to x1, x2, and x3 respectively. To compute the output, we will multiply input with respective weights and compare with threshold value as w1*x1 + w2*x2 + w3*x3 > threshold. These weights assign more importance to x3 in comparison to x1 and x2. Perceptrons used for Linear. A neuron applies non-linear transformations (activation function) to the inputs and biases.
  • 6.
    Activation function Activation Functiontakes the sum of weighted input (w1*x1 + w2*x2 + w3*x3 + 1*b) as an argument and returns the output of the neuron. In the below equation, we have represented 1 as x0 and b as w0 Fundamentals It is used to make a non-linear transformation that allows us to fit nonlinear hypotheses or to estimate the complex functions. Like “Sigmoid”, “Tanh”, ReLu and many others. Forward Propagation Backward Propagation Gradient Descent Epoch Multi-layer perceptron The different components are: I. Xi, XN: Inputs to the neuron. These can either be the actual observations from the input layer or an intermediate value from one Of the hidden layers. 2. Xo: Bias unit. This is a constant value added to the input of the activation function. It works similar to an intercept term and typically has +1 value. 3. w0,w1,w2,w3…wN: Weights on each input. Note that even the bias unit has a weight. f is known as an activation function. This makes a Neural Network extremely flexible and imparts the capability to estimate complex non-linear relationships in data. It can be a gaussian function, logistic function, hyperbolic function or even a linear function in simple cases.
  • 7.
    Hidden Layer A singlehidden layer in green but in practice can contain multiple hidden layers. In addition, another point to remember in case of an MLP is that all the layers are fully connected i.e every node in a layer(except the input and the output layer) is connected to every node in the previous layer and the following layer. Fundamentals Full Batch Gradient Descent Stochastic Gradient Descent
  • 8.
    Fundamentals Model Parameters : Theproperties of training data that will learn on its own during training by the classifier or other ML model. For example, weights and biases, or split points in Decision Tree. Model Hyperparameters: They are instead properties that govern the entire training process. Hyperparameters are important since they directly control behavior of the training algo, having important impact on performance of the model under training. The variables which determines the network structure (for example, Number of Hidden Units) The variables which determine how the network is trained (for example, Learning Rate) Model hyperparameters are set before training (before optimizing the weights and bias). • Learning Rate • Number of Epochs • Hidden Layers • Hidden Units • Activations Functions PARAMETERS HYPERPARAMETER They are required for making predictions They are required for estimating the model parameters They are estimated by optimization algorithms(Gradient Descent, Adam, Adagrad) They are estimated by hyperparameter tuning They are not set manually They are set manually The final parameters found after training will decide how the model will perform on unseen data The choice of hyperparameters decide how efficient the training is. In gradient descent the learning rate decide how efficient and accurate the optimization process is in estimating the parameters
  • 9.
    Underfitting refers toa model that can neither model the training dataset nor generalize to new dataset. An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training dataset. Fundamentals Overfitting is that a machine learning model can’t generalize or fit well on unseen dataset. The model's error on the testing or validation dataset is much greater than the error on training dataset. The model / function corresponds too closely to a dataset. As a result, overfitting may fail to fit additional data, and this may affect the accuracy of predicting future observations. A model learns the detail and noise in the training dataset to the extent that it negatively impacts the performance of the model on a new dataset. Methods to prevent Overfitting Cross-validation: Initial training data to generate multiple mini train-test splits. Use these splits to tune the model. Tune hyperparameters with only original training dataset. This allows to keep the test dataset as a truly unseen dataset. More training data: More data into the model, it will be unable to overfit all the samples and will be forced to generalize to obtain results, also increases accuracy. Data augmentation: It makes a data sample look slightly different every time it is processed by the model. The process makes each data set appear unique to the model and prevents the model from learning the characteristics of the data sets. Reduce Complexity or Data Simplification: Reduce overfitting by decreasing the complexity of the model. Reduce the number of parameters in a Neural Networks, and using dropout on a Neural Networks. Ensembling: Machine learning methods for combining predictions from multiple separate models. Boosting attempts to improve the predictive flexibility of simple models. Bagging attempts to reduce the chance of overfitting complex models.
  • 10.
    Step-by-Step Procedure of NeuralNetwork Operation Methodology
  • 11.
    Visualization of stepsfor Neural Network Operation Let’s look at the step by step building methodology of Neural Network (MLP with one hidden layer, similar to above-shown architecture). At the output layer, we have only one neuron as we are solving a binary classification problem (predict 0 or 1). We could also have two neurons for predicting each of both classes. 0.) We take input and output X as an input matrix y as an output matrix 1.) Then we initialize weights and biases with random values (one-time initiation. Next iteration, use updated weights, and biases). Let us define: wh as a weight matrix to the hidden layer bh as bias matrix to the hidden layer wout as a weight matrix to the output layer bout as bias matrix to the output layer 2.) Then we take matrix dot product of input and weights assigned to edges between the input and hidden layer then add biases of the hidden layer neurons to respective inputs, this is known as linear transformation: hidden_layer_input= matrix_dot_product(X,wh) + bh Yellow filled cells represent current active cell. Orange cell represents the input used to populate the values of the current cell
  • 12.
    Visualization of stepsfor Neural Network Operation 3) Perform non-linear transformation using an activation function (Sigmoid). Sigmoid will return the output as 1/(1 + exp(-x)). hiddenlayer_activations = sigmoid(hidden_layer_input) 4.) Then perform a linear transformation on hidden layer activation (take matrix dot product with weights and add a bias of the output layer neuron) then apply an activation function (again used sigmoid, but can use any activation function depending upon task) to predict the output output_layer_input = matrix_dot_product (hiddenlayer_activations * wout ) + bout output = sigmoid(output_layer_input) All the above steps are known as “Forward Propagation“
  • 13.
    Visualization of stepsfor Neural Network Operation 5.) Compare prediction with actual output and calculate the gradient of error (Actual – Predicted) Error is the mean square loss = ((Y-t)^2)/2 E = y – output 6.) Compute the slope/gradient of hidden and output layer neurons ( To find the slope, calculate the derivatives of non-linear activations x at each layer for each neuron). The gradient of sigmoid can be returned as x * (1 – x). slope_output_layer = derivatives_sigmoid(output) slope_hidden_layer = derivatives_sigmoid(hiddenlayer_activations) 7.) Then compute change factor(delta) at the output layer, dependent on the gradient of error multiplied by the slope of output layer activation d_output = E * slope_output_layer 8.) At this step, the error will propagate back into the network which means error at the hidden layer. For this, take the dot product of the output layer delta with the weight parameters of edges between the hidden and output layer (wout.T). Error_at_hidden_layer = matrix_dot_product(d_output, wout.Transpose)
  • 14.
    9.) Compute changefactor(delta) at hidden layer, multiply the error at hidden layer with slope of hidden layer activation d_hiddenlayer = Error_at_hidden_layer * slope_hidden_layer 10.) Then update weights at the output and hidden layer: The weights in the network can be updated from the errors calculated for training example(s). wout = wout + matrix_dot_product(hiddenlayer_activations.Transpose, d_output)*learning_rate wh = wh + matrix_dot_product(X.Transpose,d_hiddenlayer)*learning_rate learning_rate: The amount that weights are updated is controlled by a configuration parameter called the learning rate) 11.) Finally, update biases at the output and hidden layer: The biases in the network can be updated from the aggregated errors at that neuron. bias at output_layer =bias at output_layer + sum of delta of output_layer at row-wise * learning_rate bias at hidden_layer =bias at hidden_layer + sum of delta of output_layer at row-wise * learning_rate bh = bh + sum(d_hiddenlayer, axis=0) * learning_rate bout = bout + sum(d_output, axis=0)*learning_rate Steps from 5 to 11 are known as “Backward Propagation “One forward and backward propagation iteration is considered as one training cycle Above, you can see that there is still a good error not close to the actual target value because we have completed only one training iteration. If we will train the model multiple times then it will be a very close actual outcome. I have completed thousands iteration and my result is close to actual target values ([[ 0.98032096] [ 0.96845624] [ 0.04532167]]).
  • 15.
  • 16.
    Convolutional Neural Network (CNN)Architecture The ConvNet architecture consists of three types of layers: Convolutional Layer, Pooling Layer, and Fully-Connected Layer. INPUT layer : hold the input image as a 3-D array of pixel values. CONV layer : Will compute the dot product between the kernel and sub- array of an input image same size as a kernel. Then it’ll sum all the values resulted from the dot product and this will be the single pixel value of an output image. This process is repeated until the whole input image is covered and for all the kernels. RELU layer : will apply an activation function max(0,x) on all the pixel values of an output image. POOL layer : Perform down sampling along the width and height of an image resulting in reducing the dimension of an image. FC (Fully-Connected) layer : Compute the class score for each of the classification category. Advantages of Convolution Neural Network (CNN): •CNN learns the filters automatically without mentioning it explicitly. These filters help in extracting the right and relevant features from the input data. •CNN captures the from an image. Spatial features refer to the arrangement of pixels and the relationship between them in an image. They help us in identifying the object accurately, the location of an object, as well as its relation with other objects in an image. •CNN also follows the concept of parameter sharing. A single filter is applied across different parts of an input to produce a feature map.
  • 17.
    Convolutional Neural Network Step-by-StepProcess The Convolution Layer Consider we have an image of size 6*6. We define a weight matrix which extracts certain features from the images We have initialized the weight(Filter) as a 3*3 matrix. This weight shall now run across the image such that all the pixels are covered at least once, to give a convolved output. The value 429 above, is obtained by the adding the values obtained by element wise multiplication of the weight matrix and the highlighted 3*3 part of the input image. The 6*6 image is now converted into a 4*4 image. Pixel values are used again when the weight matrix moves along the image. This basically enables parameter sharing in a convolutional neural network weights are learnt to extract features from the original image which help the network in correct prediction
  • 18.
    Convolutional Neural Network Stride Thefilter or the weight matrix, was moving across the entire image moving n pixel at a time, n is stride. Stride = 1 Stride = 2 The size of image keeps on reducing as we increase the stride value. This is defined as hyperparameter, as to how we would want the weight matrix to move across the image. If the weight matrix moves 1 pixel at a time, we call it as a stride of 1.
  • 19.
    Convolutional Neural Network Padding Paddingthe input image with zeros across maintains the output image size from stride. We can also add more than one layer of zeros around the image in case of higher stride values. The initial shape of the image is retained after we padded the image with a zero. This is known as same padding since the output image has the same size as the input, which means that we considered only the valid pixels of the input image. The middle 4*4 pixels would be the same. Here we have retained more information from the borders and have also preserved the size of the image.
  • 20.
    Convolutional Neural Network Pooling •Sometimeswhen the images are too large, we would need to reduce the number of trainable parameters. •It is then desired to periodically introduce pooling layers between subsequent convolution layers. •Pooling is done for the sole purpose of reducing the spatial size of the image. •Pooling is done independently on each depth dimension, therefore the depth of the image remains unchanged. •The most common form of pooling layer generally applied is the max pooling. Here stride as 2, while pooling size also as 2. The max operation is applied to each depth dimension of the convolved output. The 4*4 convolved output has become 2*2 after the max pooling operation. Convoluted image and applied max pooling reduce the parameters.
  • 21.
    Convolutional Neural Network OutputDimensions & Output Layer Output dimensions : Filters / Depth: The number of filters The depth of the output volume will be equal to the number of filter applied. The depth of the activation map will be equal to the number of filters. Stride: For the stride of one we move across and down a single pixel. With higher stride values, we move large number of pixels at a time and hence produce smaller output volumes. Zero padding: This helps us to preserve the size of the input image. If a single zero padding is added, a single stride filter movement would retain the size of the original image. Formula to calculate the output dimensions. The spatial size of the output image = ( [W-F+2P]/S)+1. W is the input volume size F is the size of the filter P is the number of padding applied S is the number of strides. Suppose we have an input image of size 32*32*3, we apply 10 filters of size 3*3*3, with single stride and no zero padding. Here W=32, F=3, P=0 and S=1. The output depth will be equal to the number of filters applied i.e. 10. The size of the output volume will be ([32-3+0]/1)+1 = 30. Therefore the output volume will be 30*30*10. Output layer: The convolution and pooling layers would only be able to extract features and reduce the number of parameters from the original images. However, to generate the final output we need to apply a fully connected layer to generate an output equal to the number of classes we need. The output layer has a loss function like categorical cross-entropy, to compute the error in prediction. Once the forward pass is complete the back propagation begins to update the weight and biases for error and loss reduction.
  • 22.
    Well Known 10Evaluation Metrics for Classification Models Predicted: Outcome of the model on the validation set Actual: Values seen in the training set Positive (P): Observation is positive Negative (N): Observation is not positive True Positive (TP): Observation is positive, and is predicted correctly False Negative (FN): Observation is positive, but predicted wrongly True Negative (TN): Observation is negative, and predicted correctly False Positive (FP): Observation is negative, but predicted wrongly 2. Accuracy : It defines your total number of true predictions in total dataset. Accuracy is the number of correct predictions over the output size. Accuracy = TP + TN / TP + TN + FP + FN 3. Detection rate : This metric basically shows the number of correct positive class predictions made as a proportion of all of the predictions made. Detection Rate = TP / TP + FP + FN + TN 4. Logarithmic loss: log loss, functions by penalizing all false/incorrect classifications. Assign a specific probability to each class for all samples. The formula: 5. Sensitivity (true positive rate): The true positive rate, corresponds to the proportion of positive data points that are correctly considered as positive, with respect to all positive data points. Sensitivity = TP / FN + TP 6. Specificity (false positive rate): Corresponds to the proportion of negative data points that are mistakenly considered as positive, with respect to all negative data points. Specificity = FP / FP + TN : Please note that both FPR and TPR have values in the range of 0 to 1. 1. Confusion matrix is a metric used to quantify the performance of a machine learning classifier. Confusion matrices are used to visualize important predictive analytics like recall, specificity, accuracy, and precision.
  • 23.
    7. Precision This metricis the number of correct positive results divided by the number of positive results predicted by the classifier. Precision = TP / TP + FP 8. Recall Recall is the number of correct positive results divided by the number of all samples that should have been identified as positive. Recall = TP / TP + FN 9. F1 score : The F1 score is basically the harmonic mean between precision and recall. It is used to measure the accuracy of tests and is a direct indication of the model’s performance. The range of the F1 score is between 0 to 1, with the goal being to get as close as possible to 1. It is calculated as per: 10. Receiver operating characteristic curve (ROC) / area under curve (AUC) score The ROC curve is basically a graph that displays the classification model’s performance at all thresholds. As the name suggests, the AUC is the entire area below the two- dimensional area below the ROC curve. This curve basically generates two important metrics: sensitivity and specificity. Well Known 10 Evaluation Metrics for Classification Models Precision-Recall Curve (PRC) As the name suggests, this curve is a direct representation of the precision(y-axis) and the recall(x-axis). This is particularly useful for the situations where we have an imbalanced dataset and the number of negatives is much larger than the positives.
  • 24.
    Interpretation of ROCCurves (Receiver Operating Characteristic Curve) ROC Curve: 1. It is the plot between the TPR(y-axis) and FPR(x-axis). 2. Consider the model classifies the patient as having heart disease or not based on the probabilities generated for each class, we can decide the threshold of the probabilities as well. 3. For example, we want to set a threshold value of 0.4. This means that the model will classify the datapoint/patient as having heart disease if the probability of the patient having a heart disease is greater than 0.4. 4. This will obviously give a high recall value and reduce the number of False Positives. Similarly, we can visualize how our model performs for different threshold values using the ROC curve. 5. Let us generate a ROC curve for our model with k = 3. 1. At the lowest point, i.e. at (0, 0)- the threshold is set at 1.0. This means our model classifies all patients as not having a heart disease. 2. At the highest point i.e. at (1, 1), the threshold is set at 0.0. This means our model classifies all patients as having a heart disease. 3. The rest of the curve is the values of FPR and TPR for the threshold values between 0 and 1. At some threshold value, we observe that for FPR close to 0, we are achieving a TPR of close to 1. This is when the model will predict the patients having heart disease almost perfectly. 4. The area with the curve and the axes as the boundaries is called the Area Under Curve(AUC). It is this area which is considered as a metric of a good model. With this metric ranging from 0 to 1, we should aim for a high value of AUC. Models with a high AUC are called as models with good skill. Let us compute the AUC score of our model and the above plot: 0.868 5. We get a value of 0.868 as the AUC which is a pretty good score! This means that the model will be able to distinguish the patients with heart disease and those who don’t 87% of the time. 6. The diagonal line is a random model with an AUC of 0.5, a model with no skill, which just the same as making a random prediction
  • 25.
    Interpretation of Precision-Recall Curve(PRC) As the name suggests, this curve is a direct representation of the precision(y-axis) and the recall(x-axis). If you observe our definitions and formulae for the Precision and Recall above, you will notice that at no point are we using the True Negatives(the actual number of people who don’t have heart disease). This is particularly useful for the situations where we have an imbalanced dataset and the number of negatives is much larger than the positives(or when the number of patients having no heart disease is much larger than the patients having it). In such cases, our higher concern would be detecting the patients with heart disease as correctly as possible and would not need the TNR. PRC Interpretation: 1. At the lowest point, i.e. at (0, 0)- the threshold is set at 1.0. This means our model makes no distinctions between the patients who have heart disease and the patients who don’t. 2. At the highest point i.e. at (1, 1), the threshold is set at 0.0. This means that both our precision and recall are high and the model makes distinctions perfectly. 3. The rest of the curve is the values of Precision and Recall for the threshold values between 0 and 1. Our aim is to make the curve as close to (1, 1) as possible- meaning a good precision and recall. 4. Similar to ROC, the area with the curve and the axes as the boundaries is the Area Under Curve(AUC). Consider this area as a metric of a good model. The AUC ranges from 0 to 1. Therefore, we should aim for a high value of AUC. Let us compute the AUC for our model and the above plot: 0.8957 5. As before, we get a good AUC of around 90%. Also, the model can achieve high precision with recall as 0 and would achieve a high recall by compromising the precision of 50%.
  • 26.
  • 27.
    Activation Functions: BinaryStep Neural network activation functions are a crucial component of deep learning. 1. Activation functions determine the output of a learning model 2. Determines the model accuracy 3. The computational efficiency of training a model 4. Activation functions have a major effect on ability to converge and the convergence speed 1. Activation functions are mathematical equations that determine the output of a neural network. 2. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. 3. Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1. 4. They must be computationally efficient because they are calculated across thousands or even millions of neurons for each data sample. 5. Modern neural networks use a technique called backpropagation to train the model, which places an increased computational strain on the activation function, and its derivative function. Activation functions Binary Step Function A binary step function is a threshold-based activation function. If the input value is above or below a certain threshold, the neuron is activated and sends exactly the same signal to the next layer. Disadvantage: The problem with a step function is that it does not allow multi-value outputs—for example, it cannot support classifying the inputs into one of several categories.
  • 28.
    Activation Functions: Linear LinearActivation Function A linear activation function takes the form: y = mx It takes the inputs, multiplied by the weights for each neuron, and creates an output signal proportional to the input. In one sense, a linear function is better than a step function because it allows multiple outputs Limitations: 1. Not possible to use backpropagation (gradient descent) to train the model—the derivative of the function is a constant, and has no relation to the input, X. So it’s not possible to go back and understand which weights in the input neurons can provide a better prediction. 2. All layers of the neural network collapse into one—with linear activation functions, no matter how many layers in the neural network, the last layer will be a linear function of the first layer (because a linear combination of linear functions is still a linear function). So a linear activation function turns the neural network into just one layer. A neural network with a linear activation function is simply a linear regression model. It has limited power and ability to handle complexity varying parameters of input data.
  • 29.
    Activation Functions: Nonlinear Sigmoid/ Logistic Non-linear Action Functions: Non-linear activation functions allow the model to create complex mappings between the network’s inputs and outputs, which are essential for learning and modeling complex data, such as images, video, audio, and data sets which are non-linear or have high dimensionality. Non-linear activation functions advantages over linear functions: 1. They allow backpropagation because they have a derivative function which is related to the inputs. 2. They allow “stacking” of multiple layers of neurons to create a deep neural network. Multiple hidden layers of neurons are needed to learn complex data sets with high levels of accuracy. Advantages 1. Smooth gradient, preventing “jumps” in output values. 2. Output values bound between 0 and 1, normalizing the output of each neuron. 3. Clear predictions—For X above 2 or below -2, tends to bring the Y value (the prediction) to the edge of the curve, very close to 1 or 0. This enables clear predictions. Disadvantages 1. Vanishing gradient—for very high or very low values of X, there is almost no change to the prediction, causing a vanishing gradient problem. This can result in the network refusing to learn further, or being too slow to reach an accurate prediction. 2. Outputs not zero centered. 3. Computationally expensive
  • 30.
    Activation Functions: Nonlinear TanH/ Hyperbolic Tangent & ReLU (Rectified Linear Unit) & Leaky ReLU Advantages Zero centered—making it easier to model inputs that have strongly negative, neutral, and strongly positive values. Otherwise like the Sigmoid function. Disadvantages Like the Sigmoid function TanH / Hyperbolic Tangent Advantages Computationally efficient—allows the network to converge very quickly Non-linear it looks like a linear, ReLU has a derivative function and allows for backpropagation Disadvantages : The Dying ReLU problem—when inputs approach zero, or are negative, the gradient of the function becomes zero, the network cannot Advantages Prevents dying ReLU problem It has a small positive slope in the negative area, so it does enable backpropagation, even for negative input values : Other charectertics like ReLU Disadvantages Results not consistent—leaky ReLU does not provide consistent predictions for negative input values. ReLU (Rectified Linear Unit) Leaky ReLU
  • 31.
    Advantages Able to handlemultiple classes only one class in other activation functions—normalizes the outputs for each class between 0 and 1, and divides by their sum, giving the probability of the input value being in a specific class. Useful for output neurons—typically Softmax is used only for the output layer, for neural networks that need to classify inputs into multiple categories Swish is a new, self-gated activation function discovered by researchers at Google. According to their paper, it performs better than ReLU with a similar level of computational efficiency. In experiments on ImageNet with identical models running ReLU and Swish, the new function achieved top -1 classification accuracy 0.6-0.9% higher. Swish Softmax Activation Functions: Nonlinear Softmax and Swish
  • 32.
    Activation Functions Derivatives orGradients 1. The derivative—also known as a gradient—of an activation function is extremely important for training the neural network. 2. Neural networks are trained using a process called backpropagation—this is an algorithm which traces back from the output of the model, through the different neurons which were involved in generating that output, back to the original weight applied to each neuron. 3. Backpropagation suggests an optimal weight for each neuron which results in the most accurate prediction. Derivatives or Gradients of Activation Functions Sigmoid TanH ReLU Derivatives or Gradients
  • 33.
  • 34.
    Backpropagation 6 Stages ofNeural Network Learning 1. Initialization—initial weights are applied to all the neurons. 2. Forward propagation—the inputs from a training set are passed through the neural network and an output is computed. 3. Error function as we are working with a training set, the correct output is known. An error function is defined, which captures the delta between the correct output and the actual output of the model, given the current model weights (in other words, “how far off” is the model from the correct result). 4. Backpropagation—the objective of backpropagation is to change the weights for the neurons, in order to bring the error function to a minimum. 5. Weight update—weights are changed to the optimal values according to the results of the backpropagation algorithm. 6. Iterate until convergence—because the weights are updated a small delta step at a time, several iterations are required in order for the network to learn. After each iteration, the gradient descent force updates the weights towards less and less global loss function. The amount of iterations needed to converge depends on the learning rate, the network meta-parameters, and the optimization method used. Backpropagation is simply an algorithm which performs a highly efficient search for the optimal weight values, using the gradient descent technique.
  • 35.
    Backpropagation Step-by-Step Process The imagebelow is a very simple neural network model with two inputs (i1 and i2), which can be real values between 0 and 1, two hidden neurons (h1 and h2), and two output neurons (o1 and o2). Biases in neural networks are extra neurons added to each layer, which store the value of 1. This allows you to “move” or translate the activation function so it doesn’t cross the origin, by adding a constant number.
  • 36.
    The Forward Pass Step-by-StepProcess each neuron is a very simple component executes the activation function. There are several commonly used activation functions; for example, this is the sigmoid function: f(x) = 1 / 1 + exp(-x) Our simple neural network The forward pass works by: 1. Taking each of the two inputs 2. Multiplying by the first-layer weights—w1,2,3,4 3. Adding bias 4. Applying the activation function for neurons h1 and h2 5. Taking the output of h1 and h2, multiplying by the second layer weights—w5,6,7,8 6. This is the output. Assume that first input i1 is 0.1, the weight going into the first neuron, w1, is 0.27, the second input i2 is 0.2, the weight from the second weight to the first neuron, w3, is 0.57, and the first layer bias b1 is 0.4. The input of the first neuron h1 is combined from the two inputs, i1 and i2: (i1 * w1) + (i2 * w2) + b1 = (0.1 * 0.27) + (0.2 * 0.57) + (0.4 * 1) = 0.541 Feeding this into the activation function of neuron h1: f(0.541) = 1 / (1 + exp(-0.541)) = 0.632 Now, given some other weights w2 and w4 and the second input i2, you can follow a similar calculation to get an output for the second neuron in the hidden layer, h2. The final step is to take the outputs of neurons h1 and h2, multiply them by the weights w5,6,7,8, and feed them to the same activation function of neurons o1 and o2 (exactly the same calculation as above). The result is the final output of the neural network,let’s say the final outputs are 0.735 for o1 and 0.455 for o2. We’ll also assume that the correct output values are 0.5 for o1 and 0.5 for o2 (Assumed correct values because in supervised learning, each data point had its truth value).
  • 37.
    The backpropagation algorithmcalculates how much the final output values, o1 and o2, are affected by each of the weights. To do this, it calculates partial derivatives, going back from the error function to the neuron that carried a specific weight. The error function For simplicity, consider Mean Squared Error function. For the first output, the error is the correct output value minus the actual output of the neural network: 0.5-0.735 = -0.235 For the second output:0.5-0.455 = 0.045 Calculate the Mean Squared Error: MSE(o1) = ½ (-0.235)2 = 0.0276 MSE(o2) = ½ (0.045)2 = 0.001 The Total Error is the sum of the two errors: Total Error = 0.0276 + 0.001 = 0.0286 This is the number we need to minimize with Backpropagation. Final outputs: 0.735 - O1 & 0.455 – for O2. Assumed correct output values are 0.5 for o1 and 0.5 for o2 Backpropagation with gradient descent For example, weight w6, going from hidden neuron h1 to output neuron o2, affected our model as follows: neuron h1 with weight w6 → affects total input of neuron o2 → affects output o2 → affects total errors Backpropagation goes in the opposite direction: total errors → affected by output o2 → affected by total input of neuron o2 → affected by neuron h1 with weight w6 The algorithm calculates three derivatives: 1. The derivative of total errors with respect to output o2 2. The derivative of output o2 with respect to total input of neuron o2 3. Total input of neuron o2 with respect to neuron h1 with weight w6 This gives us complete traceability from the total errors, all the way back to the weight w6. Using the Leibniz Chain Rule, it is possible to calculate, based on the above three derivatives, what is the optimal value of w6 that minimizes the error function. In other words, what is the “best” weight w6 that will make the neural network most accurate? Similarly, the algorithm calculates an optimal value for each of the 8 weights. End result of backpropagation: The backpropagation algorithm results in a set of optimal weights, like this: Optimal values are: w1 = 0.355 ; w2 = 0.476 ; w3 = 0.233 ; w4 = 0.674 ; w5 = 0.142 ; w6 = 0.967 ; w7 = 0.319 ; w8 = 0.658. Update the weights to these values, and start using the neural network to make predictions for new inputs. How Often Are the Weights Updated? 1)Updating after every sample in training set; 2)Updating in batch and 3)Randomized mini-batches Backpropagation Step-by-Step Process and Calculation

Editor's Notes

  • #5 Neural networks takes several inputs, processes it through multiple neurons from multiple hidden layers, and returns the result using an output layer. This result estimation process is technically known as “Forward Propagation“. Next, we compare the result with actual output. The task is to make the output to the neural network as close to the actual (desired) output. Each of these neurons is contributing some error to the final output. We try to minimize the value/ weight of neurons that are contributing more to the error and this happens while traveling back to the neurons of the neural network and finding where the error lies. This process is known as “Backward Propagation“. In order to reduce this number of iterations to minimize the error, the neural networks use a common algorithm known as “Gradient Descent”, which helps to optimize the task quickly and efficiently. Suppose you are at the top of a mountain, and you have to reach a lake which is at the lowest point of the mountain (a.k.a valley). A twist is that you are blindfolded and you have zero visibility to see where you are headed. So, what approach will you take to reach the lake? The best way is to check the ground near you and observe where the land tends to descend. This will give an idea in what direction you should take your first step. If you follow the descending path, it is very likely you would reach the lake. The basic forming unit of a neural network is a perceptron. A perceptron can be understood as anything that takes multiple inputs and produces one output. By directly combining the input and computing the output based on a threshold value. for eg: Take x1=0, x2=1, x3=1 and setting a threshold =0. So, if x1+x2+x3>0, the output is 1 otherwise 0. You can see that in this case, the perceptron calculates the output as 1. Next, let us add weights to the inputs. Weights give importance to an input. For example, you assign w1=2, w2=3, and w3=4 to x1, x2, and x3 respectively. To compute the output, we will multiply input with respective weights and compare with threshold value as w1*x1 + w2*x2 + w3*x3 > threshold. These weights assign more importance to x3 in comparison to x1 and x2. Next, let us add bias: Each perceptron also has a bias which can be thought of as how much flexible the perceptron is. It is somehow similar to the constant b of a linear function y = ax + b. It allows us to move the lineup and down to fit the prediction with the data better. Without b the line will always go through the origin (0, 0) and you may get a poorer fit. For example, a perceptron may have two inputs, in that case, it requires three weights. One for each input and one for the bias. Now linear representation of input will look like, w1*x1 + w2*x2 + w3*x3 + 1*b.
  • #6 Till now, we have computed the output and this process is known as “Forward Propagation“. But what if the estimated output is far away from the actual output (high error). In the neural network what we do, we update the biases and weights based on the error. This weight and bias updating process is known as “Back Propagation“. Back-propagation (BP) algorithms work by determining the loss (or error) at the output and then propagating it back into the network. The weights are updated to minimize the error resulting from each neuron. Subsequently, the first step in minimizing the error is to determine the gradient (Derivatives) of each node w.r.t. the final output. To get a mathematical perspective of the Backward propagation, refer to the below section. This one round of forwarding and backpropagation iteration is known as one training iteration aka “Epoch“.
  • #7 Full Batch Gradient Descent and Stochastic Gradient Descent Both variants of Gradient Descent perform the same work of updating the weights of the MLP by using the same updating algorithm but the difference lies in the number of training samples used to update the weights and biases. Full Batch Gradient Descent Algorithm as the name implies uses all the training data points to update each of the weights once whereas Stochastic Gradient uses 1 or more(sample) but never the entire training data to update the weights once. Let us understand this with a simple example of a dataset of 10 data points with two weights w1 and w2. Full Batch: You use 10 data points (entire training data) and calculate the change in w1 (Δw1) and change in w2(Δw2) and update w1 and w2. SGD: You use 1st data point and calculate the change in w1 (Δw1) and change in w2(Δw2) and update w1 and w2. Next, when you use 2nd data point, you will work on the updated weights
  • #14 When do we train second time then update weights and biases are used for forward propagation. Above, we have updated the weight and biases for the hidden and output layer and we have used a full batch gradient descent algorithm.
  • #17 The 6*6 image is now converted into a 4*4 image. Pixel values are used again when the weight matrix moves along the image. This basically enables parameter sharing in a convolutional neural network. The weight matrix behaves like a filter in an image extracting particular information from the original image matrix. The weights are learnt such that the loss function is minimized similar to an MLP. Therefore weights are learnt to extract features from the original image which help the network in correct prediction. When we have multiple convolutional layers, the initial layer extract more generic features, while as the network gets deeper, the features extracted by the weight matrices are more and more complex and more suited to the problem at hand.