UNIT -2
MACHINE LEARNINGAND DEEP LEARNING
ALGORITHMS
20ECE506T------IoT SENSOR NODES WITH ARTIFICIAL INTELLIGENCE
2.
• Machine learning(ML) is the study of computer algorithms that improve automatically
through experience and by the use of data. It is seen as a part of artificial intelligence. Machine
learning algorithms build a model based on sample data, known as "training data", in order to
make predictions or decisions without being explicitly programmed to do so.
• Deep learning (also known as deep structured learning) is part of a broader family
of machine learning methods based on artificial neural networks with representation learning.
Learning can be supervised, semi-supervised or unsupervised.
3.
• Warren McCullochand Walter Pitts first proposed artificial neuron in the year 1943.
• It is a highly simplified computational model which resembles its behavior with neuron
possessed by the human brain.
Biological Neuron
The biological neuron receives input
signals through dendrites and sends it to
soma or cell body which is the processing
unit of a neuron, the processed signal is
then carried through the axon to other
neurons. The junction where two neurons
meet is called a synapse, the degree of
synapse tells the strength of signal carried
to other neurons.
4.
• Artificial Neuron— McCulloch Pitts Neuron (MP Neuron)
• McCulloch Pitts Neuron model is also called as a Threshold Logic Unit (TLU), or Linear
Threshold Unit, it is named so because the output value of the neuron depends on a threshold
value. The working of a biological neuron inspires this artificial neuron, it is structured and
meant to behave similarly to biological neuron.
The function g(x) performs a summation of all
the inputs and the function f(x) applies a
threshold value to the output returned by the
function g(x). The value returned by the
function f(x) is a boolean value, that is, if the
summation of inputs is greater than the fixed
threshold(b) then the neuron gets activated (i.e.
neuron is fired).
5.
b — ThresholdValues which is the only
parameter in the MP Neuron Model.
6.
Things we needto know about MP Neuron
• It takes only binary data, which is input vector X {0,1}.
ϵ
• Task performed by the neuron is binary classification i.e Y {0,1}.
ϵ
Geometrical Interpretation of MP Neuron
For simplicity let us consider
there are only two features in
the dataset, so the input vector
looks like, x = [x x ],
₁ ₂ and the
functions look like…
converting it into the form of linear equation i.e, y = m*x + c
compare equations(1) and (2), we find that…
The slope of the line m = -1 (it is fixed for any
dataset).
The y-intercept of the line c= b (the only thing we
can change to tune the model)
7.
y ∈ {0,1}
1
x1 x2
OR
function
1 2
x + x
=
Σ 2
i=1
xi ≥ 1
x2
(0,
1)
(1,
1)
x1 + x2 = b = 1
x1
(0,
0)
(1,
0)
15
Summary :
All the points above the line are classified as positive.
All the points below the line are classified as negative.
MP Neuron model works only when the points are linearly
separable.
The slope of the line in the MP Neuron model is fixed that
is -1.
We have got the power to change the value of y-
intercept(b).
8.
we can evaluatethe performance of the model using this simple formula
Limitation of MP Neuron Model
•The Model accepts data only in the form of {0,1}, we can not feed it with real values.
•It is only used for Binary classification.
•It performs well only when data is linearly separable
•The line equation has a fixed slope, so there is no flexibility in changing the slope of the line.
•We can not judge which feature is more important and we can not give priority to any feature.
•The learning algorithm is not so impressive, we are using a brute force approach to find the
threshold value.
9.
Perceptron
• The perceptronmodel, proposed by Minsky-Papert, is a more general computational model
than McCulloch-Pitts neuron. It overcomes some of the limitations of the M-P neuron by
introducing the concept of numerical weights (a measure of importance) for inputs, and a
mechanism for learning those weights.
• Inputs are no longer limited to boolean values like in the case of an M-P neuron, it supports real
inputs as well which makes it more useful and generalized.
10.
• This isvery similar to an M-P neuron but we take a weighted sum of the inputs and set the
output as one only when the sum is more than an arbitrary threshold (theta). However,
according to the convention, instead of hand coding the thresholding parameter theta, we add it
as one of the inputs, with the weight -theta like shown below, which makes it learn-able
( — Perceptron Learning Algorithm).
11.
• Here, w_0is called the bias because it represents the prior (prejudice).
• A football freak may have a very low threshold and may watch any football game irrespective
of the league, club or importance of the game [theta = 0].
• On the other hand, a selective viewer may only watch a football game that is a premier league
game, featuring Man United game and is not friendly [theta = 2].
• The point is, the weights and the bias will depend on the data ( viewing history in this case).
• Based on the data, if needed the model may have to give a lot of importance (high weight) to
the isManUnitedPlaying input and penalize the weights of other inputs.
12.
• Perceptron vs.McCulloch-Pitts Neuron
From the equations, it is clear that even a perceptron separates the input space into two halves, positive and
negative. All the inputs that produce an output 1 lie on one side (positive half space) and all the inputs that
produce an output 0 lie on the other side (negative half space).
In other words, a single perceptron can only be used to implement linearly separable functions, just like the M-P
neuron.
Then what is the difference? Why do we claim that the perceptron is an updated version of an M-P neuron? Here,
the weights, including the threshold can be learned and the inputs can be real values.
13.
• Boolean FunctionsUsing Perceptron
The above ‘possible solution’ was obtained by solving the linear system of equations on the left. It is clear that
the solution separates the input space into two spaces, negative and positive half spaces.
14.
• Let usfix the threshold (−w0 = 1) and try differ-
ent values of w1, w2
• Say, w1 = −1, w2 = −1
• What is wrong with this line? We makean error on 1
out of the 4 inputs
• Lets try some more values of w1, w2 and note how
many errors we make
w1 w2 errors
-1 -1
1
1.5 0
1
0.45 0.45
3
• We are interested in those values of w0, w1, w2
which result in 0 error
• Let us plot the error surface corresponding to dif-
x1
x2
−1 + 1.1x1 + 1.1x2 = 0
(0, 1)
(1, 1)
(0, 0) (1, 0)
−1 + (1.5)x1 + (0)x2 =
0
−1 + (−1)x1 + (−1)x2 = 0
−1 + (0.45)x1 + (0.45)x2
= 0
29
ferent values of w0, w1, w2
15.
• For easeof analysis, we will keep w0 fixed
(-1) and plot the error for different values
of w1, w2
• For a given w0, w1, w2 we will compute
−w0 + w1 ∗x1 + w2 ∗x2 for all com-
binations of (x1, x2) and note down how
many errors we make
• For the ORfunction, an error occurs if (x1,
x2) = (0, 0) but −w0 + w1 ∗x1 + w2 ∗
x2 ≥ 0 or if (x1, x2) not equal to (0, 0)
but
−w0 + w1 ∗ x1 + w2 ∗x2 < 0
• We are interested in finding an algorithm
which finds the values of w1, w2 which 30
• Our goalis to find the w vector that can perfectly classify positive inputs and
negative inputs in our data. …….Here goes:
We initialize w with some random vector. We then iterate
over all the examples in the data, (P U N) both positive
and negative examples. Now if an input x belongs to P,
ideally what should the dot product w.x be? say greater
than or equal to 0 because that’s the only thing what our
perceptron wants at the end of the day so let's give it that.
And if x belongs to N, the dot product MUST be less than
0. So if we look at the if conditions in the while loop:
Case 1: When x belongs to P and its dot product w.x < 0
Case 2: When x belongs to N and its dot product w.x ≥ 0
Only for these cases, we are updating our randomly
initialized w. Otherwise, we don’t touch w at all because
Case 1 and Case 2 are violating the very rule of a
perceptron. So we are adding x to w (ahem vector
addition ahem) in Case 1 and subtracting x from w in
Case 2.
18.
Why Would TheSpecified Update Rule Work?
We have already established that when x belongs to P, we want w.x ≥ 0, basic perceptron
rule. What we also mean by that is that when x belongs to P, the angle
between w and x should be _____ than 90 degrees.
Answer: The angle between w and x should be
less than 90 because the cosine of the angle is
proportional to the dot product
19.
• So whateverthe w vector may be, as long as it makes an angle less than 90 degrees with the
positive example data vectors (x E P) and an angle more than 90 degrees with the negative
example data vectors (x E N), we are cool. So ideally, it should look something like this:
20.
• So wenow strongly believe that the angle between w and x should be less than 90
when x belongs to P class and the angle between them should be more than 90 when x belongs
to N class. Here’s why the update works:
So when we are adding x to w, which we do when x
belongs to P and w.x < 0 (Case 1), we are
essentially increasing the cos(alpha) value, which
means, we are decreasing the alpha value, the angle
between w and x, which is what we desire. And the
similar intuition works for the case when x belongs
to N and w.x ≥ 0 (Case 2).
21.
• Here’s atoy simulation of how we might up end up learning w that makes an angle less than 90
for positive examples and more than 90 for negative examples.
22.
• XOR Function— Can’t Do!
• Now let's look at a non-linear boolean function i.e., we cannot draw a line to separate positive
inputs from the negative ones.
Notice that the fourth equation contradicts the second and the third equation. Point is, there are
no perceptron solutions for non-linearly separated data. So the key take away is that
a single perceptron cannot learn to separate the data that are non-linear in nature.
23.
In the bookpublished by Minsky and Papert in 1969, the authors implied that,
since a single artificial neuron is incapable of implementing some functions such
as the XOR logical function, larger networks also have similar limitations, and
therefore should be dropped.
Later research on three-layered perceptron's showed how to implement such
functions, therefore saving the technique from obliteration.
24.
x2
h1 h4
bias =-
2
y
w1
w2 w3
w4
h2 h3
x1
red edge indicates w = -1
blue edge indicates w = +1
58
Terminology:
• This network contains 3 layers
• The layer containing the inputs (x1, x2) is
called the input layer
• The middle layer containing the 4 perceptrons
iscalled the hidden layer
• The final layer containing one output neuron is
called the output layer
• The outputs of the 4 perceptron's in the hid- den
layer are denoted by h1, h2, h3, h4
• The red and blue edgesare called layer 1 weights
1 2 3
4
• w , w , w , w are called layer 2 weights
MLP(Multi layer perceptron's):
25.
x2
h1
-1,-1
h4
1,1
bias = -2
y
w1
w2w3
w4
h2 h3
-1,1 1,-1
x1
red edge indicates w = -1
blue edge indicates w = +1
• Let w0 be the bias output of the neuron (i.e.,
it will fire if Σ
4
i=1
wi hi ≥ w0)
x1 x2 X O R h1 h2 h3 h4 Σ
4
i=1 wi hi
0 0 0 1 0 0 0 w1
0 1 1 0 1 0 0 w2
1 0 1 0 0 1 0 w3
1 1 0 0 0 0 1 w4
• This results in the following four conditions to
implement XOR: w1 < w0, w2 ≥ w0, w3 ≥
w0, w4 < w0
• Unlike before, there are no contradictions now and
the system of inequalities can be satisfied
• Essentially each wi is now responsible for one of the 4
possible inputs and can be adjusted to get the desired
x1 x2 x3
bias=-3
Again each of the 8 perceptorns will fire only for
one of the 8 inputs
Each of the 8 weights in the second layer is
responsible for one of the 8 inputs and can be
adjusted to produce the desired output for that
input
y
w1 w2 w3 w4 w5 w6 w7 w8
63
28.
Sigmoid Neurons
• Theartificial neurons we use today are slightly different from the perceptron we looked at,
the difference is the activation function. here. Some might say that the thresholding logic
used by a perceptron is very harsh. For example, if you look at a problem of deciding if I
will be watching a movie or not, based only on one real-valued input (x_1 = critics Rating)
and if the threshold we set is 0.5 (w_0 = -0.5) and w_1= 1 then our setup would look like
this:
What would be the decision for a movie with critics Rating =
0.51? Yes!
What would be the decision for a movie with critics Rating =
0.49? No!
Some might say that its harsh that we would watch a movie with a
rating of 0.51 but not the one with a rating of 0.49 and this is
where Sigmoid comes into the picture.
29.
There will bethis sudden change in the decision (from 0 to 1) when z value crosses the
threshold (-w_0). For most real-world applications we would expect a smoother decision
function which gradually changes from 0 to 1.
30.
• Introducing sigmoidneurons where the output function is much smoother than the step
function seems like a logical and obvious thing to do. We can see that a sigmoid function is a
mathematical function with a characteristic “S”-shaped curve, also called the sigmoid curve.
There are many functions that can do the job , some are shown below:
31.
• One ofthe simplest one to work with is the logistic function.
We no longer see a sharp transition around the w_0. Also, the output is no longer binary but a real value
between 0 and 1 which can be interpreted as a probability. So instead of yes/no decision, we get the
probability of yes. The output here is smooth, continuous and differentiable and just how any learning
algorithm likes it
32.
Feed forward neuralnetwork
• A feed forward neural network is a biologically inspired classification algorithm. It
consist of a (possibly large) number of simple neuron-like processing units, organized
in layers. Every unit in a layer is connected with all the units in the previous layer.
These connections are not all equal: each connection may have a different strength
or weight. The weights on these connections encode the knowledge of a network.
Often the units in a neural network are also called nodes.
• Data enters at the inputs and passes through the network, layer by layer, until it arrives
at the outputs. During normal operation, that is when it acts as a classifier, there is no
feedback between layers. This is why they are called feed forward neural networks.
33.
The 3 inputsare shown as circles and these do not belong to any
layer of the network (although the inputs sometimes are
considered as a virtual layer with layer number 0). Any layer
that is not an output layer is a hidden layer. This network
therefore has 1 hidden layer and 1 output layer. The figure also
shows all the connections between the units in different layers.
A layer only connects to the previous layer.
The operation of this network can be divided into two phases:
1. The learning phase
2. The classification phase
In the following figure we see an example of a 2-layered
network with, from top to bottom: an output layer with 5 units,
a hidden layer with 4 units, respectively. The network has 3
input units.
34.
• The FFNetuses a supervised learning algorithm: Besides the input pattern, the neural net also
needs to know to what category the pattern belongs.
• Learning proceeds as follows: a pattern is presented at the inputs. The pattern will be
transformed in its passage through the layers of the network until it reaches the output layer.
• The units in the output layer all belong to a different category. The outputs of the network as
they are now are compared with the outputs as they ideally would have been if this pattern
were correctly classified: in the latter case the unit with the correct category would have had
the largest output value and the output values of the other output units would have been very
small. On the basis of this comparison all the connection weights are modified a little bit to
guarantee that, the next time this same pattern is presented at the inputs, the value of the output
unit that corresponds with the correct category is a little bit higher than it is now and that, at
the same time, the output values of all the other incorrect outputs are a little bit lower than they
are now.
• (The differences between the actual outputs and the idealized outputs are propagated back from
the top layer to lower layers to be used at these layers to modify connection weights. This is
why the term backpropagation network is also often used to describe this type of neural
network.)
35.
• If youperform the procedure above once for every pattern and category pair in
your data set you have performed one epoch of learning.
• The hope is that eventually, probably after many epochs, the neural net will come
to remember these pattern-category pairs. You even hope that the neural net, when
the learning phase has terminated, will be able to generalize and has learned
to classify correctly any unknown pattern presented to it.
• Because real-life data often contains noise as well as partly contradictory
information, these hopes can be fulfilled only partly.
• For learning you need to select three different objects together: a FFNet
(the classifier), a PatternList (the inputs) and a Categories (the correct outputs).
How long will the learning phase take?
In general, this question is hard to answer. It depends on the size of the neural network, the
number of patterns to be learned, the number of epochs, the tolerance of the minimizer and
the speed of your computer, how much computing time the learning phase may take
36.
In the classificationphase, the weights of the network are fixed.
• A pattern, presented at the inputs, will be transformed from layer to layer until
it reaches the output layer. Now classification can occur by selecting the
category associated with the output unit that has the largest output value. For
classification we only need to select an FFNet and a PatternList together and
choose To Categories....
37.
Back propagation:
• Itwas first introduced in the 1960s and 30 years later it was popularized by David Rumelhart,
Geoffrey Hinton, and Ronald Williams in the famous 1986 paper. In this paper, they spoke
about the various neural networks.
• Today, back propagation is doing good. Neural network training happens through back
propagation. By this approach, we fine-tune the weights of a neural net based on the error rate
obtained in the previous run. The right manner of applying this technique reduces error rates
and makes the model more reliable.
• Back propagation is used to train the neural network of the chain rule method. In simple terms,
after each feed-forward passes through a network, this algorithm does the backward pass to
adjust the model’s parameters based on weights and biases.
• A typical supervised learning algorithm attempts to find a function that maps input data to the
right output. Back propagation works with a multi-layered neural network and learns internal
representations of input to output mapping.
38.
How does backpropagation work?
Let us take a look at how back
propagation works. It has four layers:
input layer, hidden layer, hidden layer II
and final output layer.
• So, the main three layers are:
• Input layer
• Hidden layer
• Output layer
• Each layer has its own way of working
and its own way to take action such
that we are able to get the desired
results and correlate these scenarios to
our conditions. Let us discuss other
details needed to help summarizing this
algorithm.
This image summarizes the functioning of the backpropagation approach.
1.Input layer receives x
2.Input is modeled using weights w
3.Each hidden layer calculates the output and data is ready at the output layer
4.Difference between actual output and desired output is known as the error
5.Go back to the hidden layers and adjust the weights so that this error is
reduced in future runs
6.This process is repeated till we get the desired output. The training phase is
done with supervision. Once the model is stable, it is used in production.
39.
Loss function
• Oneor more variables are mapped to real numbers, which represent some price
related to those values. Intended for backpropagation, the loss function
calculates the difference between the network output and its probable output.
Why do we need back propagation?
Back propagation has many advantages, some of the important ones are listed
below-
• Back propagation is fast, simple and easy to implement
• There are no parameters to be tuned
• Prior knowledge about the network is not needed thus becoming a flexible
method
• This approach works very well in most cases
• The model need not learn the features of the function
40.
Types of backpropagation
There are two types of back propagation networks.
• Static back propagation
• Recurrent back propagation
Static back propagation
• In this network, mapping of a static input generates static output. Static
classification problems like optical character recognition will be a suitable domain
for static back propagation.
Recurrent back propagation
• Recurrent back propagation is conducted until a certain threshold is met. After the
threshold, the error is calculated and propagated backward.
• The difference between these two approaches is that static back propagation is as
fast as the mapping is static.
41.
PRINCIPAL COMPONENT ANALYSIS?
PrincipalComponent Analysis, or PCA, is a dimensionality-reduction method that is often used to
reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller
one that still contains most of the information in the large set.
Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the
trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data
sets are easier to explore and visualize and make analyzing data much easier and faster for
machine learning algorithms without extraneous variables to process.
So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while
preserving as much information as possible.
42.
STEP BY STEPEXPLANATION OF PCA
STEP 1: STANDARDIZATION
• The aim of this step is to standardize the range of the continuous initial variables so that each one of them
contributes equally to the analysis.
• More specifically, the reason why it is critical to perform standardization prior to PCA, is that the latter is quite
sensitive regarding the variances of the initial variables. That is, if there are large differences between the ranges
of initial variables, those variables with larger ranges will dominate over those with small ranges (For example, a
variable that ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1), which will
lead to biased results. So, transforming the data to comparable scales can prevent this problem.
• Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for each value
of each variable
Once the standardization is done, all the variables will be transformed to the same scale.
43.
STEP 2: COVARIANCEMATRIX COMPUTATION
• The aim of this step is to understand how the variables of the input data set are varying
from the mean with respect to each other, or in other words, to see if there is any
relationship between them. Because sometimes, variables are highly correlated in such a
way that they contain redundant information. So, in order to identify these correlations, we
compute the covariance matrix.
• The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions)
that has as entries the covariances associated with all possible pairs of the initial variables.
For example, for a 3-dimensional data set with 3 variables x, y, and z, the covariance matrix
is a 3×3 matrix of this from:
Since the covariance of a variable with itself is its variance
(Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom
right) we actually have the variances of each initial variable.
And since the covariance is commutative (Cov(a,b)=Cov(b,a)),
the entries of the covariance matrix are symmetric with respect
to the main diagonal, which means that the upper and the lower
triangular portions are equal.
44.
STEP 3: COMPUTETHE EIGENVECTORS AND EIGENVALUES OF THE
COVARIANCE MATRIX TO IDENTIFY THE PRINCIPAL COMPONENTS
• Eigenvectors and eigenvalues are the linear algebra concepts that we need to
compute from the covariance matrix in order to determine the principal
components of the data. Before getting to the explanation of these concepts,
let’s first understand what do we mean by principal components.
• Principal components are new variables that are constructed as linear
combinations or mixtures of the initial variables. These combinations are done
in such a way that the new variables (i.e., principal components) are
uncorrelated and most of the information within the initial variables is squeezed
or compressed into the first components. So, the idea is 10-dimensional data
gives you 10 principal components, but PCA tries to put maximum possible
information in the first component, then maximum remaining information in the
second and so on, until having something like shown in the scree plot below.
46.
• Organizing informationin principal components this way, will allow you to reduce
dimensionality without losing much information, and this by discarding the components
with low information and considering the remaining components as your new variables.
• An important thing to realize here is that, the principal components are less interpretable
and don’t have any real meaning since they are constructed as linear combinations of the
initial variables.
• Geometrically speaking, principal components represent the directions of the data that
explain a maximal amount of variance, that is to say, the lines that capture most
information of the data. The relationship between variance and information here, is that, the
larger the variance carried by a line, the larger the dispersion of the data points along it, and
the larger the dispersion along a line, the more the information it has. To put all this simply,
just think of principal components as new axes that provide the best angle to see and
evaluate the data, so that the differences between the observations are better visible.
47.
STEP 4: FEATUREVECTOR
• As we saw in the previous step, computing the eigenvectors and ordering them
by their eigenvalues in descending order, allow us to find the principal
components in order of significance. In this step, what we do is, to choose
whether to keep all these components or discard those of lesser significance (of
low eigenvalues), and form with the remaining ones a matrix of vectors that we
call Feature vector.
• So, the feature vector is simply a matrix that has as columns the eigenvectors of
the components that we decide to keep. This makes it the first step towards
dimensionality reduction, because if we choose to keep only p eigenvectors
(components) out of n, the final data set will have only p dimensions.
48.
LAST STEP: RECASTTHE DATA ALONG THE PRINCIPAL COMPONENTS
AXES
• In the previous steps, apart from standardization, we do not make any changes
on the data, we just select the principal components and form the feature vector,
but the input data set remains always in terms of the original axes (i.e, in terms
of the initial variables).
• In this step, which is the last one, the aim is to use the feature vector formed
using the eigenvectors of the covariance matrix, to reorient the data from the
original axes to the ones represented by the principal components (hence the
name Principal Components Analysis). This can be done by multiplying the
transpose of the original data set by the transpose of the feature vector.
49.
Convolutional neural networks:
• Convolutional neural networks, also called ConvNets, were first introduced in
the 1980s by Yann LeCun, a postdoctoral computer science researcher. LeCun
had built on the work done by Kunihiko Fukushima, a Japanese scientist who, a
few years earlier, had invented the neocognitron, a very basic image
recognition neural network.
50.
How do CNNswork?
• Convolutional neural networks are composed of multiple layers of artificial
neurons. Artificial neurons, a rough imitation of their biological counterparts, are
mathematical functions that calculate the weighted sum of multiple inputs and
outputs an activation value.
51.
• The behaviorof each neuron is defined by its weights. When fed with the pixel
values, the artificial neurons of a CNN pick out various visual features.
• When you input an image into a ConvNet, each of its layers generates several
activation maps. Activation maps highlight the relevant features of the image.
Each of the neurons takes a patch of pixels as input, multiplies their color values
by its weights, sums them up, and runs them through the activation function.
• The first (or bottom) layer of the CNN usually detects basic features such as
horizontal, vertical, and diagonal edges. The output of the first layer is fed as input
of the next layer, which extracts more complex features, such as corners and
combinations of edges. As you move deeper into the convolutional neural
network, the layers start detecting higher-level features such as objects, faces, and
more.
52.
• The operationof multiplying pixel values by weights and summing them is called
“convolution” (hence the name convolutional neural network). A CNN is usually
composed of several convolution layers, but it also contains other components.
The final layer of a CNN is a classification layer, which takes the output of the
final convolution layer as input (remember, the higher convolution layers detect
complex objects).
• Based on the activation map of the final convolution layer, the classification layer
outputs a set of confidence scores (values between 0 and 1) that specify how likely
the image is to belong to a “class.” For instance, if you have a ConvNet that
detects cats, dogs, and horses, the output of the final layer is the possibility that
the input image contains any of those animals.
53.
Training the convolutionalneural network
• One of the great challenges of developing CNNs is adjusting the weights of the
individual neurons to extract the right features from images. The process of
adjusting these weights is called “training” the neural network.
• In the beginning, the CNN starts off with random weights. During training, the
developers provide the neural network with a large dataset of images annotated
with their corresponding classes (cat, dog, horse, etc.). The ConvNet processes
each image with its random values and then compares its output with the image’s
correct label. If the network’s output does not match the label—which is likely
the case at the beginning of the training process—it makes a small adjustment to
the weights of its neurons so that the next time it sees the same image, its output
will be a bit closer to the correct answer.
54.
• The correctionsare made through a technique called backpropagation (or
backprop). Essentially, backpropagation optimizes the tuning process and makes
it easier for the network to decide which units to adjust instead of making
random corrections.
• Every run of the entire training dataset is called an “epoch.” The ConvNet goes
through several epochs during training, adjusting its weights in small amounts.
After each epoch, the neural network becomes a bit better at classifying the
training images. As the CNN improves, the adjustments it makes to the weights
become smaller and smaller. At some point, the network “converges,” which
means it essentially becomes as good as it can.
• After training the CNN, the developers use a test dataset to verify its accuracy.
The test dataset is a set of labeled images that are were not part of the training
process. Each image is run through the ConvNet, and the output is compared to
the actual label of the image. Essentially, the test dataset evaluates how good the
neural network has become at classifying images it has not seen before.
55.
Evolution of CNNArchitectures:
LeNet, AlexNet, ZFNet, GoogleNet, VGG and ResNet
• It all started with LeNet in 1998 and eventually, after nearly 15 years, lead to ground breaking models winning
the ImageNet Large Scale Visual Recognition Challenge which includes AlexNet in 2012 to GoogleNet in 2014
to ResNet in 2015 to ensemble of previous models in 2016. In the last 2 years, no significant progress has been
made and the new models are an ensemble of previous ground breaking models.
• A Convolutional Neural Network (CNN, or ConvNet) are a special kind of multi-layer neural networks, designed
to recognize visual patterns directly from pixel images with minimal preprocessing. The ImageNet project is a large
visual database designed for use in visual object recognition software research. The ImageNet project runs an
annual software contest, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where software
programs compete to correctly classify and detect objects and scenes.
56.
LeNet in 1998
•LeNet is a 7-level convolutional network by LeCun in 1998 that classifies digits and used by several
banks to recognize hand-written numbers on cheques digitized in 32x32 pixel greyscale input images.
The ability to process higher resolution images requires larger and more convolutional layers, so this
technique is constrained by the availability of computing resources.
AlexNet in 2012
• AlexNet is considered to be the first paper/ model which rose the interest in CNNs when it won the ImageNet challenge in
2012. AlexNet is a deep CNN trained on ImageNet and outperformed all the entries that year. It was a major
improvement with the next best entry getting only 26.2% top 5 test error rate. Compared to modern architectures, a
relatively simple layout was used in this paper.
ZFNet in 2013
ZFNet is a modified version of AlexNet which gives a better accuracy.
• One major difference in the approaches was that ZF Net used 7x7 sized filters whereas AlexNet used
11x11 filters. The intuition behind this is that by using bigger filters we were losing a lot of pixel
information, which we can retain by having smaller filter sizes in the earlier conv layers. The number
of filters increase as we go deeper. This network also used ReLUs for their activation and trained
using batch stochastic gradient descent
57.
VGG in 2014
•The idea of VGG was submitted in 2013 and it became a runner up in the ImageNet contest in 2014. It is
widely used as a simple architecture compared to AlexNet and ZFNet.
• VGG Net used 3x3 filters compared to 11x11 filters in AlexNet and 7x7 in ZFNet. The authors give the
intuition behind this that having two consecutive 3x3 filters gives an effective receptive field of 5x5, and 3 –
3x3 filters give a receptive field of 7x7 filters, but using this we can use a far less number of hyper-
parameters to be trained in the network.
GoogleNet in 2014
• In 2014, several great models were developed like VGG but the winner of the ImageNet contest was Google
Net.
• GoogLeNet proposed a module called the inception modules which includes skip connections in the network
forming a mini module and this module is repeated throughout the network.
• GoogLeNet uses 9 inception module and it eliminates all fully connected layers using average pooling to go
from 7x7x1024 to 1x1x1024. This saves a lot of parameters.
58.
ResNet in 2015
•There are 152 layers in the Microsoft ResNet. The authors showed empirically that if you keep on adding layers
the error rate should keep on decreasing in contrast to “plain nets” where adding a few layers resulted in higher
training and test errors. It took two to three weeks to train it on an 8 GPU machine. One intuitive reason why
residual blocks improve classification is the direct step from one layer to the next and intuitively using all these
skip steps form a gradient highway where the gradients computed can directly affect the weights in the first layer
making updates have more effect.