Perceptron and Activation Function in deep learning

Introduction to Perceptron
Dr. E. Sudheer Kumar

Biological Networks
1. The majority of neurons encode their
outputs or activations as a series of brief
electrical pulses (i.e. spikes or action
potentials).
2. Dendrites are the receptive zones that
receive activation from other neurons.
3. The cell body (soma) of the neuron’s
processes the incoming activations and
converts them into output activations.
4. Axons are transmission lines that send
activation to other neurons.
5. Synapses allow weighted transmission of
signals (using neurotransmitters) between
axons and dendrites to build up large neural
networks.

Types of Networks
• Feedforward networks
• These compute a series of
transformations
• Typically, the first layer is the
input and the last layer is the
output.
• Recurrent networks
• These have directed cycles in their
connection graph. They can have
complicated dynamics.
• More biologically realistic.
hidden units
output units
input units

Different Network Topologies
• Single layer feed-forward networks
• Input layer projecting into the output layer
Input Output
layer layer
Single layer
network

• Multi-layer feed-forward networks
• One or more hidden layers. Input projects only from previous layers
onto a layer.
Input Hidden Output
layer layer layer
2-layer or
1-hidden layer
fully connected
network

• Multi-layer feed-forward networks
Input Hidden Output
layer layers layer

• Recurrent networks
• A network with feedback, where some of its inputs are connected to
some of its outputs (discrete time).
Input Output
layer layer
Recurrent
network

MOTIVATION
• Neural networks mimic the way our brains solve the problem: by
taking in inputs, processing them and generating an output. Like us,
they learn to recognize patterns, but they do this by training on
labelled datasets. Before we get in to the learning part, let’s take a
look at the most basic of artificial neurons: the perceptron, and how
it processes inputs and produces an output.

THE PERCEPTRON
• Perceptrons were developed way back in the 1950s-60s by the scientist Frank Rosenblatt, inspired by earlier
work from Warren McCulloch and Walter Pitts. While today we use other models of artificial neurons, they
follow the general principles set by the perceptron.
Model of an artificial neuron
• As you can see, the network of nodes sends signals in one direction. This is called a feed-forward
network.
• The figure depicts a neuron connected with n other neurons and thus receives n inputs (x1, x2, ….. xn). This
configuration is called a Perceptron.

Perceptron Example
X1 X2 X3 Y
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 0
0 1 0 0
0 1 1 1
0 0 0 0

X1
X2
X3
Y
Black box
0.3
0.3
0.3 t=0.4
Output
node
Input
nodes









otherwise
0
true
is
if
1
)
(
where
)
0
4
.
0
3
.
0
3
.
0
3
.
0
( 3
2
1
z
z
I
X
X
X
I
Y

TRAINING IN PERCEPTRONS
• Try teaching a child to recognize a bus?
• You show her examples, telling her, “This is a bus. That is not a bus,”
until the child learns the concept of what a bus is. Furthermore, if the
child sees new objects that she hasn’t seen before, we could expect
her to recognize correctly whether the new object is a bus or not.
• This is exactly the idea behind the perceptron.

• Input vectors from a training set are presented to the perceptron one
after the other and weights are modified according to the following
equation,
• For all inputs i,
W(i) = W(i) + a*g’(sum of all inputs)*(T-A)*P(i),
where g’ is the derivative of the activation function, and a is the
learning rate
• Here, W is the weight vector. P is the input vector. T is the correct
output that the perceptron should have known and A is the output
given by the perceptron.

• When an entire pass through all of the input training vectors is
completed without an error, the perceptron has learnt!
• At this time, if an input vector P (already in the training set) is given to
the perceptron, it will output the correct value. If P is not in the
training set, the network will respond with an output similar to other
training vectors close to P.

WHAT IS THE PERCEPTRON
ACTUALLY DOING?
• The perceptron is adding all the inputs and separating them into 2
categories, those that cause it to fire and those that don’t. That is, it
is drawing the line:
w1x1 + w2x2 = t,
where t is the threshold.

WHAT IS THE PERCEPTRON
ACTUALLY DOING?
• To make things a little simpler for training later, let’s make a small
readjustment to the above formula. Let’s move the threshold to the
other side of the inequality, and replace it with what’s known as the
neuron’s bias. Now we can rewrite the equation as:

LIMITATION OF PERCEPTRONS
• Not every set of inputs can be divided by a line like this. Those that
can be are called linearly separable. If the vectors are not linearly
separable, learning will never reach a point where all vectors are
classified properly.

Multi-Layer Perceptrons (MLPs)
• To deal with non-linearly separable problems we can use non-monotonic
activation functions. More conveniently, we can instead extend the simple
Perceptron to a Multi-Layer Perceptron, which includes at least one hidden layer
of neurons with non-linear activations functions f(x) (such as sigmoids):
• Note that if the activation on the hidden layer were linear, the network would
be equivalent to a single layer network, and wouldn’t be able to cope with non-
linearly separable problems.

MULTI-LAYERED NEURAL NETWORKS
• Once a training sample is given as an input to the network, each
output node of the single layered neural network (also
called Perceptron) takes a weighted sum of all the inputs and pass
them through an activation function and comes up with an output.
The weights are then corrected using the following equation,
For all inputs i,
W(i) = W(i) + a*g’(sum of all inputs)*(T-A)*P(i),
where a is the learning rate and g’ is the derivative of the activation
function.

• This process is repeated by feeding the whole training set several times until the
network responds with a correct output for all the samples. The training is
possible only for inputs that are linearly separable. This is where multi-layered
neural networks come into picture.

• Each input from the input layer is fed up to each node in the hidden
layer, and from there to each node on the output layer. We should
note that there can be any number of nodes per layer and there are
usually multiple hidden layers to pass through before ultimately
reaching the output layer.
• But to train this network we need a learning algorithm which should
be able to tune not only the weights between the output layer and
the hidden layer but also the weights between the hidden layer and
the input layer.

Gradient Descent and
Back Propagation

Learning by Error Minimization
• Initialize the weights (w0, w1, …, wk)
• Adjust the weights in such a way that the output of ANN is consistent
with class labels of training examples
• Error function:
• Find the weights wi’s that minimize the above error function
• e.g., gradient descent, backpropagation algorithm
 2
)
,
(
 

i
i
i
i X
w
f
Y
E

Practical Considerations for Gradient Descent Learning
There a number of important practical/implementational considerations
that must be taken into account when training neural networks:
1. Do we need to pre-process the training data? If so, how?
2. How many hidden units do we need?
3. Are some activation functions better than others?
4. How do we choose the initial weights from which we start the training?
5. Should we have different learning rates for the different layers?
6. How do we choose the learning rates?
7. Do we change the weights after each training pattern, or after the whole set?
8. How do we avoid flat spots in the error function?
9. How do we avoid local minima in the error function?
10. When do we stop training?
In general, the answers to these questions are highly problem dependent.

BACK PROPAGATION (BACKWARD
PROPAGATION OF ERRORS)
• Backpropagation is a common method for training a neural network.
• The goal of backpropagation is to optimize the weights so that the
neural network can learn how to correctly map arbitrary inputs to
outputs.
• To tune the weights between the hidden layer and the input layer, we
need to know the error at the hidden layer, but we know the error only
at the output layer (We know the correct output from the training
sample and we also know the output predicted by the network.).
• So, the method that was suggested was to take the errors at the output
layer and proportionally propagate them backwards to the hidden
layer.

BACK PROPAGATION
• For a particular neuron in output layer
for all j { Wj,i = Wj,i + a*g’(sum of all inputs)*(T-A)*P(j) }
• This equation tunes the weights between the output layer and the
hidden layer.
• For a particular neuron j in hidden layer, we propagate the error
backwards from the output layer, thus
Error = Wj,1 * E1 + Wj,2 * E2 + …..
for all the neurons in output layer.

BACK PROPOGATION BY EXAMPLE
• Consider a neural network with two inputs, two hidden neurons, two
output neurons. Additionally, the hidden and output neurons will
include a bias.

In order to have some numbers to work with, here are the initial
weights, the biases, and training inputs/outputs: given inputs 0.05 and
0.10, we want the neural network to output 0.01 and 0.99.

ACTIVATION FUNCTION
• A function that transforms the values or states for the decision of the
output neuron is known as an activation function.
• What does an artificial neuron do? Simply, it calculates a “weighted
sum” of its input, adds a bias and then decides whether it should be
“fired” or not.
• So consider a neuron.

ACTIVATION FUNCTION
• The value of Y can be anything ranging from -inf to +inf.
• The neuron really doesn’t know the bounds of the value.
• So how do we decide whether the neuron should fire or not.
• We decided to add “activation functions” for this purpose.
• To check the Y value produced by a neuron and decide whether
outside connections should consider this neuron as “fired” or not. Or
rather let’s say—“activated” or not.

ACTIVATION FUNCTION
• If we do not apply an Activation function, then the output signal would
simply be a simple linear function. A linear function is just a
polynomial of one degree.
• A linear equation is easy to solve but they are limited in their
complexity and have less power to learn complex functional mappings
from data.
• A Neural Network without Activation function would simply be
a Linear Regression Model, which has limited power and does not
performs good most of the times.
• We want our Neural Network to not just learn and compute a linear
function but something more complicated than that.

ACTIVATION FUNCTION
• Also, without activation function our Neural network would not be
able to learn and model other complicated kinds of data such as
images, videos , audio , speech etc.
• That is why we use Artificial Neural network techniques such as Deep
learning to make sense of something complicated ,high
dimensional, non-linear - big datasets, where the model has lots and
lots of hidden layers in between and has a very complicated
architecture which helps us to make sense and extract knowledge
form such complicated big datasets.

ACTIVATION FUNCTION
• Activation functions are really important for a Artificial Neural
Network to learn and make sense of something really complicated.
They introduce non-linear properties to our network.
• Their main purpose is to convert an input signal of a node in a ANN
to an output signal. That output signal now is used as a input in the
next layer in the stack.

WHY DO WE NEED NON-LINEARITIES?
• We need to apply an Activation function f(x) so as to make the
network more powerful and add ability to it to learn something
complex and complicated form data and represent non-linear
complex arbitrary functional mappings between inputs and outputs.
• Hence using a non linear Activation, we are able to generate non-
linear mappings from inputs to outputs.

TYPES OF ACTIVATION FUNCTIONS
Step function
•Activation function A = “activated” if Y > threshold else not
•Alternatively, A = 1 if Y > threshold, 0 otherwise
•Well, what we just did is a “step function”, see the below figure.
•DRAWBACK: Suppose you are creating a binary classifier. Something which should say a
“yes” or “no” ( activate or not activate ). A Step function could do that for you! That’s exactly
what it does, say a 1 or 0. Now, think about the use case where you would want multiple such
neurons to be connected to bring in more classes. Class1, class2, class3 etc. What will happen
if more than 1 neuron is “activated”. All neurons will output a 1 ( from step function). Now
what would you decide? Which class is it? Hard, complicated.

Linear function
•A = cx
•A straight line function where activation is proportional to input ( which is
the weighted sum from neuron ).
•This way, it gives a range of activations, so it is not binary activation. We
can definitely connect a few neurons together and if more than 1 fires, we
could take the max and decide based on that. So that is ok too. Then what is
the problem with this?
•A = cx, derivative with respect to x is c. That means, the gradient has no
relationship with X. It is a constant gradient and the descent is going to be
on constant gradient. If there is an error in prediction, the changes made by
back propagation is constant and not depending on the change in input.

Sigmoid function
This looks smooth and “step function like”. What
are the benefits of this? It is nonlinear in nature.
Combinations of this function are also nonlinear!
Great. Now we can stack layers. What about non
binary activations? Yes, that too! It will give an
analog activation unlike step function. It has a
smooth gradient too.
And if you notice, between X values -2 to 2, Y values are very steep. Which means, any small changes in the
values of X in that region will cause values of Y to change significantly. That means this function has a tendency
to bring the Y values to either end of the curve.

• Another advantage of this activation function is, unlike linear function, the
output of the activation function is always going to be in range (0,1)
compared to (-inf, inf) of linear function.
• So we have our activations bound in a range. It won’t blow up the
activations then. This is great.
• Sigmoid functions are one of the most widely used activation functions
today. Then what are the problems with this?
• If you notice, towards either end of the sigmoid function, the Y values tend
to respond very less to changes in X. What does that mean? The gradient
at that region is going to be small. It gives rise to a problem of “vanishing
gradients”. So what happens when the activations reach near the “near-
horizontal” part of the curve on either sides?
• The network refuses to learn further or is drastically slow. There are ways
to work around this problem and sigmoid is still very popular in
classification problems.

Tanh Function
•Another activation function is the tanh function.
This looks very similar to sigmoid. In fact, it is a
scaled sigmoid function!

• This has characteristics similar to sigmoid that we discussed.
• It is nonlinear in nature, so great we can stack layers! It is bound to
range (-1, 1) so no worries of activations blowing up.
• One point to mention is that the gradient is stronger for tanh than
sigmoid ( derivatives are steeper).
• Deciding between the sigmoid or tanh will depend on your
requirement of gradient strength. Like sigmoid, tanh also has the
vanishing gradient problem.
• Tanh is also a very popular and widely used activation function.
Especially in time series data.

ReLu
•Later, comes the ReLu function,
A(x) = max(0,x)
The ReLu function is as shown above. It gives an output x if x is positive
and 0 otherwise.

• At first look, this would look like having the same problems of linear
function, as it is linear in positive axis.
• First of all, ReLu is nonlinear in nature. And combinations of ReLu are
also non linear!.
• Any function can be approximated with combinations of ReLu.
• Great, so this means we can stack layers. It is not bound though.
• The range of ReLu is [0, inf]. This means it can blow up the activation.

• Another point to discuss here is the sparsity of the activation.
Imagine a big neural network with a lot of neurons. Using a sigmoid
or tanh will cause almost all neurons to fire in an analog way.
• Imagine a network with random initialized weights ( or normalized )
and almost 50% of the network yields 0 activation because of the
characteristic of ReLu ( output 0 for negative values of x ).
• This means a fewer neurons are firing ( sparse activation ) and the
network is lighter. ReLu seems to be awesome! Yes it is, but nothing
is flawless.. Not even ReLu.

• Because of the horizontal line in ReLu ( for negative X ), the gradient
can go towards 0.
• For activations in that region of ReLu, gradient will be 0 because of
which the weights will not get adjusted during descent.
• That means, those neurons which go into that state will stop
responding to variations in error/ input ( simply because gradient is 0,
nothing changes ). This is called dying ReLu problem.
• This problem can cause several neurons to just die and not respond
making a substantial part of the network passive.

• There are variations in ReLu to mitigate this issue by simply making
the horizontal line into non-horizontal component .
• For example, y = 0.01x for x<0 will make it a slightly inclined line
rather than horizontal line. This is leaky ReLu. There are other
variations too.
• The main idea is to let the gradient be non zero and recover during
training eventually.
• ReLu is less computationally expensive than tanh and sigmoid
because it involves simpler mathematical operations. That is a good
point to consider when we are designing deep neural nets.

NOW WHICH ONE DO WE USE?
• Does that mean we just use ReLu for everything we do? Or sigmoid or
tanh? Well, yes and no.
• When you know the function you are trying to approximate has certain
characteristics, you can choose an activation function which will
approximate the function faster leading to faster training process.
• For example, a sigmoid works well for a classifier, because approximating
a classifier function as combinations of sigmoid is easier than maybe ReLu,
for example. Which will lead to faster training process and convergence.
• You can use your own custom functions too! If you don’t know the nature
of the function you are trying to learn, then maybe you can start with
ReLu, and then work backwards. ReLu works most of the time as a general
approximator!

Perceptron and Activation Function in deep learning

Perceptron and Activation Function in deep learning

More Related Content

Similar to Perceptron and Activation Function in deep learning

Recently uploaded

Perceptron and Activation Function in deep learning