22PCOAM16
MACHINE LEARNING
UNIT II NOTES & QB
B.TECH
III YEAR – V SEM (R22)
(2025-2026)
Prepared
By
Dr. M.Gokilavani
Department of Emerging Technologies
(Special Batch)
GURU NANAK INSTITUTIONS TECHNICAL CAMPUS (AUTONOMOUS)
SCHOOL OF ENGINEERING & TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (ARTIFICIAL INTELLIGENCE & MACHINE
LEARNING)
COURSE STRUCTURE
(Applicable for the Batch admitted from 2022-2023)
MACHINE LEARNING
B.Tech. III Year I Sem. L T P C
3 0 0 3
Course Objectives:
 To introduce students to the basic concepts and techniques of Machine Learning.
 To have a thorough understanding of the Supervised and Unsupervised learning techniques
 To study the various probability-based learning techniques
Course Outcomes:
 Distinguish between, supervised, unsupervised and semi-supervised learning
 Understand algorithms for building classifiers applied on datasets of non-linearly separable
classes
 Understand the principles of evolutionary computing algorithms
 Design an ensembler to increase the classification accuracy
UNIT - I
Learning – Types of Machine Learning – Supervised Learning – The Brain and the Neuron – Design a
Learning System – Perspectives and Issues in Machine Learning – Concept Learning Task – Concept
Learning as Search – Finding a Maximally Specific Hypothesis – Version Spaces and the Candidate
Elimination Algorithm – Linear Discriminants: – Perceptron – Linear Separability – Linear Regression.
UNIT - II
Multi-layer Perceptron– Going Forwards – Going Backwards: Back Propagation Error – Multi-layer
Perceptron in Practice – Examples of using the MLP – Overview – Deriving Back-Propagation – Radial
Basis Functions and Splines – Concepts – RBF Network – Curse of Dimensionality – Interpolations and
Basis Functions – Support Vector Machines
UNIT - III
Learning with Trees – Decision Trees – Constructing Decision Trees – Classification and Regression
Trees – Ensemble Learning – Boosting – Bagging – Different ways to Combine Classifiers – Basic
Statistics – Gaussian Mixture Models – Nearest Neighbor Methods – Unsupervised Learning – K means
Algorithms
UNIT - IV
Dimensionality Reduction – Linear Discriminant Analysis – Principal Component Analysis – Factor
Analysis – Independent Component Analysis – Locally Linear Embedding – Isomap – Least Squares
Optimization
Evolutionary Learning – Genetic algorithms – Genetic Offspring: - Genetic Operators – Using Genetic
Algorithms
UNIT - V
Reinforcement Learning – Overview – Getting Lost Example
Markov Chain Monte Carlo Methods – Sampling – Proposal Distribution – Markov Chain Monte Carlo
– Graphical Models – Bayesian Networks – Markov Random Fields – Hidden Markov Models –
TrackingMethods
TEXT BOOKS:
1. Stephen Marsland, ―Machine Learning – An Algorithmic Perspective, Second Edition,
Chapman and Hall/CRC Machine Learning and Pattern Recognition Series, 2014.
REFERENCE BOOKS:
1. Flutter for Beginners: An introductory guide to building cross-platform mobile applications with
Flutter and Dart 2, Packt Publishing Limited.
2. Rap Payne, Beginning App Development with Flutter: Create Cross-Platform Mobile Apps, 1st
edition, Apress.
3. Frank Zammetti, Practical Flutter: Improve your Mobile Development with Google’s Latest
Open-Source SDK, 1st edition, Apress.
UNIT-II
Multi-layer Perceptron– Going Forwards – Going Backwards: Back Propagation Error – Multi-layer
Perceptron in Practice – Examples of using the MLP – Overview – Deriving Back-Propagation – Radial
Basis Functions and Splines – Concepts – RBF Network – Curse of Dimensionality – Interpolations and
Basis Functions – Support Vector Machines.
INTRODUCTION:
1.MULTI-LAYER PERCEPTRON:
 Multi-layer perceptron MLP is a class of feed forward artificial neural networks.
 An MLP consists of, at least, three layers of nodes: an input layer, a hidden layer and an
output layer.
 Except for the input nodes, each node is a neuron that uses a nonlinear activation function
(Sigmoid Activation Function).
 MLP utilizes a supervised learning technique called back propagation for training.
 MLP useful in research for their ability to solve problems stochastically.
 Multi-Layer Perceptron (MLP) is an artificial neural network widely used for solving
classification and regression tasks.
Neuron:
• A neuron can have any number of inputs from one to n, where n is the total number of inputs.
• The inputs may be represented therefore as x1, x2, and x3… xn. And the corresponding
weights for the inputs as w1, w2, w3… wn.
• Output a = x1w1+x2w2+x3w3... +xnwn.
• Process output Activation function weight w0 -1 +1.
Fig: Structure of Neuron in MLP
Key Components of Multi-Layer Perceptron (MLP):
Fig: Multi-Layer Perceptron (MLP)
 Input Layer: Each neuron (or node) in this layer corresponds to an input feature. For
instance, if you have three input features, the input layer will have three neurons.
 Hidden Layers: An MLP can have any number of hidden layers, with each layer containing
any number of nodes. These layers process the information received from the input layer.
 Output Layer: The output layer generates the final prediction or result. If there are multiple
outputs, the output layer will have a corresponding number of neurons.
Every connection in the diagram is a representation of the fully connected nature of an MLP. This
means that every node in one layer connects to every node in the next layer. As the data moves
through the network, each layer transforms it until the final output is generated in the output layer.
Architecture of MLP:
Fig: Architecture of MLP
2. MULTI-LAYER PERCEPTRON IN PRACTICE:
Working of Multi-Layer Perceptron’s:
 The key mechanisms such as forward propagation, loss function, back propagation, and
optimization.
Step 1: Forward Propagation
Step 2: Loss Function
Step 3: Back propagation
Step 4: Optimization
Step 1: Forward Propagation
 In forward propagation, the data flows from the input layer to the output layer, passing
through any hidden layers.
 Each neuron in the hidden layers processes the input as follows:
1. Weighted Sum: The neuron computes the weighted sum of the inputs:
Where,
o xi is the input feature.
o wi is the corresponding weight.
o b is the bias term.
2. Activation Function: The weighted sum z is passed through an activation function to
introduce non-linearity. Common activation functions include:
Step 2: Loss Function
 Once the network generates an output, the next step is to calculate the loss using a loss
function.
 In supervised learning, this compares the predicted output to the actual label.
 For a classification problem, the commonly used binary cross-entropy loss function is:
Where,
Yi is the actual label.
y^i is the predicted label.
N is the number of samples.
 For regression problems, the mean squared error (MSE) is often used:
Step 3: Back Propagation
 The goal of training an MLP is to minimize the loss function by adjusting the network’s weights
and biases.
 This is achieved through back propagation:
1. Gradient Calculation: The gradients of the loss function with respect to each weight
and bias are calculated using the chain rule of calculus.
2. Error Propagation: The error is propagated back through the network, layer by layer.
3. Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss:
w=w–η⋅∂w∂L
Where:
w is the weight.
η is the learning rate.
∂L∂w is the gradient of the loss function with respect to the weight.
Step 4: Optimization
 MLPs rely on optimization algorithms to iteratively refine the weights and biases during
training. Popular optimization methods include:
1. Stochastic Gradient Descent (SGD): Updates the weights based on a single sample or
a small batch of data:
w=w–η⋅∂w∂L
2. Adam Optimizer: An extension of SGD that incorporates momentum and adaptive
learning rates for more efficient training:
mt= β1mt−1+ (1–β1) ⋅gt
vt = β2vt−1+(1–β2)⋅gt2
Here, gt represents the gradient at time t, and β1, β2are decay rates.
3. TYPES OF PASS:
There are two types of Pass:
1. Forward pass or Going Forward
2. Backward pass or Going Backward
i. GOING FORWARDS:
 Compute “Functional Signal", Feed forward Propagation of input pattern signals through
network.
 Working out what the outputs are for the given inputs and the current weights.
 MLP with two layers of nodes, make one with 3, or 4, or 20 layers of nodes.
 Recall (forward) algorithm through the network computing the activations of one layer of
neurons and using those as the inputs for the next layer.
 Then use these inputs and the first level of weights to calculate the activations of the hidden
layer, and then use those activations and the next set of weights to calculate the activations of
the output layer.
 The outputs of the network, can compare them to the targets and compute the error.
o Biases: Need to include a bias input to each neuron.
 Give an extra input that is permanently set to -1, and adjusting the weights to each neuron as
part of the training.
 Thus, each neuron in the network (whether it is a hidden layer or the output) has 1 extra input,
with fixed value.
ii. GOING BACKWARDS: BACK PROPAGATION ERROR:
 Going backward also called Back Propagation of ERROR.
 Compute, “Error Signals”, Propagates the error backwards through network staring at output
unit (where the error is the difference between actual and desire Output.
 Updating the weights according to the error, which is a function of the difference between the
outputs and the targets.
 In back-propagation of error, the errors are sent backwards through the network.
 It is a form of gradient descent.
Gradient Descent:
• Gradient Descent is known as one of the most commonly used optimization algorithms to train
machine learning models by means of minimizing errors between actual and expected results.
• Further, gradient descent is also used to train Neural Networks.
• In the Perceptron, changed the weights so that the neurons targets achieved.
• Choose an error function for each neuron
K: Ek = YK − tk (or)
Where,
K is Error
EK is Actual Output
YK is Expected Output
Where,
 N is the number of output nodes.
• However, suppose that we make two errors.
• In the first, the target is bigger than the output, while in the second the output is bigger
than the target.
• If these two errors are the same size, then if add them up could get 0, which means that
the error value suggests that no error was made.
• Tried to make it as small as possible. Since there was only one set of weights in the network,
this was sufficient to train the network.
• Need to make all errors have the same sign.
• In few different ways, but the one that will turn out to be best is the sum-of-squares error
function, which calculates the difference between y and t for each node, squares them, and adds
them all together:
Where,
E is Error
YK is Actual Output
TK is Expected Output
Fig: Gradient Descent Graph
Activation Function for Pass:
We will concentrate on two layers, but could easily generalized two layers
a /u Known as activation, g activation function and biases set extra weight.
4. EXAMPLES OF USING THE MLP:
OVERVIEW:
Step 1: Import Required Modules and Load Dataset
• First, we import necessary libraries such as TensorFlow, Numpy, and Matplotlib for
visualizing the data. We also load the MNIST dataset.
Step 2: Load and Normalize Image Data
• Next, we normalize the image data by dividing by 255 (since pixel values range from 0 to
255), which helps in faster convergence during training.
Output:
Step 3: Visualizing Data
• To understand the data better, we plot the first 100 training samples, each representing a
digit.
Output:
Step 4: Building the Neural Network Model
Here, we build a Sequential neural network model. The model consists of:
 Flatten Layer: Reshapes 2D input (28×28 pixels) into a 1D array of 784 elements.
 Dense Layers: Fully connected layers with 256 and 128 neurons, both using the ReLU
activation function.
 Output Layer: The final layer with 10 neurons representing the 10 classes of digits (0-9)
with sigmoid activation.
Step 5: Compiling the Model
Once the model is defined, we compile it by specifying:
 Optimizer: Adam, for efficient weight updates.
 Loss Function: Sparse categorical cross entropy, which is suitable for multi-class
classification.
 Metrics: Accuracy, to evaluate model performance.
Step 6: Training the Model
• We train the model on the training data using 10 epochs and a batch size of 2000.
• We also use 20% of the training data for validation to monitor the model’s performance on
unseen data during training.
Output:
Step 7: Evaluating the Model
• After training, we evaluate the model on the test dataset to determine its performance.
Output:
We got the accuracy of our model 92% by using model. Evaluate () on the test samples.
The model is learning effectively on the training set, but the validation accuracy and loss levels off,
which might indicate that the model is starting to over fit (where it performs well on training data
but not as well on unseen data).
Advantages of Multi-Layer Perceptron
 Versatility: MLPs can be applied to a variety of problems, both classification and regression.
 Non-linearity: Thanks to activation functions, MLPs can model complex, non-linear
relationships in data.
 Parallel Computation: With the help of GPUs, MLPs can be trained quickly by taking
advantage of parallel computing.
Disadvantages of Multi-Layer Perceptron
 Computationally Expensive: MLPs can be slow to train, especially on large datasets with
many layers.
 Prone to over fitting: Without proper regularization techniques, MLPs can over fit the training
data, leading to poor generalization.
 Sensitivity to Data Scaling: MLPs require properly normalized or scaled data for optimal
performance.
5. DERIVING BACK-PROPAGATION:
Overview:
• Back propagation is also known as "Backward Propagation of Errors" and it is a method used
to train neural network.
• Its goal is to reduce the difference between the model’s predicted output and the actual
output by adjusting the weights and biases in the network.
• Back propagation is a technique used in deep learning to train artificial neural networks
particularly feed-forward networks.
• It works iteratively to adjust weights and bias to minimize the cost function.
• In each epoch the model adapts these parameters reducing loss by following the error
gradient.
• Back propagation often uses optimization algorithms like gradient descent or stochastic
gradient descent.
• The algorithm computes the gradient using the chain rule from calculus allowing it to
effectively navigate complex layers in the neural network to minimize the cost function.
Fig: Back propagation Error Adjusting Weight by Biases
Working of Back Propagation Algorithm:
The Back propagation algorithm involves two main steps:
• Forward Pass
• Backward Pass.
i. FORWARD PASS:
• In forward pass the input data is fed into the input layer.
• These inputs combined with their respective weights are passed to hidden layers.
• For example in a network with two hidden layers (h1 and h2 as shown in Fig. (a)) the output
from h1 serves as the input to h2.
• Before applying an activation function, a bias is added to the weighted inputs.
• Each hidden layer applies an activation function like ReLU (Rectified Linear Unit) which
returns the input if it’s positive and zero otherwise.
• This adds non-linearity allowing the model to learn complex relationships in the data.
• Finally the outputs from the last hidden layer are passed to the output layer where an activation
function such as softmax converts the weighted outputs into probabilities for classification.
Example:
Assume the neurons use the sigmoid activation function for the forward and backward pass. The
target output is 0.5, and the learning rate is 1.
To find the outputs of y3, y4 and y5
In Forward Propagation,
Step 1: Initial Calculation
• The weighted sum at each node is calculated using:
A j = ∑ (w i, j ∗x i)
Where,
• aj is the weighted sum of all the inputs and weights at each node
• wi,j represents the weights associated with the jth input to the ith neuron
• xi represents the value of the jth input
Step 2: Sigmoid Function
• The sigmoid function returns a value between 0 and 1, introducing non-linearity into the
model.
Step 3: Computing Outputs
• At h1 node,
• Once we calculated the a1 value, we can now proceed to find the y3 value:
• Similarly find the values of y4 at h2 and y5 at O3
• Update Y3,Y4 and Y5 are
Step 4: Error Calculation
Our actual output is 0.5 but we obtained 0.67. To calculate the error we can use the below formula:
Using this error value we will be back propagating.
ii. BACKWARD PROPAGATION:
• In the backward pass the error (the difference between the predicted and actual output) is
propagated back through the network to adjust the weights and biases. One common method for
error calculation is the Mean Squared Error (MSE) given by:
MSE = (Predicted Output−Actual Output)2
• Once the error is calculated the network adjusts weights using gradients which are computed with
the chain rule.
• These gradients indicate how much each weight and bias should be adjusted to minimize the error
in the next iteration.
• The backward pass continues layer by layer ensuring that the network learns and improves its
performance.
• The activation function through its derivative plays a crucial role in computing these gradients
during back propagation.
In Back propagation,
Step 1: Calculating Gradients
• The change in each weight is calculated as:
Step 2: Output Unit Error
Step 3: Hidden Unit Error
Step 4: Weight Updated
FINAL OUTPUT:
• After updating the weights the forward pass is repeated yielding:
• y3=0.57
• y4=0.56
• y5=0.61
• Since y5=0.61is still not the target output the process of calculating the error and back
propagating continues until the desired output is reached.
• This process demonstrates how back propagation iteratively updates weights by minimizing
errors until the network accurately predicts the output.
Error=y target−y5
=0.5−0.61=−0.11=0.5−0.61=−0.11
• This process is said to be continued until the actual output is gained by the neural network.
6. RADIAL BASIS FUNCTIONS AND SPLINES
i. RADIAL BASICS FUNCTION:
What is radial basis function neural networks?
• A radial basis function network is a type of supervised artificial neural network that uses
supervised machine learning (ML) to function as a nonlinear classifier (Gaussian Function).
• Nonlinear classifiers used for simple linear classifiers that work on lower-dimensional
vectors.
What do you understand by radial basis function network?
• In the field of mathematical modelling, a radial basis function network is an artificial neural
network that uses radial basis functions as activation functions (Gaussian Function).
• The output of the network is a linear combination of radial basis functions of the inputs and
neuron parameters.
What is Kernel Function?
• Kernels play a fundamental role in transforming data into higher-dimensional spaces,
enabling algorithms to learn complex patterns and relationships.
• Kernel Function is used to transform n-dimensional input to m-dimensional input.
• Where m is much higher than n then find the dot product in higher dimensional efficiently.
• The main idea to use kernel is: A linear classifier or regression curve in higher dimensions
becomes a Non-linear classifier or regression curve in lower dimensions.
What are Radial Basis Functions?
• Radial Basis Functions (RBFs) are a special category of feed-forward neural
networks comprising three layers:
1. Input Layer: Receives input data and passes it to the hidden layer.
2. Hidden Layer: The core computational layer where RBF neurons process the data.
3. Output Layer: Produces the network’s predictions, suitable for classification or
regression tasks.
Need of Radial Basis Function:
• An MLP naturally separates the classes with hyper planes the input space.
• RBF would be to separate class distribution by localizing radial basis function.
• Types of Separating surface are:
1. Hyper plane linearly separable
2. Spherically Separable Hyper sphere
3. Quadratically Separately Quadrics
Fig: Types of Separable Surface
What happen in Hidden Layer?
• The patterns in the input space form clusters.
• If the centers of these clusters are known then the distance from the clusters center can be
measured. (Center of two points)
• The Most commonly used radial basic function is a Gaussian function.
• In a RBF network r is a distance from cluster center (Euclidean Distance).
DIFFERENCE BETWEEN MLP AND RBF:
Types of Radial Basic Function:
• There are several types of Radial Basis Functions (RBFs), each with its own characteristics
and mathematical formulations.
• Some common types include:
i. Gaussian Radial Basis Function
ii. Multiquadric Radial Basis Function
iii. Inverse Multiquadric Radial Basis Function
iv. Thin Plate Spline Radial Basis Function
v. Cubic Radial Basis Function
1. Gaussian Radial Basis Function: It has a bell-shaped curve and is often employed in various
applications due to its simplicity and effectiveness.
It is represented as:
2. Multiquadric Radial Basis Function: It provides a smooth interpolation and is commonly
used in applications like mesh less methods and radial basis function interpolation.
It is defined as:
3. Inverse Multiquadric Radial Basis Function: This type of function is similar to the
Multiquadric RBF but has the inverse in the denominator, resulting in a different shape.
Here is the formula for this function:
4. Thin Plate Spline Radial Basis Function: The Thin Plate Spline RBF is defined as
ϕ(r) = r2log(r)
is the Euclidean distance between the input and the centre. This RBF is often used in
applications involving thin-plate splines, which are used for surface interpolation and
deformation.
5. Cubic Radial Basis Function: The Cubic RBF is defined as
ϕ(r) = r3
Where r is the Euclidean distance. It has cubic polynomial behaviour and is sometimes used
in interpolation.
ii. SPLINES:
• To overcome the disadvantages of linear and polynomial regression we introduced the
regression splines.
• Linear regression the dataset is considered as one, but in splines regression, we have to split
the dataset into many parts which we call bin.
• The points in which we divide the data are called knots and we use different methods in
different bins. These separate functions we use in the different bins are called piecewise step
functions.
• Splines are a way to fit a high-degree polynomial function by breaking it up into smaller
piecewise polynomial functions.
• For each polynomial, we fit a separate model and connect them all together.
• In Linear regression is a straight line hence we made polynomial regression but it can make
the model over fitting issue.
• The need for a model that can be used with the good properties of both linear and polynomial
regression made the spline regression.
• While this sounds complicated, by breaking up each section into smaller polynomials, we
decrease the risk of over fitting.
How to break up a polynomial?
• Because a spline breaks up a polynomial into smaller pieces, we need to determine where to
break up the polynomial.
• The point where this division occurs is called a knot.
• In the example above, each P _ x represents a knot.
• The knots at the ends of the curves are known as boundary knots, while the knots within the
curve are known as internal knots.
TYPES OF SPLINES:
There are three types of Splines
• Cubic Splines
• Natural Splines
• Smoothing Splines
i. Cubic Splines: Cubic Splines Cubic splines require that we connect these different polynomial
functions smoothly.
•This means that the first and second derivatives of these functions must be continuous.
• The plot below shows a cubic spline and how the first derivative is a continuous
function.
ii. Natural Splines: Polynomial functions and other kinds of splines tend to have bad fits near the
ends of the functions.
• This variability can have huge consequences, particularly in forecasting.
• Natural splines resolve this issue by forcing the function to be linear after the
boundary knots.
iii. Smoothing Splines: Finally, consider the regularized version of a spline: the smoothing spline.
• The cost function is penalized if the variability of the coefficient is high.
• Below is a plot that shows a situation where smoothing splines are needed to get
an adequate model fit.
7. RBF NTWORK:
• Radial Basis Function (RBF) Neural Networks are a specialized type of Artificial Neural
Network (ANN) used primarily for function approximation tasks.
• Radial Basis Function is defined as the mathematical function that takes real-valued input and
provides the real-valued output determined by the distance between the input value and a fixed
point projected in space. This fixed point is positioned at an imaginary location within the
spatial context.
• Known for their distinct three-layer architecture and universal approximation capabilities, RBF
Networks offer faster learning speeds and efficient performance in classification and
regression problems.
What are Radial Basis Functions?
• Radial Basis Functions (RBFs) are a special category of feed-forward neural
networks comprising three layers:
1. Input Layer: Receives input data and passes it to the hidden layer.
2. Hidden Layer: The core computational layer where RBF neurons process the data.
3. Output Layer: Produces the network’s predictions, suitable for classification or
regression tasks.
Working of RBF networks
• RBF Networks are conceptually similar to K-Nearest Neighbour (k-NN) models, though
their implementation is distinct.
• The fundamental idea is that an item's predicted target value is influenced by nearby items
with similar predictor variable values. Here’s how RBF Networks operate:
1. Input Vector: The network receives an n-dimensional input vector that needs
classification or regression.
2. RBF Neurons: Each neuron in the hidden layer represents a prototype vector from the
training set. The network computes the Euclidean distance between the input vector and
each neuron's centre.
3. Activation Function: The Euclidean distance is transformed using a Radial Basis
Function (typically a Gaussian function) to compute the neuron’s activation value. This
value decreases exponentially as the distance increases.
4. Output Nodes: Each output node calculates a score based on a weighted sum of the
activation values from all RBF neurons. For classification, the category with the highest
score is chosen.
Key Characteristics of RBFs
 Radial Basis Functions: These are real-valued functions dependent solely on the
distance from a central point. The Gaussian function is the most commonly used type.
 Dimensionality: The network's dimensions correspond to the number of predictor
variables.
 Centre and Radius: Each RBF neuron has a centre and a radius (spread). The radius
affects how broadly each neuron influences the input space.
Architecture of RBF Networks:
Fig: Architecture of RBF Network
Training Process of radial basis function neural network
• An RBF neural network must be trained in three stages: choosing the centre’s, figuring out
the spread parameters, and training the output weights.
Step 1: Selecting the Centers
 Techniques for Centre Selection: Centre's can be picked at random from the training
set of data or by applying techniques such as k-means clustering.
 K-Means Clustering: The centres of these clusters are employed as the centres for the
RBF neurons in this widely used centre selection technique, which groups the input data
into k groups.
Step 2: Determining the Spread Parameters
 The spread parameter (σ) governs each RBF neuron's area of effect and establishes the
width of the RBF.
 Calculation: The spread parameter can be manually adjusted for each neuron or set as a
constant for all neurons. Setting σ based on the separation between the centre’s is a
popular method, frequently accomplished with the help of a heuristic like dividing the
greatest distance between canters by the square root of twice the number of centre’s
Step 3: Training the Output Weights
 Linear Regression: The objective of linear regression techniques, which are commonly
used to estimate the output layer weights, is to minimize the error between the anticipated
output and the actual target values.
 Pseudo-Inverse Method: One popular technique for figuring out the weights is to utilize
the pseudo-inverse of the hidden layer outputs matrix
Advantages of RBF networks:
1. Universal Approximation: RBF Networks can approximate any continuous function
with arbitrary accuracy given enough neurons.
2. Faster Learning: The training process is generally faster compared to other neural
network architectures.
3. Simple Architecture: The straightforward, three-layer architecture makes RBF
Networks easier to implement and understand.
Applications of RBF Networks:
 Classification: RBF Networks are used in pattern recognition and classification tasks,
such as speech recognition and image classification.
 Regression: These networks can model complex relationships in data for prediction
tasks.
 Function Approximation: RBF Networks are effective in approximating non-linear
functions.
7. CURSE OF DIMENSIONALITY:
• Curse of Dimensionality in Machine Learning arises when working with high-dimensional
data, leading to increased computational complexity, over fitting, and spurious correlations.
• Curse of Dimensionality refers to the phenomenon where the efficiency and effectiveness of
algorithms deteriorate as the dimensionality of the data increases exponentially.
• In high-dimensional spaces, data points become sparse, making it challenging to discern
meaningful patterns or relationships due to the vast amount of data required to adequately
sample the space.
Dimensionality Reduction Techniques:
i. Feature Selection: Identify and select the most relevant features from the original dataset while
discarding irrelevant or redundant ones.
• This reduces the dimensionality of the data, simplifying the model and improving its
efficiency.
ii. Feature Extraction: Transform the original high-dimensional data into a lower-dimensional
space by creating new features that capture the essential information.
• Techniques such as
• Principal Component Analysis (PCA) and
• T-distributed Stochastic Neighbor Embedding (t-SNE) are commonly
used for feature extraction.
Implementation:
• Step 1: Import Necessary Libraries
• Step 2: Loading the dataset
• Step 3: Remove Constant Features
• Step 4: Splitting the data and standardizing
• Step 5: Feature Selection and Dimensionality Reduction
• Step 6: Training the classifiers
Solution to Curse of Dimensionality:
• One of the ways to reduce the impact of high dimensions is to use a different measure of
distance in a space vector.
• One could explore the use of cosine similarity to replace Euclidean distance. Cosine similarity
can have a lesser impact on data with higher dimensions.
• However, use of such method could also be specific to the required solution of the problem.
8. INTERPOLATIONS AND BASIS FUNCTIONS:
• Interpolation is a method of creating new data points within the range of known data points.
• The curve is created by plotting the point on the graph at which the distance between two points
is equal to half of their difference in y-coordinates.
• The interpolation formula is as follows:
Types of Interpolations:
• Linear
• Multivariate
• Nearest Neighbor
• Polynomial
• Spline
i. Interpolation in Linear Form:
• Linear interpolation creates a continuous function out of discrete data.
• It’s a foundational building block for the gradient descent algorithm, which is used in the
training of just about every machine learning technique.
• Interpolation of a data set Linear interpolation on a set of data points (x0, y0), (x1, y1), ...,
(xn, yn) is defined as the concatenation of linear interpolants between each pair of data
points.
• This results in a continuous curve.
ii. Multivariate Interpolation:
• In numerical analysis, multivariate interpolation is interpolation on functions of more than one
variable; when the variants are spatial coordinates, it is also known as spatial interpolation.
• The function to be interpolated is known at given points (xi,yi,zi,….) and the interpolation
problem consists of yielding values at arbitrary points(x,y,z,…).
iii. Nearest Neighbour Interpolation:
• Nearest-neighbor interpolation is a simple method of multivariate interpolation in one or
more dimensions.
• Interpolation is the problem of approximating the value of a function for a non-given point in
some space when given the value of that function in points around (neighboring) that point.
• The nearest neighbor algorithm selects the value of the nearest point and does not consider the
values of neighboring points at all, yielding a piecewise-constant interpolant.
• The algorithm is very simple to implement and is commonly used in real-time 3D rendering
to select color values for a textured surface.
iv. Polynomial Interpolation:
• Polynomial interpolation is a method of estimating values between known data points.
• When graphical data contains a gap, but data is available on either side of the gap or at a few
specific points within the gap, an estimate of values within the gap can be made by
interpolation.
v. Spline Interpolation:
• Spline interpolation is a method of interpolation where the interpolating function is a
piecewise-defined polynomial called a spline.
• Unlike polynomial interpolation, which uses a single polynomial to fit all the data points,
spline interpolation divides the data into smaller segments and fits a separate polynomial to
each segment.
• This approach results in a smoother interpolating function that can better capture the local
behavior of the data.
• The most common type of spline interpolation is cubic spline interpolation, which uses cubic
polynomials for each segment and ensures continuity of the first and second derivatives at
the endpoints of each segment.
• Spline interpolation is particularly useful for smoothing noisy data or interpolating functions
with complex shapes.
Applications of Interpolation:
• Image Processing
• Computer Graphics
• Numerical Analysis
• Signal Processing
• Mathematical Modeling
• Geographic Information Systems (GIS)
• Audio Processing
9. SUPPORT VECTOR MACHINES:
• Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems.
• Primarily, it is used for Classification problems in Machine Learning.
• The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future.
• This best decision boundary is called a hyper plane.
• SVM chooses the extreme points/vectors that help in creating the hyper plane.
• These extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine.
• Consider the below diagram in which there are two different categories that are classified using
a decision boundary or hyper plane:
Working of SVM algorithm:
• The key idea behind the SVM algorithm is to find the hyper plane that best separates two
classes by maximizing the margin between them.
• This margin is the distance from the hyper plane to the nearest data points (support vectors) on
each side.
• The best hyper plane, also known as the “hard margin,” is the one that maximizes the distance
between the hyper plane and the nearest data points from both classes.
• This ensures a clear separation between the classes. So, from the above figure, we choose L2
as hard margin.
• The SVM algorithm has the characteristics to ignore the outlier and finds the best hyper plane
that maximizes the margin.
• SVM is robust to outliers.
• A soft margin allows for some misclassifications or violations of the margin to improve
generalization.
• The SVM optimizes the following equation to balance margin maximization and penalty
minimization:
Objective Function= (1/margin) + λ ∑ penalty
How does SVM classify the data?
• The penalty used for violations is often hinge loss, which has the following behavior:
o If a data point is correctly classified and within the margin, there is no penalty (loss =
0).
o If a point is incorrectly classified or violates the margin, the hinge loss increases
proportionally to the distance of the violation.
Types of SVM:
There are two of SVM
• Linear SVM
• Non Linear SVM
i. Linear SVM: When the data is perfectly linearly separable only then we can use Linear SVM.
Perfectly linearly separable means that the data points can be classified into 2 classes by using
a single straight line (if 2D).
Fig: Linear SVM
ii. Non-Linear SVM: When the data is not linearly separable then we can use Non-Linear SVM, which
means when the data points cannot be separated into 2 classes by using a
straight line (if 2D) then we use some advanced techniques like kernel tricks to
classify them.
 In most real-world applications we do not find linearly separable data points hence we use
kernel trick to solve them.
Fig: Non-Linear SVM
What to do if data are not linearly separable?
• When data is not linearly separable (i.e., it can’t be divided by a straight line), SVM uses a
technique called kernels to map the data into a higher-dimensional space where it becomes
separable.
• This transformation helps SVM find a decision boundary even for non-linear data.
Kernel in SVM:
• A kernel is a function that maps data points into a higher-dimensional space without explicitly
computing the coordinates in that space.
• This allows SVM to work efficiently with non-linear data by implicitly performing the mapping.
• For example, consider data points that are not linearly separable.
• By applying a kernel function, SVM transforms the data points into a higher-dimensional space
where they become linearly separable.
Types of Kernel in SVM:
• Linear Kernel: For linear separability.
• Polynomial Kernel: Maps data into a polynomial space.
• Radial Basis Function (RBF) Kernel: Transforms data into a space based on distances between
data points.
Why do we need to use support vector machines?
• SVMs are used in applications like handwriting recognition, intrusion detection, face detection,
email classification, gene classification, and in web pages.
• This is one of the reasons we use SVMs in machine learning.
• It can handle both classification and regression on linear and non- linear data.
What is generalization error in terms of the SVM?
• Generalization error is generally the out-of-sample error which is the measure of how accurately
a model can predict values for previously unseen data.
Why SVM is an example of a large margin classifier?
• SVM is a type of classifier which classifies positive and negative examples, here blue and red
data points.
• As shown in the image, the largest margin is found in order to avoid over fitting ie. The optimal
hyper plane is at the maximum distance from the positive and negative examples (Equal distant
from the boundary lines).
• To satisfy this constraint, and also to classify the data points accurately, the margin is maximized,
that is why this is called the large margin classifier.
Mathematical Computation: SVM
 Consider a binary classification problem with two classes, labelled as +1 and -1.
 We have a training dataset consisting of input feature vectors X and their corresponding class
labels Y.
 The equation for the linear hyper plane can be written as:
Where:
 w is the normal vector to the hyper plane (the direction perpendicular to it).
 b is the offset or bias term, representing the distance of the hyper plane from the origin along
the normal vector w.
STEP 1: Distance from a Data Point to the Hyper Plane
 The distance between a data point x_i and the decision boundary can be calculated as:
 Where ||w|| represents the Euclidean norm of the weight vector w.
 Euclidean norm of the normal vector W
STEP 2: Linear SVM Classifier
 Distance from a Data Point to the Hyper plane:
Where y^is the predicted label of a data point.
STEP 3: Optimization Problem for SVM
 For a linearly separable dataset, the goal is to find the hyper plane that maximizes the margin
between the two classes while ensuring that all data points are correctly classified.
 This leads to the following optimization problem:
Subject to the constraint:
Where:
 Yi is the class label (+1 or -1) for each training instance.
 Xi is the feature vector for the ii-th training instance.
 M is the total number of training instances.
STEP 4: Soft Margin Linear SVM Classifier
 In the presence of outliers or non-separable data, the SVM allows some misclassification by
introducing slack variables ζiζi. The optimization problem is modified as:
Subject to the constraints:
Where,
 C is a regularization parameter that controls the trade-off between margin maximization and
penalty for misclassifications.
 Ζi are slack variables that represent the degree of violation of the margin by each data point.
Step 5: Dual Problem for SVM
 The dual problem involves maximizing the Lagrange multipliers associated with the support
vectors. This transformation allows solving the SVM optimization using kernel functions for
non-linear classification.
Implementation of SVM:
• Predict if cancer is Benign or malignant.
• Using historical data about patients diagnosed with cancer enables doctors to differentiate
malignant cases and benign ones are given independent attributes.
• Load the breast cancer dataset from sklearn.datasets
• Separate input features and target variables.
• Build and train the SVM classifiers using RBF kernel.
• Plot the scatter plot of the input features.
Advantages of SVM:
• Effective on datasets with multiple features, like financial or medical data.
• Effective in cases where number of features is greater than the number of data points.
• Uses a subset of training points in the decision function called support vectors which makes it
memory efficient.
• Different kernel functions can be specified for the decision function.
Disadvantages of SVM:
• If the number of features is a lot bigger than the number of data points, avoiding over-fitting
when choosing kernel functions and regularization term is crucial.
• SVMs don't directly provide probability estimates. Those are calculated using an expensive
five-fold cross-validation.
• Works best on small sample sets because of its high training time.
Application of SVM:
• Cross-validation is a resampling method that uses different portions of the data to test and
train a model on different iterations.
• It is mainly used in settings where the goal is prediction, and one wants to estimate how
accurately a predictive model will perform in practice.

22PCOAM16 _ML_ Unit 2 Full unit notes.pdf

  • 1.
    22PCOAM16 MACHINE LEARNING UNIT IINOTES & QB B.TECH III YEAR – V SEM (R22) (2025-2026) Prepared By Dr. M.Gokilavani Department of Emerging Technologies (Special Batch)
  • 2.
    GURU NANAK INSTITUTIONSTECHNICAL CAMPUS (AUTONOMOUS) SCHOOL OF ENGINEERING & TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) COURSE STRUCTURE (Applicable for the Batch admitted from 2022-2023) MACHINE LEARNING B.Tech. III Year I Sem. L T P C 3 0 0 3 Course Objectives:  To introduce students to the basic concepts and techniques of Machine Learning.  To have a thorough understanding of the Supervised and Unsupervised learning techniques  To study the various probability-based learning techniques Course Outcomes:  Distinguish between, supervised, unsupervised and semi-supervised learning  Understand algorithms for building classifiers applied on datasets of non-linearly separable classes  Understand the principles of evolutionary computing algorithms  Design an ensembler to increase the classification accuracy UNIT - I Learning – Types of Machine Learning – Supervised Learning – The Brain and the Neuron – Design a Learning System – Perspectives and Issues in Machine Learning – Concept Learning Task – Concept Learning as Search – Finding a Maximally Specific Hypothesis – Version Spaces and the Candidate Elimination Algorithm – Linear Discriminants: – Perceptron – Linear Separability – Linear Regression. UNIT - II Multi-layer Perceptron– Going Forwards – Going Backwards: Back Propagation Error – Multi-layer Perceptron in Practice – Examples of using the MLP – Overview – Deriving Back-Propagation – Radial Basis Functions and Splines – Concepts – RBF Network – Curse of Dimensionality – Interpolations and Basis Functions – Support Vector Machines UNIT - III Learning with Trees – Decision Trees – Constructing Decision Trees – Classification and Regression Trees – Ensemble Learning – Boosting – Bagging – Different ways to Combine Classifiers – Basic Statistics – Gaussian Mixture Models – Nearest Neighbor Methods – Unsupervised Learning – K means Algorithms UNIT - IV Dimensionality Reduction – Linear Discriminant Analysis – Principal Component Analysis – Factor Analysis – Independent Component Analysis – Locally Linear Embedding – Isomap – Least Squares Optimization
  • 3.
    Evolutionary Learning –Genetic algorithms – Genetic Offspring: - Genetic Operators – Using Genetic Algorithms UNIT - V Reinforcement Learning – Overview – Getting Lost Example Markov Chain Monte Carlo Methods – Sampling – Proposal Distribution – Markov Chain Monte Carlo – Graphical Models – Bayesian Networks – Markov Random Fields – Hidden Markov Models – TrackingMethods TEXT BOOKS: 1. Stephen Marsland, ―Machine Learning – An Algorithmic Perspective, Second Edition, Chapman and Hall/CRC Machine Learning and Pattern Recognition Series, 2014. REFERENCE BOOKS: 1. Flutter for Beginners: An introductory guide to building cross-platform mobile applications with Flutter and Dart 2, Packt Publishing Limited. 2. Rap Payne, Beginning App Development with Flutter: Create Cross-Platform Mobile Apps, 1st edition, Apress. 3. Frank Zammetti, Practical Flutter: Improve your Mobile Development with Google’s Latest Open-Source SDK, 1st edition, Apress.
  • 4.
    UNIT-II Multi-layer Perceptron– GoingForwards – Going Backwards: Back Propagation Error – Multi-layer Perceptron in Practice – Examples of using the MLP – Overview – Deriving Back-Propagation – Radial Basis Functions and Splines – Concepts – RBF Network – Curse of Dimensionality – Interpolations and Basis Functions – Support Vector Machines. INTRODUCTION: 1.MULTI-LAYER PERCEPTRON:  Multi-layer perceptron MLP is a class of feed forward artificial neural networks.  An MLP consists of, at least, three layers of nodes: an input layer, a hidden layer and an output layer.  Except for the input nodes, each node is a neuron that uses a nonlinear activation function (Sigmoid Activation Function).  MLP utilizes a supervised learning technique called back propagation for training.  MLP useful in research for their ability to solve problems stochastically.  Multi-Layer Perceptron (MLP) is an artificial neural network widely used for solving classification and regression tasks. Neuron: • A neuron can have any number of inputs from one to n, where n is the total number of inputs. • The inputs may be represented therefore as x1, x2, and x3… xn. And the corresponding weights for the inputs as w1, w2, w3… wn. • Output a = x1w1+x2w2+x3w3... +xnwn. • Process output Activation function weight w0 -1 +1. Fig: Structure of Neuron in MLP Key Components of Multi-Layer Perceptron (MLP): Fig: Multi-Layer Perceptron (MLP)
  • 5.
     Input Layer:Each neuron (or node) in this layer corresponds to an input feature. For instance, if you have three input features, the input layer will have three neurons.  Hidden Layers: An MLP can have any number of hidden layers, with each layer containing any number of nodes. These layers process the information received from the input layer.  Output Layer: The output layer generates the final prediction or result. If there are multiple outputs, the output layer will have a corresponding number of neurons. Every connection in the diagram is a representation of the fully connected nature of an MLP. This means that every node in one layer connects to every node in the next layer. As the data moves through the network, each layer transforms it until the final output is generated in the output layer. Architecture of MLP: Fig: Architecture of MLP 2. MULTI-LAYER PERCEPTRON IN PRACTICE: Working of Multi-Layer Perceptron’s:  The key mechanisms such as forward propagation, loss function, back propagation, and optimization. Step 1: Forward Propagation Step 2: Loss Function Step 3: Back propagation Step 4: Optimization Step 1: Forward Propagation  In forward propagation, the data flows from the input layer to the output layer, passing through any hidden layers.  Each neuron in the hidden layers processes the input as follows: 1. Weighted Sum: The neuron computes the weighted sum of the inputs: Where, o xi is the input feature. o wi is the corresponding weight. o b is the bias term. 2. Activation Function: The weighted sum z is passed through an activation function to introduce non-linearity. Common activation functions include:
  • 6.
    Step 2: LossFunction  Once the network generates an output, the next step is to calculate the loss using a loss function.  In supervised learning, this compares the predicted output to the actual label.  For a classification problem, the commonly used binary cross-entropy loss function is: Where, Yi is the actual label. y^i is the predicted label. N is the number of samples.  For regression problems, the mean squared error (MSE) is often used: Step 3: Back Propagation  The goal of training an MLP is to minimize the loss function by adjusting the network’s weights and biases.  This is achieved through back propagation: 1. Gradient Calculation: The gradients of the loss function with respect to each weight and bias are calculated using the chain rule of calculus. 2. Error Propagation: The error is propagated back through the network, layer by layer. 3. Gradient Descent: The network updates the weights and biases by moving in the opposite direction of the gradient to reduce the loss: w=w–η⋅∂w∂L Where: w is the weight. η is the learning rate. ∂L∂w is the gradient of the loss function with respect to the weight. Step 4: Optimization  MLPs rely on optimization algorithms to iteratively refine the weights and biases during training. Popular optimization methods include: 1. Stochastic Gradient Descent (SGD): Updates the weights based on a single sample or a small batch of data: w=w–η⋅∂w∂L 2. Adam Optimizer: An extension of SGD that incorporates momentum and adaptive learning rates for more efficient training: mt= β1mt−1+ (1–β1) ⋅gt vt = β2vt−1+(1–β2)⋅gt2 Here, gt represents the gradient at time t, and β1, β2are decay rates. 3. TYPES OF PASS: There are two types of Pass: 1. Forward pass or Going Forward 2. Backward pass or Going Backward i. GOING FORWARDS:  Compute “Functional Signal", Feed forward Propagation of input pattern signals through network.  Working out what the outputs are for the given inputs and the current weights.  MLP with two layers of nodes, make one with 3, or 4, or 20 layers of nodes.
  • 7.
     Recall (forward)algorithm through the network computing the activations of one layer of neurons and using those as the inputs for the next layer.  Then use these inputs and the first level of weights to calculate the activations of the hidden layer, and then use those activations and the next set of weights to calculate the activations of the output layer.  The outputs of the network, can compare them to the targets and compute the error. o Biases: Need to include a bias input to each neuron.  Give an extra input that is permanently set to -1, and adjusting the weights to each neuron as part of the training.  Thus, each neuron in the network (whether it is a hidden layer or the output) has 1 extra input, with fixed value. ii. GOING BACKWARDS: BACK PROPAGATION ERROR:  Going backward also called Back Propagation of ERROR.  Compute, “Error Signals”, Propagates the error backwards through network staring at output unit (where the error is the difference between actual and desire Output.  Updating the weights according to the error, which is a function of the difference between the outputs and the targets.  In back-propagation of error, the errors are sent backwards through the network.  It is a form of gradient descent. Gradient Descent: • Gradient Descent is known as one of the most commonly used optimization algorithms to train machine learning models by means of minimizing errors between actual and expected results. • Further, gradient descent is also used to train Neural Networks. • In the Perceptron, changed the weights so that the neurons targets achieved. • Choose an error function for each neuron K: Ek = YK − tk (or) Where, K is Error EK is Actual Output YK is Expected Output Where,  N is the number of output nodes. • However, suppose that we make two errors. • In the first, the target is bigger than the output, while in the second the output is bigger than the target. • If these two errors are the same size, then if add them up could get 0, which means that the error value suggests that no error was made. • Tried to make it as small as possible. Since there was only one set of weights in the network, this was sufficient to train the network. • Need to make all errors have the same sign. • In few different ways, but the one that will turn out to be best is the sum-of-squares error function, which calculates the difference between y and t for each node, squares them, and adds them all together:
  • 8.
    Where, E is Error YKis Actual Output TK is Expected Output Fig: Gradient Descent Graph Activation Function for Pass: We will concentrate on two layers, but could easily generalized two layers a /u Known as activation, g activation function and biases set extra weight. 4. EXAMPLES OF USING THE MLP: OVERVIEW: Step 1: Import Required Modules and Load Dataset • First, we import necessary libraries such as TensorFlow, Numpy, and Matplotlib for visualizing the data. We also load the MNIST dataset. Step 2: Load and Normalize Image Data • Next, we normalize the image data by dividing by 255 (since pixel values range from 0 to 255), which helps in faster convergence during training.
  • 9.
    Output: Step 3: VisualizingData • To understand the data better, we plot the first 100 training samples, each representing a digit. Output: Step 4: Building the Neural Network Model Here, we build a Sequential neural network model. The model consists of:  Flatten Layer: Reshapes 2D input (28×28 pixels) into a 1D array of 784 elements.  Dense Layers: Fully connected layers with 256 and 128 neurons, both using the ReLU activation function.
  • 10.
     Output Layer:The final layer with 10 neurons representing the 10 classes of digits (0-9) with sigmoid activation. Step 5: Compiling the Model Once the model is defined, we compile it by specifying:  Optimizer: Adam, for efficient weight updates.  Loss Function: Sparse categorical cross entropy, which is suitable for multi-class classification.  Metrics: Accuracy, to evaluate model performance. Step 6: Training the Model • We train the model on the training data using 10 epochs and a batch size of 2000. • We also use 20% of the training data for validation to monitor the model’s performance on unseen data during training. Output: Step 7: Evaluating the Model • After training, we evaluate the model on the test dataset to determine its performance. Output:
  • 11.
    We got theaccuracy of our model 92% by using model. Evaluate () on the test samples. The model is learning effectively on the training set, but the validation accuracy and loss levels off, which might indicate that the model is starting to over fit (where it performs well on training data but not as well on unseen data). Advantages of Multi-Layer Perceptron  Versatility: MLPs can be applied to a variety of problems, both classification and regression.  Non-linearity: Thanks to activation functions, MLPs can model complex, non-linear relationships in data.  Parallel Computation: With the help of GPUs, MLPs can be trained quickly by taking advantage of parallel computing. Disadvantages of Multi-Layer Perceptron  Computationally Expensive: MLPs can be slow to train, especially on large datasets with many layers.  Prone to over fitting: Without proper regularization techniques, MLPs can over fit the training data, leading to poor generalization.  Sensitivity to Data Scaling: MLPs require properly normalized or scaled data for optimal performance. 5. DERIVING BACK-PROPAGATION: Overview: • Back propagation is also known as "Backward Propagation of Errors" and it is a method used to train neural network. • Its goal is to reduce the difference between the model’s predicted output and the actual output by adjusting the weights and biases in the network. • Back propagation is a technique used in deep learning to train artificial neural networks particularly feed-forward networks. • It works iteratively to adjust weights and bias to minimize the cost function. • In each epoch the model adapts these parameters reducing loss by following the error gradient. • Back propagation often uses optimization algorithms like gradient descent or stochastic gradient descent. • The algorithm computes the gradient using the chain rule from calculus allowing it to effectively navigate complex layers in the neural network to minimize the cost function.
  • 12.
    Fig: Back propagationError Adjusting Weight by Biases Working of Back Propagation Algorithm: The Back propagation algorithm involves two main steps: • Forward Pass • Backward Pass. i. FORWARD PASS: • In forward pass the input data is fed into the input layer. • These inputs combined with their respective weights are passed to hidden layers. • For example in a network with two hidden layers (h1 and h2 as shown in Fig. (a)) the output from h1 serves as the input to h2. • Before applying an activation function, a bias is added to the weighted inputs. • Each hidden layer applies an activation function like ReLU (Rectified Linear Unit) which returns the input if it’s positive and zero otherwise. • This adds non-linearity allowing the model to learn complex relationships in the data. • Finally the outputs from the last hidden layer are passed to the output layer where an activation function such as softmax converts the weighted outputs into probabilities for classification. Example: Assume the neurons use the sigmoid activation function for the forward and backward pass. The target output is 0.5, and the learning rate is 1. To find the outputs of y3, y4 and y5
  • 13.
    In Forward Propagation, Step1: Initial Calculation • The weighted sum at each node is calculated using: A j = ∑ (w i, j ∗x i) Where, • aj is the weighted sum of all the inputs and weights at each node • wi,j represents the weights associated with the jth input to the ith neuron • xi represents the value of the jth input Step 2: Sigmoid Function • The sigmoid function returns a value between 0 and 1, introducing non-linearity into the model. Step 3: Computing Outputs • At h1 node, • Once we calculated the a1 value, we can now proceed to find the y3 value: • Similarly find the values of y4 at h2 and y5 at O3
  • 14.
    • Update Y3,Y4and Y5 are Step 4: Error Calculation Our actual output is 0.5 but we obtained 0.67. To calculate the error we can use the below formula: Using this error value we will be back propagating. ii. BACKWARD PROPAGATION: • In the backward pass the error (the difference between the predicted and actual output) is propagated back through the network to adjust the weights and biases. One common method for error calculation is the Mean Squared Error (MSE) given by: MSE = (Predicted Output−Actual Output)2 • Once the error is calculated the network adjusts weights using gradients which are computed with the chain rule. • These gradients indicate how much each weight and bias should be adjusted to minimize the error in the next iteration. • The backward pass continues layer by layer ensuring that the network learns and improves its performance. • The activation function through its derivative plays a crucial role in computing these gradients during back propagation. In Back propagation, Step 1: Calculating Gradients • The change in each weight is calculated as: Step 2: Output Unit Error
  • 15.
    Step 3: HiddenUnit Error Step 4: Weight Updated FINAL OUTPUT: • After updating the weights the forward pass is repeated yielding: • y3=0.57 • y4=0.56 • y5=0.61
  • 16.
    • Since y5=0.61isstill not the target output the process of calculating the error and back propagating continues until the desired output is reached. • This process demonstrates how back propagation iteratively updates weights by minimizing errors until the network accurately predicts the output. Error=y target−y5 =0.5−0.61=−0.11=0.5−0.61=−0.11 • This process is said to be continued until the actual output is gained by the neural network. 6. RADIAL BASIS FUNCTIONS AND SPLINES i. RADIAL BASICS FUNCTION: What is radial basis function neural networks? • A radial basis function network is a type of supervised artificial neural network that uses supervised machine learning (ML) to function as a nonlinear classifier (Gaussian Function). • Nonlinear classifiers used for simple linear classifiers that work on lower-dimensional vectors. What do you understand by radial basis function network? • In the field of mathematical modelling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions (Gaussian Function). • The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters. What is Kernel Function? • Kernels play a fundamental role in transforming data into higher-dimensional spaces, enabling algorithms to learn complex patterns and relationships. • Kernel Function is used to transform n-dimensional input to m-dimensional input. • Where m is much higher than n then find the dot product in higher dimensional efficiently. • The main idea to use kernel is: A linear classifier or regression curve in higher dimensions becomes a Non-linear classifier or regression curve in lower dimensions. What are Radial Basis Functions? • Radial Basis Functions (RBFs) are a special category of feed-forward neural networks comprising three layers: 1. Input Layer: Receives input data and passes it to the hidden layer. 2. Hidden Layer: The core computational layer where RBF neurons process the data. 3. Output Layer: Produces the network’s predictions, suitable for classification or regression tasks. Need of Radial Basis Function: • An MLP naturally separates the classes with hyper planes the input space. • RBF would be to separate class distribution by localizing radial basis function. • Types of Separating surface are: 1. Hyper plane linearly separable 2. Spherically Separable Hyper sphere 3. Quadratically Separately Quadrics
  • 17.
    Fig: Types ofSeparable Surface What happen in Hidden Layer? • The patterns in the input space form clusters. • If the centers of these clusters are known then the distance from the clusters center can be measured. (Center of two points) • The Most commonly used radial basic function is a Gaussian function. • In a RBF network r is a distance from cluster center (Euclidean Distance).
  • 18.
    DIFFERENCE BETWEEN MLPAND RBF: Types of Radial Basic Function: • There are several types of Radial Basis Functions (RBFs), each with its own characteristics and mathematical formulations. • Some common types include: i. Gaussian Radial Basis Function ii. Multiquadric Radial Basis Function iii. Inverse Multiquadric Radial Basis Function iv. Thin Plate Spline Radial Basis Function v. Cubic Radial Basis Function 1. Gaussian Radial Basis Function: It has a bell-shaped curve and is often employed in various applications due to its simplicity and effectiveness. It is represented as: 2. Multiquadric Radial Basis Function: It provides a smooth interpolation and is commonly used in applications like mesh less methods and radial basis function interpolation. It is defined as: 3. Inverse Multiquadric Radial Basis Function: This type of function is similar to the Multiquadric RBF but has the inverse in the denominator, resulting in a different shape. Here is the formula for this function:
  • 19.
    4. Thin PlateSpline Radial Basis Function: The Thin Plate Spline RBF is defined as ϕ(r) = r2log(r) is the Euclidean distance between the input and the centre. This RBF is often used in applications involving thin-plate splines, which are used for surface interpolation and deformation. 5. Cubic Radial Basis Function: The Cubic RBF is defined as ϕ(r) = r3 Where r is the Euclidean distance. It has cubic polynomial behaviour and is sometimes used in interpolation. ii. SPLINES: • To overcome the disadvantages of linear and polynomial regression we introduced the regression splines. • Linear regression the dataset is considered as one, but in splines regression, we have to split the dataset into many parts which we call bin. • The points in which we divide the data are called knots and we use different methods in different bins. These separate functions we use in the different bins are called piecewise step functions. • Splines are a way to fit a high-degree polynomial function by breaking it up into smaller piecewise polynomial functions. • For each polynomial, we fit a separate model and connect them all together. • In Linear regression is a straight line hence we made polynomial regression but it can make the model over fitting issue. • The need for a model that can be used with the good properties of both linear and polynomial regression made the spline regression. • While this sounds complicated, by breaking up each section into smaller polynomials, we decrease the risk of over fitting.
  • 20.
    How to breakup a polynomial? • Because a spline breaks up a polynomial into smaller pieces, we need to determine where to break up the polynomial. • The point where this division occurs is called a knot. • In the example above, each P _ x represents a knot. • The knots at the ends of the curves are known as boundary knots, while the knots within the curve are known as internal knots. TYPES OF SPLINES: There are three types of Splines • Cubic Splines • Natural Splines • Smoothing Splines i. Cubic Splines: Cubic Splines Cubic splines require that we connect these different polynomial functions smoothly. •This means that the first and second derivatives of these functions must be continuous. • The plot below shows a cubic spline and how the first derivative is a continuous function. ii. Natural Splines: Polynomial functions and other kinds of splines tend to have bad fits near the ends of the functions. • This variability can have huge consequences, particularly in forecasting. • Natural splines resolve this issue by forcing the function to be linear after the boundary knots. iii. Smoothing Splines: Finally, consider the regularized version of a spline: the smoothing spline. • The cost function is penalized if the variability of the coefficient is high. • Below is a plot that shows a situation where smoothing splines are needed to get an adequate model fit. 7. RBF NTWORK: • Radial Basis Function (RBF) Neural Networks are a specialized type of Artificial Neural Network (ANN) used primarily for function approximation tasks. • Radial Basis Function is defined as the mathematical function that takes real-valued input and provides the real-valued output determined by the distance between the input value and a fixed point projected in space. This fixed point is positioned at an imaginary location within the spatial context. • Known for their distinct three-layer architecture and universal approximation capabilities, RBF Networks offer faster learning speeds and efficient performance in classification and regression problems.
  • 21.
    What are RadialBasis Functions? • Radial Basis Functions (RBFs) are a special category of feed-forward neural networks comprising three layers: 1. Input Layer: Receives input data and passes it to the hidden layer. 2. Hidden Layer: The core computational layer where RBF neurons process the data. 3. Output Layer: Produces the network’s predictions, suitable for classification or regression tasks. Working of RBF networks • RBF Networks are conceptually similar to K-Nearest Neighbour (k-NN) models, though their implementation is distinct. • The fundamental idea is that an item's predicted target value is influenced by nearby items with similar predictor variable values. Here’s how RBF Networks operate: 1. Input Vector: The network receives an n-dimensional input vector that needs classification or regression. 2. RBF Neurons: Each neuron in the hidden layer represents a prototype vector from the training set. The network computes the Euclidean distance between the input vector and each neuron's centre. 3. Activation Function: The Euclidean distance is transformed using a Radial Basis Function (typically a Gaussian function) to compute the neuron’s activation value. This value decreases exponentially as the distance increases. 4. Output Nodes: Each output node calculates a score based on a weighted sum of the activation values from all RBF neurons. For classification, the category with the highest score is chosen. Key Characteristics of RBFs  Radial Basis Functions: These are real-valued functions dependent solely on the distance from a central point. The Gaussian function is the most commonly used type.  Dimensionality: The network's dimensions correspond to the number of predictor variables.  Centre and Radius: Each RBF neuron has a centre and a radius (spread). The radius affects how broadly each neuron influences the input space. Architecture of RBF Networks: Fig: Architecture of RBF Network
  • 22.
    Training Process ofradial basis function neural network • An RBF neural network must be trained in three stages: choosing the centre’s, figuring out the spread parameters, and training the output weights. Step 1: Selecting the Centers  Techniques for Centre Selection: Centre's can be picked at random from the training set of data or by applying techniques such as k-means clustering.  K-Means Clustering: The centres of these clusters are employed as the centres for the RBF neurons in this widely used centre selection technique, which groups the input data into k groups. Step 2: Determining the Spread Parameters  The spread parameter (σ) governs each RBF neuron's area of effect and establishes the width of the RBF.  Calculation: The spread parameter can be manually adjusted for each neuron or set as a constant for all neurons. Setting σ based on the separation between the centre’s is a popular method, frequently accomplished with the help of a heuristic like dividing the greatest distance between canters by the square root of twice the number of centre’s Step 3: Training the Output Weights  Linear Regression: The objective of linear regression techniques, which are commonly used to estimate the output layer weights, is to minimize the error between the anticipated output and the actual target values.  Pseudo-Inverse Method: One popular technique for figuring out the weights is to utilize the pseudo-inverse of the hidden layer outputs matrix Advantages of RBF networks: 1. Universal Approximation: RBF Networks can approximate any continuous function with arbitrary accuracy given enough neurons. 2. Faster Learning: The training process is generally faster compared to other neural network architectures. 3. Simple Architecture: The straightforward, three-layer architecture makes RBF Networks easier to implement and understand. Applications of RBF Networks:  Classification: RBF Networks are used in pattern recognition and classification tasks, such as speech recognition and image classification.  Regression: These networks can model complex relationships in data for prediction tasks.  Function Approximation: RBF Networks are effective in approximating non-linear functions. 7. CURSE OF DIMENSIONALITY: • Curse of Dimensionality in Machine Learning arises when working with high-dimensional data, leading to increased computational complexity, over fitting, and spurious correlations. • Curse of Dimensionality refers to the phenomenon where the efficiency and effectiveness of algorithms deteriorate as the dimensionality of the data increases exponentially. • In high-dimensional spaces, data points become sparse, making it challenging to discern meaningful patterns or relationships due to the vast amount of data required to adequately sample the space. Dimensionality Reduction Techniques: i. Feature Selection: Identify and select the most relevant features from the original dataset while discarding irrelevant or redundant ones. • This reduces the dimensionality of the data, simplifying the model and improving its efficiency. ii. Feature Extraction: Transform the original high-dimensional data into a lower-dimensional space by creating new features that capture the essential information. • Techniques such as
  • 23.
    • Principal ComponentAnalysis (PCA) and • T-distributed Stochastic Neighbor Embedding (t-SNE) are commonly used for feature extraction. Implementation: • Step 1: Import Necessary Libraries • Step 2: Loading the dataset • Step 3: Remove Constant Features • Step 4: Splitting the data and standardizing • Step 5: Feature Selection and Dimensionality Reduction • Step 6: Training the classifiers Solution to Curse of Dimensionality: • One of the ways to reduce the impact of high dimensions is to use a different measure of distance in a space vector. • One could explore the use of cosine similarity to replace Euclidean distance. Cosine similarity can have a lesser impact on data with higher dimensions. • However, use of such method could also be specific to the required solution of the problem. 8. INTERPOLATIONS AND BASIS FUNCTIONS: • Interpolation is a method of creating new data points within the range of known data points. • The curve is created by plotting the point on the graph at which the distance between two points is equal to half of their difference in y-coordinates. • The interpolation formula is as follows: Types of Interpolations: • Linear • Multivariate • Nearest Neighbor • Polynomial • Spline i. Interpolation in Linear Form: • Linear interpolation creates a continuous function out of discrete data. • It’s a foundational building block for the gradient descent algorithm, which is used in the training of just about every machine learning technique. • Interpolation of a data set Linear interpolation on a set of data points (x0, y0), (x1, y1), ..., (xn, yn) is defined as the concatenation of linear interpolants between each pair of data points. • This results in a continuous curve.
  • 24.
    ii. Multivariate Interpolation: •In numerical analysis, multivariate interpolation is interpolation on functions of more than one variable; when the variants are spatial coordinates, it is also known as spatial interpolation. • The function to be interpolated is known at given points (xi,yi,zi,….) and the interpolation problem consists of yielding values at arbitrary points(x,y,z,…). iii. Nearest Neighbour Interpolation: • Nearest-neighbor interpolation is a simple method of multivariate interpolation in one or more dimensions. • Interpolation is the problem of approximating the value of a function for a non-given point in some space when given the value of that function in points around (neighboring) that point. • The nearest neighbor algorithm selects the value of the nearest point and does not consider the values of neighboring points at all, yielding a piecewise-constant interpolant. • The algorithm is very simple to implement and is commonly used in real-time 3D rendering to select color values for a textured surface. iv. Polynomial Interpolation: • Polynomial interpolation is a method of estimating values between known data points. • When graphical data contains a gap, but data is available on either side of the gap or at a few specific points within the gap, an estimate of values within the gap can be made by interpolation.
  • 25.
    v. Spline Interpolation: •Spline interpolation is a method of interpolation where the interpolating function is a piecewise-defined polynomial called a spline. • Unlike polynomial interpolation, which uses a single polynomial to fit all the data points, spline interpolation divides the data into smaller segments and fits a separate polynomial to each segment. • This approach results in a smoother interpolating function that can better capture the local behavior of the data. • The most common type of spline interpolation is cubic spline interpolation, which uses cubic polynomials for each segment and ensures continuity of the first and second derivatives at the endpoints of each segment. • Spline interpolation is particularly useful for smoothing noisy data or interpolating functions with complex shapes. Applications of Interpolation: • Image Processing • Computer Graphics • Numerical Analysis • Signal Processing • Mathematical Modeling • Geographic Information Systems (GIS) • Audio Processing
  • 26.
    9. SUPPORT VECTORMACHINES: • Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. • Primarily, it is used for Classification problems in Machine Learning. • The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. • This best decision boundary is called a hyper plane. • SVM chooses the extreme points/vectors that help in creating the hyper plane. • These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine. • Consider the below diagram in which there are two different categories that are classified using a decision boundary or hyper plane: Working of SVM algorithm: • The key idea behind the SVM algorithm is to find the hyper plane that best separates two classes by maximizing the margin between them. • This margin is the distance from the hyper plane to the nearest data points (support vectors) on each side. • The best hyper plane, also known as the “hard margin,” is the one that maximizes the distance between the hyper plane and the nearest data points from both classes. • This ensures a clear separation between the classes. So, from the above figure, we choose L2 as hard margin. • The SVM algorithm has the characteristics to ignore the outlier and finds the best hyper plane that maximizes the margin. • SVM is robust to outliers. • A soft margin allows for some misclassifications or violations of the margin to improve generalization. • The SVM optimizes the following equation to balance margin maximization and penalty minimization: Objective Function= (1/margin) + λ ∑ penalty How does SVM classify the data? • The penalty used for violations is often hinge loss, which has the following behavior: o If a data point is correctly classified and within the margin, there is no penalty (loss = 0). o If a point is incorrectly classified or violates the margin, the hinge loss increases proportionally to the distance of the violation.
  • 27.
    Types of SVM: Thereare two of SVM • Linear SVM • Non Linear SVM i. Linear SVM: When the data is perfectly linearly separable only then we can use Linear SVM. Perfectly linearly separable means that the data points can be classified into 2 classes by using a single straight line (if 2D). Fig: Linear SVM ii. Non-Linear SVM: When the data is not linearly separable then we can use Non-Linear SVM, which means when the data points cannot be separated into 2 classes by using a straight line (if 2D) then we use some advanced techniques like kernel tricks to classify them.  In most real-world applications we do not find linearly separable data points hence we use kernel trick to solve them. Fig: Non-Linear SVM
  • 28.
    What to doif data are not linearly separable? • When data is not linearly separable (i.e., it can’t be divided by a straight line), SVM uses a technique called kernels to map the data into a higher-dimensional space where it becomes separable. • This transformation helps SVM find a decision boundary even for non-linear data. Kernel in SVM: • A kernel is a function that maps data points into a higher-dimensional space without explicitly computing the coordinates in that space. • This allows SVM to work efficiently with non-linear data by implicitly performing the mapping. • For example, consider data points that are not linearly separable. • By applying a kernel function, SVM transforms the data points into a higher-dimensional space where they become linearly separable. Types of Kernel in SVM: • Linear Kernel: For linear separability. • Polynomial Kernel: Maps data into a polynomial space. • Radial Basis Function (RBF) Kernel: Transforms data into a space based on distances between data points. Why do we need to use support vector machines? • SVMs are used in applications like handwriting recognition, intrusion detection, face detection, email classification, gene classification, and in web pages. • This is one of the reasons we use SVMs in machine learning. • It can handle both classification and regression on linear and non- linear data. What is generalization error in terms of the SVM? • Generalization error is generally the out-of-sample error which is the measure of how accurately a model can predict values for previously unseen data. Why SVM is an example of a large margin classifier? • SVM is a type of classifier which classifies positive and negative examples, here blue and red data points. • As shown in the image, the largest margin is found in order to avoid over fitting ie. The optimal hyper plane is at the maximum distance from the positive and negative examples (Equal distant from the boundary lines).
  • 29.
    • To satisfythis constraint, and also to classify the data points accurately, the margin is maximized, that is why this is called the large margin classifier. Mathematical Computation: SVM  Consider a binary classification problem with two classes, labelled as +1 and -1.  We have a training dataset consisting of input feature vectors X and their corresponding class labels Y.  The equation for the linear hyper plane can be written as: Where:  w is the normal vector to the hyper plane (the direction perpendicular to it).  b is the offset or bias term, representing the distance of the hyper plane from the origin along the normal vector w. STEP 1: Distance from a Data Point to the Hyper Plane  The distance between a data point x_i and the decision boundary can be calculated as:  Where ||w|| represents the Euclidean norm of the weight vector w.  Euclidean norm of the normal vector W STEP 2: Linear SVM Classifier  Distance from a Data Point to the Hyper plane: Where y^is the predicted label of a data point. STEP 3: Optimization Problem for SVM  For a linearly separable dataset, the goal is to find the hyper plane that maximizes the margin between the two classes while ensuring that all data points are correctly classified.  This leads to the following optimization problem: Subject to the constraint: Where:  Yi is the class label (+1 or -1) for each training instance.  Xi is the feature vector for the ii-th training instance.  M is the total number of training instances. STEP 4: Soft Margin Linear SVM Classifier  In the presence of outliers or non-separable data, the SVM allows some misclassification by introducing slack variables ζiζi. The optimization problem is modified as: Subject to the constraints: Where,
  • 30.
     C isa regularization parameter that controls the trade-off between margin maximization and penalty for misclassifications.  Ζi are slack variables that represent the degree of violation of the margin by each data point. Step 5: Dual Problem for SVM  The dual problem involves maximizing the Lagrange multipliers associated with the support vectors. This transformation allows solving the SVM optimization using kernel functions for non-linear classification. Implementation of SVM: • Predict if cancer is Benign or malignant. • Using historical data about patients diagnosed with cancer enables doctors to differentiate malignant cases and benign ones are given independent attributes. • Load the breast cancer dataset from sklearn.datasets • Separate input features and target variables. • Build and train the SVM classifiers using RBF kernel. • Plot the scatter plot of the input features. Advantages of SVM: • Effective on datasets with multiple features, like financial or medical data. • Effective in cases where number of features is greater than the number of data points. • Uses a subset of training points in the decision function called support vectors which makes it memory efficient. • Different kernel functions can be specified for the decision function. Disadvantages of SVM: • If the number of features is a lot bigger than the number of data points, avoiding over-fitting when choosing kernel functions and regularization term is crucial. • SVMs don't directly provide probability estimates. Those are calculated using an expensive five-fold cross-validation. • Works best on small sample sets because of its high training time. Application of SVM: • Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. • It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.