Artificial Neural Networks and Bayesian Learning

ELH -4.2: MACHINE LEARNING UNIT – II
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 1
For more Notes visit :https://sites.google.com/view/chandrakanthats/teaching
ELH -4.2: MACHINE LEARNING
UNIT –II
Artificial Neural Networks(ANNs) and Bayesian Learning
INTRODUCTION
Biological Neural Networks (BNNs)
 Biological neural networks refer to the network of neurons (nerve cells) in biological
organisms that process and transmit information.
 These networks are the foundation of the brain and the nervous system in animals.
Structure of Neurons

P a g e | 2
Key Concepts
1. Neurons: The basic unit of the nervous system, a neuron is a specialized cell that transmits
electrical and chemical signals. A typical neuron has dendrites (inputs), a cell body (soma),
and an axon (output).
2. Synapses: The junctions where neurons communicate with each other through
neurotransmitters. Synapses can be excitatory or inhibitory, influencing the likelihood of
the receiving neuron firing an action potential.
3. Action Potential: A brief electrical charge that travels down the axon of a neuron when it
is activated. This is the fundamental mechanism for neural communication.
4. Plasticity: The ability of the neural network to change and adapt over time through
experience. Synaptic plasticity, for example, refers to the strengthening or weakening of
synapses based on activity levels.
5. Neurotransmitters: Chemical messengers released by neurons to transmit signals across
synapses. Examples include dopamine, serotonin, and acetylcholine.
Functioning of Biological Neural Networks
 Information Processing: Neurons receive signals through dendrites, process these signals
in the soma, and transmit them via the axon to other neurons.
 Signal Integration: Neurons integrate multiple synaptic inputs to determine whether to
generate an action potential.
 Learning and Memory: Changes in synaptic strength, known as synaptic plasticity,
underlie learning and memory processes.
Artificial Neural Networks (ANNs)
 Artificial neural networks are computational models designed to simulate the way
biological neural networks process information.
 ANNs are a cornerstone of machine learning and artificial intelligence.

P a g e | 3
Structure of Artificial Neurons (Perceptrons)
Key Concepts
1. Artificial Neurons (Perceptrons): The basic units of ANNs that mimic the behavior of
biological neurons. They receive inputs, apply weights, and produce an output through an
activation function.
Multilayer Layer Perceptron(MLP)

P a g e | 4
2. Layers:
o Input Layer: The layer that receives the initial data.
o Hidden Layers: Intermediate layers where the actual processing is done through
neurons.
o Output Layer: The layer that produces the final output.
3. Weights and Biases: Parameters that determine the strength of the connection between
neurons. Weights are adjusted during training to minimize the error in predictions.
4. Activation Function: A non-linear function applied to the neuron’s output to enable the
network to learn complex patterns. Common activation functions include Step, Sign and
Sigmoid function.
5. Training: The process of learning by adjusting weights using algorithms such as
backpropagation and gradient descent to minimize error.
6. Learning Rate: A hyperparameter that controls the adjustment of weights during training.
Functioning of Artificial Neural Networks
 Feedforward: Data flows from the input layer to the output layer, passing through hidden
layers.
 Backpropagation: The training process where the error is propagated backward to update
the weights.
 Learning: The network learns by adjusting weights to minimize the difference between
the predicted output and the actual output.

P a g e | 5
Neural Network Representations (Mapping between BNN&ANN)
Artificial Neural Networks (ANNs) are designed to simulate the way Biological Neural Networks
(BNNs) process information. Understanding the similarities and differences between BNNs and
ANNs helps in appreciating the computational models used in machine learning.
Biological Neural Network(BNN) Artificial Neural Network(ANN)
Dendrites Inputs
Cell nucleus Nodes
Synapse Weights
Axon Output
Mapping
1. Neurons to Artificial Neurons:
o Biological Neurons: Receive input signals through dendrites, integrate them in the
soma, and transmit through the axon.
o Artificial Neurons: Receive inputs, compute a weighted sum plus a bias, apply an
activation function, and produce an output.

P a g e | 6
2. Synaptic Connections to Weights:
o Synapses: Chemical junctions where neurotransmitters transmit signals, with the
strength of transmission modifiable.
o Weights: Numerical values that determine the strength of connections between
artificial neurons, adjusted during training.
3. Action Potential to Activation Function:
o Action Potential: A spike in electrical activity when a neuron's threshold is
reached.
o Activation Function: A mathematical function that decides whether a neuron
should be activated, introducing non-linearity (e.g., sigmoid function).
4. Plasticity to Training:
o Plasticity: The brain's ability to change and adapt by altering synaptic strengths.
o Training: The process of adjusting weights and biases in an ANN to minimize
prediction error, simulating learning.
Comparison of BNNs and ANNs
Aspect Biological Neurons (BNNs) Artificial Neurons (ANNs)
Inputs Dendrites receiving signals from
other neurons
Inputs xi from previous layer or
external data
Integration Soma integrates signals and
generates electrical activity
Compute weighted sum z=∑wi xi+b
Transmission Axon transmits electrical signal to
other neurons
Apply activation function ϕ(z)
Output Axon terminals release
neurotransmitters at synapses
Output passed to next layer or used as
final prediction
Synaptic
Connections
Chemical junctions where
neurotransmitters transmit signals
Weights wi determine connection
strength
Action Potential Electrical spike when neuron's
threshold is reached
Activation function introduces non-
linearity
Plasticity Adaptation by altering synaptic
strengths
Training adjusts weights and biases
Learning
Mechanism
Experience-driven synaptic
changes
Algorithm-driven weight updates
(backpropagation, gradient descent)

P a g e | 7
Characteristics of Artificial Neural Networks
ANNs exhibit several key characteristics that contribute to their versatility and effectiveness in
various applications. Here are some of the main characteristics:
1. Non-linearity
 ANNs can model complex non-linear relationships between inputs and outputs.
 This allows them to capture intricate patterns in data that linear models cannot.
2. Adaptivity
 ANNs can adapt to new data through learning.
 This makes them flexible and capable of improving their performance over time as they
are exposed to more data.
3. Parallel Processing
 The computations within an ANN are distributed across multiple neurons.
 This parallelism enables efficient processing of large-scale data and complex models.
4. Fault Tolerance
 ANNs can tolerate errors or noise in the input data and still produce reasonable outputs.
 This robustness makes them reliable for real-world applications where data can be
imperfect.
5. Generalization
 ANNs can generalize from the training data to unseen data.
 This ability to perform well on new, unseen data is crucial for tasks like classification and
prediction.
6. Learning Capability
 ANNs learn from examples through training algorithms such as backpropagation.
 This enables them to solve a wide range of problems by adjusting weights based on training
data.

P a g e | 8
7. Distributed Memory
 Knowledge in ANNs is stored in the weights and connections between neurons.
 This distributed representation means that no single part of the network solely determines
the output, enhancing fault tolerance.
8. Data-Driven Approach
 ANNs rely on large amounts of data to learn and improve.
 The data-driven nature allows them to uncover patterns and insights that are not explicitly
programmed.
9. Scalability
 ANNs can be scaled up to handle very large datasets and complex models by adding more
neurons and layers.
 This scalability is essential for tackling big data and intricate tasks.
10. Dynamic Nature
 ANNs can be designed to handle dynamic, time-varying inputs.
 This makes them suitable for tasks like time-series forecasting and real-time processing.
11. Layered Structure
 ANNs consist of multiple layers (input, hidden, output), each performing different
transformations on the data.
 This layered architecture allows for hierarchical feature extraction and representation.
12. Connectivity
 The neurons within an ANN are interconnected, forming a network.
 The pattern of these connections influences the network's ability to learn and represent
complex functions.

P a g e | 9
Appropriate Problems for Neural Network Learning
ANNs are particularly well-suited to a variety of problems where the training data is complex,
noisy, or high-dimensional, especially those characterized by the following features:
1. Instances are represented by many attribute-value pairs
ANNs handle high-dimensional data well, where each instance is described by multiple features.
Examples:
 Image Recognition: Each pixel in an image represents an attribute.
 Financial Data Analysis: Each financial metric (e.g., price, volume, ratios) is an attribute.
2. The target function output may be discrete-valued, real-valued, or a vector of several real-
or discrete-valued attributes
ANNs can predict different types of outputs, including categorical, continuous, or a combination.
Examples:
 Classification: Identifying whether an email is spam or not (discrete-valued).
 Regression: Predicting house prices (real-valued).
 Multi-output Prediction: Predicting both the genre and the sentiment of a movie review
(vector of discrete values).
3. The training examples may contain errors
ANNs are robust to noisy data and can generalize well even when training data is imperfect.
Examples:
 Speech Recognition: Handling background noise in audio recordings.
 Medical Diagnosis: Dealing with imperfect and noisy medical records.

P a g e | 10
4. Long training times are acceptable
Training ANNs, especially deep networks, can be time-consuming, but they often yield high
performance once trained.
Examples:
 Image Classification with Deep Learning: Training deep CNNs can take hours to days.
 Natural Language Processing: Training language models like GPT can take several days
to weeks.
5. Fast evaluation of the learned target function may be required
Once trained, ANNs can make predictions quickly, making them suitable for real-time
applications.
Examples:
 Autonomous Vehicles: Making split-second decisions based on sensor data.
 Real-time Fraud Detection: Quickly identifying fraudulent transactions.
6. The ability of humans to understand the learned target function is not important
ANNs are often considered "black boxes" due to their complexity, and interpretability might not
be crucial in some applications.
Examples:
 Product Recommendations: The exact reasoning behind recommendations may not need
to be explained to users, as long as the recommendations are effective.
 Game Playing AI: How an AI like AlphaGo makes decisions may not need to be
understood by players or developers.

P a g e | 11
Types of Architectures of Artificial Neural Networks
ANNs have different architectures based on the arrangement of neurons, how they are
interconnected, and the composition of their layers.
Here are the main types of ANN architectures:
I. Feedforward Architecture
In feedforward networks, the signal flows in one direction, from input to output, without any
feedback loops. This type can be further divided into:
a) Single-Layer Perceptron
 Consists of only one layer of output neurons.
 The simplest form of neural network.
 Suitable for linearly separable problems.
b) Multilayer Perceptron (MLP)
 Contains one or more hidden layers between the input and output layers.
 Can solve non-linear problems.
 Utilizes activation functions like Sigmoid, Tanh.

P a g e | 12
c) Radial Basis Function (RBF) Network
 Uses radial basis functions as activation functions.
 Typically has one hidden layer.
 Suitable for function approximation and classification.
II. Feedback Architecture
In feedback networks, signals can flow in both directions, incorporating loops. This allows the
network to maintain a state or memory of previous inputs. Key types include:
a) Recurrent Neural Network (RNN)
 RNNs have connections that loop back, allowing them to maintain state or memory of
previous time steps.

P a g e | 13
 This enables them to process sequential data like time series, natural language, and speech.
 In these networks, the outputs of the neurons are used as feedback inputs for other neurons.
b) Fully Recurrent Network
 This refers to a feedback architecture in which every neuron is connected to every other
neuron, creating a densely interconnected network.
 Here all nodes are connected to all other nodes and each node works as both input and
output.

P a g e | 14
Perceptrons
 A perceptron is a type of artificial neural network and one of the simplest forms of a neural
network model.
 It was first introduced by Frank Rosenblatt in 1957.
 It is a simple linear classifier that can be used for supervised learning tasks, particularly
binary classification problems.
Structure
A perceptron consists of a single layer of neurons (also known as nodes) connected to inputs and
an output.
Components of a Perceptron
1. Inputs (x):
 The perceptron receives one or more input values, represented as a vector
x= (x1, x2,...,xn)
 These inputs are the features or attributes of the data points being processed.
2. Weights (w):
 Each input xix_ixi is associated with a weight wiw_iwi, forming a weight vector
w=(w1, w2,...,wn).
 The weights determine the influence or strength of each input on the perceptron's output.

P a g e | 15
3. Bias (b):
 The bias term b acts as an additional parameter that helps the perceptron model more
complex decision boundaries.
 It allows the activation function to shift left or right, enabling better fitting of the data.
4. Summation:
 The perceptron computes the weighted sum of the inputs plus the bias.
 This can be expressed as w⋅x+b, where w⋅x is the dot product of the weight and input
vectors.
1
n
i i
i
z w x b w x b

    

5. Activation Function:
 The weighted sum is passed through an activation function to produce the output.
 Typically, a step function is used, which outputs 1 if the weighted sum is greater than or
equal to a specified threshold (often 0), and 0 otherwise.
 Mathematically, the activation function f can be defined as:
)
0
0
(
1 if z
Output f z
Otherwise


  


P a g e | 16
Perceptron Learning Algorithm/Training Rule
 The Perceptron Learning Algorithm (also known as the Perceptron Training Rule) is a
simple algorithm used to train a single-layer perceptron for binary classification tasks.
 It adjusts the weights and bias of the perceptron based on the error between its predictions
and the actual target values.
 The goal is to find the weights and bias that allow the perceptron to correctly classify the
input data.

P a g e | 17
Example: AND Gate

P a g e | 18
Advantages of Perceptron Training Rule
1. Simplicity: Perceptrons are easy to understand and implement, making them a good
starting point for beginners in neural network theory.
2. Convergence: When the data is linearly separable (i.e., it's possible to draw a straight line
to separate two classes in a binary classification problem), the Perceptron Learning Rule
guarantees convergence. It will find a solution if one exists.
3. Efficiency: In cases where the data is linearly separable, Perceptrons can converge quickly
with a fixed learning rate.
4. Interpretability: Since Perceptrons are based on simple linear combinations and
thresholding, the learned model's parameters (weights) are interpretable. we can understand
the contribution of each input feature to the decision.

P a g e | 19
Disadvantages of Perceptron Training Rule
1. Limited to Linearly Separable Data: The perceptron can only solve problems where the
data is linearly separable. It cannot handle cases where a linear decision boundary is
insufficient, such as XOR problems.
2. Single-Layer Architecture: Perceptrons are limited to single-layer networks, which means
they cannot model more complex relationships that require multiple layers of neurons, as
seen in deep neural networks.
3. Noisy Data Handling: Perceptrons are sensitive to noise in the data. If the data is not
perfectly linearly separable, the learning rule may not converge or may converge to an
incorrect solution.
4. Binary Classification Limitation: The standard perceptron is designed for binary
classification tasks. While it can be extended to handle multi-class classification (e.g.,
through the use of multiple perceptrons or a one-vs-all strategy), it is inherently limited to
binary outputs in its basic form.
Types of Classification Problems
All kinds of classification problems that can be solved using neural networks into two broad
categories:
I)Linearly Separable Problems
 These are problems where a single linear decision boundary (a straight line in 2D, a plane
in 3D, etc.) can separate the data into different classes.

P a g e | 20
Examples:
 AND: Both inputs need to be 1 for the output to be 1. This is linearly separable because we
can draw a straight line to separate the (0,0), (0,1), and (1,0) inputs (which output 0) from
the (1,1) input (which outputs 1).
 OR: At least one input needs to be 1 for the output to be 1. This is also linearly separable
because a single line can separate the (0,0) input (which outputs 0) from the other inputs
(which output 1).
II) Non-Linearly Separable Problems
 These are problems where no single linear decision boundary can separate the data into
different classes. More complex boundaries are required to correctly classify the data.
Example:
 XOR: The output is 1 if the inputs are different and 0 if they are the same. The XOR
problem is not linearly separable because we cannot draw a single straight line that
separates the (0,0) and (1,1) inputs (which output 0) from the (0,1) and (1,0) inputs (which
output 1). we need a more complex boundary to separate these classes.

P a g e | 21
To address non-linearly separable problems, neural network machine learning techniques
often involve the following strategies:
1. Multi-Layer Perceptrons (MLPs)
 Hidden Layers: Introduce one or more hidden layers between the input and output layers.
These layers allow the network to learn complex, non-linear decision boundaries.
 Activation Functions: Use non-linear activation functions (e.g., ReLU, sigmoid, tanh) in
the hidden layers to enable the network to capture non-linear relationships in the data.
2. Deep Learning
 Deep Neural Networks (DNNs): Utilize neural networks with many layers (deep
architectures). Deep networks can model highly complex and abstract features from the
data.
 Convolutional Neural Networks (CNNs): Specialized for grid-like data such as images.
CNNs use convolutional layers to detect spatial hierarchies of features.
 Recurrent Neural Networks (RNNs): Suitable for sequential data. RNNs can maintain
information about previous inputs using internal states, making them effective for time-
series data and natural language processing.
Implications for Neural Networks
 Single-Layer Perceptrons: These can only solve linearly separable problems. They are
limited to problems where the classes can be separated by a straight line or hyperplane.
 Multi-Layer Perceptrons (MLPs): These can solve non-linearly separable problems by
using multiple layers of neurons. The hidden layers in MLPs allow them to create complex
decision boundaries, making them capable of handling XOR and other non-linearly
separable problems.

P a g e | 22
Gradient Descent Rule for Neural Networks
• Gradient Descent is an optimization algorithm used to find the best-fitting parameters
(weights and biases) for a model by minimizing a given loss function.
• It is used to update model parameters iteratively in the direction of the steepest decrease in
the loss function, effectively searching the hypothesis space for the best parameters.
• Gradient Descent can be applied to various machine learning models, including neural
networks, to find the best model parameters that minimize the difference between predicted
and actual outcomes.
• When dealing with non-linearly separable data, Gradient Descent can still find a set of
parameters that approximates the target concept.
• However, the algorithm may not converge quickly or might not reach a global minimum
if the loss function has multiple local minima.
Analogy: Blindfolded in Rough Terrain
 Imagine we are blindfolded and standing on rough terrain. Our objective is to reach the
lowest point, which represents the global minimum of the loss function.
 Feeling the Ground: feel the ground in all directions to find the steepest descent,
analogous to calculating the gradient.

P a g e | 23
 Taking a Step: take a step in the direction where the ground is descending the fastest,
similar to updating the parameters in the opposite direction of the gradient.
 Repeating the Process: By continuously feeling the ground and taking steps, we
eventually reach the lowest point, assuming you don't get stuck in a local minimum or
saddle point.
Gradient Descent Algorithm:
 Gradient descent minimizes a cost function J(w) parametrized by model parameters w.
 The gradient (or derivative) indicates the incline or slope of the cost function. Hence, to
minimize the cost function, we move in the direction opposite to the gradient.
 Cost Function(J(w)): The cost function (or loss function) is a measure of how well the
model's predictions match the actual data. The goal of training a model is to minimize this
cost function.
 Parameters(w): Parameters are the weights in the model that we aim to optimize.
 Gradient(∇J(w)): The gradient of the cost function with respect to the parameters tells us
the direction and rate of the steepest ascent. In gradient descent, we move in the direction
opposite to the gradient to minimize the cost function.

P a g e | 24
By iteratively updating the parameters in the opposite direction of the gradient, Gradient Descent
effectively minimizes the cost function, leading to optimal model parameters.

P a g e | 25
Derivation of the Gradient Descent Rule

P a g e | 26

P a g e | 27
GRADIENT DESCENT algorithm for training a linear unit
Issues in Gradient Descent
Despite its simplicity and effectiveness, gradient descent faces several practical challenges:
1. Slow Convergence:
o Learning Rate Selection: If the learning rate α is too small, the algorithm will take
tiny steps and converge very slowly. If α is too large, the algorithm might overshoot
the minimum and diverge.
o Plateaus and Flat Regions: In regions where the gradient is close to zero, the
algorithm can become stuck and make slow progress.

P a g e | 28
2. Local Minima:
o Multiple Minima: The error surface (the plot of the objective function) might have
multiple local minima. Gradient descent can get trapped in one of these local
minima instead of finding the global minimum.
o Saddle Points: These are points where the gradient is zero but are not minima (they
are higher-dimensional equivalents of peaks or valleys in 2D). The algorithm can
get stuck at saddle points.

P a g e | 29
Variants of Gradient Descent
Several variants of gradient descent have been developed to address these issues:
1. Stochastic Gradient Descent (SGD): Instead of using the entire dataset to compute the
gradient, SGD uses a single randomly chosen data point (or a small batch). This introduces
noise into the gradient estimates, which can help escape local minima but may also lead to
more fluctuation in the descent path.
2. Mini-Batch Gradient Descent: This is a compromise between batch gradient descent and
SGD. It uses a small, randomly chosen subset of the data (mini-batch) to compute the
gradient. It balances the efficiency of SGD and the stability of batch gradient descent.
3. Momentum: This method adds a fraction of the previous update to the current update,
which helps accelerate convergence, especially in the presence of plateaus and flat regions.

P a g e | 30
4. Adaptive Learning Rate Methods: Methods like Adam adjust the learning rate
dynamically based on the gradients' history, which helps in handling varying gradients and
speeds up convergence.
In the above Figure, initially learning rate remain very high up-to 350 epochs then it started
decreases and also ensure faster convergence (blue line) w.r.t. the ordinary method without
optimization (yellow line).

P a g e | 31
Multilayer Networks and the Backpropagation Algorithm
 Multi-layer Perceptrons (MLPs), are a class of artificial neural networks that consist of
multiple layers of interconnected neurons (nodes or units).
 Unlike single-layer neural networks, MLPs have one or more hidden layers between the
input and output layers. These hidden layers allow the network to learn and model more
complex, non-linear relationships in the data.
 These networks are foundational architecture in deep learning.
 The backpropagation algorithm is a crucial component for training these networks
effectively.
 Multilayer networks trained using the backpropagation algorithm have demonstrated
remarkable success in various machine learning tasks, including image recognition, natural
language processing, and many others.
 Their ability to learn hierarchical representations of data makes them a fundamental tool in
modern artificial intelligence.
Structure of Multilayer Networks
A typical multilayer network consists of:
1. Input Layer: The first layer that receives the input features.
2. Hidden Layers: One or more layers where each neuron performs a weighted sum of its
inputs, applies an activation function, and passes the result to the next layer. The hidden
layers enable the network to learn and represent complex patterns.

P a g e | 32
3. Output Layer: The final layer that produces the network's output, which can be a
classification label, a regression value, or any other prediction.
Each connection between neurons has an associated weight, and each neuron has a bias term. The
learning process involves adjusting these weights and biases to minimize the error between the
network's predictions and the actual targets.
The Backpropagation Algorithm
 The backpropagation algorithm is a supervised learning method used to train multilayer
neural networks.
 It works by computing the gradient of the loss function with respect to the weights in
the network, and then using this gradient to update the weights in the direction that
minimizes the loss.
 The algorithm's name, "backpropagation," comes from the way it calculates gradients
of the loss function with respect to the network's parameters by working backward from
the output layer to the input layer.
 It involves two main phases: forward propagation and backward propagation

P a g e | 33
Training Process
1. Initialization: Initialize the weights and biases, typically with small random values.
2. Forward Propagation:
 Input data is passed through the network layer by layer.
 At each neuron, the weighted sum of inputs plus the bias is calculated and passed
through the activation function.
( )
a f Wx b
 
 The output of each layer becomes the input to the next layer.
 This process continues until the final output layer, producing the network's prediction.
3. Backward Propagation:
 Calculate the error at the output layer (the difference between the predicted and
actual values).
^
2
1
1
( )
n
i i
i
L y y
n 
 

where yi are the actual target values and y^i are the predicted values.
 Compute the gradient of the error with respect to the output layer's weights and
biases.
 Propagate the error backwards through the network, layer by layer, computing the
gradients of the error with respect to the weights and biases of each preceding layer.
For the gradient of the loss with respect to weight W:
L
W


For the gradient of the loss with respect to bias b:
L
b


 Update the weights and biases using the computed gradients and a learning rate.
ij ij
ij
L
W W
W


 

and ij ij
ij
L
b b
b
 
 

where η is the learning rate.

P a g e | 34
4. Iteration: Repeat the forward and backward passes for multiple epochs (complete passes
through the training dataset) until the network converges (i.e., the loss stops decreasing
significantly).
In BACKPROPAGATION algorithm, we consider networks with multiple output units rather than
single units as before, so we redefine E to sum the errors over all of the network output units.

P a g e | 35

P a g e | 36
Example

P a g e | 37
Backpropagation with gradient descent can effectively minimize the loss function and help the
network approach the global loss minimum, making it a powerful method for training neural
networks.

P a g e | 38
Advantages of the Backpropagation Algorithm:
1. Efficiency: Backpropagation is computationally efficient and scalable, making it suitable
for training large neural networks on big datasets.
2. Universal Approximator: Neural networks trained using backpropagation have the ability
to approximate a wide range of complex functions, making them powerful for various
machine learning tasks, including regression and classification.
3. Non-Linearity: The algorithm allows neural networks to learn non-linear relationships in
data, enabling them to capture complex patterns and representations.
4. Versatility: Backpropagation can be applied to different types of neural network
architectures, including feedforward networks, convolutional neural networks (CNNs), and
recurrent neural networks (RNNs).
5. Generalization: backpropagation helps neural networks generalize well to unseen data,
reducing overfitting.
6. Deep Learning: Backpropagation is the foundation for training deep neural networks
(deep learning), which have achieved state-of-the-art results in various fields, including
computer vision and natural language processing.
Disadvantages of the Backpropagation Algorithm:
1. Local Minima: The algorithm can get stuck in local minima or saddle points in the loss
landscape, preventing it from finding the global minimum. However, techniques like
momentum and advanced optimizers help mitigate this issue.
2. Vanishing and Exploding Gradients: In deep neural networks, gradients can become too
small (vanishing gradients) or too large (exploding gradients), causing convergence issues.

P a g e | 39
Proper weight initialization and gradient clipping can address this problem.
3. Hyper parameter Sensitivity: Backpropagation involves tuning hyperparameters such as
the learning rate, batch size, which can be time-consuming and require trial and error.
4. Complexity: The implementation and training of deep neural networks can be complex
and computationally intensive, demanding significant computational resources.

P a g e | 40
Bayesian Learning
Introduction
 Bayesian learning is a statistical approach to machine learning that involves
updating the probability of a hypothesis as more evidence or information becomes
available.
 It relies on Bayes' theorem, a foundational concept in probability theory, to combine
prior knowledge with new data to make predictions or decisions.
 Bayesian learning provides a robust framework for updating beliefs and making
decisions in the presence of uncertainty, making it a powerful tool in various
applications, from spam detection to medical diagnosis and beyond.
Bayes' Theorem
Bayes' theorem provides a way to update the probability of a hypothesis H given new evidence E.
It is mathematically expressed as:
( ) ( )
( )
( )
P E H P H
P H E
P E


∣
∣
Where:
 P(H∣E) is the posterior probability of the hypothesis H given the evidence E.
 P(E∣H) is the likelihood of observing the evidence E given that the hypothesis H is true.
 P(H) is the prior probability of the hypothesis H before observing the evidence.
 P(E) is the marginal likelihood of observing the evidence E.

P a g e | 41
Example1: Medical Diagnosis
Suppose a doctor is trying to determine whether a patient has a particular disease D based on the
result of a diagnostic test T.
Given:
 The prior probability of the disease P(D)=0.01 (1% of the population has the disease).
 The probability of a positive test result given the disease P(T∣D)=0.9 (90% of those with
the disease test positive).
 The probability of a positive test result without the disease (T∣¬D)=0.05 (5% of those
without the disease test positive).
 The overall probability of a positive test result P(T).
Solution:
1.Calculate the Marginal Likelihood P(T): P(T)=P(T∣D)⋅P(D)+P(T∣¬D)⋅P(¬D)
Where P(¬D)=1−P(D)=0.99.
P(T)=(0.9⋅0.01)+(0.05⋅0.99)
P(T)=0.009+0.0495=0.0585
2. Apply Bayes' Theorem:
( ) ( )
( )
( )
P T D P D
P D T
P T


∣
∣
0.9 0.01 0.009
0.1538
0.0585 5
( )
0.058
P D T

  
∣
The posterior probability P(D∣T) is approximately 0.1538, or 15.38%. This means that even if the
test is positive, there is only a 15.38% chance that the patient actually has the disease, given the
test's characteristics and the rarity of the disease.

P a g e | 42
Example 2: Probability of Rain on Marie's Wedding Day Given the Weatherman's Forecast
Marie is getting married tomorrow at an outdoor ceremony in the desert. In recent years, it has
rained only 5 days each year. Unfortunately, the weatherman has predicted rain for tomorrow.
When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't
rain, he incorrectly forecasts rain 10% of the time.
Given this information, what is the probability that it will rain on the day of Marie's wedding? Use
Bayes' theorem to calculate this probability.
Solution:
Given Information
 Probability of rain on any given day (P(R)):
5
0.0137
365

 Probability of no rain on any given day (P(¬R)): 1 0.9863
( )
P R
 
 Probability that the weatherman correctly predicts rain given that it rains (P(W∣R)): 0.90
 Probability that the weatherman incorrectly predicts rain given that it doesn't rain
(P(W∣¬R)): 0.10
What We Want to Find
 Probability that it will rain given that the weatherman predicts rain (P(R∣W)).
 Using Bayes' Theorem:
( ) ( ) ( )
( )
P R W P W R P R
P W
 
∣ ∣
 P(W)=P(W∣R)⋅P(R)+P(W∣¬R)⋅P(¬R)
 P(W)=(0.90⋅0.0137)+(0.10⋅0.9863)= =0.01233+0.09863= 0.11096

0.90 0.0137 0.01233
0.1111
0.11096 0.1109
( )
6
P R W

  
∣
The probability that it will rain on the day of Marie's wedding, given the weatherman's prediction,
is approximately 0.1111, or 11.11%.

P a g e | 43
Concept Learning with Bayes' Theorem
Concept learning involves inferring a general rule or pattern from specific examples. In the context
of Bayesian learning, concept learning can be understood as updating our beliefs about the
correctness of different hypotheses as we observe more examples.
Steps in Bayesian Concept Learning
1. Define the Hypothesis Space: List all possible hypotheses H that can explain the data.
2. Specify the Prior Probabilities: Assign a prior probability PP(H) to each hypothesis based
on prior knowledge or assumptions.
3. Collect Evidence: Gather data or observations that will help in evaluating the hypotheses.
4. Compute the Likelihood: Calculate the likelihood P(E∣H) of the observed data for each
hypothesis.
5. Apply Bayes' Theorem: Use Bayes' theorem to update the prior probabilities and compute
the posterior probabilities P(H∣E) for each hypothesis.
6. Select the Best Hypothesis: Choose the hypothesis with the highest posterior probability
as the best explanation for the observed data.
Example: Concept Learning with Bayesian Approach
Suppose you are trying to determine whether an email is spam or not (binary classification). The
hypotheses are:
 H1: The email is spam.
 H2: The email is not spam.
1. Define the Hypothesis Space:H1 and H2.
2. Specify the Prior Probabilities:
 P(H1)=0.4 (prior probability of an email being spam).
 P(H2)=0.6 (prior probability of an email not being spam).
3. Collect Evidence: Evidence E could be the presence of certain words, phrases, or features in
the email.

P a g e | 44
4. Compute the Likelihood:
 P(E∣H1) is the likelihood of observing the evidence given the email is spam.
 P(E∣H2) is the likelihood of observing the evidence given the email is not spam.
5. Apply Bayes' Theorem: Calculate the posterior probabilities
1
( ) ( )
( )
(
1
1
)
P E H P H
P H E
P E


∣
∣
2
( ) ( )
( )
(
2
2
)
P E H P H
P H E
P E


∣
∣
6. Select the Best Hypothesis:
 Compare P(H1∣E) and P(H2∣E).
 If P(H1∣E)>P(H2∣E), classify the email as spam; otherwise, classify it as not spam.
Features of Bayesian Learning Methods
Bayesian learning methods are a set of approaches in machine learning and statistics that are based
on Bayes' theorem. Here are some key features of Bayesian learning methods:
1. Probabilistic Framework:
 Bayesian methods treat both model parameters and predictions as random variables and
express uncertainty using probability distributions.
 This allows for a more principled way to incorporate prior knowledge and update beliefs
as new data becomes available.
 Example: In medical diagnosis, Bayesian networks are used to model the probabilistic
relationships between symptoms and diseases. By updating probabilities based on
observed symptoms (data), the network can provide a probabilistic assessment of the
likelihood of different diseases.
2. Prior Knowledge:
 They allow the incorporation of prior beliefs about parameters before observing data.

P a g e | 45
 This prior information can be based on previous knowledge, domain expertise, or data
from similar tasks.
 Example: In spam email classification, Bayesian methods can incorporate prior
knowledge about typical words or patterns found in spam emails versus legitimate
emails. This prior information helps improve the accuracy of classifying new emails as
spam or not spam.
3. Posterior Inference:
 After observing data, Bayesian methods compute the posterior distribution over
parameters.
 This distribution reflects updated beliefs about the parameters given the observed data,
combining prior beliefs with the likelihood of the data.
4. Flexibility in Model Complexity:
 Bayesian methods can handle complex models and provide a framework for model
selection by comparing the marginal likelihood (evidence) of different models.
 Example: Bayesian non-parametric models, such as Gaussian Processes, allow for
flexible modeling of complex relationships without specifying the exact functional
form of the relationship between variables. This flexibility is particularly useful in
scenarios where the underlying data generating process is not well understood.
5. Regularization: Bayesian inference naturally provides regularization by integrating over the
parameter space, which helps prevent overfitting and improves generalization to unseen data.
6. Sequential Learning: They support sequential updating of beliefs as new data points become
available, making them suitable for online learning scenarios.
7. Uncertainty Quantification: Bayesian methods provide a direct way to quantify uncertainty
in predictions, which is useful in decision-making processes that require risk assessment.

P a g e | 46
8. Complex Inference: In some cases, Bayesian inference can be computationally intensive,
especially when dealing with high-dimensional data or complex models. However, advances
in computational techniques such as Markov chain Monte Carlo (MCMC) and variational
inference have helped mitigate these challenges.
Overall, Bayesian learning methods offer a powerful framework for reasoning under uncertainty
and integrating prior knowledge with observed data, making them widely applicable in various
domains such as healthcare, finance, and natural language processing.
Density estimation in Bayesian Learning
 Density estimation involves estimating the probability distribution that generated a given
dataset.
 All ML tasks can be solved if data generating probability distributions are identified.
 Thus, distribution estimation is the most general approach to ML.
 However, distribution estimation is hard without prior knowledge (i.e., non-parametric
methods).

P a g e | 47
 In Bayesian learning, this involves using Bayes' theorem to update the probability
distribution of the data given a model and prior information.
 This technique is particularly useful in scenarios where the true distribution is unknown,
and it provides a framework for making inferences about the data.
 Bayesian density estimation provides a full distribution over possible data points, allowing
for uncertainty quantification and more robust decision-making.
Objects and Needs of Density estimation
1. Parameters: Unknown quantities in the model that need to be estimated.
2. Prior Distribution: Represents our beliefs about the parameters before observing the data.
3. Posterior Distribution: Updated beliefs about the parameters after observing the data
4. Estimating the complete probability distribution is usually computationally infeasible, so
practitioners often settle for summary statistics like the mean or mode.
5. Estimating the full density function can be very challenging, so often, the focus is on obtaining
a point estimate, such as the mean.
6. Understanding Data: Density estimation helps us gain insights into the data by providing a
probabilistic description of its structure. It allows us to identify patterns, modes, clusters, and
outliers within the data.
7. Flexibility: Ability to model complex distributions and relationships in the data
8. Data Visualization: Probability density functions can be used to create visual representations
of the data distribution, such as histograms, kernel density plots, or probability density plots.
These visualizations aid in data exploration and presentation.
9. Data Modeling: Density estimation is often a crucial step in statistical modeling. It provides
the foundation for many statistical techniques, such as Bayesian inference, maximum
likelihood estimation, and hypothesis testing.
10. Data Compression: In data compression techniques like MDL Gaussian Mixture Models
(GMM) density estimation helps capture essential features of the data distribution, reducing
data dimensionality while retaining important information.

P a g e | 48
Gaussian (Normal) Distribution
 The Gaussian, or Normal distribution, is one of the most fundamental and widely used
probability distributions in statistics and many other fields. It is essential for understanding
a range of statistical concepts and methods.
 The Gaussian distribution describes a continuous probability distribution for a random
variable X. It is characterized by its bell-shaped curve, symmetric about the mean.
 The probability density function (pdf) of the Gaussian distribution is given by:
where:
μ is the mean of the distribution.
σ2
is the variance of the distribution.
σ is the standard deviation, which is the square root of the variance.

P a g e | 49
Key Properties of Gaussian (Normal) Distribution
1. Symmetry:
 The Gaussian distribution is symmetric about its mean μ
 The mean, median, and mode of the distribution are all equal and located at μ
2. Shape:
 The bell shape of the distribution is determined by the standard deviation σ
 A larger σ results in a wider, flatter curve, while a smaller σ results in a steeper, narrower
curve.
3. 68-95-99.7 Rule:
 Approximately 68% of the data falls within one standard deviation of the mean (μ±σ).
 Approximately 95% of the data falls within two standard deviations of the mean (μ±2σ).
 Approximately 99.7% of the data falls within three standard deviations of the mean
(μ±3σ).
4. Moment Generating Function:
 The moments of the Gaussian distribution (mean, variance, skewness, and kurtosis) can
be derived from its moment generating function (MGF).

P a g e | 50
5. Standard Normal Distribution:
 A special case of the Gaussian distribution where μ=0 and σ2
=1.
 The pdf of the standard normal distribution is:
 Any normal random variable X can be standardized to Z using
Applications of Gaussian (Normal) Distribution
1. Statistical Inference: Many statistical tests and confidence intervals are based on the
assumption that the data follows a normal distribution.
2. Central Limit Theorem: The theorem states that the sum (or average) of a large number
of independent and identically distributed random variables tends to follow a normal
distribution, regardless of the original distribution of the variables.

P a g e | 51
3. Real-World Phenomena: Many natural and human-made phenomena are approximately
normally distributed, such as heights, test scores, and measurement errors.
4. Regression Analysis: In linear regression, the assumption of normally distributed errors
is crucial for hypothesis testing and constructing confidence intervals.
5. Finance and Economics: The Gaussian distribution is used in modeling asset returns and
in the Black-Scholes option pricing model.
Techniques for Density Estimation in Bayesian Learning
Bayesian techniques for density estimation leverage prior knowledge and observed data to infer
the underlying distribution of the data. These techniques can be broadly classified into parametric
and non-parametric methods.
I) Parametric Methods
 Assume that the data is generated from a distribution with a fixed set of parameters.
 Use Bayesian inference to estimate these parameters.
 Examples: Maximum Likelihood (ML) and Least-Squared (LS) Error Hypotheses,
Gaussian Mixture Models (GMM)
a) Maximum Likelihood (ML) Hypothesis:
 Hypothesis: The best estimate of the parameters is the one that maximizes the likelihood
of the observed data.
 Procedure:
1. Define the likelihood function based on the chosen model.
2. Find the parameter values that maximize this likelihood function.

P a g e | 52
where D represents the observed data and θ represents the model parameters.
 This approach does not incorporate prior beliefs about the parameters and focuses solely
on maximizing the likelihood function.
Application: Widely used in many statistical models, including those for density estimation.
Advantages:
 Provides consistent and efficient estimates under certain conditions.
 Directly related to the probabilistic model.
Example:
In a linear regression model with Gaussian noise, the ML estimate of the parameters θ (which
includes the intercept and slope coefficients) is found by maximizing the likelihood of the observed
data given these parameters.
Maximum Likelihood (ML) for Predicting Probabilities
In Bayesian learning, ML can be adapted to predict probabilities by focusing on estimating the
parameters of the probability distribution that generated the data.
1. Model Selection: Choose a parametric model for the probability distribution (e.g., Gaussian,
Poisson).
2. Likelihood Function: Define the likelihood function for the chosen model based on the
observed data.

P a g e | 53
3. Parameter Estimation: Use ML to estimate the parameters of the model by maximizing the
likelihood function.
4. Probability Prediction: Once the parameters are estimated, use the model to predict
probabilities for new data points.
b) Least Squares (LS) Error Hypothesis
 Hypothesis: The best estimate of the parameters minimizes the sum of squared differences
between observed and predicted values.
 This approach is often used in regression analysis and corresponds to the ML estimate
when the errors are assumed to be normally distributed.
 Procedure:
1. Define the residuals as the difference between observed and predicted values.
2. Minimize the sum of squared residuals.
where yi are the observed values and y^i are the predicted values given the parameters θ.
Application: Commonly used in regression problems and can be applied in density estimation
contexts where the model can be expressed in a regression framework.
Advantages:
 Simple and intuitive.
 Works well with linear models and Gaussian errors.

P a g e | 54
Example: In a simple linear regression model, the LS error estimate of the parameters is
obtained by minimizing the sum of squared differences between the actual values yi and the
values predicted y^i
II) Non-Parametric Methods
 Do not assume a fixed parametric form for the distribution.
 Use flexible models that can adapt to the shape of the data.
 Examples: Nearest Neighbor Density Estimation, Kernel Density Estimation (KDE),
Gaussian Processes (GP)
a) Nearest Neighbor Density Estimation
 Estimates the density by considering the distance to the kth
nearest neighbor of each point.
 The density estimate at a point is inversely proportional to the volume of the region
containing the k nearest neighbors.

P a g e | 55
where V(x) is the volume of the hypersphere centered at x that contains k data points.
b) Kernel Density Estimation (KDE)
 A non-parametric way to estimate the probability density function of a random variable.
 Uses a kernel function (e.g., Gaussian) placed at each data point to estimate the density.
Where K is the kernel function, h is the bandwidth, and xi are the observed data points.
 The bandwidth parameter controls the smoothness of the resulting density estimate.

P a g e | 56
Minimum Description Length (MDL) Principle
 The Minimum Description Length (MDL) principle is a method for model selection and
complexity control in statistical learning.
 It is closely related to Bayesian learning and information theory.
 The core idea of MDL is to choose the model that provides the best compression of the
data, balancing model complexity and goodness of fit.

P a g e | 57
Core Concept
 The MDL principle is based on the idea that the best model for a given set of data is the
one that minimizes the total length of encoding the model and the data.
 In other words, it seeks the model that provides the most compact representation of the
data.
Description Length
The description length consists of two parts:
1. Model Description Length: The length of the code required to describe the model
parameters.
2. Data Description Length: The length of the code required to describe the data given the
model.
Formally, the total description length L can be expressed as:
MDL and Bayesian Inference
MDL is closely related to Bayesian inference. In fact, the MDL principle can be viewed as a
practical implementation of Bayesian model selection.
 Model Description Length (Prior): In Bayesian terms, this corresponds to the prior
probability of the model. A simpler model has a higher prior probability and thus a shorter
description length.
 Data Description Length (Likelihood): This corresponds to the likelihood of the data
given the model. A model that fits the data well will have a higher likelihood, leading to a
shorter description length.
In Bayesian learning, we typically maximize the posterior probability:

P a g e | 58
In MDL, we minimize the description length:
These two approaches are fundamentally similar, as both aim to balance model complexity (prior)
and data fit (likelihood).
Applications in Bayesian Learning
1. Model Selection: MDL is used to select the model that provides the best trade-off between
complexity and fit. This is particularly useful when comparing different models or
hypotheses.
2. Regularization: By penalizing complex models, MDL naturally implements
regularization, helping to avoid overfitting.
3. Clustering: In clustering algorithms like Gaussian Mixture Models (GMMs), MDL can
be used to determine the optimal number of clusters.
4. Compression: MDL can be directly applied in data compression tasks, where the goal is
to compress data efficiently.
Naive Bayes Classifier
 The Naive Bayes classifier is a simple yet powerful probabilistic classifier based on Bayes'
theorem.
 It's called "naive" because it assumes that all features are independent of each other given
the class label.
 This assumption simplifies the computation but may not always be accurate
Bayes' Theorem

P a g e | 59
where:
 P(C∣X) is the posterior probability of class C given the features X.
 P(X∣C) is the likelihood of the features X given the class C.
 P(C) is the prior probability of class C.
 P(X) is the marginal likelihood of the features X.
Naive Bayes Assumption
The naive Bayes classifier assumes that the features are conditionally independent given the class.
This simplifies the likelihood term P(X∣C) as:
where Xi are the individual features.
Classification Rule
The Naive Bayes classification rule is to choose the class C that maximizes the posterior
probability:
Given the independence assumption, this can be written as:
Steps to Build a Naive Bayes Classifier
1. Convert Dataset into Frequency Tables: Count the occurrences of each feature for each
class.
2. Generate Likelihood Table: Calculate the probabilities of each feature given each class.
3. Use Bayes' Theorem: Calculate the posterior probability for each class given a new
instance and choose the class with the highest probability.

P a g e | 60
An Illustrative Example: Naive Bayes Classifiers
 So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
 Hence on a Sunny day, Player can play the game.

P a g e | 61
Advantages of Naive Bayes Classifier
1. Simplicity: The Naive Bayes classifier is simple to understand and implement.
2. Speed: It is computationally efficient, requiring less training time compared to other
classifiers, especially with large datasets.
3. Scalability: It performs well with a large number of features and instances.
4. Robust to Irrelevant Features: Since the model assumes independence between features,
it can perform well even when irrelevant features are present.
5. Effective with Small Data: Performs well with relatively small amounts of training data.
6. Probability Outputs: Provides probabilistic outputs which are useful in many applications
such as risk assessment and decision-making.
Disadvantages of Naive Bayes Classifier
1. Assumption of Independence: The main limitation is the naive assumption of feature
independence, which is often not true in real-world data.
2. Zero Probability: If a categorical variable's category was not observed in the training data,
the model assigns a zero probability which can be problematic. This can be mitigated by
techniques such as Laplace smoothing.
3. Data Scarcity: If the training data is rare for a particular class, the probability estimates
can be unreliable.
4. Continuous Features: Assumption of a normal distribution for continuous features can be
limiting and sometimes incorrect.
5. Sensitivity to Irrelevant Features: While robust to irrelevant features, the presence of too
many of them can still degrade performance.

P a g e | 62
Applications of Naive Bayes Classifier
1. Text Classification:
o Spam Filtering: Distinguishing spam emails from non-spam (ham) emails.
o Sentiment Analysis: Determining the sentiment (positive, negative, neutral) from
text data such as reviews and social media posts.
o Document Categorization: Classifying news articles, research papers, and other
documents into predefined categories.
2. Medical Diagnosis: Predicting the likelihood of a disease based on patient symptoms and
medical history.
3. Recommendation Systems: Suggesting products, movies, or other items to users based on
their past behavior and preferences.
4. Anomaly Detection: Identifying unusual patterns that do not conform to expected
behavior, such as fraud detection in financial transactions.
5. Real-Time Prediction: Due to its fast prediction capability, it is used in applications
requiring real-time responses such as recommendation engines and live spam detection.
6. Weather Prediction: Predicting weather conditions based on historical data and observed
weather patterns.
7. Predictive Maintenance: Predicting equipment failures in industries to plan maintenance
activities effectively.
8. Biological Data Classification: Classifying sequences or features in bioinformatics, such
as identifying protein functions based on sequence data.

P a g e | 63
Gaussian Naive Bayes Classifier
 In the Gaussian Naive Bayes classifier, it's assumed that the continuous features follow a
Gaussian (normal) distribution.
 Example: Predicting house prices where features like size, number of rooms, etc., are
continuous and may follow a Gaussian distribution.
 This means that for each class, the features are modeled using a Gaussian distribution with
class-specific means and variances.
 To classify a new sample, the model calculates the posterior probability for each class given
the sample’s features, using the Gaussian distribution parameters. It then assigns the
sample to the class with the highest posterior probability.

P a g e | 64
An Illustrative Example: Gaussian Naive Bayes Classifier

P a g e | 65
Advantages of Gaussian Naive Bayes Classifier:
 Simple to implement with continuous data.
 Works well if the feature distributions approximate Gaussian.
Disadvantages of Gaussian Naive Bayes Classifier:
 The performance may degrade if the Gaussian assumption does not hold for the features.
Applications of The Gaussian Naive Bayes Classifier
1. Medical Diagnosis Predicting the likelihood of a disease based on continuous medical
measurements (e.g., blood pressure, cholesterol levels).
2. Finance: Credit scoring where features such as income, debt, and age are continuous.
4. Weather Prediction Forecasting weather conditions using continuous variables like
temperature, humidity, and wind speed.
5. Environmental Science: Predicting pollution levels based on continuous measurements like
concentration of pollutants in the air.
6. Image Processing: Classifying pixel intensities in grayscale images where the feature
distribution can be approximately Gaussian.

P a g e | 66
Comparison between Standard Naive Bayes Classifiers and Gaussian Naive
Bayes Classifiers
Aspect Standard Naive Bayes Classifiers Gaussian Naive Bayes
Data Type Discrete features (Multinomial) or
Binary features (Bernoulli)
Continuous features
Feature
Distribution
Multinomial: Features are counts or
frequencies
Bernoulli: Features are binary
(presence/absence)
Features are assumed to follow a
Gaussian (normal) distribution
Use Case Text classification (Multinomial) or
binary/boolean classification tasks
(Bernoulli)
Tasks where features are
continuous and approximately
normally distributed
Feature
Independence
Assumes features are conditionally
independent given the class
Assumes features are
conditionally independent given
the class
Advantages Multinomial: Effective for count-
based features like word counts.
Bernoulli: Simple for binary features.
Simple and effective for
continuous features if Gaussian
assumption holds.
Disadvantages Multinomial: Not suitable for non-
count data.
Bernoulli: Limited to binary features.
Performance may degrade if
feature distributions are not
Gaussian.
Example
Applications
Text classification, spam detection,
document categorization.
Predicting house prices,
continuous attribute classification.

P a g e | 67
Bayesian Belief Network (BBN)
 A Bayesian Belief Network (BBN), also known as a Bayesian network, Bayes net, or
belief network, is a powerful graphical model used to represent and analyze the
probabilistic relationships among a set of variables.
 It is particularly useful in domains where uncertainty and complex interactions between
variables are present.
Structure of Bayesian Belief Networks
BBNs are structured as directed acyclic graphs (DAGs) is a type of graph that consists of vertices
(or nodes) connected by directed edges, with the crucial property that it contains no cycles. This
means that if you start at any vertex and follow the directed edges, you cannot return to that same
vertex.
where:
 Nodes represent random variables, which can be observable quantities, latent variables, or
even unknown parameters.
 Edges (or directed arrows) indicate the conditional dependencies between these variables.
If a directed edge goes from node A to node B, it implies that A has a direct influence on
B.
This structure allows for the representation of joint probability distributions through the product
of conditional probabilities associated with each node, given its parent nodes. The absence of a
direct edge between two nodes signifies that they are conditionally independent of each other,
given their respective parent nodes

P a g e | 68
Key Components
1. Conditional Probability Tables (CPTs): Each node in a Bayesian network has an
associated CPT that quantifies the effect of the parent nodes on the node itself. For
example, if a node has two parent nodes, the CPT will include probabilities for all
combinations of the parent states.
2. Joint Probability Distribution: The joint probability of all variables in the network can
be expressed as:
where Xi represents the variables in the network and Parents(Xi) are the parent nodes of Xi.
Example 1:
 Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably
responds at detecting a burglary but also responds for minor earthquakes.
 Harry has two neighbors John and Mary, who have taken a responsibility to inform Harry
at work when they hear the alarm.
 John always calls Harry when he hears the alarm, but sometimes he got confused with the
phone ringing and calls at that time too.
 On the other hand, Mary likes to listen to high music, so sometimes she misses to hear the
alarm. Here we would like to compute the probability of Burglary Alarm.

P a g e | 69
Calculation of the occurrence of an Event:
Calculate the probability that the alarm has sounded but neither a burglary nor an earthquake has
occurred, and both John and Mary call?
Let us now try to calculate the probability that the Alarm rings (A) given that:
 John Calls(J)
 Mary calls(M)
 Earthquake doesn't happen (~E)
 Burglary doesn't happen (~B)
P(A) = P(J|A)*P(M|A)*P(A|~E, ~B)*P(~B)*P(~E)
= 0.90*0.70*0.001*0.999*0.998
P(A) = 0.00062
Example 2:

P a g e | 70
Example 3:
Advantages of BNNs:
1. Uncertainty Handling: BBNs handle uncertainty and probabilistic reasoning, making
them useful for scenarios where outcomes are uncertain or incomplete.
2. Interpretable Relationships: The directed acyclic graph (DAG) structure provides a clear
and interpretable way to visualize and understand relationships between variables.
3. Flexibility: BBNs can model complex dependencies and conditional independencies
among variables, allowing for nuanced and detailed representations of real-world systems.
4. Inference and Querying: BBNs support efficient probabilistic inference, enabling the
calculation of posterior probabilities given evidence, which is useful for decision-making
and predictions.
5. Learning from Data: BBNs can be learned from data, allowing them to adapt to new
information and improve accuracy over time.
6. Causal Relationships: They can model causal relationships, providing insights into how
changes in one variable may affect others.

P a g e | 71
Disadvantages of BNNs:
1. Computational Complexity: Inference in BBNs can be computationally intensive,
especially in large networks with many variables and dependencies, potentially requiring
advanced algorithms or approximations.
2. Structure Learning Challenges: Learning the structure of a BBN from data can be
challenging and computationally expensive, particularly with large or high-dimensional
datasets.
3. Data Requirements: BBNs require sufficient and high-quality data to accurately estimate
probabilities and learn network parameters or structure.
4. Assumption of Independence: BBNs assume conditional independence between variables
given their parents, which may not always accurately reflect real-world dependencies.
5. Scalability Issues: For very large networks or those with complex dependencies, BBNs may
face scalability issues and require significant resources to manage and analyze.
6. Expert Knowledge Requirement: Building and interpreting a BBN often requires domain
expertise to define the network structure and understand the implications of the modeled
dependencies.
Applications of BNNs
Bayesian Belief Networks have a wide range of applications, including:
 Medical Diagnosis: BBNs can model the relationships between diseases and symptoms,
helping to infer the likelihood of a disease given observed symptoms.
 Spam Filtering: They are used to classify emails as spam or not based on various features
extracted from the email content.
 Gene Regulatory Networks: In bioinformatics, BBNs help model the interactions
between genes and their regulatory mechanisms.
 Decision Support Systems: BBNs assist in making informed decisions under uncertainty
by modeling the relationships between different factors affecting the decision

P a g e | 72
Expectation-Maximization (EM) Algorithm
 The Expectation-Maximization (EM) algorithm is a powerful statistical technique used
for finding maximum likelihood estimates of parameters in models that involve unobserved
latent variables.
 It is particularly useful in situations where the data is incomplete or has missing values.
Algorithm:
1. Initialization:
2. Expectation Step (E-step):

P a g e | 73
3. Maximization Step (M-step):
4. Convergence:

P a g e | 74
Advantages of the EM Algorithm
1. Handles Incomplete Data: EM can effectively handle datasets with missing or incomplete
data by treating missing values as latent variables and iteratively estimating them.
2. Flexibility: It can be applied to a variety of statistical models, including Gaussian Mixture
Models, Hidden Markov Models, and more.
3. Iterative Refinement: The iterative process of E-step and M-step allows for refined
parameter estimation, improving as more data or iterations are processed.
4. Convergence:Each iteration of the EM algorithm guarantees that the likelihood of the
observed data will not decrease, leading to convergence towards a local maximum.
5. General Framework: The EM algorithm provides a general framework that can be adapted
to different models by modifying the E-step and M-step accordingly.
Disadvantages of the EM Algorithm
1. Local Optimize often converges to a local maximum of the likelihood function, which
may not be the global maximum.
2. Computational Complexity: For large datasets or complex models, the computation can
be intensive, especially if the model involves high-dimensional latent variables.
3. Initialization Sensitivity: The results can be sensitive to the choice of initial parameter
values. Poor initialization can lead to suboptimal solutions.
4. Slow Convergence: In some cases, especially with complex models or poor initialization,
the algorithm may converge slowly, requiring many iterations.
5. Requirement for Complete Data: While EM can handle missing data, it requires the
assumption that the missing data are missing at random (MAR) or missing completely at
random (MCAR).

P a g e | 75
Gaussian Mixture Models (GMM)
 Gaussian Mixture Models (GMM) combined with the Expectation-Maximization (EM)
algorithm is a powerful method for clustering data.
 A GMM is a function that is composed of several Gaussians, each identified by k ∈ {1,…, K},
where K is the number of clusters of our data set. Each Gaussian k in the mixture is comprised
of the following:
 A mean μ that defines its center.
 A covariance Σ that defines its width. This would be equivalent to the dimensions of
an ellipsoid in a multivariate scenario.
 A mixing probability π that defines how big or small the Gaussian function will
be.parameters:
 GMM can model clusters of various shapes by adjusting the covariance matrix of each
Gaussian component.
 Since GMM uses probabilistic assignments, it is robust to noise and outliers in the data.
 Unlike hard clustering methods like k-means, GMM assigns probabilities to each data point
belonging to each cluster, allowing for soft clustering and uncertainty quantification.

P a g e | 76
The model is defined by the following components:
Advantages of GMM and EM for Clustering
 Flexible Clustering: GMM can model data with varying cluster shapes and sizes due to
the use of covariance matrices.
 Soft Clustering: Unlike hard clustering methods like k-means, GMM assigns probabilities
to data points, allowing for soft assignments to clusters.
 Handling Overlapping Clusters: GMM can effectively handle situations where clusters
overlap by using probabilistic assignments.
Applications of the EM Algorithm
1. Clustering:EM is widely used to fit GMMs for clustering data into different groups based
on the underlying distribution of features.
2. Dimensionality Reduction: EM can be used to estimate parameters in factor analysis
models, which are used to reduce dimensionality by identifying latent variables.

P a g e | 77
3. Image Processing: EM is used in image processing tasks like segmentation, where it helps
identify different regions or objects within an image.
4. Bioinformatics: EM can be applied to analyze gene expression data, particularly when
dealing with incomplete or noisy measurements.
5. Finance: EM is used to estimate parameters in financial models, such as those involving
latent variables related to market risk or credit risk.
6. Recommendation Systems: EM can be used to model user preferences and item attributes
in recommendation systems, improving predictions of user ratings or preferences.
***********

Artificial Neural Networks and Bayesian Learning

More Related Content

Similar to Artificial Neural Networks and Bayesian Learning

More from Kuvempu University

Recently uploaded

Artificial Neural Networks and Bayesian Learning