Neural Networks
in Data Mining -
“An Overview”
Agenda
u Introduction
u Data Mining Techniques
u Neural Networks for Data Mining?
 Neural Networks Classification
 Neural Networks Pruning
 Neural Networks Rule Extraction
u Conclusion
u Questions?
 Extraction of interesting (non-trivial,
implicit, previously unknown and
potentially useful) information or patterns
from data in large databases
 It is an essential step in the process of
knowledge discovery.
Data Mining
• data cleaning
• data integration
• data selection
• data transformation
• data mining
• pattern evaluation
• knowledge presentation.
Steps of Knowledge Discovery
Data Mining: A KDD Process
 Data mining—core of
knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Why Data mining?
Data explosion problem
Automated data collection tools and
mature database technology lead to
tremendous amounts of data stored in
databases, data warehouses and other
information repositories
We are drowning in data, but starving for
knowledge!
Solution: Data warehousing and data mining
Tasks of data mining
 Concept Description
 Association
 Classification
 Prediction
 Cluster Analysis
 Outlier Analysis
Classification
It is the process of finding a model
that is able to predict the class of
objects whose label is unknown.
For eg. It can classify the customers who can pay the
loan based on the existing records in the bank database.
 Decision trees
 Bayesian classification
 Neural networks
 Genetic algorithms
 Memory Based Reasoning
etc.,
Classification methods
 high tolerance of noisy data.
 ability to classify patterns on which they
have not been trained.
 can be used when there is little
knowledge of the relationships between
attribute and classes.
Why Neural networks?
 well suited for continuous valued inputs
and outputs unlike most decision tree
algorithms.
 rules can be extracted easily by available
techniques from trained neural network.
Why Neural networks?
- Contd.
Neural Networks
It is the study of how to make
computers to make sensible
decisions and to learn by ordinary
experience as we do.
Neurons
The human brain consists has about 100 billion neurons and
100 trillion connections (synapses) between them.
Here is what a typical neuron looks like:
Many highly specialized types of neurons exist, and these
differ widely in appearance. Characteristically, neurons are
highly asymmetric in shape.
 It consists of an input layer, one or more
hidden layers and an output layer.
Input Layer
Hidden Layer
Output Layer
Structure of Multi layer feed forward neural network
Multi layer feed forward neural network
Backpropagation
 Backpropagation is the neural network
learning algorithm.
 It learns by iteratively processing a
dataset of training examples, comparing the
network's prediction for each example with
the actual known target value.
Overview of BP
The backpropagation algorithm learns
the network by iteratively processing the
np training examples of a dataset,
comparing the networks result ok for
each example with the desired known
target value dk for each target class k in
a dataset.
Consider a fully connected three layer
feedforward neural network as in figure ,
X1
X2
Xi
Xl
…
…
…
h1
O1
…
w11
w12
wl1
wlm
hm
On
v11
v12
vm1
vmn
Overviewof BP – Contd.
Bias (-1)
Bias (-1)
h2
Consists of l input neurons, m hidden
neurons and n output neurons
np be the number of examples consider
for training.
 Let xip be the ith input unit of pth
example in a dataset, where i= 1, 2,… l.
Wij be the weight between input unit
neuron i to hidden unit neuron j, where
j=1,2…m,
Overview of BP – Contd.
vjk be the weight between hidden neuron
j to output neuron k, where k=1, 2,… n.
initially the weights wij and vjk takes the
random value between -1 to 1.
Let hj be the activation value of the
hidden neuron j
ok be the actual output of the kth neuron.
Overview of BP – Contd.
Bias
• It is a threshold value that serves to
vary the activity of the neuron.
• The bias input is fixed and always
equals -1.
Overview of BP – Contd.
The activation value of hidden neuron
hj for pth examples can be calculated by,
Overview of BP – Contd.
The actual output ok can be calculated
by,
Overview of SBP – Contd.
 Weights are modified for each example
so as to minimize the mean squared
error (mse).
 The value of mse can be calculated
according to the following equation
Overview of BP – Contd.
Weight updation are made in the
backward direction i.e., from the
output layer through hidden layer
and to input layer.
Overview of BP – Contd.
Learning Rate λ
 avoids local minimum (where the
weights appear to converge but are not
at the optimal solution).
encourages finding global minimum.
Typically having a value between 0.0 to
1.0.
Overview of BP – Contd.
For each unit k in the output layer
 compute the Error using
Errk = ok(1-ok)(dk-ok)
For each weight vjk in network
 compute weight increment using
Δvjk=(λ) Errk*hj
 update the weight vjk using
vjk = vjk + Δvjk
Overview of BP – Contd.
For each unit j in the hidden layers, from
the last to the first hidden layer
compute the Error using
Errj = hj (1-hj) Σ Errk*vjk;
For each weight wij in network
 compute weight increment using
Δwij=(λ)*Errj*xip
 update the weight wij using
wij = wij + Δwij
Overview of BP – Contd.
Overview of BP – Contd.
For each bias Ǿj in network
 compute the bias increment
using
Δ Ǿj = (λ)*Errj
 update the bias weight using
Ǿj = Ǿj + Δ Ǿj
The algorithm stops the learning when,
• The mean squared error is below a
threshold value.
• A pre specified number of epochs has
expired
Overview of BP – Contd.
Random data selection method
 The training and testing examples are
taken randomly from each class.
K-fold cross validation method
Example
The iris dataset is having 3 classes with
50 examples for each class. From each
class 25 examples are taken randomly for
training and another 25 examples are
taken randomly for testing the network.
Data selection method.
Performance Measures
Accuracy
It is the percentage of test dataset
that are correctly classified by the
classifier.
Speed
It refers to computational time and
cost involved in generating and using
given classifier.
Evolving Network Architectures
The success of ANNs largely depends on their
architecture.
 Small networks require long training time and can
be easily get trapped into a Local Minima.
 Large networks able to learn fast and avoids local
minima but with poor generalization.
 Optimal architecture is a network that is large
enough to learn the problem and is small enough
to generalize well.
approaches for optimizing Neural
Networks
Constructive methods
- new hidden units are added during the training
process, also called as Growing methods.
Destructive methods
- a large network is trained and then unimportant
nodes or weights are removed, also called as Pruning
methods.
Hybrid methods
- can both add and remove.
Pruning is defined as a network trimming within the
assumed initial architecture.
This can be accomplished by estimating the sensitivity
of the total error to the exclusion of each weight or
neuron in the network.
The weights or neurons which are insensitive to error
changes can be discarded after each step of training.
The trimmed network is of smaller size and is likely
give higher accuracy than before its trimming.
What is Pruning ?
Hepatitis Pruning Results
Step Current
Architecture
Acctest % Epochs Pruned Neurons
1 19-25-2 78.2 200 18 hidden neurons
2 19-7-2 80.5 50 5 hidden neurons
3 19-2-2 83.95 50 Pruning stops
Original network with architecture 19-25-2 with accuracy
78.2% is reduced to the architecture 19-2-2.
 Requires 0.76 seconds to obtain the pruned network.
Rule Extraction
Why Rule extraction?
An important drawback of neural networks is their lack of
explanation capability i.e., it is very difficult to understand how
an ANN has solved a problem. To overcome this problem
various rule extraction algorithms have been developed.
Rule extraction : It changes a black box system into a white
box system by translating the internal knowledge of a neural
network into a set of symbolic rules .
The classification process of a neural networks can be
described by a set of simple rules.
Extracted Rules of 6 real datasets.
•robots that can see, feel, and predict the world around them
•improved stock prediction
•common usage of self-driving cars
•composition of music
•handwritten documents to be automatically transformed into
formatted word processing documents
•trends found in the human genome to aid in the
understanding of the data compiled by the Human Genome
Project
•self-diagnosis of medical problems using neural networks
and much more!
NNs might, in the future, allow:
QUESTIONS

Neural Networks in Data Mining - “An Overview”

  • 1.
    Neural Networks in DataMining - “An Overview”
  • 2.
    Agenda u Introduction u DataMining Techniques u Neural Networks for Data Mining?  Neural Networks Classification  Neural Networks Pruning  Neural Networks Rule Extraction u Conclusion u Questions?
  • 3.
     Extraction ofinteresting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases  It is an essential step in the process of knowledge discovery. Data Mining
  • 4.
    • data cleaning •data integration • data selection • data transformation • data mining • pattern evaluation • knowledge presentation. Steps of Knowledge Discovery
  • 5.
    Data Mining: AKDD Process  Data mining—core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 6.
    Why Data mining? Dataexplosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining
  • 7.
    Tasks of datamining  Concept Description  Association  Classification  Prediction  Cluster Analysis  Outlier Analysis
  • 8.
    Classification It is theprocess of finding a model that is able to predict the class of objects whose label is unknown. For eg. It can classify the customers who can pay the loan based on the existing records in the bank database.
  • 9.
     Decision trees Bayesian classification  Neural networks  Genetic algorithms  Memory Based Reasoning etc., Classification methods
  • 10.
     high toleranceof noisy data.  ability to classify patterns on which they have not been trained.  can be used when there is little knowledge of the relationships between attribute and classes. Why Neural networks?
  • 11.
     well suitedfor continuous valued inputs and outputs unlike most decision tree algorithms.  rules can be extracted easily by available techniques from trained neural network. Why Neural networks? - Contd.
  • 12.
    Neural Networks It isthe study of how to make computers to make sensible decisions and to learn by ordinary experience as we do.
  • 13.
    Neurons The human brainconsists has about 100 billion neurons and 100 trillion connections (synapses) between them. Here is what a typical neuron looks like: Many highly specialized types of neurons exist, and these differ widely in appearance. Characteristically, neurons are highly asymmetric in shape.
  • 14.
     It consistsof an input layer, one or more hidden layers and an output layer. Input Layer Hidden Layer Output Layer Structure of Multi layer feed forward neural network Multi layer feed forward neural network
  • 15.
    Backpropagation  Backpropagation isthe neural network learning algorithm.  It learns by iteratively processing a dataset of training examples, comparing the network's prediction for each example with the actual known target value.
  • 16.
    Overview of BP Thebackpropagation algorithm learns the network by iteratively processing the np training examples of a dataset, comparing the networks result ok for each example with the desired known target value dk for each target class k in a dataset.
  • 17.
    Consider a fullyconnected three layer feedforward neural network as in figure , X1 X2 Xi Xl … … … h1 O1 … w11 w12 wl1 wlm hm On v11 v12 vm1 vmn Overviewof BP – Contd. Bias (-1) Bias (-1) h2
  • 18.
    Consists of linput neurons, m hidden neurons and n output neurons np be the number of examples consider for training.  Let xip be the ith input unit of pth example in a dataset, where i= 1, 2,… l. Wij be the weight between input unit neuron i to hidden unit neuron j, where j=1,2…m, Overview of BP – Contd.
  • 19.
    vjk be theweight between hidden neuron j to output neuron k, where k=1, 2,… n. initially the weights wij and vjk takes the random value between -1 to 1. Let hj be the activation value of the hidden neuron j ok be the actual output of the kth neuron. Overview of BP – Contd.
  • 20.
    Bias • It isa threshold value that serves to vary the activity of the neuron. • The bias input is fixed and always equals -1. Overview of BP – Contd.
  • 21.
    The activation valueof hidden neuron hj for pth examples can be calculated by, Overview of BP – Contd.
  • 22.
    The actual outputok can be calculated by, Overview of SBP – Contd.
  • 23.
     Weights aremodified for each example so as to minimize the mean squared error (mse).  The value of mse can be calculated according to the following equation Overview of BP – Contd.
  • 24.
    Weight updation aremade in the backward direction i.e., from the output layer through hidden layer and to input layer. Overview of BP – Contd.
  • 25.
    Learning Rate λ avoids local minimum (where the weights appear to converge but are not at the optimal solution). encourages finding global minimum. Typically having a value between 0.0 to 1.0. Overview of BP – Contd.
  • 26.
    For each unitk in the output layer  compute the Error using Errk = ok(1-ok)(dk-ok) For each weight vjk in network  compute weight increment using Δvjk=(λ) Errk*hj  update the weight vjk using vjk = vjk + Δvjk Overview of BP – Contd.
  • 27.
    For each unitj in the hidden layers, from the last to the first hidden layer compute the Error using Errj = hj (1-hj) Σ Errk*vjk; For each weight wij in network  compute weight increment using Δwij=(λ)*Errj*xip  update the weight wij using wij = wij + Δwij Overview of BP – Contd.
  • 28.
    Overview of BP– Contd. For each bias Ǿj in network  compute the bias increment using Δ Ǿj = (λ)*Errj  update the bias weight using Ǿj = Ǿj + Δ Ǿj
  • 29.
    The algorithm stopsthe learning when, • The mean squared error is below a threshold value. • A pre specified number of epochs has expired Overview of BP – Contd.
  • 30.
    Random data selectionmethod  The training and testing examples are taken randomly from each class. K-fold cross validation method Example The iris dataset is having 3 classes with 50 examples for each class. From each class 25 examples are taken randomly for training and another 25 examples are taken randomly for testing the network. Data selection method.
  • 31.
    Performance Measures Accuracy It isthe percentage of test dataset that are correctly classified by the classifier. Speed It refers to computational time and cost involved in generating and using given classifier.
  • 32.
    Evolving Network Architectures Thesuccess of ANNs largely depends on their architecture.  Small networks require long training time and can be easily get trapped into a Local Minima.  Large networks able to learn fast and avoids local minima but with poor generalization.  Optimal architecture is a network that is large enough to learn the problem and is small enough to generalize well.
  • 33.
    approaches for optimizingNeural Networks Constructive methods - new hidden units are added during the training process, also called as Growing methods. Destructive methods - a large network is trained and then unimportant nodes or weights are removed, also called as Pruning methods. Hybrid methods - can both add and remove.
  • 34.
    Pruning is definedas a network trimming within the assumed initial architecture. This can be accomplished by estimating the sensitivity of the total error to the exclusion of each weight or neuron in the network. The weights or neurons which are insensitive to error changes can be discarded after each step of training. The trimmed network is of smaller size and is likely give higher accuracy than before its trimming. What is Pruning ?
  • 35.
    Hepatitis Pruning Results StepCurrent Architecture Acctest % Epochs Pruned Neurons 1 19-25-2 78.2 200 18 hidden neurons 2 19-7-2 80.5 50 5 hidden neurons 3 19-2-2 83.95 50 Pruning stops Original network with architecture 19-25-2 with accuracy 78.2% is reduced to the architecture 19-2-2.  Requires 0.76 seconds to obtain the pruned network.
  • 36.
    Rule Extraction Why Ruleextraction? An important drawback of neural networks is their lack of explanation capability i.e., it is very difficult to understand how an ANN has solved a problem. To overcome this problem various rule extraction algorithms have been developed. Rule extraction : It changes a black box system into a white box system by translating the internal knowledge of a neural network into a set of symbolic rules . The classification process of a neural networks can be described by a set of simple rules.
  • 37.
    Extracted Rules of6 real datasets.
  • 38.
    •robots that cansee, feel, and predict the world around them •improved stock prediction •common usage of self-driving cars •composition of music •handwritten documents to be automatically transformed into formatted word processing documents •trends found in the human genome to aid in the understanding of the data compiled by the Human Genome Project •self-diagnosis of medical problems using neural networks and much more! NNs might, in the future, allow:
  • 40.