Instance-Based Learning
4/16/2020 1Pavithra T, Dept of ECE, GSKSJTI
Overview
• Instance-Based Learning
•Comparison of Eager and Instance-Based Learning
• Instance Distances for Instance-Based Learning
• Nearest Neighbor (NN) Algorithm
• Advantages and Disadvantages of the NN algorithm
• Approaches to overcome the Disadvantages of the NN
algorithm
•Locally weighted regression
•Radial basis functions
•Case based Reasoning
4/16/202
0
2Pavithra T, Dept of ECE, GSKSJTI
Different Learning Methods
• Eager Learning
– Learning = acquiring an explicit structure of a classifier
on the whole training set;
– Classification = an instance gets a classification using
the explicit structure of the classifier.
• Instance-Based Learning (Lazy Learning)
– Learning = storing all training instances
– Classification = an instance gets a classification equal to
the classification of the nearest instances to the instance.
4/16/202
0
3Pavithra T, Dept of ECE, GSKSJTI
–All learning methods presented so far construct a general
explicit description of the target function when examples are
provided
–In case of Instance Based learning,
– Examples are simply stored
– Generalizing is postponed until a new instance must be
classified
– Sometimes referred to as lazy learning
– In order to assign a target function value, its relationship
to the previously stored examples is examined
– IBL includes Nearest neighbor , locally weighted regression
and case based reasoning methods
Instance-Based Learning
4/16/202
0
4
Pavithra T, Dept of ECE, GSKSJTI
 Advantages:
 Instead of estimating for the whole instance space, local
approximations to the target function are possible
 Especially if target function is complex but still decomposable
 Disadvantages:
 Classification costs are high (number of computations to index
each training example at query time)
 Efficient techniques for indexing examples are important to
reduce computational effort
 Typically all attributes are considered when attempting to
retrieve similar training examples from memory
 If the concept depends only on a few attributes, the truly most
similar instances may be far away
4/16/202
0
5Pavithra T, Dept of ECE, GSKSJTI
The Features of the Task of the NN Algorithm:
• The instance language comes with a set A with n attributes a1,
a2, … an.
• The domain of each attribute ai can be discrete or continuous.
• An instance x is represented as < a1(x), a2(x), … an(x) >,
where ai(x) is the value of the attribute ai for the instance x;
• The classes to be learned can be:
– Discrete: In this case we learn discrete function f(x) and the
co-domain C of the function consists of the classes c to be
learned.
– Continuous: In this case we learn continuous function f(x) and
the co-domain C of the function consists of the classes c to be
learned.
Nearest-Neighbor Algorithm (NN)
4/16/202
0
6
Pavithra T, Dept of ECE, GSKSJTI
a
ji
jia
range
xaxa
),x(xd
|)()(| 

Distance Functions
The distance functions are composed from difference metrics da
w.r.t. attributes a defined for each two instances xi and xj.
• If the attribute a is numerical, then :
• If the attribute a is discrete, then :


 

otherwise.1,
)a()a(if0, ji
jia
xx
),x(xd
4/16/202
0
7Pavithra T, Dept of ECE, GSKSJTI
Distance Functions
The main distance function for determining nearest
neighbors is the Euclidean distance:
2
),(


Aa
jiaji xxd),xd(x
4/16/202
0
8Pavithra T, Dept of ECE, GSKSJTI
k-Nearest-Neighbor Algorithm
4/16/202
0
9Pavithra T, Dept of ECE, GSKSJTI
+
+
+
+
-
-
-
-
-
-
e1
1-nn:1-nn: q1 is positive
5-nn: q1 is classified as negative
q1
Classification & Decision Boundaries
4/16/202
0
10
Pavithra T, Dept of ECE, GSKSJTI
4/16/202
0
11Pavithra T, Dept of ECE, GSKSJTI
k-Nearest-Neighbor Algorithm
4/16/202
0
12Pavithra T, Dept of ECE, GSKSJTI
Distance Weighted Nearest-Neighbor Algorithm
4/16/202
0
13
Pavithra T, Dept of ECE, GSKSJTI
Advantages of the NN Algorithm
• The NN algorithm can estimate complex target classes
locally and differently for each new instance to be
classified;
• The NN algorithm provides good generalization accuracy
on many domains
• The NN algorithm learns very quickly;
• The NN algorithm is robust to noisy training data;
• The NN algorithm is intuitive and easy to understand
which facilitates implementation and modification.
4/16/202
0
14Pavithra T, Dept of ECE, GSKSJTI
4/16/202
0
Pavithra T, Dept of ECE, GSKSJTI 15
Disadvantages of the NN Algorithm
• The NN algorithm has large storage requirements because it
has to store all the data
• The NN algorithm is slow during instance because all the
training instances have to be visited
• The accuracy of the NN algorithm degrades with increase of
noise in the training data
• The accuracy of the NN algorithm degrades with increase of
irrelevant attributes
4/16/202
0
16Pavithra T, Dept of ECE, GSKSJTI
4/16/202
0
Pavithra T, Dept of ECE, GSKSJTI 17
Remarks
 Highly effective inductive inference method for many
practical problems provided a sufficiently large set of
training examples
 Inductive bias of k-nearest neighbours assumption
that the classification of xq will be similar to the
classification of other instances that are nearby in the
Euclidean Distance
 Referred to as Curse of dimensionality
 Solutions to this problem:
 More relevant attributes can be stretched over the
axis and least relevant attributes can be shortened
over the axis
 attributes can be weighted differently and
eliminate least relevant attributes from instance
space 4/16/202
0
18
Pavithra T, Dept of ECE, GSKSJTI
A note on terminology:
Regression means approximating a real valued target
function
Residual is the error ˆ f(x)−f(x) in approximating the target
function
Kernel function is the function of distance that is used to
determine the weight of each training example.
In other words, the kernel function is the function K such that
wi=K(d(xi,xq))
4/16/202
0
19Pavithra T, Dept of ECE, GSKSJTI
Locally Weighted Linear Regression
• It is a generalization of NN approach
• Why local?
• because function is approximated using based on the data
near the query point
• Why weighted ?
• Methods like gradient descent can be used to calculate the
coefficients w0, w1, ..., wn to minimize the error in fitting
such linear functions
• Why linear?
• Target function is approximated using a linear function ˆ
f(x)=w0+w1a1(x)+...+wnan(x)
• Why regression ?
• Approximating a real valued target function
• ANNs require a global approximation to the target function but
here, just a local approximation is needed
• Therefore the error function has to be redefined
4/16/202
0
20Pavithra T, Dept of ECE, GSKSJTI
Possibilities to redefine the error criterion E
1.Minimize the squared error over just the k nearest
neighbours
E1(xq)≡(1/2) Σ x ∈ k nearest neighbours (f(x)−ˆ f(x))2
2.Minimize the squared error over the entire set D,
while weighting the error of each training example by
some decreasing function K of its distance from xq
E2(xq)≡ (½)Σ x ∈ D (f(x)−ˆ f(x))2·K(d(xq, x))
3.Combine 1 and 2
E3(xq)≡(1/2)Σ x∈k nearest neighbours(f(x)−ˆf(x))2·K(d(xq,x))
4/16/202
0
21Pavithra T, Dept of ECE, GSKSJTI
Choice of the error criterion
 E2 is the most efficient criterion:
 because it allows every training example to have impact
on the classification of xq
 However, computational effort grows with the number of
training examples
 E3 is a good approximation to E2 with constant effort
 Rederiving the gradient descent rule,
 ∆wj =η Σ x ∈ k nearest neighbours K(d(xq, x)) (f(x)−ˆ f(x)) aj
Remarks on locally weighted linear regression:
 In most cases, constant, linear or quadratic functions are
used for target functions
 Because costs for fitting more complex functions are
prohibitively high
 Simple approximations are good enough over a
sufficiently small subregion of instance space
4/16/202
0
22
Pavithra T, Dept of ECE, GSKSJTI
RADIAL BASIS FUNCTIONS
4/16/202
0
23
Pavithra T, Dept of ECE, GSKSJTI
 It is common to choose each function Ku(d(xu,x)) to
be a Gaussian function centred at xu with some
variance σ2
 Ku(d(xu,x))= e(1/2σ2)d2(xu,x)
 The function of ˆ f(x) can be viewed as describing a
two-layer network
1. layer1 consists of units computes the values of
various Ku (d(xu,x)) values
2. layer2 computes a linear combination of the
above results
4/16/202
0
24Pavithra T, Dept of ECE, GSKSJTI
CASE BASED REASONING
 3 imp properties of NN and Linear regression
1. Lazy learners
2. new query is classified by analyzing a similar instance
3. Instances are represented as real valued points on a
n dimensional space
• CBR based on first 2 principles
• Instances are represented using symbols
• Example:
i. CADET system uses CBR to assist design of simple mechanical
device like water faucets
ii. Library : 75 designs and design fragments in memory
iii. Instance is stored by describing its structure and qualitative design
iv. New design problem is presented by specifying the desired function
and requesting for corresponding structure4/16/202
0
25Pavithra T, Dept of ECE, GSKSJTI
A STORED CASE AND A NEW PROBLEM
+ indicates variable increases at the arrow head with variable at
its tail end
- indicates variable decreases at the arrow head with variable at
its tail end
4/16/202
0
26Pavithra T, Dept of ECE, GSKSJTI
Generic Properties of CBR
(Distinguishable from NN method)
 Instances represented by rich symbolic descriptions
 Multiple cases may be combined to form solution to
new problem
 There may be tight coupling between case retrieval,
knowledge based reasoning and problem solving
 Summary:
 CBR is a instance based learning method in which instances
are rich relational descriptions and in which retrieval and
combination of cases to current query may rely on
knowledge based reasoning and search intensive problem
solving methods.
4/16/202
0
27Pavithra T, Dept of ECE, GSKSJTI

Instance Based Learning in Machine Learning

  • 1.
  • 2.
    Overview • Instance-Based Learning •Comparisonof Eager and Instance-Based Learning • Instance Distances for Instance-Based Learning • Nearest Neighbor (NN) Algorithm • Advantages and Disadvantages of the NN algorithm • Approaches to overcome the Disadvantages of the NN algorithm •Locally weighted regression •Radial basis functions •Case based Reasoning 4/16/202 0 2Pavithra T, Dept of ECE, GSKSJTI
  • 3.
    Different Learning Methods •Eager Learning – Learning = acquiring an explicit structure of a classifier on the whole training set; – Classification = an instance gets a classification using the explicit structure of the classifier. • Instance-Based Learning (Lazy Learning) – Learning = storing all training instances – Classification = an instance gets a classification equal to the classification of the nearest instances to the instance. 4/16/202 0 3Pavithra T, Dept of ECE, GSKSJTI
  • 4.
    –All learning methodspresented so far construct a general explicit description of the target function when examples are provided –In case of Instance Based learning, – Examples are simply stored – Generalizing is postponed until a new instance must be classified – Sometimes referred to as lazy learning – In order to assign a target function value, its relationship to the previously stored examples is examined – IBL includes Nearest neighbor , locally weighted regression and case based reasoning methods Instance-Based Learning 4/16/202 0 4 Pavithra T, Dept of ECE, GSKSJTI
  • 5.
     Advantages:  Insteadof estimating for the whole instance space, local approximations to the target function are possible  Especially if target function is complex but still decomposable  Disadvantages:  Classification costs are high (number of computations to index each training example at query time)  Efficient techniques for indexing examples are important to reduce computational effort  Typically all attributes are considered when attempting to retrieve similar training examples from memory  If the concept depends only on a few attributes, the truly most similar instances may be far away 4/16/202 0 5Pavithra T, Dept of ECE, GSKSJTI
  • 6.
    The Features ofthe Task of the NN Algorithm: • The instance language comes with a set A with n attributes a1, a2, … an. • The domain of each attribute ai can be discrete or continuous. • An instance x is represented as < a1(x), a2(x), … an(x) >, where ai(x) is the value of the attribute ai for the instance x; • The classes to be learned can be: – Discrete: In this case we learn discrete function f(x) and the co-domain C of the function consists of the classes c to be learned. – Continuous: In this case we learn continuous function f(x) and the co-domain C of the function consists of the classes c to be learned. Nearest-Neighbor Algorithm (NN) 4/16/202 0 6 Pavithra T, Dept of ECE, GSKSJTI
  • 7.
    a ji jia range xaxa ),x(xd |)()(|   Distance Functions Thedistance functions are composed from difference metrics da w.r.t. attributes a defined for each two instances xi and xj. • If the attribute a is numerical, then : • If the attribute a is discrete, then :      otherwise.1, )a()a(if0, ji jia xx ),x(xd 4/16/202 0 7Pavithra T, Dept of ECE, GSKSJTI
  • 8.
    Distance Functions The maindistance function for determining nearest neighbors is the Euclidean distance: 2 ),(   Aa jiaji xxd),xd(x 4/16/202 0 8Pavithra T, Dept of ECE, GSKSJTI
  • 9.
  • 10.
    + + + + - - - - - - e1 1-nn:1-nn: q1 ispositive 5-nn: q1 is classified as negative q1 Classification & Decision Boundaries 4/16/202 0 10 Pavithra T, Dept of ECE, GSKSJTI
  • 11.
  • 12.
  • 13.
    Distance Weighted Nearest-NeighborAlgorithm 4/16/202 0 13 Pavithra T, Dept of ECE, GSKSJTI
  • 14.
    Advantages of theNN Algorithm • The NN algorithm can estimate complex target classes locally and differently for each new instance to be classified; • The NN algorithm provides good generalization accuracy on many domains • The NN algorithm learns very quickly; • The NN algorithm is robust to noisy training data; • The NN algorithm is intuitive and easy to understand which facilitates implementation and modification. 4/16/202 0 14Pavithra T, Dept of ECE, GSKSJTI
  • 15.
    4/16/202 0 Pavithra T, Deptof ECE, GSKSJTI 15
  • 16.
    Disadvantages of theNN Algorithm • The NN algorithm has large storage requirements because it has to store all the data • The NN algorithm is slow during instance because all the training instances have to be visited • The accuracy of the NN algorithm degrades with increase of noise in the training data • The accuracy of the NN algorithm degrades with increase of irrelevant attributes 4/16/202 0 16Pavithra T, Dept of ECE, GSKSJTI
  • 17.
    4/16/202 0 Pavithra T, Deptof ECE, GSKSJTI 17
  • 18.
    Remarks  Highly effectiveinductive inference method for many practical problems provided a sufficiently large set of training examples  Inductive bias of k-nearest neighbours assumption that the classification of xq will be similar to the classification of other instances that are nearby in the Euclidean Distance  Referred to as Curse of dimensionality  Solutions to this problem:  More relevant attributes can be stretched over the axis and least relevant attributes can be shortened over the axis  attributes can be weighted differently and eliminate least relevant attributes from instance space 4/16/202 0 18 Pavithra T, Dept of ECE, GSKSJTI
  • 19.
    A note onterminology: Regression means approximating a real valued target function Residual is the error ˆ f(x)−f(x) in approximating the target function Kernel function is the function of distance that is used to determine the weight of each training example. In other words, the kernel function is the function K such that wi=K(d(xi,xq)) 4/16/202 0 19Pavithra T, Dept of ECE, GSKSJTI
  • 20.
    Locally Weighted LinearRegression • It is a generalization of NN approach • Why local? • because function is approximated using based on the data near the query point • Why weighted ? • Methods like gradient descent can be used to calculate the coefficients w0, w1, ..., wn to minimize the error in fitting such linear functions • Why linear? • Target function is approximated using a linear function ˆ f(x)=w0+w1a1(x)+...+wnan(x) • Why regression ? • Approximating a real valued target function • ANNs require a global approximation to the target function but here, just a local approximation is needed • Therefore the error function has to be redefined 4/16/202 0 20Pavithra T, Dept of ECE, GSKSJTI
  • 21.
    Possibilities to redefinethe error criterion E 1.Minimize the squared error over just the k nearest neighbours E1(xq)≡(1/2) Σ x ∈ k nearest neighbours (f(x)−ˆ f(x))2 2.Minimize the squared error over the entire set D, while weighting the error of each training example by some decreasing function K of its distance from xq E2(xq)≡ (½)Σ x ∈ D (f(x)−ˆ f(x))2·K(d(xq, x)) 3.Combine 1 and 2 E3(xq)≡(1/2)Σ x∈k nearest neighbours(f(x)−ˆf(x))2·K(d(xq,x)) 4/16/202 0 21Pavithra T, Dept of ECE, GSKSJTI
  • 22.
    Choice of theerror criterion  E2 is the most efficient criterion:  because it allows every training example to have impact on the classification of xq  However, computational effort grows with the number of training examples  E3 is a good approximation to E2 with constant effort  Rederiving the gradient descent rule,  ∆wj =η Σ x ∈ k nearest neighbours K(d(xq, x)) (f(x)−ˆ f(x)) aj Remarks on locally weighted linear regression:  In most cases, constant, linear or quadratic functions are used for target functions  Because costs for fitting more complex functions are prohibitively high  Simple approximations are good enough over a sufficiently small subregion of instance space 4/16/202 0 22 Pavithra T, Dept of ECE, GSKSJTI
  • 23.
  • 24.
     It iscommon to choose each function Ku(d(xu,x)) to be a Gaussian function centred at xu with some variance σ2  Ku(d(xu,x))= e(1/2σ2)d2(xu,x)  The function of ˆ f(x) can be viewed as describing a two-layer network 1. layer1 consists of units computes the values of various Ku (d(xu,x)) values 2. layer2 computes a linear combination of the above results 4/16/202 0 24Pavithra T, Dept of ECE, GSKSJTI
  • 25.
    CASE BASED REASONING 3 imp properties of NN and Linear regression 1. Lazy learners 2. new query is classified by analyzing a similar instance 3. Instances are represented as real valued points on a n dimensional space • CBR based on first 2 principles • Instances are represented using symbols • Example: i. CADET system uses CBR to assist design of simple mechanical device like water faucets ii. Library : 75 designs and design fragments in memory iii. Instance is stored by describing its structure and qualitative design iv. New design problem is presented by specifying the desired function and requesting for corresponding structure4/16/202 0 25Pavithra T, Dept of ECE, GSKSJTI
  • 26.
    A STORED CASEAND A NEW PROBLEM + indicates variable increases at the arrow head with variable at its tail end - indicates variable decreases at the arrow head with variable at its tail end 4/16/202 0 26Pavithra T, Dept of ECE, GSKSJTI
  • 27.
    Generic Properties ofCBR (Distinguishable from NN method)  Instances represented by rich symbolic descriptions  Multiple cases may be combined to form solution to new problem  There may be tight coupling between case retrieval, knowledge based reasoning and problem solving  Summary:  CBR is a instance based learning method in which instances are rich relational descriptions and in which retrieval and combination of cases to current query may rely on knowledge based reasoning and search intensive problem solving methods. 4/16/202 0 27Pavithra T, Dept of ECE, GSKSJTI