Support Vector Machines
Dr. M. Aksam Iftikhar,
Assistant Professor, CS
2
SVM: History & Motivation
o Support Vector Machine (SVM) is a supervised learning algorithm
developed by Vladimir Vapnik and it was first heard in 1992,
introduced by Vapnik, Boser and Guyon in COLT-92.
o It is said that Vladimir Vapnik has mentioned its idea in 1979 in
one of his paper but its major development was in the 90’s.
o SVM got into mainstream because of their exceptional performance in
Handwritten Digit Recognition.
3
SVM: Applications
o Text and image classification
o Hand-writing recognition,
o Data mining
o Bioinformatics
o Medicine and biosequence analysis
o and even stock market
4
SVM: Algorithm
5
We are given a set of n points (vectors): x1, x2,…,xn, such
that xi is a vector of length m, and each belong to one of
two classes we label them by “+1” and “-1”.
So our training set is:
We want to find a separating hyperplane w.x + b = 0
that separates these points into the two classes.
“The positives” (class “+1”) and
“The negatives” (class “-1”).
(Assuming that they are linearly
separable)
1 1 2 2
( , ),( , ),....( , )
n n
x y x y x y , { 1, 1}
m
i i
i x R y
    
So the decision
function will be
f(x) = sign(w.x + b)
SVM: Problem Definition - Linear separable case
6
Linear Separators
Binary classification can be viewed as the task of
separating classes in feature space:
wT
x + b = 0
wT
x + b < 0
wT
x + b > 0
f(x) = sign(wT
x + b)
But There are many possibilities
for such hyperplanes !!
7
Yes, There are many possible separating hyperplanes
It could be
Which one
should we
choose!
this one or this or this or may be….!
1
i
y 
1
i
y 
SVM: Separating Hyperplane
8
i
x
'
x
Suppose we choose the hypreplane (seen below) that is
close to some sample xi.
Now suppose we have a new point x’ that should be in
class “-1” and is close to xi. Using our classification
function f(x) this point is misclassified!
Poor generalization!
(Poor performance on unseen
data)
( ) ( )
f x sign w x b
  
SVM: Choosing a Separating Hyperplane
9
i
x
'
x
SVM: Choosing a Separating Hyperplane
Hyperplane should be as far as possible from any sample
point.
This way a new data that is close to the old samples will
be classified correctly.
Good generalization!
10
SVM: Choosing a Separating Hyperplane
The SVM idea is to maximize the distance between The
hyperplane and the closest sample points.
In the optimal hyper-
plane:
The distance to the closest
negative point =
The distance to the closest
positive point.
11
SVM: Choosing a Separating Hyperplane
i
x
M
a
r
g
i
n
d
d
d
SVM’s goal is to maximize the Margin which is twice the
perpendicular distance “d” between the separating
hyperplane and the closest sample.
Why it is the best?
 Robust to outliners
as we saw and thus
strong generalization
ability.
 It proved itself to
have better
performance on test
data in both practice
and in theory.
12
SVM: Support Vectors
i
x
M
a
r
g
i
n
d
d
d
These are
Support
Vectors
Support vectors are the samples closest to the
separating hyperplane. Oh! So this is where the
name came from!
13
Maximum Margin Classification
Maximizing the margin is also good according to intuition.
Implies that only support vectors matter; other training examples
are ignorable.
14
Finding the SVM hyperplane
• Determining the hyperplane is equivalent to finding the optimal
parameters w and b.
• The SVM hyperplane is determined by solving a dual optimization
problem based on Lagrange multipliers, and not discussed here in detail.
• However, one thing is important to note that this optimization problem
requires computing the inner products xi
T
xj between all the training
examples pairs.
• Remember, inner product/dot product between 2 products is simply sum of
products of all its components.
• We will return to this point later in the topic of Kernel trick.
15
SVM: Limitations
Limitations of Linear SVM:
 Doesn’t work well on non-
linearly separable data.
 Noise (outlier) problem.
But, it can deal with non-linear
classes with a nice tweak.
i
x
M
a
r
g
i
n
d
d
d
16
SVM: Non Linear case
Key idea: map our points with a mapping function (x) to a space of
sufficiently high dimension so that they become separable by a
hyperplane in the new higher-dimensional feature space.
o Input space: the space where the points xi are located.
o Feature space: the space of f(xi) after the transformation, where f(.) is the
transformation function, For example: a non-linearly separable case in one
dimension:
0 x
Mapping data to two-dimensional space with (x) = (x, x2
)
0 x
x2
17
Interlude: Illustration of a hyperplane
• A hyperplane refers to a D-dimensional plane in a D+1 dimensional
space.
• E.g. a SVM boundary line is a hyperplane in a 2D input space
(see figure below). Similarly, in a 3D input space, the SVM
boundary will be a 2-dimensional hyperplane and so on.
• This terminology is especially used for 4D and higher-dimensional
input spaces, for which, visualizing the input space is not possible.
0 x
x2
18
Input space vs Feature space
• Earlier, we referred a space induced by features (feature vectors) as a
feature space.
• In this lecture on SVM, the space before mapping to higher
dimension is referred as input space, and after the mapping, it’s
called the feature space.
19
SVM: Non Linear
An illustration of the algorithm in 2D input space:
20
Non-linear SVMs: Feature spaces
General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ: x→ φ(x)
21
Limitations of non-linear mapping
1. The new feature space (after mapping) may be very
high dimensional, and working in the high
dimensional space is computationally expensive.
• Remember, in order to compute w (and
b), we need to compute the inner
products of feature vectors of training
examples together.
• i.e. xi
T
xj for all i, j
• Computing the inner products in the
higher dimensional space increases the
computational complexity to great
extent.
22
Limitations of non-linear mapping
2. How do we know the mapping function, i.e. what
will be the new set of features in the higher
dimensional space? E.g. for a single feature x, the
additional features can be x2
, sqrt(x), sin(x) etc.
• Finding the correct mapping function is not
easy.
Both of the above limitations are addressed in
SVM classification by applying a simple tweak
– the Kernel Trick.
23
The Kernel Trick
24
The Kernel Trick – intuitition
• To compute the inner products of vectors in higher
dimensional space, we can apply the kernel trick.
• Using this trick, we can compute the inner products (in
higher dimension) without actually going into the higher
dimension.
• Remember, it’s the inner products of vectors, which we
need in the higher dimension, and not the vectors
themselves.
• This trick is known as the Kernel trick, as it is applied with the
help of a kernel function.
25
SVM: Kernel Function
• A kernel function is defined as a function of input space that
can be written as a dot product of two feature vectors in the
expanded feature space:
• We noted earlier that determining the hyperplane in linear
SVM involves dot product of input vectors, i.e. xi
T
xj .
• Now, we only need to compute K(xi, xj) (which uses the
input space) and don’t need to perform the computations
in the higher dimensional feature space explicitly. This is
what is called the Kernel Trick.
( , ) ( ) ( )
T
i j i j
K  

x x x x
26
The “Kernel Trick” – example
• Let us understand the kernel trick with the help of an example.
• Assume, a non-linear separability case, where the original input vectors
are 2-dimensional, i.e. x1 and x2.
• In order to apply SVM, the input space is transformed into a
(higher) 6-dimensional feature space according to the following
mapping function.
• φ(x) = [1 x1
2
√2 x1x2 x2
2
√2x1 √2x2], i.e. every datapoint x = [x1, x2] is
mapped into high-dimensional space via the transformation Φ: x→ φ(x)
27
The “Kernel Trick” – example (contd.)
• Now, determining the hyperplane in this higher dimension requires
computing the dot product of the feature vectors, i.e. φ(xi) T
φ(xj) for
all i, j. Therefore,
• K(xi,xj)= φ(xi)T
φ(xj). where φ(x) = [1 x1
2
√2 x1x2 x2
2
√2x1 √2x2]
• Expanding the product:
K(xi,xj) = [1 xi1
2
√2 xi1xi2 xi2
2
√2xi1 √2xi2]T
[1 xj1
2
√2 xj1xj2 xj2
2
√2xj1 √2xj2]
= 1+ xi1
2
xj1
2
+ 2 xi1xj1 xi2xj2+ xi2
2
xj2
2
+ 2xi1xj1 + 2xi2xj2
=(1 + xi
T
xj)2
• Where the final expression is a function of the original input space, i.e. xi
= [xi1, xi2] and not the transformed higher dimension feature space.
28
The “Kernel Trick” – example (contd.)
• In other words, if we compute (1 + xi
T
xj)2
in the input space, this is
equivalent to finding the inner products of vectors in the higher
dimensional feature space, i.e.
• K(xi,xj) =(1 + xi
T
xj)2
= φ(xi)T
φ(xj)
• Thus, a kernel function implicitly maps data to a high-dimensional
feature space (without the need to compute each φ(x) explicitly).
• In this case, the kernel function K(xi,xj)=(1 + xi
T
xj)2
• Note that choosing the kernel function also relieves us from the
problem of choosing the feature set in higher dimension, i.e. we
don’t need to engineer features in higher dimension.
29
The “Kernel Trick”
• The kernel function K(xi,xj) =(1 + xi
T
xj)2
is a special case of a more
generalized class of kernel functions, called polynomial functions
(more about this on the next slide).
• This is not the only choice for kernel function.
• In fact, we can choose from different kernel functions including the
polynomial function, sigmoid kernel, gaussian kernel etc.
30
Examples of Kernel Functions
Linear kernel:
Polynomial kernel of power p:
Gaussian kernel
(Also called RBF Kernel)
Can lift to infinite dim. space
Two-layer perceptron:
j
i
j
i
K x
x
x
x 

)
,
(
2
2
2
/
||
||
)
,
(

j
i
e
K j
i
x
x
x
x



p
j
i
j
i
K )
1
(
)
,
( x
x
x
x 


)
tanh(
)
,
( 
 

 j
i
j
i
K x
x
x
x
31
SVM: Kernel Issues
How to know which Kernel to use?
This is a good question and actually still an open question, many
researches have been working to deal with this issue but still we
don’t have a firm answer. It is one of the weakness of SVM.
Generally, we have to test each kernel for a particular problem.
How to verify that rising to higher dimension using a specific
kernel will map the data to a space in which they are linearly
separable?
Even though rising to higher dimension increases the likelihood that
they will be separable we can’t guarantee that.
32
SVM: Kernel Issues
We saw that the Gaussian Radial Basis Kernel lifts the data to
infinite dimension so our data is always separable in this
space so why don’t we always use this kernel?
First of all we should decide which  to use in this kernel:
Secondly, A strong kernel, which lifts the data to infinite
dimension, sometimes may lead us the severe problem of
Overfitting.
2
2
1
exp( )
2
i j
x x

 
33
SVM: Kernel Issues
o In addition to the above problems, another problem is that
sometimes the points are linearly separable but the margin is Low:
All these problems
leads us to the
compromising
solution:
Soft Margin:
A solution which can work even if our
data is not perfectly linearly separable.
34
SVM: Kernel Issues
o In addition to the above problems, another problem is that
sometimes the points are linearly separable but the margin is Low:
All these problems
leads us to the
compromising
solution:
Soft Margin:
A solution which can work even if our
data is not perfectly linearly separable.
35
SVM: Kernel Issues
o In addition to the above problems, another problem is that
sometimes the points are linearly separable but the margin is Low:
All these problems
leads us to the
compromising
solution:
Soft Margin:
A soft margin in a Support Vector Machine (SVM) is a technique that allows SVMs to classify data that is
not linearly separable by making some mistakes.
The goal is to keep the margin wide enough so that other points can still be classified correctly..
36

Lecture09 SVM Intro, Kernel Trick (updated).pptx

  • 1.
    Support Vector Machines Dr.M. Aksam Iftikhar, Assistant Professor, CS
  • 2.
    2 SVM: History &Motivation o Support Vector Machine (SVM) is a supervised learning algorithm developed by Vladimir Vapnik and it was first heard in 1992, introduced by Vapnik, Boser and Guyon in COLT-92. o It is said that Vladimir Vapnik has mentioned its idea in 1979 in one of his paper but its major development was in the 90’s. o SVM got into mainstream because of their exceptional performance in Handwritten Digit Recognition.
  • 3.
    3 SVM: Applications o Textand image classification o Hand-writing recognition, o Data mining o Bioinformatics o Medicine and biosequence analysis o and even stock market
  • 4.
  • 5.
    5 We are givena set of n points (vectors): x1, x2,…,xn, such that xi is a vector of length m, and each belong to one of two classes we label them by “+1” and “-1”. So our training set is: We want to find a separating hyperplane w.x + b = 0 that separates these points into the two classes. “The positives” (class “+1”) and “The negatives” (class “-1”). (Assuming that they are linearly separable) 1 1 2 2 ( , ),( , ),....( , ) n n x y x y x y , { 1, 1} m i i i x R y      So the decision function will be f(x) = sign(w.x + b) SVM: Problem Definition - Linear separable case
  • 6.
    6 Linear Separators Binary classificationcan be viewed as the task of separating classes in feature space: wT x + b = 0 wT x + b < 0 wT x + b > 0 f(x) = sign(wT x + b) But There are many possibilities for such hyperplanes !!
  • 7.
    7 Yes, There aremany possible separating hyperplanes It could be Which one should we choose! this one or this or this or may be….! 1 i y  1 i y  SVM: Separating Hyperplane
  • 8.
    8 i x ' x Suppose we choosethe hypreplane (seen below) that is close to some sample xi. Now suppose we have a new point x’ that should be in class “-1” and is close to xi. Using our classification function f(x) this point is misclassified! Poor generalization! (Poor performance on unseen data) ( ) ( ) f x sign w x b    SVM: Choosing a Separating Hyperplane
  • 9.
    9 i x ' x SVM: Choosing aSeparating Hyperplane Hyperplane should be as far as possible from any sample point. This way a new data that is close to the old samples will be classified correctly. Good generalization!
  • 10.
    10 SVM: Choosing aSeparating Hyperplane The SVM idea is to maximize the distance between The hyperplane and the closest sample points. In the optimal hyper- plane: The distance to the closest negative point = The distance to the closest positive point.
  • 11.
    11 SVM: Choosing aSeparating Hyperplane i x M a r g i n d d d SVM’s goal is to maximize the Margin which is twice the perpendicular distance “d” between the separating hyperplane and the closest sample. Why it is the best?  Robust to outliners as we saw and thus strong generalization ability.  It proved itself to have better performance on test data in both practice and in theory.
  • 12.
    12 SVM: Support Vectors i x M a r g i n d d d Theseare Support Vectors Support vectors are the samples closest to the separating hyperplane. Oh! So this is where the name came from!
  • 13.
    13 Maximum Margin Classification Maximizingthe margin is also good according to intuition. Implies that only support vectors matter; other training examples are ignorable.
  • 14.
    14 Finding the SVMhyperplane • Determining the hyperplane is equivalent to finding the optimal parameters w and b. • The SVM hyperplane is determined by solving a dual optimization problem based on Lagrange multipliers, and not discussed here in detail. • However, one thing is important to note that this optimization problem requires computing the inner products xi T xj between all the training examples pairs. • Remember, inner product/dot product between 2 products is simply sum of products of all its components. • We will return to this point later in the topic of Kernel trick.
  • 15.
    15 SVM: Limitations Limitations ofLinear SVM:  Doesn’t work well on non- linearly separable data.  Noise (outlier) problem. But, it can deal with non-linear classes with a nice tweak. i x M a r g i n d d d
  • 16.
    16 SVM: Non Linearcase Key idea: map our points with a mapping function (x) to a space of sufficiently high dimension so that they become separable by a hyperplane in the new higher-dimensional feature space. o Input space: the space where the points xi are located. o Feature space: the space of f(xi) after the transformation, where f(.) is the transformation function, For example: a non-linearly separable case in one dimension: 0 x Mapping data to two-dimensional space with (x) = (x, x2 ) 0 x x2
  • 17.
    17 Interlude: Illustration ofa hyperplane • A hyperplane refers to a D-dimensional plane in a D+1 dimensional space. • E.g. a SVM boundary line is a hyperplane in a 2D input space (see figure below). Similarly, in a 3D input space, the SVM boundary will be a 2-dimensional hyperplane and so on. • This terminology is especially used for 4D and higher-dimensional input spaces, for which, visualizing the input space is not possible. 0 x x2
  • 18.
    18 Input space vsFeature space • Earlier, we referred a space induced by features (feature vectors) as a feature space. • In this lecture on SVM, the space before mapping to higher dimension is referred as input space, and after the mapping, it’s called the feature space.
  • 19.
    19 SVM: Non Linear Anillustration of the algorithm in 2D input space:
  • 20.
    20 Non-linear SVMs: Featurespaces General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x→ φ(x)
  • 21.
    21 Limitations of non-linearmapping 1. The new feature space (after mapping) may be very high dimensional, and working in the high dimensional space is computationally expensive. • Remember, in order to compute w (and b), we need to compute the inner products of feature vectors of training examples together. • i.e. xi T xj for all i, j • Computing the inner products in the higher dimensional space increases the computational complexity to great extent.
  • 22.
    22 Limitations of non-linearmapping 2. How do we know the mapping function, i.e. what will be the new set of features in the higher dimensional space? E.g. for a single feature x, the additional features can be x2 , sqrt(x), sin(x) etc. • Finding the correct mapping function is not easy. Both of the above limitations are addressed in SVM classification by applying a simple tweak – the Kernel Trick.
  • 23.
  • 24.
    24 The Kernel Trick– intuitition • To compute the inner products of vectors in higher dimensional space, we can apply the kernel trick. • Using this trick, we can compute the inner products (in higher dimension) without actually going into the higher dimension. • Remember, it’s the inner products of vectors, which we need in the higher dimension, and not the vectors themselves. • This trick is known as the Kernel trick, as it is applied with the help of a kernel function.
  • 25.
    25 SVM: Kernel Function •A kernel function is defined as a function of input space that can be written as a dot product of two feature vectors in the expanded feature space: • We noted earlier that determining the hyperplane in linear SVM involves dot product of input vectors, i.e. xi T xj . • Now, we only need to compute K(xi, xj) (which uses the input space) and don’t need to perform the computations in the higher dimensional feature space explicitly. This is what is called the Kernel Trick. ( , ) ( ) ( ) T i j i j K    x x x x
  • 26.
    26 The “Kernel Trick”– example • Let us understand the kernel trick with the help of an example. • Assume, a non-linear separability case, where the original input vectors are 2-dimensional, i.e. x1 and x2. • In order to apply SVM, the input space is transformed into a (higher) 6-dimensional feature space according to the following mapping function. • φ(x) = [1 x1 2 √2 x1x2 x2 2 √2x1 √2x2], i.e. every datapoint x = [x1, x2] is mapped into high-dimensional space via the transformation Φ: x→ φ(x)
  • 27.
    27 The “Kernel Trick”– example (contd.) • Now, determining the hyperplane in this higher dimension requires computing the dot product of the feature vectors, i.e. φ(xi) T φ(xj) for all i, j. Therefore, • K(xi,xj)= φ(xi)T φ(xj). where φ(x) = [1 x1 2 √2 x1x2 x2 2 √2x1 √2x2] • Expanding the product: K(xi,xj) = [1 xi1 2 √2 xi1xi2 xi2 2 √2xi1 √2xi2]T [1 xj1 2 √2 xj1xj2 xj2 2 √2xj1 √2xj2] = 1+ xi1 2 xj1 2 + 2 xi1xj1 xi2xj2+ xi2 2 xj2 2 + 2xi1xj1 + 2xi2xj2 =(1 + xi T xj)2 • Where the final expression is a function of the original input space, i.e. xi = [xi1, xi2] and not the transformed higher dimension feature space.
  • 28.
    28 The “Kernel Trick”– example (contd.) • In other words, if we compute (1 + xi T xj)2 in the input space, this is equivalent to finding the inner products of vectors in the higher dimensional feature space, i.e. • K(xi,xj) =(1 + xi T xj)2 = φ(xi)T φ(xj) • Thus, a kernel function implicitly maps data to a high-dimensional feature space (without the need to compute each φ(x) explicitly). • In this case, the kernel function K(xi,xj)=(1 + xi T xj)2 • Note that choosing the kernel function also relieves us from the problem of choosing the feature set in higher dimension, i.e. we don’t need to engineer features in higher dimension.
  • 29.
    29 The “Kernel Trick” •The kernel function K(xi,xj) =(1 + xi T xj)2 is a special case of a more generalized class of kernel functions, called polynomial functions (more about this on the next slide). • This is not the only choice for kernel function. • In fact, we can choose from different kernel functions including the polynomial function, sigmoid kernel, gaussian kernel etc.
  • 30.
    30 Examples of KernelFunctions Linear kernel: Polynomial kernel of power p: Gaussian kernel (Also called RBF Kernel) Can lift to infinite dim. space Two-layer perceptron: j i j i K x x x x   ) , ( 2 2 2 / || || ) , (  j i e K j i x x x x    p j i j i K ) 1 ( ) , ( x x x x    ) tanh( ) , (      j i j i K x x x x
  • 31.
    31 SVM: Kernel Issues Howto know which Kernel to use? This is a good question and actually still an open question, many researches have been working to deal with this issue but still we don’t have a firm answer. It is one of the weakness of SVM. Generally, we have to test each kernel for a particular problem. How to verify that rising to higher dimension using a specific kernel will map the data to a space in which they are linearly separable? Even though rising to higher dimension increases the likelihood that they will be separable we can’t guarantee that.
  • 32.
    32 SVM: Kernel Issues Wesaw that the Gaussian Radial Basis Kernel lifts the data to infinite dimension so our data is always separable in this space so why don’t we always use this kernel? First of all we should decide which  to use in this kernel: Secondly, A strong kernel, which lifts the data to infinite dimension, sometimes may lead us the severe problem of Overfitting. 2 2 1 exp( ) 2 i j x x   
  • 33.
    33 SVM: Kernel Issues oIn addition to the above problems, another problem is that sometimes the points are linearly separable but the margin is Low: All these problems leads us to the compromising solution: Soft Margin: A solution which can work even if our data is not perfectly linearly separable.
  • 34.
    34 SVM: Kernel Issues oIn addition to the above problems, another problem is that sometimes the points are linearly separable but the margin is Low: All these problems leads us to the compromising solution: Soft Margin: A solution which can work even if our data is not perfectly linearly separable.
  • 35.
    35 SVM: Kernel Issues oIn addition to the above problems, another problem is that sometimes the points are linearly separable but the margin is Low: All these problems leads us to the compromising solution: Soft Margin: A soft margin in a Support Vector Machine (SVM) is a technique that allows SVMs to classify data that is not linearly separable by making some mistakes. The goal is to keep the margin wide enough so that other points can still be classified correctly..
  • 36.

Editor's Notes

  • #2 Vikramaditya Jakkula : “Tutorial on Support vector machines” school of EECS Washington State University .
  • #4 Vikramaditya Jakkula : “Tutorial on Support vector machines” school of EECS Washington State University .
  • #5 Rasterization: Slide # 37