Lecture09 SVM Intro, Kernel Trick (updated).pptx

Support Vector Machines
Dr. M. Aksam Iftikhar,
Assistant Professor, CS

2
SVM: History & Motivation
o Support Vector Machine (SVM) is a supervised learning algorithm
developed by Vladimir Vapnik and it was first heard in 1992,
introduced by Vapnik, Boser and Guyon in COLT-92.
o It is said that Vladimir Vapnik has mentioned its idea in 1979 in
one of his paper but its major development was in the 90’s.
o SVM got into mainstream because of their exceptional performance in
Handwritten Digit Recognition.

3
SVM: Applications
o Text and image classification
o Hand-writing recognition,
o Data mining
o Bioinformatics
o Medicine and biosequence analysis
o and even stock market

5
We are given a set of n points (vectors): x1, x2,…,xn, such
that xi is a vector of length m, and each belong to one of
two classes we label them by “+1” and “-1”.
So our training set is:
We want to find a separating hyperplane w.x + b = 0
that separates these points into the two classes.
“The positives” (class “+1”) and
“The negatives” (class “-1”).
(Assuming that they are linearly
separable)
1 1 2 2
( , ),( , ),....( , )
n n
x y x y x y , { 1, 1}
m
i i
i x R y
    
So the decision
function will be
f(x) = sign(w.x + b)
SVM: Problem Definition - Linear separable case

6
Linear Separators
Binary classification can be viewed as the task of
separating classes in feature space:
wT
x + b = 0
wT
x + b < 0
wT
x + b > 0
f(x) = sign(wT
x + b)
But There are many possibilities
for such hyperplanes !!

7
Yes, There are many possible separating hyperplanes
It could be
Which one
should we
choose!
this one or this or this or may be….!
1
i
y 
1
i
y 
SVM: Separating Hyperplane

8
i
x
'
x
Suppose we choose the hypreplane (seen below) that is
close to some sample xi.
Now suppose we have a new point x’ that should be in
class “-1” and is close to xi. Using our classification
function f(x) this point is misclassified!
Poor generalization!
(Poor performance on unseen
data)
( ) ( )
f x sign w x b
  
SVM: Choosing a Separating Hyperplane

9
i
x
'
x
Hyperplane should be as far as possible from any sample
point.
This way a new data that is close to the old samples will
be classified correctly.
Good generalization!

10
The SVM idea is to maximize the distance between The
hyperplane and the closest sample points.
In the optimal hyper-
plane:
The distance to the closest
negative point =
The distance to the closest
positive point.

11
i
x
M
a
r
g
i
n
d
d
d
SVM’s goal is to maximize the Margin which is twice the
perpendicular distance “d” between the separating
hyperplane and the closest sample.
Why it is the best?
 Robust to outliners
as we saw and thus
strong generalization
ability.
 It proved itself to
have better
performance on test
data in both practice
and in theory.

12
SVM: Support Vectors
i
x
M
a
r
g
i
n
d
d
d
These are
Support
Vectors
Support vectors are the samples closest to the
separating hyperplane. Oh! So this is where the
name came from!

13
Maximum Margin Classification
Maximizing the margin is also good according to intuition.
Implies that only support vectors matter; other training examples
are ignorable.

14
Finding the SVM hyperplane
• Determining the hyperplane is equivalent to finding the optimal
parameters w and b.
• The SVM hyperplane is determined by solving a dual optimization
problem based on Lagrange multipliers, and not discussed here in detail.
• However, one thing is important to note that this optimization problem
requires computing the inner products xi
T
xj between all the training
examples pairs.
• Remember, inner product/dot product between 2 products is simply sum of
products of all its components.
• We will return to this point later in the topic of Kernel trick.

15
SVM: Limitations
Limitations of Linear SVM:
 Doesn’t work well on non-
linearly separable data.
 Noise (outlier) problem.
But, it can deal with non-linear
classes with a nice tweak.
i
x
M
a
r
g
i
n
d
d
d

16
SVM: Non Linear case
Key idea: map our points with a mapping function (x) to a space of
sufficiently high dimension so that they become separable by a
hyperplane in the new higher-dimensional feature space.
o Input space: the space where the points xi are located.
o Feature space: the space of f(xi) after the transformation, where f(.) is the
transformation function, For example: a non-linearly separable case in one
dimension:
0 x
Mapping data to two-dimensional space with (x) = (x, x2
)
0 x
x2

17
Interlude: Illustration of a hyperplane
• A hyperplane refers to a D-dimensional plane in a D+1 dimensional
space.
• E.g. a SVM boundary line is a hyperplane in a 2D input space
(see figure below). Similarly, in a 3D input space, the SVM
boundary will be a 2-dimensional hyperplane and so on.
• This terminology is especially used for 4D and higher-dimensional
input spaces, for which, visualizing the input space is not possible.
0 x
x2

18
Input space vs Feature space
• Earlier, we referred a space induced by features (feature vectors) as a
feature space.
• In this lecture on SVM, the space before mapping to higher
dimension is referred as input space, and after the mapping, it’s
called the feature space.

19
SVM: Non Linear
An illustration of the algorithm in 2D input space:

20
Non-linear SVMs: Feature spaces
General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ: x→ φ(x)

21
Limitations of non-linear mapping
1. The new feature space (after mapping) may be very
high dimensional, and working in the high
dimensional space is computationally expensive.
• Remember, in order to compute w (and
b), we need to compute the inner
products of feature vectors of training
examples together.
• i.e. xi
T
xj for all i, j
• Computing the inner products in the
higher dimensional space increases the
computational complexity to great
extent.

22
Limitations of non-linear mapping
2. How do we know the mapping function, i.e. what
will be the new set of features in the higher
dimensional space? E.g. for a single feature x, the
additional features can be x2
, sqrt(x), sin(x) etc.
• Finding the correct mapping function is not
easy.
Both of the above limitations are addressed in
SVM classification by applying a simple tweak
– the Kernel Trick.

24
The Kernel Trick – intuitition
• To compute the inner products of vectors in higher
dimensional space, we can apply the kernel trick.
• Using this trick, we can compute the inner products (in
higher dimension) without actually going into the higher
dimension.
• Remember, it’s the inner products of vectors, which we
need in the higher dimension, and not the vectors
themselves.
• This trick is known as the Kernel trick, as it is applied with the
help of a kernel function.

25
SVM: Kernel Function
• A kernel function is defined as a function of input space that
can be written as a dot product of two feature vectors in the
expanded feature space:
• We noted earlier that determining the hyperplane in linear
SVM involves dot product of input vectors, i.e. xi
T
xj .
• Now, we only need to compute K(xi, xj) (which uses the
input space) and don’t need to perform the computations
in the higher dimensional feature space explicitly. This is
what is called the Kernel Trick.
( , ) ( ) ( )
T
i j i j
K  

x x x x

26
The “Kernel Trick” – example
• Let us understand the kernel trick with the help of an example.
• Assume, a non-linear separability case, where the original input vectors
are 2-dimensional, i.e. x1 and x2.
• In order to apply SVM, the input space is transformed into a
(higher) 6-dimensional feature space according to the following
mapping function.
• φ(x) = [1 x1
2
√2 x1x2 x2
2
√2x1 √2x2], i.e. every datapoint x = [x1, x2] is
mapped into high-dimensional space via the transformation Φ: x→ φ(x)

27
The “Kernel Trick” – example (contd.)
• Now, determining the hyperplane in this higher dimension requires
computing the dot product of the feature vectors, i.e. φ(xi) T
φ(xj) for
all i, j. Therefore,
• K(xi,xj)= φ(xi)T
φ(xj). where φ(x) = [1 x1
2
√2 x1x2 x2
2
√2x1 √2x2]
• Expanding the product:
K(xi,xj) = [1 xi1
2
√2 xi1xi2 xi2
2
√2xi1 √2xi2]T
[1 xj1
2
√2 xj1xj2 xj2
2
√2xj1 √2xj2]
= 1+ xi1
2
xj1
2
+ 2 xi1xj1 xi2xj2+ xi2
2
xj2
2
+ 2xi1xj1 + 2xi2xj2
=(1 + xi
T
xj)2
• Where the final expression is a function of the original input space, i.e. xi
= [xi1, xi2] and not the transformed higher dimension feature space.

28
The “Kernel Trick” – example (contd.)
• In other words, if we compute (1 + xi
T
xj)2
in the input space, this is
equivalent to finding the inner products of vectors in the higher
dimensional feature space, i.e.
• K(xi,xj) =(1 + xi
T
xj)2
= φ(xi)T
φ(xj)
• Thus, a kernel function implicitly maps data to a high-dimensional
feature space (without the need to compute each φ(x) explicitly).
• In this case, the kernel function K(xi,xj)=(1 + xi
T
xj)2
• Note that choosing the kernel function also relieves us from the
problem of choosing the feature set in higher dimension, i.e. we
don’t need to engineer features in higher dimension.

29
The “Kernel Trick”
• The kernel function K(xi,xj) =(1 + xi
T
xj)2
is a special case of a more
generalized class of kernel functions, called polynomial functions
(more about this on the next slide).
• This is not the only choice for kernel function.
• In fact, we can choose from different kernel functions including the
polynomial function, sigmoid kernel, gaussian kernel etc.

30
Examples of Kernel Functions
Linear kernel:
Polynomial kernel of power p:
Gaussian kernel
(Also called RBF Kernel)
Can lift to infinite dim. space
Two-layer perceptron:
j
i
j
i
K x
x
x
x 

)
,
(
2
2
2
/
||
||
)
,
(

j
i
e
K j
i
x
x
x
x



p
j
i
j
i
K )
1
(
)
,
( x
x
x
x 


)
tanh(
)
,
( 
 

 j
i
j
i
K x
x
x
x

31
SVM: Kernel Issues
How to know which Kernel to use?
This is a good question and actually still an open question, many
researches have been working to deal with this issue but still we
don’t have a firm answer. It is one of the weakness of SVM.
Generally, we have to test each kernel for a particular problem.
How to verify that rising to higher dimension using a specific
kernel will map the data to a space in which they are linearly
separable?
Even though rising to higher dimension increases the likelihood that
they will be separable we can’t guarantee that.

32
SVM: Kernel Issues
We saw that the Gaussian Radial Basis Kernel lifts the data to
infinite dimension so our data is always separable in this
space so why don’t we always use this kernel?
First of all we should decide which  to use in this kernel:
Secondly, A strong kernel, which lifts the data to infinite
dimension, sometimes may lead us the severe problem of
Overfitting.
2
2
1
exp( )
2
i j
x x

 

33
SVM: Kernel Issues
o In addition to the above problems, another problem is that
sometimes the points are linearly separable but the margin is Low:
All these problems
leads us to the
compromising
solution:
Soft Margin:
A solution which can work even if our
data is not perfectly linearly separable.

34
SVM: Kernel Issues
All these problems
leads us to the
compromising
solution:
Soft Margin:
A solution which can work even if our
data is not perfectly linearly separable.

35
SVM: Kernel Issues
All these problems
leads us to the
compromising
solution:
Soft Margin:
A soft margin in a Support Vector Machine (SVM) is a technique that allows SVMs to classify data that is
not linearly separable by making some mistakes.
The goal is to keep the margin wide enough so that other points can still be classified correctly..

Lecture09 SVM Intro, Kernel Trick (updated).pptx

More Related Content

What's hot

Similar to Lecture09 SVM Intro, Kernel Trick (updated).pptx

More from DrMTayyabChaudhry1

Recently uploaded

Lecture09 SVM Intro, Kernel Trick (updated).pptx

Editor's Notes