Data Science - Part IX - Support Vector Machine

 What is a Support Vector Machine?
 Support Vector Machine Applications
 Linear Classifier Separators
 Classification Margin
 Maximum Margin / Support Vectors
 Soft Margin
 Non Linear Support Vector Machines
 Feature Space / Holographic Projection
 Kernels / Kernel Trick
 Classification
 Practical Application Example – Breast Cancer

 SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained
increasing popularity in late 1990s.
 SVMs are currently among the best performers for a number of classification tasks
ranging from text to genomic data.
 A SVM is primarily used for determining the classification of a dichotomous binary
response variable (0 or 1).
 They are incredibly sensitive to noise in the data. A relatively small number of
mislabeled examples can dramatically decrease the performance.
 SVMs can be applied to complex data types beyond feature vectors (e.g. graphs,
sequences, relational data) by designing kernel functions for such data.
 SVM techniques have been extended to a number of tasks such as regression [Vapnik et
al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
 Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually
done in a try-and-see manner.

Numerous real world applications:
 Hand-written character recognition.
 Image classification (facial recognition).
 Bioinformatics
 Protein classification
 Cancer classification
 Text (and hypertext) categorization.

 We can plot data along a 2-Dimensional feature space and find a separation between the
different type of data. This space can be split using a linear separator called a hyperplane.
 Is this a good split
between the classes?

 Which one of these hyperplanes create a better separation?
 Is this a good split
between the classes?
 Or is this version
better?

 Which one of these hyperplanes creates the optimal separation?
 How do you know?

 Distance from example xi to the
separator is:
 Examples closest to the hyperplane
are support vectors.
 Margin ρ of the separator is the
distance between support vectors.

 A SVM relies on the principal that the
optimal linear separator also
maximizes the margin.
 Maximizing the margin is good
according to intuition and PAC
theory.
 Implies that only support vectors
matter; other training examples are
ignorable.
 Most “important” training points are
support vectors; they define the
hyperplane.

 What if the training set is noisy?
 Solution 1: Utilize a soft margin
calculation.
 Solution 2: Use powerful kernels.

 What if the training set is not linearly
separable?
 Slack variables ξi can be added to
allow misclassification of difficult or
noisy examples, resulting margin
called soft.
 Overfitting can be controlled by soft
margin approach

 Rather than fitting nonlinear curves to the data,
SVM handles this by using a kernel function to
map the data into a different space where a
hyperplane can be used to do the separation.
 This method is prone to over fitting and should be
used sparingly.

 Datasets that are linearly separable with some noise work out great:
 But what are we going to do if the dataset is just too hard?
 How about… mapping data to a higher-dimensional space:

Φ: x → φ(x)
 The original feature space of a non-linear SVM can be mapped to some higher-dimensional
feature space where the training set is separable utilizing a kernel function.
2-Dimensional 3-Dimensional

 Fitting hyperplanes as separators is
mathematically easy. We can draw
from the kernel function.
 By replacing the raw input variables
with a much larger set of features we
get a nice property:
 A planar separator in the high-
dimensional space of feature vectors is
a curved separator in the low
dimensional space of the raw input
variables.
 Ex: A planar separator in a 20-Dimensional
feature space projected back to the original
2-Dimensional space

 The concept of a kernel mapping function is very powerful. It allows SVM models to
perform separations even with very complex boundaries such as shown below.

 If we map the input vectors into a
very high-dimensional feature space,
the task of finding the maximum-
margin separator becomes
computationally intractable.
 The mathematics is all linear, which is
good, but the vectors have a huge
number of components. So taking
the scalar product of two vectors is
very expensive.
 The way to keep things tractable is to
use “the kernel trick”.
Mercer’s theorem:
 Every semi-positive definite
symmetric function is a kernel

 For many mappings from a low-
dimensional space to a high-
dimensional space, there is a simple
operation on two vectors in the low-
dimensional space that can be used to
compute the scalar product of their
two images in the high-dimensional
space.
)(.)(),( baba
xxxxK 
 Letting the
Kernel do the
Work
 doing the scalar
products in the
obvious way.

 All of the computations that we need to do to find the maximum-margin separator can
be expressed in terms of scalar products between pairs of datapoints (in the high-
dimensional feature space).
 These scalar products are the only part of the computation that depends on the
dimensionality of the high-dimensional space.
 So if we had a fast way to do the scalar products we would not have to pay a price for
solving the learning problem in the high-dimensional space.
 The kernel trick is just a magic way of doing scalar products a whole lot faster than is
usually possible.
 It relies on choosing a way of mapping to the high-dimensional feature space that
allows fast scalar products.

 There may be an infinite number of kernels which one can employ. Here are some of
the more common kernels:
* For the neural network kernel, there is one “hidden unit” per support vector, so the
process of fitting the maximum margin hyperplane decides how many hidden units to
use. Also, it may violate Mercer’s condition.
)(tanh),(
),(
)1.(),(
22
2/||||






x.yyx
yx
yxyx
yx
kK
eK
K p
 Polynomial
 Gaussian Radial
Function
 Neural Network*

 The final classification rule is eloquently simple:
 All the cleverness goes into selecting the support vectors that maximize the margin and
computing the weight to use on each support vector.
 We also need to choose a good kernel function and we may need to choose a lambda
for dealing with non-separable cases.
 
SVs
stest
s xxKwbias

0),(
The set of support vectors

 Support Vector Machines work very well in practice.
 The user must choose the kernel function and its parameters, but the rest is automatic.
 The test performance is very good.
 They can be expensive in time and space for big datasets.
 The computation of the maximum-margin hyper-plane depends on the square of the
number of training cases.
 We need to store all the support vectors.
 SVM’s are very good if you have no idea about what structure to impose on the task.
 The kernel trick can also be used to do PCA in a much higher-dimensional space, thus
giving a non-linear version of PCA in the original space.

 The dataset contains information related to
females with tumors of different characteristics
and whether or not the tumor was benign or
malignant.
 The dataset contains 700 individuals, of which,
458 (65.5%) were benign and 241 (34.5%) were
malignant.
 Our goal is to devise:
 A Support Vector Machine to classify case
based upon tumor characteristics as benign or
malignant.

 First, lets run a tuning function to identify the best parameters to use for the SVM
model.
 The process runs a 10-fold cross validation methodology and identified the following:
 Gamma = 0.01
 Cost = 1
 The E1071 package in R uses a linear kernel as its standard algorithm.

 We created a random test sample of the dataset and included only the measurement variables.
This data was not involved in the initial training of the Support Vector Machine but will be used to
validate the results from passing the data through the SVM.
Prediction Accuracy:
97.1%

 Reside in Wayne, Illinois
 Active Semi-Professional Classical Musician
(Bassoon).
 Married my wife on 10/10/10 and been
together for 10 years.
 Pet Yorkshire Terrier / Toy Poodle named
Brunzie.
 Pet Maine Coons’ named Maximus Power and
Nemesis Gul du Cat.
 Enjoy Cooking, Hiking, Cycling, Kayaking, and
Astronomy.
 Self proclaimed Data Nerd and Technology
Lover.

Data Science - Part IX - Support Vector Machine

Data Science - Part IX - Support Vector Machine

More Related Content

What's hot

Similar to Data Science - Part IX - Support Vector Machine

More from Derek Kane

Recently uploaded

In this document

Data Science - Part IX - Support Vector Machine