Multiclass Classification
Sergey Ivanov
Plan
1. Presentation on Multiclass Classification
a. Error Rates and the Bayes Classifier
b. Gaussian and Linear Classifiers. Linear Discriminant Analysis.
Logistic Regression;
c. Multi-class classification models and methods;
d. Multi-class strategies: one-versus-all, one-versus-one, error-
correction-codes
2. Linear Classifiers and Multi-classification Tutorial
3. In-class exercise
1. Multilabel Classification format
2. Classifier Comparison
3. LDA as dimensionality reduction
4. LDA vs PCA
5. Logistic Regression for 3 classes
6. Linear models
7. LDA and QDA
8. Naive Regression
9. Cross Validation in Python
References
Naive Bayes
Naive Bayes
Naive Bayes
Naive Bayes
Naive Bayes
Naive Bayes
1. Gaussian NB
2. Bernoulli NB
Naive Bayes
Pros:
1. Fast
2. Prevent curse of dimensionality
3. Decent classifier for several
tasks (e.g. text classification)
4. Inherently multiclass
Cons:
1. Bad estimator of probabilities to
the class.
Linear/Quadratic Discriminant Analysis (LDA/QDA)
Linear/Quadratic Discriminant Analysis (LDA/QDA)
Linear/Quadratic Discriminant Analysis (LDA/QDA)
● LDA = each class has the
same covariance equals to
averaged covariance of the
classes
● QDA = each class has its
own covariance
Linear/Quadratic Discriminant Analysis (LDA/QDA)
Pros:
1. Closed-Form solution
2. Inherently Multiclass
3. No hyperparameters tuning
4. Can be used as dimensionality
reduction
Cons:
1. Assume unimodal Gaussian
distribution for each class
2. Cannot reduce dimensions to
more than the number of
classes.
3. Not useful if “information” is in
data variance instead of the
mean of classes.
Stochastic Gradient Descent (SGD)
Loss functions L:
Stochastic Gradient Descent (SGD)
Regularization Term R:
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD)
Practical Tips:
● Scale data so that each dimension has unit variance and zero
mean. StandardScaler() in Python.
● Empirically, n_iter = np.ceil(10**6 / n)
● Averaged SGD works best with large number of features.
● After PCA, multiply training data by c such that L2 norm will be
equals to 1.
Stochastic Gradient Descent (SGD)
Pros:
1. Fast
2. Ease of implementation
3. Sound theoretical results
Cons:
1. Hyperparameters tuning
2. Sensitive to feature scaling
3. Not multiclass
Multilabel and Multiclass classification
● Multiclass: classifying more than 2 classes. For
example, classifying digits.
● Multilabel: assigning a set of topics to each sample.
For example, assignment of topics to an article.
● Multioutput-multiclass: fixed number of output
variables, each of which can take on arbitrary number
of values. For example, predicting a fruit and its color,
where each fruit can take on arbitrary set of values
from {‘blue’, ‘orange’, ‘green’, ‘white’,...}.
Multilabel and Multiclass classification
● Inherent Multiclass: Naive Bayes,
LDA/QDA, DT, Random Forest, kNN
● One-vs-Rest
● One-vs-One
● Error-Correcting Output Codes
One-vs-Rest (OVR)
One-vs-Rest (OVR)
One-vs-Rest (OVR)
Training: Fits one classifier per
class against all other data as a
negative class. In total K classifiers.
Prediction: applies K classifiers to a
new data point. Selects the one
that got a positive class. In case of
ties, selects the class with highest
confidence.
Pros:
● Efficient
● Interpretable
One-vs-One (OVO)
One-vs-One (OVO)
Training: Fits (K-1) classifier per
class against each other class. In
total K*(K-1)/2 classifiers.
Prediction: applies K*(K-1)/2
classifiers to a new data point.
Selects the class that got the
majority of votes (“+1”). In case of
ties, selects the class with highest
confidence.
Pros:
● Used for Kernel algorithms (e.g.
“SVM”).
Cons:
● Not as fast as OVR
Error-Correcting Output Codes (ECOC)
Training: 1) Obtain a binary
codeword for each class of
length c. 2) Learn a separate
binary classifier for each position
of a codeword. In total, c
classifiers.
Prediction: Apply c classifiers to
a new data point and select the
class closest to a datapoint by
Hamming distance.
Error-Correcting Output Codes (ECOC)
How to obtain codewords?
1) Row separation
2) Column separation
Pros:
● Can be more
correct than
OVR
Multilabel and Multiclass classification
● Inherent Multiclass: Naive Bayes,
LDA/QDA, DT, Random Forest, kNN
● One-vs-Rest
● One-vs-One
● Error-Correcting Output Codes

Linear models and multiclass classification

  • 1.
  • 2.
    Plan 1. Presentation onMulticlass Classification a. Error Rates and the Bayes Classifier b. Gaussian and Linear Classifiers. Linear Discriminant Analysis. Logistic Regression; c. Multi-class classification models and methods; d. Multi-class strategies: one-versus-all, one-versus-one, error- correction-codes 2. Linear Classifiers and Multi-classification Tutorial 3. In-class exercise 1. Multilabel Classification format 2. Classifier Comparison 3. LDA as dimensionality reduction 4. LDA vs PCA 5. Logistic Regression for 3 classes 6. Linear models 7. LDA and QDA 8. Naive Regression 9. Cross Validation in Python References
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
    Naive Bayes 1. GaussianNB 2. Bernoulli NB
  • 9.
    Naive Bayes Pros: 1. Fast 2.Prevent curse of dimensionality 3. Decent classifier for several tasks (e.g. text classification) 4. Inherently multiclass Cons: 1. Bad estimator of probabilities to the class.
  • 10.
  • 11.
  • 12.
    Linear/Quadratic Discriminant Analysis(LDA/QDA) ● LDA = each class has the same covariance equals to averaged covariance of the classes ● QDA = each class has its own covariance
  • 13.
    Linear/Quadratic Discriminant Analysis(LDA/QDA) Pros: 1. Closed-Form solution 2. Inherently Multiclass 3. No hyperparameters tuning 4. Can be used as dimensionality reduction Cons: 1. Assume unimodal Gaussian distribution for each class 2. Cannot reduce dimensions to more than the number of classes. 3. Not useful if “information” is in data variance instead of the mean of classes.
  • 14.
    Stochastic Gradient Descent(SGD) Loss functions L:
  • 15.
    Stochastic Gradient Descent(SGD) Regularization Term R:
  • 16.
  • 17.
  • 18.
    Stochastic Gradient Descent(SGD) Practical Tips: ● Scale data so that each dimension has unit variance and zero mean. StandardScaler() in Python. ● Empirically, n_iter = np.ceil(10**6 / n) ● Averaged SGD works best with large number of features. ● After PCA, multiply training data by c such that L2 norm will be equals to 1.
  • 19.
    Stochastic Gradient Descent(SGD) Pros: 1. Fast 2. Ease of implementation 3. Sound theoretical results Cons: 1. Hyperparameters tuning 2. Sensitive to feature scaling 3. Not multiclass
  • 20.
    Multilabel and Multiclassclassification ● Multiclass: classifying more than 2 classes. For example, classifying digits. ● Multilabel: assigning a set of topics to each sample. For example, assignment of topics to an article. ● Multioutput-multiclass: fixed number of output variables, each of which can take on arbitrary number of values. For example, predicting a fruit and its color, where each fruit can take on arbitrary set of values from {‘blue’, ‘orange’, ‘green’, ‘white’,...}.
  • 21.
    Multilabel and Multiclassclassification ● Inherent Multiclass: Naive Bayes, LDA/QDA, DT, Random Forest, kNN ● One-vs-Rest ● One-vs-One ● Error-Correcting Output Codes
  • 22.
  • 23.
  • 24.
    One-vs-Rest (OVR) Training: Fitsone classifier per class against all other data as a negative class. In total K classifiers. Prediction: applies K classifiers to a new data point. Selects the one that got a positive class. In case of ties, selects the class with highest confidence. Pros: ● Efficient ● Interpretable
  • 25.
  • 26.
    One-vs-One (OVO) Training: Fits(K-1) classifier per class against each other class. In total K*(K-1)/2 classifiers. Prediction: applies K*(K-1)/2 classifiers to a new data point. Selects the class that got the majority of votes (“+1”). In case of ties, selects the class with highest confidence. Pros: ● Used for Kernel algorithms (e.g. “SVM”). Cons: ● Not as fast as OVR
  • 27.
    Error-Correcting Output Codes(ECOC) Training: 1) Obtain a binary codeword for each class of length c. 2) Learn a separate binary classifier for each position of a codeword. In total, c classifiers. Prediction: Apply c classifiers to a new data point and select the class closest to a datapoint by Hamming distance.
  • 28.
    Error-Correcting Output Codes(ECOC) How to obtain codewords? 1) Row separation 2) Column separation Pros: ● Can be more correct than OVR
  • 29.
    Multilabel and Multiclassclassification ● Inherent Multiclass: Naive Bayes, LDA/QDA, DT, Random Forest, kNN ● One-vs-Rest ● One-vs-One ● Error-Correcting Output Codes

Editor's Notes

  • #4 http://scikit-learn.org/stable/modules/naive_bayes.html
  • #5 http://scikit-learn.org/stable/modules/naive_bayes.html
  • #6 http://scikit-learn.org/stable/modules/naive_bayes.html
  • #7 http://scikit-learn.org/stable/modules/naive_bayes.html
  • #8 http://scikit-learn.org/stable/modules/naive_bayes.html
  • #9 http://scikit-learn.org/stable/modules/naive_bayes.html
  • #10 http://scikit-learn.org/stable/modules/naive_bayes.html
  • #11 http://scikit-learn.org/stable/modules/lda_qda.html
  • #12 http://scikit-learn.org/stable/modules/lda_qda.html
  • #13 http://scikit-learn.org/stable/modules/lda_qda.html
  • #14 http://scikit-learn.org/stable/modules/lda_qda.html
  • #15 http://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation
  • #16 http://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation
  • #17 http://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation
  • #18 http://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation
  • #19 http://scikit-learn.org/stable/modules/sgd.html#tips-on-practical-use
  • #20 http://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation
  • #21 http://scikit-learn.org/stable/modules/multiclass.html#multiclass-and-multilabel-algorithms
  • #22 http://scikit-learn.org/stable/modules/multiclass.html#multiclass-and-multilabel-algorithms
  • #23 http://scikit-learn.org/stable/modules/multiclass.html#one-vs-the-rest
  • #24 http://scikit-learn.org/stable/modules/multiclass.html#one-vs-the-rest
  • #25 http://scikit-learn.org/stable/modules/multiclass.html#one-vs-the-rest
  • #26 http://scikit-learn.org/stable/modules/multiclass.html#one-vs-the-rest
  • #27 http://scikit-learn.org/stable/modules/multiclass.html#one-vs-one
  • #28 http://www.jair.org/media/105/live-105-1426-jair.pdf http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OutputCodeClassifier.html#sklearn.multiclass.OutputCodeClassifier
  • #29 http://www.jair.org/media/105/live-105-1426-jair.pdf 7 page http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OutputCodeClassifier.html#sklearn.multiclass.OutputCodeClassifier
  • #30 http://scikit-learn.org/stable/modules/multiclass.html#multiclass-and-multilabel-algorithms