This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Machine Learning for Cyber
Unit : Data Sets and Features
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Learning Outcomes
Upon completion of this unit:
• Students will have a better understanding of features for machine
learning.
• Students will have a better understanding of how to extract features.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Data sets
• Data sets are central to machine learning
• The algorithms are data hungry
• What is better have?
• The most powerful machine learning algorithm and small amounts of poor
data
• Or
• Lots of good data even if our algorithm is not the best
• Data is king!
• Why have companies like Facebook, google, etc. do so well?
• Because all of you that have social media feed them new data all the time
and you even annotate it for them by labeling it with likes, etc.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Difference
• What is the difference between a regular data science class and a
data science class for cyber security
• The data
• Everything else is the same. The machine learning algorithms do not
really change
• What Changes is the how the data is represented or converted to
features, samples, classes, etc.
• In this, module we will explore data sets.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Characteristics of data sets
• When thinking of a data set, think in terms of a matrix
• Rows => samples
• Columns => features
Matrix_data = np.loadtxt(dataset, delimiter= “ , ” , skiprows=1)
X = Matrix_data[ : , : 4]
y = Matrix_data[ : , 4]
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Iris
• Iris dataset is a collection of multiple variable analysis dataset. It
contains totally 150 data. There are 3 labels in this dataset which are
Setosa, Versicolour, and Virginica. Each class contains 4 features to do
prediction.
X1 X2 X3 X4 y
[[5.1 3.5 1.4 0.2 0. ]
[4.9 3. 1.4 0.2 0. ]
…
[5.9 3. 5.1 1.8 2. ]]
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Text Data set
• Phishing Websites Data Set:
• Phishing Dataset
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Images data set
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Speech
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
NSL-KDD
• Network intrusion
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
UNSW big data
• Network
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Phishing
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Honeypot unsupervised
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Malware
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Fraud Detection
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Biometrics
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
What are features?
• Ways to represent a sample
• Vector space model
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Types of features
• Binary
• Continuous
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Dimensionality
• Number of features determines dimensionality
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• Machine Learning (ML) is essential for automated systems to make
decisions and to infer new knowledge about the world.
• Machine learning approaches can be divided into
• supervised learning (such as Support Vector Machines)
• unsupervised learning (such as K-means clustering).
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Within supervised approaches
• Within supervised approaches, the learning methodologies can be
divided based on whether they
• predict a class (into classifiers )
• or a magnitude (and regression models)
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• An additional categorization for these methods depends on whether
they use
• sequential
• or non-sequential data.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Technique Definition Pros Cons
Support Vector Machines Supervised learning approach that optimizes the margin that separates data. SLT Confidence characteristic
(expected risk)
class imbalance issues
Decision Trees This method performs classification by constructing trees where branches are
separated by decision points.
Easy to understand Not flexible
Neural Networks Model represents the structure of the human brain with neurons and links to
the neurons.
Versatile Can obscure the underlying
structure of the model
K-means clustering Unsupervised method that forms k-means clusters to minimize distance
between centroids and members of cluster.
Unsupervised – so no training
needed
Needs clearly defined
separations in the data in order
to be effective
Linear Discriminant Analysis
(LDA)
Creates linear function of features to classify data Simple yet robust classification
method
Normality assumptions of the
classes
Naïve Bayes Probabilistic Learning to calculate the probability of seeing a certain condition in
the world by selecting the most probable class given the feature vector
Fast, easy to understand the
model
Bayes assumptions of
independence
Maximum Likelihood
Estimation (MLE)
Calculates the likelihood that an object will be seen based on its proportion in
the sample data
Simple Too simplistic for some
applications
Hidden Markov Models
(HMM)
A Markov Chain is a weighted automaton consisting of nodes and arcs where the
nodes represent states and the arcs represent the probability of going from one
state to another.
Probabilistic. Good for
sequence mining
Combinatorial complexity/
needs prior knowledge
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
A few ML algorithms
• Classifiers are machine learning approaches that produce as an
output a specific class given some input features.
• Important classifiers include:
• Support Vector Machines (Burges 1998) commonly implemented using
LibSVM (Chang and Lin, 2001)
• Naïve Bayes
• artificial neural networks
• deep learning based neural networks
• decision trees
• random forests
• k-nearest neighbor classifier
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
deep learning
• deep learning based methods are simply neural nets with more
layers.
• Deep learning methods have made a big impact in the field of
machine learning in recent years.
• given enough computational power, they can automatically learn the
optimal features to be used in a classification problem.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Feature Engineering
• In the past, learning what features to use required using humans to
engineer the features.
• This issue has now been alleviated somewhat by deep learning.
• Revolutionized the industry
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• Additionally, artificial neural networks are classifiers that can handle
non-linearly separable data.
• In theory, this capability allows them to model data that may be
more difficult to classify.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Libraries
• import numpy as np
• from sklearn import datasets
• from sklearn.cross_validation import train_test_split
• from sklearn.preprocessing import StandardScaler
• from sklearn.metrics import accuracy_score
• from matplotlib.colors import ListedColormap
• import matplotlib.pyplot as plt
• from sklearn.metrics import confusion_matrix
• from sklearn.metrics import precision_score
• from sklearn.metrics import recall_score, f1_score
• import pandas as pd
• from sklearn.preprocessing import LabelEncoder
• from sklearn import decomposition
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
SKlearn library
• SKlearn library is the main library which contains most of the
traditional machine learning tools we will
• The numpy library is essential for efficient matrix and linear algebra
operations.
• For those with experience with MatLab, I can say that numpy is a way of
performing linear algebra operations in python similar to how they are done
in MatLab.
• This makes the code more efficient in its implementation and faster as well.
• The datasets library helps to obtain standard corpora. You can use it
to obtain annotated data like Fisher’s iris data set, for instance.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• sklearn.cross_validation we can import train_test_split which is
used to create splits in a data matrix such as 70% for training
purposes and 30% for testing purposes.
• From sklearn.preprocessing we can import the StandardScaler
module which helps to scale feature data.
• We will use functions such as these to scale our data for the
Tensorflow based classifiers.
• Deep learning algorithms can improve significantly when data is
properly scaled. So, it is recommended to do this.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• Two more very important libraries are matplotlib.pyplot and pandas.
• The matplotlib.pyplot library is very useful for visualization of data
and results and the pandas library is very useful for pre-processing.
• The pandas library can be very useful to pre-process large data sets
in very fast and very efficient ways.
• There are some parameters that are sometimes useful to set in your
code.
• The code sample below shows the use of np.set_printoptions.
• The function is used to print all values in a numpy array.
• This can be useful when trying to visualize the contents of a large
data set.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• ## set parameters
• np.set_printoptions(threshold=np.inf) ## print all values in numpy
array
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Splitting the data
• Let us assume that our data is stored in the matrix X.
• The code segment below uses the function train_test_split.
• This function is used to split a data set (in this case X) into 4 sets
which are X_train, X_test, y_train, y_test.
• These are the 4 sets that will be used by the traditional classifiers or
the deep learning classifiers.
• The sets that start with X hold the data (feature vectors) and the sets
that start with y hold the labels per sample (e.g. y1 for the first
feature vector, y2 for the second feature vector, and so on).
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Split size
• The values test_size=0.01and random_state=42 in the function are
parameters that define the split.
• The value 0.01 makes a train set that has 99% of all samples while
the test set has 1% of all samples.
• In contrast test_size=0.20 would mean that there is a 80% and 20%
split.
• The random_state=42 allows you to always get the same random
data since the seed is defined as 42.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• #X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,
random_state=48)
• ## k-folds cross validation all goes in train sets (hence 0.01)
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01,
random_state=42)
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• To call all the functions or classifiers you can employ the following
approach.
• Here we have defined 5 common classifiers.
• Notice that each one gets the 4 data sets obtained from the
percentage split.
• Notice also that the data files have a _normalized added to their
name.
• This is a good standard approach used by programmers to indicate
that this data has been scaled.
• The next chapter addresses scaling. Here you run X_train through a
scaler function to obtain X_train_normalized.
• The labels (y) are not scaled.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• #######################################
• ## ML_MAIN()
• #logistic_regression_rc(X_train_normalized, y_train,
X_test_normalized, y_test)
• #svm_rc(X_train_normalized, y_train, X_test_normalized, y_test)
• #random_forest_rc(X_train, y_train, X_test, y_test)
• #knn_rc(X_train_normalized, y_train, X_test_normalized, y_test)
• multilayer_perceptron_rc(X_train_normalized, y_train,
X_test_normalized, y_test)
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Optimization
• Before we begin to discuss some of the machine learning algorithms,
I should say something about optimization.
• Optimization is a key process in machine learning.
• Basically, any supervised learning algorithm needs to learn a
prediction equation given a set of annotated data.
• This prediction function usually has a set of parameters that must be
learned.
• However, the question is “how do you learn these parameters?”
• The answer is that you do so through optimization.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• In its simplest form, optimization consists of trying a set of
parameters with your model and seeing what result they give you.
• If the result is not good, the optimization algorithm needs to decide if
you should decrease the values of the parameters or increase the
values of the parameters.
• In general, you do this in a loop (increasing and decreasing) until you
find an optimal set of parameters.
• But one of the questions to answer here is: do the values go up or
down?
• Well, as it turns out, there are methodologies based on calculus that help you
to make this decision.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Optimization graph
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• The above graph represents an optimization problem.
• The y axis represents the cost (or penalty) of using a given parameter.
• The x axis represents the value of the parameter (w) being used at
the given iteration.
• The curve represents the behavior that the function being used to
minime the cost will follow for every value of parameter w.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• As shown in the graph, the optimal value for the curve is found
where the star is located (i.e where the value of cost is at a
minimum).
• So, somehow the optimization algorithm needs to travel through the
function and arrive at the position indicated by the star.
• At that point, the value of “w” reduces the cost and finds the best
solution.
• Instead of trying all values of “w” at random, the algorithm can make
educated guesses about which direction to follow (up or down).
• To do this, we can use calculus to calculate the derivative of the
function at a given point.
• This will allow us to determine the slope at that point.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• In the case of the graph, this represents the tangent line to the curve
if we calculate the derivative at point w.
• If we calculate the slope at the position of the star symbol, then the
slope is zero because the tangent at that point is parallel to the x axis.
• The slope at the point “w” will be positive.
• Based on this result, we can tell the direction we want to take for
parameter w (decrease or increase).
• This type of optimization is called gradient descent and is very
important in machine learning and deep learning.
• There are several approaches to implement gradient descent and this
is just the simplest explanation for conceptual purposes.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• old_x = 0
• new_x = 4
• step_size = 0.01
• precision = 0.00001
•
• def function_derivative(x):
• return 3*x ** 2 – 6*x
• while absolute_value(new_x – old_x) > precision:
• old_x= new_x
• new_x = old_x – step_size * function_derivative(old_x)
• print “result is: ”, new_x
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
• In the previous code example we assume a function of
• f() = x3
– 3x2
+ 7
• that needs to be optimized for parameter x. We will need the value of
the derivative for each point x. The derivative for f() is:
• f ´() = 3x2
– 6x
• So, the parameter x can be calculated in a loop using the derivative
function which will determine the direction to follow when increasing
or decreasing the parameter x.
This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Summary
• Data sets and features
MLforCyber_MLDataSetsandFeatures_Presentation.pptx

MLforCyber_MLDataSetsandFeatures_Presentation.pptx

  • 1.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Machine Learning for Cyber Unit : Data Sets and Features
  • 2.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Learning Outcomes Upon completion of this unit: • Students will have a better understanding of features for machine learning. • Students will have a better understanding of how to extract features.
  • 3.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Data sets • Data sets are central to machine learning • The algorithms are data hungry • What is better have? • The most powerful machine learning algorithm and small amounts of poor data • Or • Lots of good data even if our algorithm is not the best • Data is king! • Why have companies like Facebook, google, etc. do so well? • Because all of you that have social media feed them new data all the time and you even annotate it for them by labeling it with likes, etc.
  • 4.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Difference • What is the difference between a regular data science class and a data science class for cyber security • The data • Everything else is the same. The machine learning algorithms do not really change • What Changes is the how the data is represented or converted to features, samples, classes, etc. • In this, module we will explore data sets.
  • 5.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Characteristics of data sets • When thinking of a data set, think in terms of a matrix • Rows => samples • Columns => features Matrix_data = np.loadtxt(dataset, delimiter= “ , ” , skiprows=1) X = Matrix_data[ : , : 4] y = Matrix_data[ : , 4]
  • 6.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Iris • Iris dataset is a collection of multiple variable analysis dataset. It contains totally 150 data. There are 3 labels in this dataset which are Setosa, Versicolour, and Virginica. Each class contains 4 features to do prediction. X1 X2 X3 X4 y [[5.1 3.5 1.4 0.2 0. ] [4.9 3. 1.4 0.2 0. ] … [5.9 3. 5.1 1.8 2. ]]
  • 7.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Text Data set • Phishing Websites Data Set: • Phishing Dataset
  • 8.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Images data set
  • 9.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Speech
  • 10.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 NSL-KDD • Network intrusion
  • 11.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 UNSW big data • Network
  • 12.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Phishing
  • 13.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Honeypot unsupervised
  • 14.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Malware
  • 15.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Fraud Detection
  • 16.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Biometrics
  • 17.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 What are features? • Ways to represent a sample • Vector space model
  • 18.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Types of features • Binary • Continuous
  • 19.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Dimensionality • Number of features determines dimensionality
  • 20.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • Machine Learning (ML) is essential for automated systems to make decisions and to infer new knowledge about the world. • Machine learning approaches can be divided into • supervised learning (such as Support Vector Machines) • unsupervised learning (such as K-means clustering).
  • 21.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Within supervised approaches • Within supervised approaches, the learning methodologies can be divided based on whether they • predict a class (into classifiers ) • or a magnitude (and regression models)
  • 22.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • An additional categorization for these methods depends on whether they use • sequential • or non-sequential data.
  • 23.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Technique Definition Pros Cons Support Vector Machines Supervised learning approach that optimizes the margin that separates data. SLT Confidence characteristic (expected risk) class imbalance issues Decision Trees This method performs classification by constructing trees where branches are separated by decision points. Easy to understand Not flexible Neural Networks Model represents the structure of the human brain with neurons and links to the neurons. Versatile Can obscure the underlying structure of the model K-means clustering Unsupervised method that forms k-means clusters to minimize distance between centroids and members of cluster. Unsupervised – so no training needed Needs clearly defined separations in the data in order to be effective Linear Discriminant Analysis (LDA) Creates linear function of features to classify data Simple yet robust classification method Normality assumptions of the classes Naïve Bayes Probabilistic Learning to calculate the probability of seeing a certain condition in the world by selecting the most probable class given the feature vector Fast, easy to understand the model Bayes assumptions of independence Maximum Likelihood Estimation (MLE) Calculates the likelihood that an object will be seen based on its proportion in the sample data Simple Too simplistic for some applications Hidden Markov Models (HMM) A Markov Chain is a weighted automaton consisting of nodes and arcs where the nodes represent states and the arcs represent the probability of going from one state to another. Probabilistic. Good for sequence mining Combinatorial complexity/ needs prior knowledge
  • 24.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 A few ML algorithms • Classifiers are machine learning approaches that produce as an output a specific class given some input features. • Important classifiers include: • Support Vector Machines (Burges 1998) commonly implemented using LibSVM (Chang and Lin, 2001) • Naïve Bayes • artificial neural networks • deep learning based neural networks • decision trees • random forests • k-nearest neighbor classifier
  • 25.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 deep learning • deep learning based methods are simply neural nets with more layers. • Deep learning methods have made a big impact in the field of machine learning in recent years. • given enough computational power, they can automatically learn the optimal features to be used in a classification problem.
  • 26.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Feature Engineering • In the past, learning what features to use required using humans to engineer the features. • This issue has now been alleviated somewhat by deep learning. • Revolutionized the industry
  • 27.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • Additionally, artificial neural networks are classifiers that can handle non-linearly separable data. • In theory, this capability allows them to model data that may be more difficult to classify.
  • 28.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Libraries • import numpy as np • from sklearn import datasets • from sklearn.cross_validation import train_test_split • from sklearn.preprocessing import StandardScaler • from sklearn.metrics import accuracy_score • from matplotlib.colors import ListedColormap • import matplotlib.pyplot as plt • from sklearn.metrics import confusion_matrix • from sklearn.metrics import precision_score • from sklearn.metrics import recall_score, f1_score • import pandas as pd • from sklearn.preprocessing import LabelEncoder • from sklearn import decomposition
  • 29.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 SKlearn library • SKlearn library is the main library which contains most of the traditional machine learning tools we will • The numpy library is essential for efficient matrix and linear algebra operations. • For those with experience with MatLab, I can say that numpy is a way of performing linear algebra operations in python similar to how they are done in MatLab. • This makes the code more efficient in its implementation and faster as well. • The datasets library helps to obtain standard corpora. You can use it to obtain annotated data like Fisher’s iris data set, for instance.
  • 30.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • sklearn.cross_validation we can import train_test_split which is used to create splits in a data matrix such as 70% for training purposes and 30% for testing purposes. • From sklearn.preprocessing we can import the StandardScaler module which helps to scale feature data. • We will use functions such as these to scale our data for the Tensorflow based classifiers. • Deep learning algorithms can improve significantly when data is properly scaled. So, it is recommended to do this.
  • 31.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • Two more very important libraries are matplotlib.pyplot and pandas. • The matplotlib.pyplot library is very useful for visualization of data and results and the pandas library is very useful for pre-processing. • The pandas library can be very useful to pre-process large data sets in very fast and very efficient ways. • There are some parameters that are sometimes useful to set in your code. • The code sample below shows the use of np.set_printoptions. • The function is used to print all values in a numpy array. • This can be useful when trying to visualize the contents of a large data set.
  • 32.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • ## set parameters • np.set_printoptions(threshold=np.inf) ## print all values in numpy array
  • 33.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Splitting the data • Let us assume that our data is stored in the matrix X. • The code segment below uses the function train_test_split. • This function is used to split a data set (in this case X) into 4 sets which are X_train, X_test, y_train, y_test. • These are the 4 sets that will be used by the traditional classifiers or the deep learning classifiers. • The sets that start with X hold the data (feature vectors) and the sets that start with y hold the labels per sample (e.g. y1 for the first feature vector, y2 for the second feature vector, and so on).
  • 34.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Split size • The values test_size=0.01and random_state=42 in the function are parameters that define the split. • The value 0.01 makes a train set that has 99% of all samples while the test set has 1% of all samples. • In contrast test_size=0.20 would mean that there is a 80% and 20% split. • The random_state=42 allows you to always get the same random data since the seed is defined as 42.
  • 35.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • #X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=48) • ## k-folds cross validation all goes in train sets (hence 0.01) • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=42)
  • 36.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • To call all the functions or classifiers you can employ the following approach. • Here we have defined 5 common classifiers. • Notice that each one gets the 4 data sets obtained from the percentage split. • Notice also that the data files have a _normalized added to their name. • This is a good standard approach used by programmers to indicate that this data has been scaled. • The next chapter addresses scaling. Here you run X_train through a scaler function to obtain X_train_normalized. • The labels (y) are not scaled.
  • 37.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • ####################################### • ## ML_MAIN() • #logistic_regression_rc(X_train_normalized, y_train, X_test_normalized, y_test) • #svm_rc(X_train_normalized, y_train, X_test_normalized, y_test) • #random_forest_rc(X_train, y_train, X_test, y_test) • #knn_rc(X_train_normalized, y_train, X_test_normalized, y_test) • multilayer_perceptron_rc(X_train_normalized, y_train, X_test_normalized, y_test)
  • 38.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Optimization • Before we begin to discuss some of the machine learning algorithms, I should say something about optimization. • Optimization is a key process in machine learning. • Basically, any supervised learning algorithm needs to learn a prediction equation given a set of annotated data. • This prediction function usually has a set of parameters that must be learned. • However, the question is “how do you learn these parameters?” • The answer is that you do so through optimization.
  • 39.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • In its simplest form, optimization consists of trying a set of parameters with your model and seeing what result they give you. • If the result is not good, the optimization algorithm needs to decide if you should decrease the values of the parameters or increase the values of the parameters. • In general, you do this in a loop (increasing and decreasing) until you find an optimal set of parameters. • But one of the questions to answer here is: do the values go up or down? • Well, as it turns out, there are methodologies based on calculus that help you to make this decision.
  • 40.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Optimization graph
  • 41.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • The above graph represents an optimization problem. • The y axis represents the cost (or penalty) of using a given parameter. • The x axis represents the value of the parameter (w) being used at the given iteration. • The curve represents the behavior that the function being used to minime the cost will follow for every value of parameter w.
  • 42.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • As shown in the graph, the optimal value for the curve is found where the star is located (i.e where the value of cost is at a minimum). • So, somehow the optimization algorithm needs to travel through the function and arrive at the position indicated by the star. • At that point, the value of “w” reduces the cost and finds the best solution. • Instead of trying all values of “w” at random, the algorithm can make educated guesses about which direction to follow (up or down). • To do this, we can use calculus to calculate the derivative of the function at a given point. • This will allow us to determine the slope at that point.
  • 43.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • In the case of the graph, this represents the tangent line to the curve if we calculate the derivative at point w. • If we calculate the slope at the position of the star symbol, then the slope is zero because the tangent at that point is parallel to the x axis. • The slope at the point “w” will be positive. • Based on this result, we can tell the direction we want to take for parameter w (decrease or increase). • This type of optimization is called gradient descent and is very important in machine learning and deep learning. • There are several approaches to implement gradient descent and this is just the simplest explanation for conceptual purposes.
  • 44.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • old_x = 0 • new_x = 4 • step_size = 0.01 • precision = 0.00001 • • def function_derivative(x): • return 3*x ** 2 – 6*x • while absolute_value(new_x – old_x) > precision: • old_x= new_x • new_x = old_x – step_size * function_derivative(old_x) • print “result is: ”, new_x
  • 45.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 • In the previous code example we assume a function of • f() = x3 – 3x2 + 7 • that needs to be optimized for parameter x. We will need the value of the derivative for each point x. The derivative for f() is: • f ´() = 3x2 – 6x • So, the parameter x can be calculated in a loop using the derivative function which will determine the direction to follow when increasing or decreasing the parameter x.
  • 46.
    This document islicensed with a Creative Commons Attribution 4.0 International License ©2017 Summary • Data sets and features

Editor's Notes

  • #17 This sets up the problem we’ll use to demonstrate various control and data structures of scripts. This is a common security need, and various commercial tools such as tripwire and tiger do this. They are not scripts, but they work very much like what is here. You might mention that, in practice, one would put the files we will create in places other than where this exercise puts them. Normally the files would go in a protected area, but here I opt for simplicity.