MLforCyber_MLDataSetsandFeatures_Presentation.pptx

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Machine Learning for Cyber
Unit : Data Sets and Features

Learning Outcomes
Upon completion of this unit:
• Students will have a better understanding of features for machine
learning.
• Students will have a better understanding of how to extract features.

Data sets
• Data sets are central to machine learning
• The algorithms are data hungry
• What is better have?
• The most powerful machine learning algorithm and small amounts of poor
data
• Or
• Lots of good data even if our algorithm is not the best
• Data is king!
• Why have companies like Facebook, google, etc. do so well?
• Because all of you that have social media feed them new data all the time
and you even annotate it for them by labeling it with likes, etc.

Difference
• What is the difference between a regular data science class and a
data science class for cyber security
• The data
• Everything else is the same. The machine learning algorithms do not
really change
• What Changes is the how the data is represented or converted to
features, samples, classes, etc.
• In this, module we will explore data sets.

Characteristics of data sets
• When thinking of a data set, think in terms of a matrix
• Rows => samples
• Columns => features
Matrix_data = np.loadtxt(dataset, delimiter= “ , ” , skiprows=1)
X = Matrix_data[ : , : 4]
y = Matrix_data[ : , 4]

Iris
• Iris dataset is a collection of multiple variable analysis dataset. It
contains totally 150 data. There are 3 labels in this dataset which are
Setosa, Versicolour, and Virginica. Each class contains 4 features to do
prediction.
X1 X2 X3 X4 y
[[5.1 3.5 1.4 0.2 0. ]
[4.9 3. 1.4 0.2 0. ]
…
[5.9 3. 5.1 1.8 2. ]]

Text Data set
• Phishing Websites Data Set:
• Phishing Dataset

Images data set

Speech

NSL-KDD
• Network intrusion

UNSW big data
• Network

Phishing

Honeypot unsupervised

Malware

Fraud Detection

Biometrics

What are features?
• Ways to represent a sample
• Vector space model

Types of features
• Binary
• Continuous

Dimensionality
• Number of features determines dimensionality

• Machine Learning (ML) is essential for automated systems to make
decisions and to infer new knowledge about the world.
• Machine learning approaches can be divided into
• supervised learning (such as Support Vector Machines)
• unsupervised learning (such as K-means clustering).

Within supervised approaches
• Within supervised approaches, the learning methodologies can be
divided based on whether they
• predict a class (into classifiers )
• or a magnitude (and regression models)

• An additional categorization for these methods depends on whether
they use
• sequential
• or non-sequential data.

Technique Definition Pros Cons
Support Vector Machines Supervised learning approach that optimizes the margin that separates data. SLT Confidence characteristic
(expected risk)
class imbalance issues
Decision Trees This method performs classification by constructing trees where branches are
separated by decision points.
Easy to understand Not flexible
Neural Networks Model represents the structure of the human brain with neurons and links to
the neurons.
Versatile Can obscure the underlying
structure of the model
K-means clustering Unsupervised method that forms k-means clusters to minimize distance
between centroids and members of cluster.
Unsupervised – so no training
needed
Needs clearly defined
separations in the data in order
to be effective
Linear Discriminant Analysis
(LDA)
Creates linear function of features to classify data Simple yet robust classification
method
Normality assumptions of the
classes
Naïve Bayes Probabilistic Learning to calculate the probability of seeing a certain condition in
the world by selecting the most probable class given the feature vector
Fast, easy to understand the
model
Bayes assumptions of
independence
Maximum Likelihood
Estimation (MLE)
Calculates the likelihood that an object will be seen based on its proportion in
the sample data
Simple Too simplistic for some
applications
Hidden Markov Models
(HMM)
A Markov Chain is a weighted automaton consisting of nodes and arcs where the
nodes represent states and the arcs represent the probability of going from one
state to another.
Probabilistic. Good for
sequence mining
Combinatorial complexity/
needs prior knowledge

A few ML algorithms
• Classifiers are machine learning approaches that produce as an
output a specific class given some input features.
• Important classifiers include:
• Support Vector Machines (Burges 1998) commonly implemented using
LibSVM (Chang and Lin, 2001)
• Naïve Bayes
• artificial neural networks
• deep learning based neural networks
• decision trees
• random forests
• k-nearest neighbor classifier

deep learning
• deep learning based methods are simply neural nets with more
layers.
• Deep learning methods have made a big impact in the field of
machine learning in recent years.
• given enough computational power, they can automatically learn the
optimal features to be used in a classification problem.

Feature Engineering
• In the past, learning what features to use required using humans to
engineer the features.
• This issue has now been alleviated somewhat by deep learning.
• Revolutionized the industry

• Additionally, artificial neural networks are classifiers that can handle
non-linearly separable data.
• In theory, this capability allows them to model data that may be
more difficult to classify.

Libraries
• import numpy as np
• from sklearn import datasets
• from sklearn.cross_validation import train_test_split
• from sklearn.preprocessing import StandardScaler
• from sklearn.metrics import accuracy_score
• from matplotlib.colors import ListedColormap
• import matplotlib.pyplot as plt
• from sklearn.metrics import confusion_matrix
• from sklearn.metrics import precision_score
• from sklearn.metrics import recall_score, f1_score
• import pandas as pd
• from sklearn.preprocessing import LabelEncoder
• from sklearn import decomposition

SKlearn library
• SKlearn library is the main library which contains most of the
traditional machine learning tools we will
• The numpy library is essential for efficient matrix and linear algebra
operations.
• For those with experience with MatLab, I can say that numpy is a way of
performing linear algebra operations in python similar to how they are done
in MatLab.
• This makes the code more efficient in its implementation and faster as well.
• The datasets library helps to obtain standard corpora. You can use it
to obtain annotated data like Fisher’s iris data set, for instance.

• sklearn.cross_validation we can import train_test_split which is
used to create splits in a data matrix such as 70% for training
purposes and 30% for testing purposes.
• From sklearn.preprocessing we can import the StandardScaler
module which helps to scale feature data.
• We will use functions such as these to scale our data for the
Tensorflow based classifiers.
• Deep learning algorithms can improve significantly when data is
properly scaled. So, it is recommended to do this.

• Two more very important libraries are matplotlib.pyplot and pandas.
• The matplotlib.pyplot library is very useful for visualization of data
and results and the pandas library is very useful for pre-processing.
• The pandas library can be very useful to pre-process large data sets
in very fast and very efficient ways.
• There are some parameters that are sometimes useful to set in your
code.
• The code sample below shows the use of np.set_printoptions.
• The function is used to print all values in a numpy array.
• This can be useful when trying to visualize the contents of a large
data set.

• ## set parameters
• np.set_printoptions(threshold=np.inf) ## print all values in numpy
array

Splitting the data
• Let us assume that our data is stored in the matrix X.
• The code segment below uses the function train_test_split.
• This function is used to split a data set (in this case X) into 4 sets
which are X_train, X_test, y_train, y_test.
• These are the 4 sets that will be used by the traditional classifiers or
the deep learning classifiers.
• The sets that start with X hold the data (feature vectors) and the sets
that start with y hold the labels per sample (e.g. y1 for the first
feature vector, y2 for the second feature vector, and so on).

Split size
• The values test_size=0.01and random_state=42 in the function are
parameters that define the split.
• The value 0.01 makes a train set that has 99% of all samples while
the test set has 1% of all samples.
• In contrast test_size=0.20 would mean that there is a 80% and 20%
split.
• The random_state=42 allows you to always get the same random
data since the seed is defined as 42.

• #X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,
random_state=48)
• ## k-folds cross validation all goes in train sets (hence 0.01)
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01,
random_state=42)

• To call all the functions or classifiers you can employ the following
approach.
• Here we have defined 5 common classifiers.
• Notice that each one gets the 4 data sets obtained from the
percentage split.
• Notice also that the data files have a _normalized added to their
name.
• This is a good standard approach used by programmers to indicate
that this data has been scaled.
• The next chapter addresses scaling. Here you run X_train through a
scaler function to obtain X_train_normalized.
• The labels (y) are not scaled.

• #######################################
• ## ML_MAIN()
• #logistic_regression_rc(X_train_normalized, y_train,
X_test_normalized, y_test)
• #svm_rc(X_train_normalized, y_train, X_test_normalized, y_test)
• #random_forest_rc(X_train, y_train, X_test, y_test)
• #knn_rc(X_train_normalized, y_train, X_test_normalized, y_test)
• multilayer_perceptron_rc(X_train_normalized, y_train,
X_test_normalized, y_test)

Optimization
• Before we begin to discuss some of the machine learning algorithms,
I should say something about optimization.
• Optimization is a key process in machine learning.
• Basically, any supervised learning algorithm needs to learn a
prediction equation given a set of annotated data.
• This prediction function usually has a set of parameters that must be
learned.
• However, the question is “how do you learn these parameters?”
• The answer is that you do so through optimization.

• In its simplest form, optimization consists of trying a set of
parameters with your model and seeing what result they give you.
• If the result is not good, the optimization algorithm needs to decide if
you should decrease the values of the parameters or increase the
values of the parameters.
• In general, you do this in a loop (increasing and decreasing) until you
find an optimal set of parameters.
• But one of the questions to answer here is: do the values go up or
down?
• Well, as it turns out, there are methodologies based on calculus that help you
to make this decision.

Optimization graph

• The above graph represents an optimization problem.
• The y axis represents the cost (or penalty) of using a given parameter.
• The x axis represents the value of the parameter (w) being used at
the given iteration.
• The curve represents the behavior that the function being used to
minime the cost will follow for every value of parameter w.

• As shown in the graph, the optimal value for the curve is found
where the star is located (i.e where the value of cost is at a
minimum).
• So, somehow the optimization algorithm needs to travel through the
function and arrive at the position indicated by the star.
• At that point, the value of “w” reduces the cost and finds the best
solution.
• Instead of trying all values of “w” at random, the algorithm can make
educated guesses about which direction to follow (up or down).
• To do this, we can use calculus to calculate the derivative of the
function at a given point.
• This will allow us to determine the slope at that point.

• In the case of the graph, this represents the tangent line to the curve
if we calculate the derivative at point w.
• If we calculate the slope at the position of the star symbol, then the
slope is zero because the tangent at that point is parallel to the x axis.
• The slope at the point “w” will be positive.
• Based on this result, we can tell the direction we want to take for
parameter w (decrease or increase).
• This type of optimization is called gradient descent and is very
important in machine learning and deep learning.
• There are several approaches to implement gradient descent and this
is just the simplest explanation for conceptual purposes.

• old_x = 0
• new_x = 4
• step_size = 0.01
• precision = 0.00001
•
• def function_derivative(x):
• return 3*x ** 2 – 6*x
• while absolute_value(new_x – old_x) > precision:
• old_x= new_x
• new_x = old_x – step_size * function_derivative(old_x)
• print “result is: ”, new_x

• In the previous code example we assume a function of
• f() = x3
– 3x2
+ 7
• that needs to be optimized for parameter x. We will need the value of
the derivative for each point x. The derivative for f() is:
• f ´() = 3x2
– 6x
• So, the parameter x can be calculated in a loop using the derivative
function which will determine the direction to follow when increasing
or decreasing the parameter x.

Summary
• Data sets and features

MLforCyber_MLDataSetsandFeatures_Presentation.pptx

MLforCyber_MLDataSetsandFeatures_Presentation.pptx

More Related Content

Similar to MLforCyber_MLDataSetsandFeatures_Presentation.pptx

Recently uploaded

MLforCyber_MLDataSetsandFeatures_Presentation.pptx

Editor's Notes