Machine Learning
with Azure
Barbara Fusinska
@BasiaFusinska
About me
Programmer
Machine Learning
Data Scientist
@BasiaFusinska
https://github.com/BasiaFusinska/AzureMLWorkshop
Agenda
• What’s Machine Learning?
• Azure ML Experiments
• Classification
• Regression
• Publishing the Web Service
• Azure Data Sources
• Resampling methods
• Machine Learning Tuning
• Exploratory Data Analysis
• Clustering
• Cortana Intelligence Gallery
• Jupyter Notebooks
• Retraining the model
What’s the reason you’re here?
What are hoping to find out?
When/How are you going to use this
knowledge?
My goals - Teaching
• What’s Machine Learning?
• How to use Azure ML Studio?
• Show how to start and where to
go next
https://github.com/BasiaFusinska/AzureMLWorkshop
Setup
• Clone or download
https://github.com/BasiaFusinska/Azure
MLWorkshop
• Sign up for Azure Machine Learning
Studio
https://studio.azureml.net
• Sign in to Azure Machine Learning
Studio
• Other tools: VisualStudio, Rstudio,
Python
Machine Learning?
Movies Genres
Title # Kisses # Kicks Genre
Taken 3 47 Action
Love story 24 2 Romance
P.S. I love you 17 3 Romance
Rush hours 5 51 Action
Bad boys 7 42 Action
Question:
What is the genre of
Gone with the wind
?
Data-based classification
Id Feature 1 Feature 2 Class
1. 3 47 A
2. 24 2 B
3. 17 3 B
4. 5 51 A
5. 7 42 A
Question:
What is the class of the entry
with the following features:
F1: 31, F2: 4
?
Data Visualization
0
10
20
30
40
50
60
0 10 20 30 40 50
Rule 1:
If on the left side of the
line then Class = A
Rule 2:
If on the right side of the
line then Class = B
A
B
Chick sexing
Supervised
learning
• Classification, regression
• Label, target value
• Training & Validation
phases
Unsupervised
learning
• Clustering, feature
selection
• Finding structure of data
• Statistical values
describing the data
Supervised Machine Learning workflow
Clean data Data split
Machine Learning
algorithm
Trained model Score
Preprocess
data
Training
data
Test data
Publishing the model
Machine Learning
Model
Model Training
Published
Machine Learning
Model
Prediction
Training data
Publish model
Test stream
Scores
Data -> Predictive model -> Operational web API in minutes
APIML STUDIO
Classification problem
Model training
Data & Labels
Classification data
Source #Links #Characters ... Fake
TopNews 10 2750 … T
Twitter 2 120 … F
TopNews 235 502 … F
Channel X 1530 3024 … T
Twitter 24 70 … F
StoryLeaks 722 1408 … T
Facebook 98 230 … T
… … … … ...
Features
Labels
Iris Dataset
• Features:
• Sepal length
• Sepal width
• Petal length
• Petal width
• Species:
• Setosa
• Versicolor
• Virginica
http://archive.ics.uci.edu/ml/datasets/Iris
Data
classification:
Two-class Iris
Demo
Evaluation methods for classification
Confusion
Matrix
Reference
Positive Negative
Prediction
Positive TP FP
Negative FN TN
Receiver Operating Characteristic
curve
Area under the curve
(AUC)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
#𝑐𝑜𝑟𝑟𝑒𝑐𝑡
#𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
=
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁
𝑇𝑁 + 𝐹𝑁
How good at avoiding
false alarms
How good it is at
detecting positives
https://azure.microsoft.com/en-gb/pricing/details/machine-learning/
K-Nearest Neighbours Algorithm
• Object is classified by a majority
vote
• k – algorithm parameter
• Distance metrics: Euclidean
(continuous variables), Hamming
(text)
?
Naïve Bayes classifier
𝑝 𝐶 𝑘 𝒙) =
𝑝 𝐶 𝑘 𝑝 𝒙 𝐶 𝑘)
𝑝(𝒙)
𝒙 = (𝑥1, … , 𝑥 𝑘)
𝑝 𝐶 𝑘 𝑥1, … , 𝑥 𝑘) likelihood
evidence
prior
posterior
Naïve Bayes example
Sex Height Weight Foot size
Male 6 190 11
Male 6.2 170 10
Female 5 130 6
… … … …
Sex Height Weight Foot size
? 5.9 140 8
𝑝 𝑚𝑎𝑙𝑒 𝒙 =
𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝(8|𝑚𝑎𝑙𝑒)
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝 8 𝑚𝑎𝑙𝑒 +
𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒)
𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝒙 =
𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒)
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
Logistic regression
𝑧 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘
𝑦 =
1 𝑓𝑜𝑟 𝑧 > 0
0 𝑓𝑜𝑟 𝑧 < 0
𝑦 =
1 𝑓𝑜𝑟 𝜙(𝑧) > 0.5
0 𝑓𝑜𝑟 𝜙(𝑧) < 0.5
Logistic function
Coefficients
Best fit of β
Decision trees
• Use the information gain and
entropy
• Finding the feature that best
splits the dataset
• Build the tree
• Prune the tree
Task: Adult Centus
Income Prediction
• Built-in dataset sample
• Data exploration
• Classification statement
• Data split
• Training
• Performance evaluation
• Results visualisation
https://archive.ics.uci.edu/ml/datasets/census+income
Task: Data
preparation
• Data exploration
• Missing data
• Feature selection
Publishing the
experiment
Demo
API
https://azure.microsoft.com/en-gb/pricing/details/machine-learning/
Task: Publishing
income prediction
• Set up predictive experiment
• Set up the Web Service
• Deploy the Web Service
• Additionally:
• Remove income from the request
• Only return Scores
Azure ML data sources
• Built-in datasets
• Uploaded data
• Import Data module:
• Web URL via HTTP
• Hive Query
• SQL Database (Azure SQL or Azure VM)
• Azure Table
• Azure Blob Storage
• Data Feed Provider (OData)
• Azure CosmosDB
Task: Upload
dataset
• Download the Prestige.csv file
• Add dataset to Azure ML Studio
• Upload the downloaded file
Regression problem
• Dependent value
• Predicting the real value
• Fitting the coefficients
• Analytical solutions
• Gradient descent
𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘
Ordinary linear regression
Residual sum of squares (RSS)
𝑆 𝑤 =
𝑖=1
𝑛
(𝑦𝑖 − 𝑥𝑖
𝑇
𝑤)2
= 𝑦 − 𝑋𝑤 𝑇
𝑦 − 𝑋𝑤
𝑤 = 𝑎𝑟𝑔 min
𝑤
𝑆(𝑤)
Regression
problem
Demo
Evaluation methods for regression
• Errors
𝑅𝑀𝑆𝐸 = 𝑖=1
𝑛
(𝑓𝑖 − 𝑦𝑖)2
𝑛
𝑅2 = 1 −
(𝑓𝑖 − 𝑦𝑖)2
( 𝑦 − 𝑦𝑖)2
• Statistics (t, ANOVA)
Residuals vs
Fitted
• Check if residuals have non-
linear patterns
• Check if the model captures
the non-linear relationship
• Should show equally spread
residuals around the
horizontal line
Normal Q-Q
• Shows if the residuals are
normally distributed
• Values should be lined on the
straight dashed line
• Check if residuals do not
deviate severely
Scale-Location
• Show if residuals are spread
equally along the ranges of
predictors
• Test the assumption of equal
variance (homoscedasticity)
• Should show horizontal line
with equally (randomly)
spread points
Residuals vs
Leverage
• Helps to find influential cases
• When outside of the Cook’s
distance the cases are
influential
• With no influential cases
Cook’s distance lines should
be barely visible
Task: Prestige EDA
• Descriptive statistics (dimensions,
rows, columns, data types,
correlation)
• Distributions, correlations, outliers
• Handle missing data
• Features significance
Categorical data for regression
• Categories: A, B, C are coded as
dummy variables
• In general if the variable has k
categories it will be decoded into
k-1 dummy variables
Category V1 V2
A 0 0
B 1 0
C 0 1
𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑗 𝑥𝑗 + 𝛽𝑗+1 𝑣1 + ⋯ + 𝛽𝑗+𝑘−1 𝑣 𝑘
Categorical data for regression
𝑓 𝑥 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑣1 + ⋯ + 𝛽 𝑘 𝑣 𝑘−1 +
𝛽 𝑘+1 𝑣1 𝑥 + ⋯ + 𝛽2𝑘−1 𝑣 𝑘−1 𝑥
𝑦 ~ 𝑥 + 𝑐𝑎𝑡 + 𝑥: 𝑐𝑎𝑡
Task: Prestige
Regression
• Numeric and categorical features
• Linear regression training
• Algorithm evaluation
• Set Up the Web Service
Resampling: Bootstrapping
k-fold cross validation
Data
resampling
Demo
Task: Cross-
validation
• Use income prediction
classification
• Replace splitting data to train and
test with cross-validation
• Algorithm evaluation
Machine Learning Tuning
• Data preparation
• Data cleansing
• Normalisation
• Removing/Adding duplicates
• Algorithms
• Comparing different methods
• Adjusting algorithm to the
problem
• Hyperparameters
Parameters
tuning
Demo
Task: Tuning
• Tune the Income Classification
problem
• Use Decision Tree classification
algorithm
• Tune the parameters using range
of values
• Performance evaluation
Task: Compare
different
algorithms
• Use Income prediction experiment
• Use four different classification
algorithm
• Compare algorithms performances
Exploratory Data Analysis
• Descriptive statistics
(dimensions, rows, columns,
data types, correlation)
• Data visualization (distributions,
outliers)
• Missing data
• Duplicate data
• Data transformations
• Features significance
Task: Flights delays
EDA
• Dataset EDA
• Build in datasets
• Join Airport codes & Airport names
• Join Weather dataset
• Set up categorical data
• Clean missing data
• Check for duplicates
Task: Flights delays
predictions
• Remove target leaking features
• Classification problem
• Define the target value
• Train the model
• Regression problem
• Define the target value
• Use linear regression
Customising the process
• Programming languages: R &
Python
• R Scripts
• R Models
• Python Scripts
R Script
# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame
dataset2 <- maml.mapInputPort(2) # class: data.frame
# Contents of optional Zip port are in ./src/
# source("src/yourfile.R");
# load("src/yourData.rdata");
# Sample operation
data.set = rbind(dataset1, dataset2);
# You'll see this output in the R Device port.
# It'll have your stdout, stderr and PNG graphics device(s).
plot(data.set);
# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort("data.set");
Python Script
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
# imports up here can be used to
import pandas as pd
# The entry point function can contain up to two input arguments:
# Param<dataframe1>: a pandas.DataFrame
# Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):
# Execution logic goes here
print('Input pandas.DataFrame #1:rnrn{0}'.format(dataframe1))
# If a zip file is connected to the third input port is connected,
# it is unzipped under ".Script Bundle". This directory is added
# to sys.path. Therefore, if your zip file contains a Python file
# mymodule.py you can import it using:
# import mymodule
# Return value must be of a sequence of pandas.DataFrame
return dataframe1,
R & Python
Scripts
Demo
R model: Trainer
# Input: dataset
# Output: model
# The code below is an example which can be replaced with your own code.
# See the help page of "Create R Model" module for the list of predefined
functions and constants.
library(e1071)
features <- get.feature.columns(dataset)
labels <- as.factor(get.label.column(dataset))
train.data <- data.frame(features, labels)
feature.names <- get.feature.column.names(dataset)
names(train.data) <- c(feature.names, "Class")
model <- naiveBayes(Class ~ ., train.data)
R model: Scorer
# Input: model, dataset
# Output: scores
# The code below is an example which can be replaced with your own code.
# See the help page of "Create R Model" module for the list of predefined
functions and constants.
library(e1071)
probabilities <- predict(model, dataset, type="raw")[,2]
classes <- as.factor(as.numeric(probabilities >= 0.5))
scores <- data.frame(classes, probabilities)
R Model
Demo
Clustering problem
K-means Algorithm
Hierarchical clustering
• Decision of where the cluster
should be split
• Metric: distance between pairs
of observation
• Linkage criterion: dissimilarity of
sets
Clustering
Irises
Demo
Evaluating
methods for
clustering
• Sum of squares
• Class based measures
• Underlying true
Task: Income
Clustering
• Use Adult Census Income dataset
• Clustering using k-means
algorithm
• Compare clusters with the original
classes assignments
• Visualise the findings
Cortana Intelligence Gallery
https://gallery.cortanaintelligence.com/
Task: Twitter
sentiment
• Find Twitter sentiment Experiment
• Open the experiment in Azure ML
Studio
• Run the experiment and visualise
the results
Cortana
Gallery
Demo
Jupyter Notebooks
• Running cells
• Markdown documentation
• Different kernels
• Visualisation
Azure
Notebooks
Demo
https://notebooks.azure.com/
Azure ML
Notebooks
Demo
Retraining the model
• Set up Retraining Web Service
• Output node connected with the
saved model
• New training dataset
• Batch execution
Keep in touch
BarbaraFusinska.com
Barbara@Fusinska.com
@BasiaFusinska
https://github.com/BasiaFusinska/AzureMLWorkshop

Machine Learning with Azure

  • 1.
    Machine Learning with Azure BarbaraFusinska @BasiaFusinska
  • 2.
    About me Programmer Machine Learning DataScientist @BasiaFusinska https://github.com/BasiaFusinska/AzureMLWorkshop
  • 3.
    Agenda • What’s MachineLearning? • Azure ML Experiments • Classification • Regression • Publishing the Web Service • Azure Data Sources • Resampling methods • Machine Learning Tuning • Exploratory Data Analysis • Clustering • Cortana Intelligence Gallery • Jupyter Notebooks • Retraining the model
  • 4.
    What’s the reasonyou’re here? What are hoping to find out? When/How are you going to use this knowledge?
  • 5.
    My goals -Teaching • What’s Machine Learning? • How to use Azure ML Studio? • Show how to start and where to go next https://github.com/BasiaFusinska/AzureMLWorkshop
  • 6.
    Setup • Clone ordownload https://github.com/BasiaFusinska/Azure MLWorkshop • Sign up for Azure Machine Learning Studio https://studio.azureml.net • Sign in to Azure Machine Learning Studio • Other tools: VisualStudio, Rstudio, Python
  • 7.
  • 9.
    Movies Genres Title #Kisses # Kicks Genre Taken 3 47 Action Love story 24 2 Romance P.S. I love you 17 3 Romance Rush hours 5 51 Action Bad boys 7 42 Action Question: What is the genre of Gone with the wind ?
  • 10.
    Data-based classification Id Feature1 Feature 2 Class 1. 3 47 A 2. 24 2 B 3. 17 3 B 4. 5 51 A 5. 7 42 A Question: What is the class of the entry with the following features: F1: 31, F2: 4 ?
  • 11.
    Data Visualization 0 10 20 30 40 50 60 0 1020 30 40 50 Rule 1: If on the left side of the line then Class = A Rule 2: If on the right side of the line then Class = B A B
  • 12.
  • 13.
    Supervised learning • Classification, regression •Label, target value • Training & Validation phases
  • 14.
    Unsupervised learning • Clustering, feature selection •Finding structure of data • Statistical values describing the data
  • 15.
    Supervised Machine Learningworkflow Clean data Data split Machine Learning algorithm Trained model Score Preprocess data Training data Test data
  • 16.
    Publishing the model MachineLearning Model Model Training Published Machine Learning Model Prediction Training data Publish model Test stream Scores
  • 17.
    Data -> Predictivemodel -> Operational web API in minutes APIML STUDIO
  • 18.
  • 19.
    Classification data Source #Links#Characters ... Fake TopNews 10 2750 … T Twitter 2 120 … F TopNews 235 502 … F Channel X 1530 3024 … T Twitter 24 70 … F StoryLeaks 722 1408 … T Facebook 98 230 … T … … … … ... Features Labels
  • 20.
    Iris Dataset • Features: •Sepal length • Sepal width • Petal length • Petal width • Species: • Setosa • Versicolor • Virginica http://archive.ics.uci.edu/ml/datasets/Iris
  • 21.
  • 22.
    Evaluation methods forclassification Confusion Matrix Reference Positive Negative Prediction Positive TP FP Negative FN TN Receiver Operating Characteristic curve Area under the curve (AUC) 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = #𝑐𝑜𝑟𝑟𝑒𝑐𝑡 #𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁 𝑇𝑁 + 𝐹𝑁 How good at avoiding false alarms How good it is at detecting positives
  • 23.
  • 24.
    K-Nearest Neighbours Algorithm •Object is classified by a majority vote • k – algorithm parameter • Distance metrics: Euclidean (continuous variables), Hamming (text) ?
  • 25.
    Naïve Bayes classifier 𝑝𝐶 𝑘 𝒙) = 𝑝 𝐶 𝑘 𝑝 𝒙 𝐶 𝑘) 𝑝(𝒙) 𝒙 = (𝑥1, … , 𝑥 𝑘) 𝑝 𝐶 𝑘 𝑥1, … , 𝑥 𝑘) likelihood evidence prior posterior
  • 26.
    Naïve Bayes example SexHeight Weight Foot size Male 6 190 11 Male 6.2 170 10 Female 5 130 6 … … … … Sex Height Weight Foot size ? 5.9 140 8 𝑝 𝑚𝑎𝑙𝑒 𝒙 = 𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝(8|𝑚𝑎𝑙𝑒) 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝 8 𝑚𝑎𝑙𝑒 + 𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒) 𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝒙 = 𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒) 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
  • 27.
    Logistic regression 𝑧 =𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘 𝑦 = 1 𝑓𝑜𝑟 𝑧 > 0 0 𝑓𝑜𝑟 𝑧 < 0 𝑦 = 1 𝑓𝑜𝑟 𝜙(𝑧) > 0.5 0 𝑓𝑜𝑟 𝜙(𝑧) < 0.5 Logistic function Coefficients Best fit of β
  • 28.
    Decision trees • Usethe information gain and entropy • Finding the feature that best splits the dataset • Build the tree • Prune the tree
  • 29.
    Task: Adult Centus IncomePrediction • Built-in dataset sample • Data exploration • Classification statement • Data split • Training • Performance evaluation • Results visualisation https://archive.ics.uci.edu/ml/datasets/census+income
  • 30.
    Task: Data preparation • Dataexploration • Missing data • Feature selection
  • 31.
  • 32.
  • 33.
    Task: Publishing income prediction •Set up predictive experiment • Set up the Web Service • Deploy the Web Service • Additionally: • Remove income from the request • Only return Scores
  • 34.
    Azure ML datasources • Built-in datasets • Uploaded data • Import Data module: • Web URL via HTTP • Hive Query • SQL Database (Azure SQL or Azure VM) • Azure Table • Azure Blob Storage • Data Feed Provider (OData) • Azure CosmosDB
  • 35.
    Task: Upload dataset • Downloadthe Prestige.csv file • Add dataset to Azure ML Studio • Upload the downloaded file
  • 36.
    Regression problem • Dependentvalue • Predicting the real value • Fitting the coefficients • Analytical solutions • Gradient descent 𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘
  • 37.
    Ordinary linear regression Residualsum of squares (RSS) 𝑆 𝑤 = 𝑖=1 𝑛 (𝑦𝑖 − 𝑥𝑖 𝑇 𝑤)2 = 𝑦 − 𝑋𝑤 𝑇 𝑦 − 𝑋𝑤 𝑤 = 𝑎𝑟𝑔 min 𝑤 𝑆(𝑤)
  • 38.
  • 39.
    Evaluation methods forregression • Errors 𝑅𝑀𝑆𝐸 = 𝑖=1 𝑛 (𝑓𝑖 − 𝑦𝑖)2 𝑛 𝑅2 = 1 − (𝑓𝑖 − 𝑦𝑖)2 ( 𝑦 − 𝑦𝑖)2 • Statistics (t, ANOVA)
  • 40.
    Residuals vs Fitted • Checkif residuals have non- linear patterns • Check if the model captures the non-linear relationship • Should show equally spread residuals around the horizontal line
  • 41.
    Normal Q-Q • Showsif the residuals are normally distributed • Values should be lined on the straight dashed line • Check if residuals do not deviate severely
  • 42.
    Scale-Location • Show ifresiduals are spread equally along the ranges of predictors • Test the assumption of equal variance (homoscedasticity) • Should show horizontal line with equally (randomly) spread points
  • 43.
    Residuals vs Leverage • Helpsto find influential cases • When outside of the Cook’s distance the cases are influential • With no influential cases Cook’s distance lines should be barely visible
  • 44.
    Task: Prestige EDA •Descriptive statistics (dimensions, rows, columns, data types, correlation) • Distributions, correlations, outliers • Handle missing data • Features significance
  • 45.
    Categorical data forregression • Categories: A, B, C are coded as dummy variables • In general if the variable has k categories it will be decoded into k-1 dummy variables Category V1 V2 A 0 0 B 1 0 C 0 1 𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑗 𝑥𝑗 + 𝛽𝑗+1 𝑣1 + ⋯ + 𝛽𝑗+𝑘−1 𝑣 𝑘
  • 46.
    Categorical data forregression 𝑓 𝑥 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑣1 + ⋯ + 𝛽 𝑘 𝑣 𝑘−1 + 𝛽 𝑘+1 𝑣1 𝑥 + ⋯ + 𝛽2𝑘−1 𝑣 𝑘−1 𝑥 𝑦 ~ 𝑥 + 𝑐𝑎𝑡 + 𝑥: 𝑐𝑎𝑡
  • 47.
    Task: Prestige Regression • Numericand categorical features • Linear regression training • Algorithm evaluation • Set Up the Web Service
  • 48.
  • 49.
  • 50.
  • 51.
    Task: Cross- validation • Useincome prediction classification • Replace splitting data to train and test with cross-validation • Algorithm evaluation
  • 52.
    Machine Learning Tuning •Data preparation • Data cleansing • Normalisation • Removing/Adding duplicates • Algorithms • Comparing different methods • Adjusting algorithm to the problem • Hyperparameters
  • 53.
  • 54.
    Task: Tuning • Tunethe Income Classification problem • Use Decision Tree classification algorithm • Tune the parameters using range of values • Performance evaluation
  • 55.
    Task: Compare different algorithms • UseIncome prediction experiment • Use four different classification algorithm • Compare algorithms performances
  • 56.
    Exploratory Data Analysis •Descriptive statistics (dimensions, rows, columns, data types, correlation) • Data visualization (distributions, outliers) • Missing data • Duplicate data • Data transformations • Features significance
  • 57.
    Task: Flights delays EDA •Dataset EDA • Build in datasets • Join Airport codes & Airport names • Join Weather dataset • Set up categorical data • Clean missing data • Check for duplicates
  • 58.
    Task: Flights delays predictions •Remove target leaking features • Classification problem • Define the target value • Train the model • Regression problem • Define the target value • Use linear regression
  • 59.
    Customising the process •Programming languages: R & Python • R Scripts • R Models • Python Scripts
  • 60.
    R Script # Map1-based optional input ports to variables dataset1 <- maml.mapInputPort(1) # class: data.frame dataset2 <- maml.mapInputPort(2) # class: data.frame # Contents of optional Zip port are in ./src/ # source("src/yourfile.R"); # load("src/yourData.rdata"); # Sample operation data.set = rbind(dataset1, dataset2); # You'll see this output in the R Device port. # It'll have your stdout, stderr and PNG graphics device(s). plot(data.set); # Select data.frame to be sent to the output Dataset port maml.mapOutputPort("data.set");
  • 61.
    Python Script # Thescript MUST contain a function named azureml_main # which is the entry point for this module. # imports up here can be used to import pandas as pd # The entry point function can contain up to two input arguments: # Param<dataframe1>: a pandas.DataFrame # Param<dataframe2>: a pandas.DataFrame def azureml_main(dataframe1 = None, dataframe2 = None): # Execution logic goes here print('Input pandas.DataFrame #1:rnrn{0}'.format(dataframe1)) # If a zip file is connected to the third input port is connected, # it is unzipped under ".Script Bundle". This directory is added # to sys.path. Therefore, if your zip file contains a Python file # mymodule.py you can import it using: # import mymodule # Return value must be of a sequence of pandas.DataFrame return dataframe1,
  • 62.
  • 63.
    R model: Trainer #Input: dataset # Output: model # The code below is an example which can be replaced with your own code. # See the help page of "Create R Model" module for the list of predefined functions and constants. library(e1071) features <- get.feature.columns(dataset) labels <- as.factor(get.label.column(dataset)) train.data <- data.frame(features, labels) feature.names <- get.feature.column.names(dataset) names(train.data) <- c(feature.names, "Class") model <- naiveBayes(Class ~ ., train.data)
  • 64.
    R model: Scorer #Input: model, dataset # Output: scores # The code below is an example which can be replaced with your own code. # See the help page of "Create R Model" module for the list of predefined functions and constants. library(e1071) probabilities <- predict(model, dataset, type="raw")[,2] classes <- as.factor(as.numeric(probabilities >= 0.5)) scores <- data.frame(classes, probabilities)
  • 65.
  • 66.
  • 67.
  • 68.
    Hierarchical clustering • Decisionof where the cluster should be split • Metric: distance between pairs of observation • Linkage criterion: dissimilarity of sets
  • 69.
  • 70.
    Evaluating methods for clustering • Sumof squares • Class based measures • Underlying true
  • 71.
    Task: Income Clustering • UseAdult Census Income dataset • Clustering using k-means algorithm • Compare clusters with the original classes assignments • Visualise the findings
  • 72.
  • 73.
    Task: Twitter sentiment • FindTwitter sentiment Experiment • Open the experiment in Azure ML Studio • Run the experiment and visualise the results
  • 74.
  • 75.
    Jupyter Notebooks • Runningcells • Markdown documentation • Different kernels • Visualisation
  • 76.
  • 77.
  • 78.
    Retraining the model •Set up Retraining Web Service • Output node connected with the saved model • New training dataset • Batch execution
  • 80.