COMP 499 Introduction to Data
Analytics
Lecture 9 — Exploratory Data Analysis
Greg Butler
Data Science Research Centre
and
Centre for Structural and Functional Genomics
and
Computer Science and Software Engineering
Concordia University, Montreal, Canada
gregb@cs.concordia.ca
Exploratory Data Analysis (EDA)
Outline of Lecture
I EDA: Concepts, Steps, Methods
I Skewness and Kurtosis
I Regression: Curve Fitting
I Dimension reduction: PCA
I Clustering
I Feature Engineering
Data Analytics
wikipedia
Exploratory Data Analysis
Tukey 1977 book
John Tukey (1977), Exploratory Data Analysis, Addison-Wesley.
NIST Engineering Statistics Handbook
Exploratory Data Analysis (EDA) is an approach/philosophy for data
analysis that employs a variety of techniques (mostly graphical) to
1. maximize insight into a data set;
2. uncover underlying structure;
3. extract important variables;
4. detect outliers and anomalies;
5. test underlying assumptions;
6. develop parsimonious models; and
7. determine optimal factor settings.
The EDA approach is not a set of techniques, but an attitude/philosophy
about how a data analysis should be carried out.
https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
Exploratory Data Analysis
NIST Engineering Statistics Handbook
EDA is an approach to data analysis
that postpones the usual assumptions about what kind of model
the data follow
with the more direct approach of
allowing the data itself
to reveal its underlying structure and model.
https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
EDA Checklist
1. What question(s) are you trying to solve (or prove wrong)?
2. What kind of data do you have and how do you treat different
types?
3. What’s missing from the data and how do you deal with it?
4. Where are the outliers and why should you care about them?
5. How can you add, change or remove features to get more out
of your data?
Daniel Bourke, A Gentle Introduction to Exploratory Data Analysis,
https://towardsdatascience.com/a-gentle-introduction-to-exploratory-data-analysis-f11d843b8184
EDA Circle of Life
Daniel Bourke, A Gentle Introduction to Exploratory Data Analysis,
https://towardsdatascience.com/a-gentle-introduction-to-exploratory-data-analysis-f11d843b8184
EDA Methods
EDA Steps
EDA Key Concepts
EDA: Skewness and Kurtosis
Besides analyses to characterize central tendency and variability ...
a further characterization of the data includes skewness and kurtosis.
Skewness
Skewness is a measure of symmetry, or more precisely, the lack of
symmetry.
A distribution, or data set, is symmetric if it looks the same to the left
and right of the center point.
Kurtosis
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed
relative to a normal distribution.
That is, data sets with high kurtosis tend to have heavy tails, or outliers.
Data sets with low kurtosis tend to have light tails, or lack of outliers.
A uniform distribution would be the extreme case.
Detecting Skewness and Kurtosis
The histogram is an effective graphical technique for showing both the
skewness and kurtosis of data set.
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
Process: Exploratory Data Analysis
Exploratory Data Analysis
Learn about the properties of the data
Steps for Exploratory Data Analysis
I Descriptive statistics: mean/median and variance, quantiles,
outliers
I Correlation
I Fitting curves and distributions
I Dimension reduction
I Clustering
Regression: Curve Fitting
Regression Analysis
a set of statistical processes for estimating the relationships among
variables
helps one understand how the typical value of the dependent
variable changes
when any one of the independent variables is varied,s
while the other independent variables are held fixed.
Linear Regression
fit a line to (x,y) data
y is dependent variable, x is independent variable
Curve Fitting
Can fit other forms of curves to data
Regression: Curve Fitting
Anscombe’s Quartet
Dimension reduction: PCA
Principal Component Analysis (PCA)
Aim: to identify the combinations of variables that explain the
variability in the data set
Method
Transform original set of correlated variables
into
set of orthogonal (independent) variables
I linear combination of original variables
I first principal component accounts for as much of variability
as possible
I second PC accounts for as much of remaining variability as
possible
I etc
Map to PC for Dimension Reduction
Clustering
Clustering
brings together “similar” observations
Distances
Many potential distances
Euclidean distance
Manhattan distance
Cosine distance
k-Means Clustering
Creates k clusters, pre-defined k
Start with k random centroids
Iteratively assign points to nearest centroid,
and recompute centroids
Agglomerative Clustering
Start each point is cluster
Iteratively merge closest clusters
Clusters define Nominal Dimension
Clustering: Consistency of Data
Cluster/sort data values
To bring together
duplicate and similar data values
to make it easy to see differences/errors
(See OpenRefine video 1 of 3)
Cluster observations
To bring together
duplicate and similar observations
to make it easy to see differences/errors
Check for consistency
Differences need to be investigated
Feature Engineering
Feature
A feature is an attribute or property shared by all of the
independent units on which analysis or prediction is to be done.
Any attribute could be a feature, as long as it is useful to the
model.
Process of Feature Engineering
I Brainstorming or Testing features;
I Deciding what features to create;
I Creating features;
I Checking how the features work with your model;
I Improving your features if needed;
I Go back to brainstorming/creating more features until the
work is done.
See video 3, Ryan Baker, Coursera, Big Data Week 3 Feature Engineering
https://www.youtube.com/watch?v=drUToKxEAUA
Feature Creation
Aggregation
Basic aggregation operators
I sum
I mean, media, mode
I frequency
Other
I binning
Transformation
Apply a transformation to features
I normalization, unification, resolution, regularization
I log
I feature split
I scaling
Feature Creation: Binning
Numerical Data to Categorical Data
Example: Age
Define bins:
Infant for age between 0 – 4
Child for age between 5 – 12
Teen for age between 13 – 19
YoungAdult for age between 20 – 29
Adult for age between 30 – 44
Mature for age between 45 – 64
Senior for age between 65 – 79
Elderly for age 80 and over
Feature Creation: Splitting
Feature Splitting
Example: Name split to FirstName, LastName
Example: Date 2019-06-21 split to Year, Month, Day
Python featuretools
Feature Contribution
Correlation Example
r2 measures how much of variation is explained by linear regression
Contribution to Model
When building a model from your dataset,
does the technique allow you
to know the contribution of each feature?
Compare with PCA
PCA finds principal orthogonal components
components are ranked by contribution
components are defined as combinations of features

Introduction to EDA and Data Analytics with Power BI

  • 1.
    COMP 499 Introductionto Data Analytics Lecture 9 — Exploratory Data Analysis Greg Butler Data Science Research Centre and Centre for Structural and Functional Genomics and Computer Science and Software Engineering Concordia University, Montreal, Canada [email protected]
  • 2.
    Exploratory Data Analysis(EDA) Outline of Lecture I EDA: Concepts, Steps, Methods I Skewness and Kurtosis I Regression: Curve Fitting I Dimension reduction: PCA I Clustering I Feature Engineering
  • 3.
  • 4.
    Exploratory Data Analysis Tukey1977 book John Tukey (1977), Exploratory Data Analysis, Addison-Wesley. NIST Engineering Statistics Handbook Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to 1. maximize insight into a data set; 2. uncover underlying structure; 3. extract important variables; 4. detect outliers and anomalies; 5. test underlying assumptions; 6. develop parsimonious models; and 7. determine optimal factor settings. The EDA approach is not a set of techniques, but an attitude/philosophy about how a data analysis should be carried out. https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
  • 5.
    Exploratory Data Analysis NISTEngineering Statistics Handbook EDA is an approach to data analysis that postpones the usual assumptions about what kind of model the data follow with the more direct approach of allowing the data itself to reveal its underlying structure and model. https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
  • 7.
    EDA Checklist 1. Whatquestion(s) are you trying to solve (or prove wrong)? 2. What kind of data do you have and how do you treat different types? 3. What’s missing from the data and how do you deal with it? 4. Where are the outliers and why should you care about them? 5. How can you add, change or remove features to get more out of your data? Daniel Bourke, A Gentle Introduction to Exploratory Data Analysis, https://towardsdatascience.com/a-gentle-introduction-to-exploratory-data-analysis-f11d843b8184
  • 8.
    EDA Circle ofLife Daniel Bourke, A Gentle Introduction to Exploratory Data Analysis, https://towardsdatascience.com/a-gentle-introduction-to-exploratory-data-analysis-f11d843b8184
  • 9.
  • 10.
  • 11.
  • 12.
    EDA: Skewness andKurtosis Besides analyses to characterize central tendency and variability ... a further characterization of the data includes skewness and kurtosis. Skewness Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. Kurtosis Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case. Detecting Skewness and Kurtosis The histogram is an effective graphical technique for showing both the skewness and kurtosis of data set. https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
  • 13.
    Process: Exploratory DataAnalysis Exploratory Data Analysis Learn about the properties of the data Steps for Exploratory Data Analysis I Descriptive statistics: mean/median and variance, quantiles, outliers I Correlation I Fitting curves and distributions I Dimension reduction I Clustering
  • 14.
    Regression: Curve Fitting RegressionAnalysis a set of statistical processes for estimating the relationships among variables helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied,s while the other independent variables are held fixed. Linear Regression fit a line to (x,y) data y is dependent variable, x is independent variable Curve Fitting Can fit other forms of curves to data
  • 15.
  • 16.
    Dimension reduction: PCA PrincipalComponent Analysis (PCA) Aim: to identify the combinations of variables that explain the variability in the data set Method Transform original set of correlated variables into set of orthogonal (independent) variables I linear combination of original variables I first principal component accounts for as much of variability as possible I second PC accounts for as much of remaining variability as possible I etc Map to PC for Dimension Reduction
  • 17.
    Clustering Clustering brings together “similar”observations Distances Many potential distances Euclidean distance Manhattan distance Cosine distance k-Means Clustering Creates k clusters, pre-defined k Start with k random centroids Iteratively assign points to nearest centroid, and recompute centroids Agglomerative Clustering Start each point is cluster Iteratively merge closest clusters Clusters define Nominal Dimension
  • 18.
    Clustering: Consistency ofData Cluster/sort data values To bring together duplicate and similar data values to make it easy to see differences/errors (See OpenRefine video 1 of 3) Cluster observations To bring together duplicate and similar observations to make it easy to see differences/errors Check for consistency Differences need to be investigated
  • 19.
    Feature Engineering Feature A featureis an attribute or property shared by all of the independent units on which analysis or prediction is to be done. Any attribute could be a feature, as long as it is useful to the model. Process of Feature Engineering I Brainstorming or Testing features; I Deciding what features to create; I Creating features; I Checking how the features work with your model; I Improving your features if needed; I Go back to brainstorming/creating more features until the work is done. See video 3, Ryan Baker, Coursera, Big Data Week 3 Feature Engineering https://www.youtube.com/watch?v=drUToKxEAUA
  • 20.
    Feature Creation Aggregation Basic aggregationoperators I sum I mean, media, mode I frequency Other I binning Transformation Apply a transformation to features I normalization, unification, resolution, regularization I log I feature split I scaling
  • 21.
    Feature Creation: Binning NumericalData to Categorical Data Example: Age Define bins: Infant for age between 0 – 4 Child for age between 5 – 12 Teen for age between 13 – 19 YoungAdult for age between 20 – 29 Adult for age between 30 – 44 Mature for age between 45 – 64 Senior for age between 65 – 79 Elderly for age 80 and over
  • 22.
    Feature Creation: Splitting FeatureSplitting Example: Name split to FirstName, LastName Example: Date 2019-06-21 split to Year, Month, Day
  • 23.
  • 24.
    Feature Contribution Correlation Example r2measures how much of variation is explained by linear regression Contribution to Model When building a model from your dataset, does the technique allow you to know the contribution of each feature? Compare with PCA PCA finds principal orthogonal components components are ranked by contribution components are defined as combinations of features