Introduction to EDA and Data Analytics with Power BI
1.
COMP 499 Introductionto Data
Analytics
Lecture 9 — Exploratory Data Analysis
Greg Butler
Data Science Research Centre
and
Centre for Structural and Functional Genomics
and
Computer Science and Software Engineering
Concordia University, Montreal, Canada
[email protected]
2.
Exploratory Data Analysis(EDA)
Outline of Lecture
I EDA: Concepts, Steps, Methods
I Skewness and Kurtosis
I Regression: Curve Fitting
I Dimension reduction: PCA
I Clustering
I Feature Engineering
Exploratory Data Analysis
Tukey1977 book
John Tukey (1977), Exploratory Data Analysis, Addison-Wesley.
NIST Engineering Statistics Handbook
Exploratory Data Analysis (EDA) is an approach/philosophy for data
analysis that employs a variety of techniques (mostly graphical) to
1. maximize insight into a data set;
2. uncover underlying structure;
3. extract important variables;
4. detect outliers and anomalies;
5. test underlying assumptions;
6. develop parsimonious models; and
7. determine optimal factor settings.
The EDA approach is not a set of techniques, but an attitude/philosophy
about how a data analysis should be carried out.
https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
5.
Exploratory Data Analysis
NISTEngineering Statistics Handbook
EDA is an approach to data analysis
that postpones the usual assumptions about what kind of model
the data follow
with the more direct approach of
allowing the data itself
to reveal its underlying structure and model.
https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
7.
EDA Checklist
1. Whatquestion(s) are you trying to solve (or prove wrong)?
2. What kind of data do you have and how do you treat different
types?
3. What’s missing from the data and how do you deal with it?
4. Where are the outliers and why should you care about them?
5. How can you add, change or remove features to get more out
of your data?
Daniel Bourke, A Gentle Introduction to Exploratory Data Analysis,
https://towardsdatascience.com/a-gentle-introduction-to-exploratory-data-analysis-f11d843b8184
8.
EDA Circle ofLife
Daniel Bourke, A Gentle Introduction to Exploratory Data Analysis,
https://towardsdatascience.com/a-gentle-introduction-to-exploratory-data-analysis-f11d843b8184
EDA: Skewness andKurtosis
Besides analyses to characterize central tendency and variability ...
a further characterization of the data includes skewness and kurtosis.
Skewness
Skewness is a measure of symmetry, or more precisely, the lack of
symmetry.
A distribution, or data set, is symmetric if it looks the same to the left
and right of the center point.
Kurtosis
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed
relative to a normal distribution.
That is, data sets with high kurtosis tend to have heavy tails, or outliers.
Data sets with low kurtosis tend to have light tails, or lack of outliers.
A uniform distribution would be the extreme case.
Detecting Skewness and Kurtosis
The histogram is an effective graphical technique for showing both the
skewness and kurtosis of data set.
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
13.
Process: Exploratory DataAnalysis
Exploratory Data Analysis
Learn about the properties of the data
Steps for Exploratory Data Analysis
I Descriptive statistics: mean/median and variance, quantiles,
outliers
I Correlation
I Fitting curves and distributions
I Dimension reduction
I Clustering
14.
Regression: Curve Fitting
RegressionAnalysis
a set of statistical processes for estimating the relationships among
variables
helps one understand how the typical value of the dependent
variable changes
when any one of the independent variables is varied,s
while the other independent variables are held fixed.
Linear Regression
fit a line to (x,y) data
y is dependent variable, x is independent variable
Curve Fitting
Can fit other forms of curves to data
Dimension reduction: PCA
PrincipalComponent Analysis (PCA)
Aim: to identify the combinations of variables that explain the
variability in the data set
Method
Transform original set of correlated variables
into
set of orthogonal (independent) variables
I linear combination of original variables
I first principal component accounts for as much of variability
as possible
I second PC accounts for as much of remaining variability as
possible
I etc
Map to PC for Dimension Reduction
17.
Clustering
Clustering
brings together “similar”observations
Distances
Many potential distances
Euclidean distance
Manhattan distance
Cosine distance
k-Means Clustering
Creates k clusters, pre-defined k
Start with k random centroids
Iteratively assign points to nearest centroid,
and recompute centroids
Agglomerative Clustering
Start each point is cluster
Iteratively merge closest clusters
Clusters define Nominal Dimension
18.
Clustering: Consistency ofData
Cluster/sort data values
To bring together
duplicate and similar data values
to make it easy to see differences/errors
(See OpenRefine video 1 of 3)
Cluster observations
To bring together
duplicate and similar observations
to make it easy to see differences/errors
Check for consistency
Differences need to be investigated
19.
Feature Engineering
Feature
A featureis an attribute or property shared by all of the
independent units on which analysis or prediction is to be done.
Any attribute could be a feature, as long as it is useful to the
model.
Process of Feature Engineering
I Brainstorming or Testing features;
I Deciding what features to create;
I Creating features;
I Checking how the features work with your model;
I Improving your features if needed;
I Go back to brainstorming/creating more features until the
work is done.
See video 3, Ryan Baker, Coursera, Big Data Week 3 Feature Engineering
https://www.youtube.com/watch?v=drUToKxEAUA
20.
Feature Creation
Aggregation
Basic aggregationoperators
I sum
I mean, media, mode
I frequency
Other
I binning
Transformation
Apply a transformation to features
I normalization, unification, resolution, regularization
I log
I feature split
I scaling
21.
Feature Creation: Binning
NumericalData to Categorical Data
Example: Age
Define bins:
Infant for age between 0 – 4
Child for age between 5 – 12
Teen for age between 13 – 19
YoungAdult for age between 20 – 29
Adult for age between 30 – 44
Mature for age between 45 – 64
Senior for age between 65 – 79
Elderly for age 80 and over
Feature Contribution
Correlation Example
r2measures how much of variation is explained by linear regression
Contribution to Model
When building a model from your dataset,
does the technique allow you
to know the contribution of each feature?
Compare with PCA
PCA finds principal orthogonal components
components are ranked by contribution
components are defined as combinations of features