Types of Data &
Data Preprocessing
Prof. Navneet Goyal
Department of Computer Science &
Information Systems
BITS, Pilani
Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
Why Preprocess Data?
 Data in the real world is dirty
 Incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
 Noisy: containing errors or outliers
 Inconsistent: containing discrepancies in
codes or names
 Welcome to the real world!
 No quality data, no quality mining results!
 Quality decisions must be based on quality
data
Understanding Your Data





Descriptive Data summarization
Foundation for data processing
Central tendency: Mean, Mode, Median
Data Dispersion: Quartiles, Interquartile
range (IQR), Variance

 Distributive Measure:
sum,count,max,min
 Algebraic Measure: algebraic fn. On
one or more distributive measure
 Example: average weighted average
Understanding Your Data
 Mean is sensitive to extreme values
 Solution: Trimming
 For skewed data: median is a better
measure (middle values of ordered
set)
 Holistic measure: cannot be
computed by data partitioning
 Example: median
 Computationally more expensive
Understanding Your Data
Mode: most frequently occurring data value
Unimodal, bimodal, trimodal, multimodal
No mode!!
Dispersion of data: range, quartile, outliers
Range=max-min
 kth percentile of a set of data in numerical order is the
value xi having the property that k% of data values lie
at or below xi
th
 Median is 50 percentile
 Quartiles: Q1(25th percentile), Q3 (75th percentile)
 Give idea about center, spread, & shape of distribution
 IQR = Q3 – Q1 (all holistic measures)





Understanding Your Data
 Outliers: single out values falling at least
1.5 x IQR above Q3 or below Q1
 Which of the measures discussed so far are
one or more of the data values?
 5-member summary:
minimum, Q1, Median, Q3, maximum
(since Q1, Median, Q3 contain no information about the tails)

 Boxplots
 Variance & Std. Dev.
 Interpret σ=0 & σ>0
Major Tasks in Data
Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies

 Data integration
 Integration of multiple databases, data
cubes, or files

 Data transformation
 Normalization and aggregation
Major Tasks in Data
Preprocessing
 Data reduction (sampling)
 Obtains reduced representation in volume but
produces the same or similar analytical
results

 Data discretization
 Part of data reduction but with particular
importance, especially for numerical data
Forms of data preprocessing

Figure taken from Han & kamber Book: Data Mining Concepts & Techniques, 2e
Data Cleaning
 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy
data
 Correct inconsistent data
Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data

 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus
deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data

 Missing data may need to be inferred.
How to Handle
Missing Data?


Ignore the tuple: usually done when class label is missing
(assuming the tasks is classification—not effective when the
percentage of missing values per attribute varies considerably)



Fill in the missing value manually: tedious + infeasible?



Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!



Use the attribute mean to fill in the missing value



Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter



Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
Noisy Data
 Noise: random error or variance in a measured
variable
 Incorrect attribute values may be due to






faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention

 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data

 Smooth out the data to remove noise
Smoothing Techniques
 Binning method:
 first sort data and partition into (equi-depth) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.

 Clustering
 detect and remove outliers

 Combined computer and human inspection
 detect suspicious values and check by human

 Regression
 smooth by fitting the data into regression functions
Binning
 Binning methods smooth a sorted data
value by consulting its neighborhood,
that is, values around it
 Sorted values are distributed into a
number of ‘buckets’ or ‘bins’
 Binning does local smoothing
 Different binning methods illustrated by
an example
 Also used as data discretization tech.
Simple Discretization Methods:
Binning
 Equal-width (distance) partitioning:

 It divides the range into N intervals of equal size:
uniform grid
 if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.
 Most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.

 Equal-depth (frequency) partitioning:

 It divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling
 Managing categorical attributes can be tricky.
Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21,
24, 25, 26, 28, 29, 34
 Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
 Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
 Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression
y
Y1

y=x+1

Y1’

X1

x
Cluster Analysis
Data Smoothing &
Reduction
 Many methods discussed above for data
smoothing are also methods for data
reduction involving discretization
 For eg. Binning reduces the number of
distinct values per attribute ( a form of
data reduction for logic-based data
mining methods such a decision tree
induction
Data Integration
 Data integration:
 combines data from multiple sources into a coherent
store

 Schema integration
 integrate metadata from different sources
 Entity identification problem: identify real world
entities from multiple data sources, e.g., A.cust-id ≡
B.cust-#

 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from
different sources are different
 possible reasons: different representations, different
scales, e.g., metric vs. British units
Handling Redundant Data
in Data Integration
 Redundant data occur often when integration of
multiple databases
 The same attribute may have different names in
different databases
 One attribute may be a “derived” attribute in another
table, e.g., annual revenue

 Redundant data may be detected by correlation
analysis (Pearson’s Correlation Coefficient)
 Correlation does not imply Causality
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
Data Transformation
 Smoothing: remove noise from data
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
 min-max normalization
 z-score normalization
 normalization by decimal scaling

 Attribute/feature construction
 New attributes constructed from the given ones
Data Transformation:
Normalization
 min-max normalization
v − minA
v' =
(new _ maxA − new _ minA) + new _ minA
maxA − minA
 z-score normalization (zero-mean)
v −meanA
v' =
stand _ devA
 normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(| v ' |)<1
10
Data Transformation:
Attribute Construction
 New attributes are constructed from given
attributes and added
 Improves accuracy
 Helps in understanding of structure in higdimensional data
 For eg. Add area based on attributes height
& width
 Knowing about relationships among
attributes help in knowledge discovery
Data Reduction Strategies
 Warehouse may store terabytes of data:
Complex data analysis/mining may take a very
long time to run on the complete data set
 Data reduction
 Obtains a reduced representation of the data set that
is much smaller in volume but yet produces the
same (or almost the same) analytical results

 Data reduction strategies






Data cube aggregation
Attribute subset selection (feature subset selection)
Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation
Data Cube Aggregation
 Cube at the lowest level of abstraction – base cuboid
 Cube at highest level of abstraction – apex cuboid
 Cubes are created at various levels of abstraction,
depending upon the analysis task – cubiods
 Cube is a lattice of cuboids
 Data volume reduces as we move up from base to apex
cubiod
 While doing data mining, the smallest available cuboid
relevant to the given task should be used
 Cube aggregation gives smaller data without loss of
information necessary for the analysis task
Attribute subset selection
 Also called Feature subset selection
 Leave out irrelevant attributes and pick only relevant
attributes
 Difficult and time consuming process
 Reduces the data size by removing irrelevant or
redundant attributes (dimensions)
 Goal is to select a minimum set of features such that the
resulting probability distribution of data classes is as close as
possible to the original distribution given the values of all
features
 Additional benefit: less attributes appear in discovered
patterns, making interpretation easier
Attribute subset selection
 How to select a good representative subset?
 For N attributes, 2N possible subsets
 Heuristic methods (due to exponential # of
choices)
 Heuristic methods that explore a reduced
search space are generally used
 Greedy algorithms
 Heuristic methods:
 step-wise forward selection
 step-wise backward elimination
 combining forward selection and backward
elimination
 decision-tree induction
Example of Decision Tree
Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A6?

A1?

Class 1
>

Class 2

Class 1

Reduced attribute set: {A1, A4, A6}

Class 2
Wavelet Transforms

Haar2

Daubechie4

 Discrete wavelet transform (DWT): linear signal
processing
 Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
 Method:


Length, L, must be an integer power of 2 (padding with 0s, when
necessary)



Each transform has 2 functions: smoothing, difference



Applies to pairs of data, resulting in two set of data of length L/2



Applies two functions recursively, until reaches the desired length

Figure taken from Han & kamber Book: Data Mining Concepts & Techniques, 2e
Principal Component
Analysis
 Given N data vectors from k-dimensions,
find c <= k orthogonal vectors that can be
best used to represent data
 The original data set is reduced to one consisting
of N data vectors on c principal components
(reduced dimensions)

 Each data vector is a linear combination of
the c principal component vectors
 Works for numeric data only
 Used when the number of dimensions is
large
Principal Component
Analysis
X2
Y1
Y2

X1
Numerosity Reduction
 Can we reduce the data volume by
choosing alternative ‘smaller forms of
data representation?
 Techniques:
 Parametric
 Non-parametric methods
Numerosity Reduction
 Parametric methods
 Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)
 Log-linear models

 Non-parametric methods
 Do not assume models. Stores reduced
representations of the data
 Major families: histograms, clustering, sampling
Regression and LogLinear Models
 Linear regression: Data are modeled to fit a straight
line
 Often uses the least-square method to fit the line

 Multiple regression: allows a response variable Y to
be modeled as a linear function of multidimensional
feature vector
 Log-linear model: approximates discrete
multidimensional probability distributions
Regress Analysis and
Log-Linear Models
 Linear regression: Y = α + β X
 Two parameters , α and β specify the line and are
to be estimated by using the data at hand.
 using the least squares criterion to the known
values of Y1, Y2, …, X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2.
 Many nonlinear functions can be transformed into
the above.
 Log-linear models:
 The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
 Probability: p(a, b, c, d) = αab βacχad δbcd
Histograms
40
 A popular data
reduction technique
35
 Divide data into
30
buckets and store
average (sum) for each
25
bucket
20
 Can be constructed
optimally in one
15
dimension using
dynamic programming 10
 Related to quantization 5
problems.
0

10000

30000

50000

70000

90000
Clustering
 Partition data set into clusters, and one can store
cluster representation only
 Can be very effective if data is clustered but not if
data is “smeared”
 Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms, further detailed in Chapter 8
Sampling
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew

 Develop adaptive sampling methods
 Stratified sampling:

 Approximate the percentage of each class (or
subpopulation of interest) in the overall database
 Used in conjunction with skewed data

 Sampling may not reduce database I/Os (page at a
time).
Sampling

WOR ndom
SRS le ra
t
simp e withou
(
l
samp ment)
e
eplac
r

SRSW
R

Raw Data
Sampling
Raw Data

Cluster/Stratified Sample

Data preprocessing ng

  • 1.
    Types of Data& Data Preprocessing Prof. Navneet Goyal Department of Computer Science & Information Systems BITS, Pilani
  • 2.
    Data Preprocessing  Whypreprocess the data?  Data cleaning  Data integration and transformation  Data reduction  Discretization and concept hierarchy generation  Summary
  • 3.
    Why Preprocess Data? Data in the real world is dirty  Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  Noisy: containing errors or outliers  Inconsistent: containing discrepancies in codes or names  Welcome to the real world!  No quality data, no quality mining results!  Quality decisions must be based on quality data
  • 4.
    Understanding Your Data     DescriptiveData summarization Foundation for data processing Central tendency: Mean, Mode, Median Data Dispersion: Quartiles, Interquartile range (IQR), Variance  Distributive Measure: sum,count,max,min  Algebraic Measure: algebraic fn. On one or more distributive measure  Example: average weighted average
  • 5.
    Understanding Your Data Mean is sensitive to extreme values  Solution: Trimming  For skewed data: median is a better measure (middle values of ordered set)  Holistic measure: cannot be computed by data partitioning  Example: median  Computationally more expensive
  • 6.
    Understanding Your Data Mode:most frequently occurring data value Unimodal, bimodal, trimodal, multimodal No mode!! Dispersion of data: range, quartile, outliers Range=max-min  kth percentile of a set of data in numerical order is the value xi having the property that k% of data values lie at or below xi th  Median is 50 percentile  Quartiles: Q1(25th percentile), Q3 (75th percentile)  Give idea about center, spread, & shape of distribution  IQR = Q3 – Q1 (all holistic measures)     
  • 7.
    Understanding Your Data Outliers: single out values falling at least 1.5 x IQR above Q3 or below Q1  Which of the measures discussed so far are one or more of the data values?  5-member summary: minimum, Q1, Median, Q3, maximum (since Q1, Median, Q3 contain no information about the tails)  Boxplots  Variance & Std. Dev.  Interpret σ=0 & σ>0
  • 8.
    Major Tasks inData Preprocessing  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data transformation  Normalization and aggregation
  • 9.
    Major Tasks inData Preprocessing  Data reduction (sampling)  Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization  Part of data reduction but with particular importance, especially for numerical data
  • 10.
    Forms of datapreprocessing Figure taken from Han & kamber Book: Data Mining Concepts & Techniques, 2e
  • 11.
    Data Cleaning  Datacleaning tasks  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data
  • 12.
    Missing Data  Datais not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data  Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data  Missing data may need to be inferred.
  • 13.
    How to Handle MissingData?  Ignore the tuple: usually done when class label is missing (assuming the tasks is classification—not effective when the percentage of missing values per attribute varies considerably)  Fill in the missing value manually: tedious + infeasible?  Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!  Use the attribute mean to fill in the missing value  Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter  Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree
  • 14.
    Noisy Data  Noise:random error or variance in a measured variable  Incorrect attribute values may be due to      faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention  Other data problems which requires data cleaning  duplicate records  incomplete data  inconsistent data  Smooth out the data to remove noise
  • 15.
    Smoothing Techniques  Binningmethod:  first sort data and partition into (equi-depth) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Clustering  detect and remove outliers  Combined computer and human inspection  detect suspicious values and check by human  Regression  smooth by fitting the data into regression functions
  • 16.
    Binning  Binning methodssmooth a sorted data value by consulting its neighborhood, that is, values around it  Sorted values are distributed into a number of ‘buckets’ or ‘bins’  Binning does local smoothing  Different binning methods illustrated by an example  Also used as data discretization tech.
  • 17.
    Simple Discretization Methods: Binning Equal-width (distance) partitioning:  It divides the range into N intervals of equal size: uniform grid  if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.  Most straightforward  But outliers may dominate presentation  Skewed data is not handled well.  Equal-depth (frequency) partitioning:  It divides the range into N intervals, each containing approximately same number of samples  Good data scaling  Managing categorical attributes can be tricky.
  • 18.
    Binning Methods forData Smoothing  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34  Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34  Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29  Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
  • 19.
  • 20.
  • 21.
    Data Smoothing & Reduction Many methods discussed above for data smoothing are also methods for data reduction involving discretization  For eg. Binning reduces the number of distinct values per attribute ( a form of data reduction for logic-based data mining methods such a decision tree induction
  • 22.
    Data Integration  Dataintegration:  combines data from multiple sources into a coherent store  Schema integration  integrate metadata from different sources  Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-#  Detecting and resolving data value conflicts  for the same real world entity, attribute values from different sources are different  possible reasons: different representations, different scales, e.g., metric vs. British units
  • 23.
    Handling Redundant Data inData Integration  Redundant data occur often when integration of multiple databases  The same attribute may have different names in different databases  One attribute may be a “derived” attribute in another table, e.g., annual revenue  Redundant data may be detected by correlation analysis (Pearson’s Correlation Coefficient)  Correlation does not imply Causality  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  • 24.
    Data Transformation  Smoothing:remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range  min-max normalization  z-score normalization  normalization by decimal scaling  Attribute/feature construction  New attributes constructed from the given ones
  • 25.
    Data Transformation: Normalization  min-maxnormalization v − minA v' = (new _ maxA − new _ minA) + new _ minA maxA − minA  z-score normalization (zero-mean) v −meanA v' = stand _ devA  normalization by decimal scaling v v' = j Where j is the smallest integer such that Max(| v ' |)<1 10
  • 26.
    Data Transformation: Attribute Construction New attributes are constructed from given attributes and added  Improves accuracy  Helps in understanding of structure in higdimensional data  For eg. Add area based on attributes height & width  Knowing about relationships among attributes help in knowledge discovery
  • 27.
    Data Reduction Strategies Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set  Data reduction  Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results  Data reduction strategies      Data cube aggregation Attribute subset selection (feature subset selection) Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation
  • 28.
    Data Cube Aggregation Cube at the lowest level of abstraction – base cuboid  Cube at highest level of abstraction – apex cuboid  Cubes are created at various levels of abstraction, depending upon the analysis task – cubiods  Cube is a lattice of cuboids  Data volume reduces as we move up from base to apex cubiod  While doing data mining, the smallest available cuboid relevant to the given task should be used  Cube aggregation gives smaller data without loss of information necessary for the analysis task
  • 29.
    Attribute subset selection Also called Feature subset selection  Leave out irrelevant attributes and pick only relevant attributes  Difficult and time consuming process  Reduces the data size by removing irrelevant or redundant attributes (dimensions)  Goal is to select a minimum set of features such that the resulting probability distribution of data classes is as close as possible to the original distribution given the values of all features  Additional benefit: less attributes appear in discovered patterns, making interpretation easier
  • 30.
    Attribute subset selection How to select a good representative subset?  For N attributes, 2N possible subsets  Heuristic methods (due to exponential # of choices)  Heuristic methods that explore a reduced search space are generally used  Greedy algorithms  Heuristic methods:  step-wise forward selection  step-wise backward elimination  combining forward selection and backward elimination  decision-tree induction
  • 31.
    Example of DecisionTree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A6? A1? Class 1 > Class 2 Class 1 Reduced attribute set: {A1, A4, A6} Class 2
  • 32.
    Wavelet Transforms Haar2 Daubechie4  Discretewavelet transform (DWT): linear signal processing  Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients  Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space  Method:  Length, L, must be an integer power of 2 (padding with 0s, when necessary)  Each transform has 2 functions: smoothing, difference  Applies to pairs of data, resulting in two set of data of length L/2  Applies two functions recursively, until reaches the desired length Figure taken from Han & kamber Book: Data Mining Concepts & Techniques, 2e
  • 33.
    Principal Component Analysis  GivenN data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data  The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions)  Each data vector is a linear combination of the c principal component vectors  Works for numeric data only  Used when the number of dimensions is large
  • 34.
  • 35.
    Numerosity Reduction  Canwe reduce the data volume by choosing alternative ‘smaller forms of data representation?  Techniques:  Parametric  Non-parametric methods
  • 36.
    Numerosity Reduction  Parametricmethods  Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)  Log-linear models  Non-parametric methods  Do not assume models. Stores reduced representations of the data  Major families: histograms, clustering, sampling
  • 37.
    Regression and LogLinearModels  Linear regression: Data are modeled to fit a straight line  Often uses the least-square method to fit the line  Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector  Log-linear model: approximates discrete multidimensional probability distributions
  • 38.
    Regress Analysis and Log-LinearModels  Linear regression: Y = α + β X  Two parameters , α and β specify the line and are to be estimated by using the data at hand.  using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….  Multiple regression: Y = b0 + b1 X1 + b2 X2.  Many nonlinear functions can be transformed into the above.  Log-linear models:  The multi-way table of joint probabilities is approximated by a product of lower-order tables.  Probability: p(a, b, c, d) = αab βacχad δbcd
  • 39.
    Histograms 40  A populardata reduction technique 35  Divide data into 30 buckets and store average (sum) for each 25 bucket 20  Can be constructed optimally in one 15 dimension using dynamic programming 10  Related to quantization 5 problems. 0 10000 30000 50000 70000 90000
  • 40.
    Clustering  Partition dataset into clusters, and one can store cluster representation only  Can be very effective if data is clustered but not if data is “smeared”  Can have hierarchical clustering and be stored in multi-dimensional index tree structures  There are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8
  • 41.
    Sampling  Allow amining algorithm to run in complexity that is potentially sub-linear to the size of the data  Choose a representative subset of the data  Simple random sampling may have very poor performance in the presence of skew  Develop adaptive sampling methods  Stratified sampling:  Approximate the percentage of each class (or subpopulation of interest) in the overall database  Used in conjunction with skewed data  Sampling may not reduce database I/Os (page at a time).
  • 42.
    Sampling WOR ndom SRS lera t simp e withou ( l samp ment) e eplac r SRSW R Raw Data
  • 43.