Sess03 Dimension Reduction Methods.pptx

Dimension Reduction Methods
Principal Components & Factor
Analysis

Need for Dimension Reduction
• The databases typically used in data mining may have millions of records
and thousands of variables
• It is unlikely that all of the variables are independent, with no correlation
structure among them
• Data analysts need to guard against multicollinearity, a condition where
some of the predictor variables are strongly correlated with each other
• Multicollinearity leads to instability in the solution space, leading to
possible incoherent results, such as in multiple regression, where a
multicollinear set of predictors can result in a regression which is
significant overall, even when none of the individual variables is significant
• Even if such instability is avoided, inclusion of variables which are highly
correlated tends to overemphasize a particular component of the model,
as the component is essentially being double counted

• Statisticians noted that the sample size needed to fit a multivariate
function grows exponentially with the number of variables - higher-
dimension spaces are inherently sparse
• For example, the empirical rule tells us that, in 1-d, about 68% of normally
distributed variates lie between ±1𝜎 from the mean; while, for a 10-d
multivariate normal distribution, only 2% of the data lies within the
analogous hypersphere
• The use of too many predictor variables to model a relationship with a
response variable can unnecessarily complicate the interpretation of the
analysis, and violates the principle of parsimony
• Also, retaining too many variables may lead to overfitting, in which the
generality of the findings is hindered because new data do not behave the
same as the training data for all the variables

• Further, analysis solely at the variable-level might miss the fundamental
underlying relationships among the predictors
• Several predictors might fall naturally into a single group, (a factor or a
component), which addresses a single aspect of the data
• For example, the variables savings account balance, checking account
balance, home equity, stock portfolio value, and 401k balance might all fall
together under the single component, assets
• In some applications, such as image analysis, retaining full dimensionality
would make most problems intractable. For example, a face classification
system based on 256 × 256 pixel images could potentially require vectors
of dimension 65,536

Goals
• To reduce the number of predictor items
• To help ensure that these predictor items are independent
• To provide a framework for interpretability of the results
• But how?
Using the correlation structure
among the predictor variables

Principal Components Analysis (PCA)
• PCA seeks to explain the correlation structure of a set of predictor variables, using a
smaller set of linear combinations of these variables called components
• The total variability of a data set produced by the complete set of m variables can
often be mostly accounted for by a smaller set of k linear combinations of these
variables meaning that there is almost as much information in the k components as
there is in the original m variables
• If desired, the analyst can then replace the original m variables with the k < m
components, so that the working data set now consists of n records on k
components, rather than n records on m variables
• The analyst should note that PCA acts solely on the predictor variables, and ignores
the target variable

• Suppose that the original variables 𝕩 = 𝑥1, 𝑥2, … , 𝑥𝑚 form a coordinate system in
m-dimensional space
• The principal components represent a new coordinate system, found by rotating the
original system along the directions of maximum variability
• When preparing to perform data reduction, the analyst should first standardize the
data, so that the mean for each variable is zero, and the standard deviation is one
• Standardization done by
• Here,

• How to create Covariance Matrix?
• Covariance measures are not scaled so, change in unit of measurement would
change value of covariance so, correlation matrix is developed by using
• Correlation Matrix has the following structure

• Again consider the standardized data i.e.
• As each variable has been standardized so,
i.e. for standardized data set, covariance and correlation matrix are same
• Then ith principal component (PC) can be computed for the standardized data
matrix as where denotes the transpose
of ith eigen vector of
• The PCs are linear combinations of the standardized variables such that
• The variances of the PCs are as large as possible
• The PCs are uncorrelated
• The first PC is which has greater
variability than any other possible combination of the Z variables

• The first PC is the linear combination that maximizes
• The second PC is the linear combination that maximizes and it is
independent of first PC
• In general the ith PC is the linear combination which is independent of all
the other PCs and maximizes
• Eigenvalues: If B is an mxm matrix and let I be the Identity matrix then the following
scalars will be called Eigenvalues of B if
• Eigenvectors: For the above matrix and its eigenvalues a nonzero mx1 vector e will
be called as an eigen vector corresponding to the eigenvalue λ if
• Two important properties of Eigenvalues and Eigenvectors
when the eigenvectors are
computed on covariance matrix and

Results on PCA
• The total variability in the standardized set of predictors equals the sum of the
variances of the Z-vectors, which equals the sum of the variances of the
components, which equals the sum of the eigenvalues, which equals the number of
predictors
• The partial correlation between a given component and a given predictor variable is
a function of an eigenvector and an eigenvalue. Specifically,
where are the eigenvalue-eigenvector pairs
for the correlation matrix and
• The proportion of the total variability in Z that is explained by the ith principal
component is the ratio of the ith eigenvalue to the number of variables,
that is, the ratio

PCA on a Dataset
• Houses dataset provides census information from all the block groups from the 1990
California census
• For this data set, a block group has an average of 1425.5 people living in an area that is
geographically compact
• Block groups were excluded that contained zero entries for any of the variables
• Median house value is the response variable; the predictor variables are the following
• Median income, Housing median age, Total rooms, Total bedrooms, Population,
Households, Latitude, Longitude
• The original data set had 20,640 records, of which 18,540 were randomly selected for a
training data set, and 2100 held out for a test data set
• Median house value appears to be in dollars, but median income has been scaled to a 0–
15 continuous scale
• Note that longitude is expressed in negative terms, meaning west of Greenwich. Larger
absolute values for longitude indicate geographic locations further west

How Many Components???
• One of the motivations for PCA was to reduce the number of distinct explanatory
elements
• should we retain only the first principal component, as it explains nearly half the
variability? Or, should we retain all eight components, as they explain 100% of the
variability?
• clearly, retaining all eight components does not help us to reduce the number of
distinct explanatory elements
• Note from Table of the eigenvalues for several of the components are rather low,
explaining less than 2% of the variability in the Z-variables
• The criteria used for deciding how many components to extract are the following:
• The Eigenvalue Criterion
• The Proportion of Variance Explained Criterion
• The Minimum Communality Criterion
• The Scree Plot Criterion

Eigen Values and Variance Explained

Eigen Value Criterion
• Recall sum of the eigenvalues represents the number of variables entered into the
PCA
• An eigenvalue of 1 would then mean that the component would explain about “one
variable’s worth” of the variability
• rationale for using the eigenvalue criterion is that each component should explain at
least one variable’s worth of the variability
• therefore, the eigenvalue criterion states that only components with eigenvalues
greater than 1 should be retained
• Note that, if there are fewer than 20 variables, the eigenvalue criterion tends to
recommend extracting too few components
• If there are more than 50 variables, this criterion may recommend extracting too
many
• From earlier Table, we see that three components have eigenvalues greater than 1,
and are therefore retained. Component 4 has an eigenvalue of 0.825, which is not
too far from one, so that we may decide to consider retaining this component as
well

Proportion of Variance Explained Criterion
• First, the analyst specifies how much of the total variability that he or she would like
the principal components to account for
• Then, the analyst simply selects the components one by one until the desired
proportion of variability explained is attained
• For example, suppose we would like our components to explain 85% of the
variability in the variables. Then, from Table, we would choose components 1–3,
which together explain 86.057% of the variability. However, if we wanted our
components to explain 90% or 95% of the variability, then we would need to include
component 4 along with components 1–3, which together would explain 96.368% of
the variability
• Again, as with the eigenvalue criterion, how large a proportion is enough?

Scree Plot Criterion
• A scree plot is a graphical plot of the eigenvalues against the component number
• Scree plots are useful for finding an upper bound (maximum) for the number of
components that should be retained
• The scree plot criterion is this: The maximum number of components that should
be extracted is just before where the plot first begins to straighten out into a
horizontal line

Comparison of Three Criteria
• To summarize, the recommendations of our criteria are as
follows:
• The Eigenvalue Criterion:
o Retain components 1–3, but do not throw away component 4 yet
• The Proportion of Variance Explained Criterion
o Components 1–3 account for a solid 86% of the variability, and tacking on
component 4 gives us a superb 96% of the variability
• The Scree Plot Criterion
o Do not extract more than four components

Comparison of Three Criteria
• In a case like this, where there is no clear-cut best solution, why not try it both ways
and see what happens? Three and Four components
• The component weights smaller than 0.15 are suppressed to ease the component
interpretation
• Note that the first three components are each exactly the same in both cases, and
each is the same as when we extracted all eight components
• This is because each component extracts its portion of the variability sequentially, so
that later component extractions do not affect the earlier ones.

Communalities
• PCA does not extract all the variance from the variables, but only that proportion of
the variance that is shared by several variables
• Communality represents the proportion of variance of a particular variable that is
shared with other variables
• The communalities represent the overall importance of each of the variables in the
PCA as a whole
• For example, a variable with a communality much smaller than the other variables
indicates that this variable shares much less of the common variability among the
variables, and contributes less to the PCA solution
• Communalities that are very low for a particular variable should be an indication to
the analyst that the particular variable might not participate in the PCA solution
• Overall, large communality values indicate that the principal components have
successfully extracted a large proportion of the variability in the original variables,
while small communality values show that there is still much variation in the data
set that has not been accounted for by the principal components

Communalities
• Communality values are calculated as the sum of squared component weights, for a
given variable
• We are trying to determine whether to retain component 4, the “housing age”
component
• Thus, we calculate the commonality value for the variable housing median age, using the
component weights for this variable (hage_z) from Table
• Two communality values for housing median age are calculated, one for retaining three
components, and the other for retaining four components
• Communalities less than 0.5 can be considered to be too low  the variable shares less
than half of its variability in common with the other variables
• Suppose that for some reason we wanted or needed to keep the variable housing
median age as an active part of the analysis. Then, extracting only three components
would not be adequate, as housing median age shares only 35% of its variance with the
other variables.
• If we wanted to keep this variable in the analysis, we would need to extract the fourth
component, which lifts the communality for housing median age over the 50% threshold

Comparison of Four Selection Criteria
• The Eigenvalue Criterion recommended three components, but did not absolutely reject
the fourth component. Also, for small numbers of variables, this criterion can
underestimate the best number of components to extract
• The Proportion of Variance Explained Criterion stated that we needed to use
four components if we wanted to account for that superb 96% of the variability.
As our ultimate goal is to substitute these components for the original data and
use them in further modeling downstream, being able to explain so much of the
variability in the original data is very attractive
• The Scree Plot Criterion said not to exceed four components
• The Minimum Communality Criterion stated that, if we wanted to keep housing
median age in the analysis, we had to extract the fourth component. As we
intend to substitute the components for the original data, then we need to keep
this variable, and therefore we need to extract the fourth component

Validation of PCs’
• Recall that the original data set was divided into a training data set and a test data set
• In order to validate the principal components uncovered here, we now perform PCA on the
standardized variables for the test data set
• The resulting component matrix is shown in Table, with component weights smaller than
±0.50 suppressed
• Although the component weights do not exactly equal those of the training set, nevertheless
the same four components were extracted, with a one-to-one correspondence in terms of
which variables are associated with which component
• If the split sample method described here does not successfully provide validation, then the
analyst should take this as an indication that the results (for the data set as a whole) are not
generalizable, and the results should not be reported as valid.
• If the lack of validation stems from a subset of the variables, then the analyst may consider
omitting these variables, and performing the PCA again

Factor Analysis
• Related to Principal Components Analysis but has different goals
• PCA seeks to identify orthogonal linear combinations of the variables, to be used either for
descriptive purposes or to substitute a smaller number of uncorrelated components for the
original variables
• In contrast, factor analysis represents a model for the data, and as such is more elaborate
especially using factor rotation
• The factor analysis model hypothesizes that the response vector can be
modeled as linear combinations of a smaller set of latent unobserved latent random variables
called common factors along with an error term in following way
:Response vector, :Loading matrix, :Unobservable random factors
:error vector
• Some assumptions are made here
and is diagonal
• Unfortunately, the factor solutions provided by factor analysis are invariant to transformations
• Two models and where T represents an
orthogonal transformations matrix, both will provide the same results

Application of FA
• Applying to Adults dataset
• The intended task is to find the set of demographic characteristics that can best predict
whether the individual has an income of over $50,000 per year
• For this example, we shall use only the following variables for the purpose of our factor
analysis: age, demogweight (a measure of the socioeconomic status of the individual’s
district), education_num, hours-per-week, and capnet (= capital gain - capital loss)
• Using same training and test data partition of 25000 and 7561 records respectively
• The variables were standardized, and the Z-vectors found using
• The correlation matrix is given below
Note that the correlations, although statistically significant in several cases, are overall much weaker than the correlations from the
houses data set. A weaker correlation structure should pose more of a challenge for the dimension-reduction method

Application
• Factor analysis requires a certain level of correlation in order to function appropriately
• The following tests have been developed to ascertain whether there exists sufficiently high
correlation to perform factor analysis
• The proportion of variability within the standardized predictor variables which is shared,
and therefore might be caused by underlying factors, is measured by the Kaiser–Meyer–
Olkin (KMO) Measure of Sampling Adequacy. Values of the KMO statistic less than 0.50
indicate that factor analysis may not be appropriate
• Bartlett’s Test of Sphericity tests the null hypothesis that the correlation matrix is an
identity matrix, that is, that the variables are truly uncorrelated. The statistic reported is
the p-value, so that very small values would indicate evidence against the null
hypothesis, that is, the variables really are correlated. For p-values much larger than
0.10, there is insufficient evidence that the variables are correlated, and so factor
analysis may not be suitable
• So we can proceed for Factor Analysis

Using R
• To allow us to view the results using a scatter plot, we decide a priori to extract only two
factors
• The following factor analysis is performed using the principal axis factoring option with an
iterative procedure used to estimate the communalities and the factor solution
• This analysis required 152 such iterations before reaching convergence
• The eigenvalues and the proportions of the variance explained by each factor are shown
• Note that the first two factors extract less than half of the total variability in the variables, as
contrasted with the houses data set, where the first two components extracted over 72% of
the variability due to the weaker correlation structure

Using R
• The factor loadings 𝐋m×k are shown in Table. Factor loadings are analogous to the component
weights in PCA, and represent the correlation between the ith variable and the jth factor
• Notice that the factor loadings are much weaker than the previous houses example, again due
to the weaker correlations among the standardized variables
• The communalities are also much weaker than the houses example : The low communality
values reflect the fact that there is not much shared correlation among the variables
• Note that the factor extraction increases the shared correlation

Factor Rotation
• To assist in the interpretation of the factors, factor rotation may be performed
• Corresponds to a transformation (usually orthogonal) of the coordinate axes, leading to a
different set of factor loadings
• Analogous to a scientist attempting to elicit greater contrast and detail by adjusting the focus
of the microscope and sharpest focus occurs when each variable has high factor loadings on a
single factor, with low-to-moderate loadings on the other factors
• For the Houses example, this sharp focus occurred already on the unrotated factor loadings,
so rotation was not necessary
• However, Table of factor loading for Adults dataset shows that we should perhaps try factor
rotation for the adult data set to improve our interpretation

Factor Rotation
• Note that most vectors do not closely follow the coordinate axes, which means that there is
poor “contrast” among the variables for each factor, thereby reducing interpretability
• Next, a varimax rotation was applied to the matrix of factor loadings, resulting in the new set
of factor loadings
• Note that the contrast has been increased for most variables
• Figure shows that the factor loadings have been rotated along the axes of maximum
variability, represented by Factor 1 and Factor 2
• Often, the first factor extracted represents a “general factor,” and accounts for much of the
total variability

Factor Rotation
• The effect of factor rotation is to redistribute this first factor’s variability explained among the
second, third, and subsequent factors
• The sums of squared loadings for Factor 1 for the unrotated case is
• This represents 10.7% of the total variability, and about 61% of the variance explained by the
first two factors
• For the rotated case, Factor 1’s influence has been partially redistributed to Factor 2 now
accounting for 9.6% of the total variability and about 55% of the variance explained by the
first two factors

Goals of Factor Rotation
• Three methods for orthogonal rotation, in which the axes are rigidly maintained at
90∘
• The goal when rotating the matrix of factor loadings is to ease interpretability by
simplifying the rows and columns of the column matrix
• We assume that the columns in a matrix of factor loadings represent the factors,
and that the rows represent the variables
• Simplifying the rows of this matrix would entail maximizing the loading of a
particular variable on one particular factor, and keeping the loadings for this variable
on the other factors as low as possible (ideal: row of zeroes and ones)
• Similarly, simplifying the columns of this matrix would entail maximizing the loading
of a particular factor on one particular variable, and keeping the loadings for this
factor on the other variables as low as possible (ideal: column of zeroes and ones)
• Three types of Rotation
• Quartimax Rotation
• Varimax Rotation
• Equimax Rotation

Types of Factor Rotation
• Quartimax Rotation seeks to simplify the rows of a matrix of factor loadings. It
tends to rotate the axes so that the variables have high loadings for the first factor,
and low loadings thereafter. The difficulty is that it can generate a strong “general”
first factor, in which many variables have high loadings
• Varimax Rotation prefers to simplify the column of the factor loading matrix. It
maximizes the variability in the loadings for the factors, with a goal of working
toward the ideal column of zeroes and ones for each variable. The rationale for
varimax rotation is that we can best interpret the factors when they are strongly
associated with some variable and strongly not associated with other variables
• Varimax is more invariant than Quartimax (Shown by researchers)
• Equimax Rotation seeks to compromise between simplifying the columns and the
rows
• The researcher may prefer to avoid the requirement that the rotated factors remain
orthogonal (independent)
• In this case, oblique rotation methods are available, in which the factors may be
correlated with each other
• This rotation method is called oblique because the axes are no longer required to be
at 90∘, but may form an oblique angle

Sess03 Dimension Reduction Methods.pptx

Sess03 Dimension Reduction Methods.pptx

More Related Content

Similar to Sess03 Dimension Reduction Methods.pptx

Recently uploaded

Sess03 Dimension Reduction Methods.pptx