Dimensionality Reduction -PCA

Dimensionality Reduction with Principal Component Analysis

Working directly with high-dimensional data, such as images, comes with some difficulties: It is hard to analyze, interpretation is difficult, visualization is nearly impossible, and (from a practical point of view) storage of the data vectors can be expensive. However, high-dimensional data often has properties that we can exploit. For example, high-dimensional data is often overcomplete, i.e., many dimensions are redundant and can be explained by a combination of other dimensions. Furthermore, dimensions in high-dimensional data are often correlated so that the data possesses an intrinsic lower-dimensional structure. Dimensionality reduction exploits structure and correlation and allows us to work with a more compact representation of the data, ideally without losing information. We can think of dimensionality reduction as a compression technique, similar to jpeg or mp3, which are compression algorithms for images and music.
feature selection methods that choose a subset of important features pruning the rest and feature extraction methods that form fewer, new features from the original inputs.

Ideally, we should not need feature selection or extraction as a separate process; the classifier (or regressor) should be able to use whichever features are necessary, discarding the irrelevant. However, there are several reasons why we are interested in reducing dimensionality as a separate preprocessing step:
In most learning algorithms, the complexity depends on the number of input dimensions, $d$, as well as on the size of the data sample, $N$, and for reduced memory and computation, we are interested in reducing the dimensionality of the problem. Decreasing d also decreases the complexity of the inference algorithm during testing.

When an input is decided to be unnecessary, we save the cost of extracting it.

Simpler models are more robust on small datasets. Simpler models have less variance, that is, they vary less depending on the particulars of a sample, including noise, outliers, and so forth.

When data can be explained with fewer features, we get a better idea about the process that underlies the data and this allows knowledge extraction.

When data can be represented in a few dimensions without loss of information, it can be plotted and analyzed visually for structure and outliers.

Feature Selection and Feature Extraction
There are two main methods for reducing dimensionality: feature selection and feature extraction. In feature selection, we are interested in finding $k$ of the $d$ dimensions that give us the most information and we discard the other $(d − k)$ dimensions. We are going to discuss subset selection as a feature selection method.
In feature extraction, we are interested in finding a new set of $k$ dimensions that are combinations of the original $d$ dimensions. These methods may be supervised or unsupervised depending on whether or not they use the output information. The best known and most widely used feature extraction methods are Principal Components Analysis (PCA) and Linear Discriminant Analysis (LDA), which are both linear projection methods, unsupervised and supervised respectively. PCA bears much similarity to two other unsupervised linear projection methods,namely, Factor Analysis (FA) and Multidimensional Scaling (MDS). Examples of nonlinear dimensionality reduction techniques are Isometric feature mapping (Isomap) and Locally Linear Embedding (LLE).

Subset Selection ( Feature Selection) 

In subset selection, we are interested in finding the best subset of the set of features. The best subset contains the least number of dimensions that most contribute to accuracy. We discard the remaining, unimportant dimensions. Using a suitable error function, this can be used in both regression and classification problems. There are $2^d$ possible subsets of $d$ variables, but we cannot test for all of them unless $d$ is small and we employ heuristics to get a reasonable (but not optimal) solution in reasonable (polynomial) time.

There are two approches: In forward selection we start with no variables and add them one by one, at each step adding the one that decreases the error the most, until any further addition does not decrease the error (or decreases backward selection it only sightly). 

In backward selection, we start with all variables and remove them one by one, at each step removing the one that decreases the error the most (or increases it only slightly),until any further removal increases the error significantly. In either case, checking the error should be done on a validation set distinct from the training set because we want to test the generalization accuracy. With more features, generally we have lower training error, but not necessarily lower validation error.

In an application like face recognition, feature selection is not a good method for dimensionality reduction because individual pixels by themselves do not carry much discriminative information; it is the combination of values of several pixels together that carry information about the face identity. This is done by feature extraction methods that we discuss

Principal Component Analysis ( PCA)

PCA, proposed by Pearson(1901) and Hotelling (1933), has been around for more than 100 years and is still one of the most commonly used techniques for data compression and data visualization. It is also used for the identification of simple patterns, latent factors, and structures of high-dimensional data. 

Dimensionality reduction generally exploits a property of high-dimensional data (e.g., images) that it often lies on a low-dimensional subspace.Figure below gives an illustrative example in two dimensions. Although the data in Figure (a) does not quite lie on a line, the data does not vary much in the $x_2$-direction, so that we can express it as if it were on a line – with nearly no loss; see Figure (b). To describe the data in Figure (b), only the $x_1$-coordinate is required, and the data lies in a one-dimensional subspace of $R^2$.


This gave an example of how a two-dimensional dataset can be represented using a single coordinate. In Figure (b), we chose to ignore the x2-coordinate of the data because it did not add too much information
so that the compressed data is similar to the original data in Figure(a). We could have chosen to ignore the x1-coordinate, but then the compressed data had been very dissimilar from the original data, and much information in the data would have been lost.
If we interpret information content in the data as how “space filling” the dataset is, then we can describe the information contained in the data by looking at the spread of the data. We know that the variance is an indicator of the spread of the data, and we can derive PCA as a dimensionality reduction algorithm that maximizes the variance in the low-dimensional representation of the data to retain as much information as possible. Following fig illustrate this.



In projection methods, we are interested in finding a mapping from the inputs in the original $d$-dimensional space to a new ($k < d$)-dimensional space, with minimum loss of information. The projection of $x$ on the direction of $w$ is
$ z = w^Tx$
principal Principal components analysis (PCA) is an unsupervised method in that it does not use the output information; the criterion to be maximized is the variance. The principal component is $w_1$ such that the sample, after projection on to $w_1$, is most spread out so that the difference between the sample points becomes most apparent. For a unique solution and to make the direction the important factor, we require $||w_1||=1$. We know that if $z_1 = w_1^Tx$ with $Cov(x) = \Sigma$, then

$Var(z_1) = w_1^T\Sigma w_1$

We seek $w_1$ such that $Var(z_1)$ is maximized subject to the constraint that $w_1^Tw_1 = 1.$ Writing this as a Lagrange problem, we have
$max_{w_1}w_1^T\Sigma w_1 − \alpha(w_1^Tw_1-1)$

Taking the derivative with respect to $w_1$ and setting it equal to 0, we have
$2\Sigma w_1 − 2\alpha w_1 = 0$, and therefore $\Sigma w_1 = \alpha w_1$

which holds if $w_1$ is an eigenvector of $\Sigma$ and $\alpha$ is  the corresponding eigenvalue.
Because we want to maximize
$w_1^T\Sigma w_1 = \alpha w_1^Tw_1=\alpha$

we choose the eigenvector with the largest eigenvalue for the variance to be maximum. Therefore the principal component is the eigenvector of the covariance matrix of the input sample with the largest eigenvalue.
$\lambda_1 = \alpha$.
$Var(z_1) = w_1^T\Sigma w_1$
$Var(z_1)=\lambda_1$

the variance of the data projected onto a one-dimensional subspace equals the eigenvalue that is associated with the basis vector $b_1$ that spans this subspace. Therefore, to maximize the variance of the low-dimensional code, we choose the basis vector associated with the largest eigenvalue principal component of the data covariance matrix.

The second principal component, $w_2$, should also maximize variance, be of unit length, and be orthogonal to $w_1$. This latter requirement is so that after projection $z_2 = w_2^Tx$ is uncorrelated with $z_1$. So $w_2$ should be the eigenvector of $\Sigma$ with the second largest eigenvalue, $\lambda_2 = \alpha$. Similarly, we can show that the other dimensions are given by the eigenvectors with decreasing eigenvalues.

Because $\Sigma$ is symmetric, for two different eigenvalues, the eigenvectors are orthogonal. If $\Sigma$ is positive definite ($x^T\Sigma x > 0$, for all nonnull $x$), then all its eigenvalues are positive. If $\Sigma$ is singular, then its rank, the effective dimensionality, is $k$ with $k < d$ and $\lambda_i, i = k + 1, \ldots, d$ are 0 ($\lambda_i$) are sorted in descending order). The $k$ eigenvectors with nonzero eigenvalues are the dimensions of the reduced space. The first eigenvector (the one with the largest eigenvalue), $w_1$, namely, the principal component, explains the largest part of the variance; the second explains the second largest; and so on.
We define
$ z = W^T (x −m)$
where the $k$ columns of $W$ are the $k$ leading eigenvectors of $S$, the estimator to Σ. We subtract the sample mean $m$ from $x$ before projection to center the data on the origin. After this linear transformation, we get to a $k$-dimensional space whose dimensions are the eigenvectors, and the variances over these new dimensions are equal to the eigenvalues (see figure below). To normalize variances, we can divide by the square roots of the eigenvalues.

Let us see an example to get some intuition (Rencher 1995): Assume we are given a class of students with grades on five courses and we want to order these students. That is, we want to project the data onto one dimension, such that the difference between the data points become most apparent. We can use PCA. The eigenvector with the highest eigenvalue is the direction that has the highest variance, that is, the direction on which the students are most spread out. This works better than taking the average because we take into account correlations and differences in variances.


If we form a $(d × d)$ matrix $C$ whose ith column is the normalized eigenvector $c_i$ of $S$, then $C^TC = I$ and

$SC=CD$
$S=CDC^T$
where $D$ is a diagonal matrix whose diagonal elements are the eigenvalues, $\lambda_1, \ldots , \lambda_d$. This is called the spectral decomposition of $S$. Since $C$ is  orthogonal and $CC^T = C^TC = I$, we can multiply on the left by $C^T$ and on the right by C to obtain
$ C^T SC = D$

PCA is unsupervised and does not use output information. It is a onegroup procedure. However, in the case of classification, there are multiple groups. Karhunen-Loève allows using class information; for example , instead of using the covariance matrix of the whole sample, we can estimate separate class covariance matrices, take their average (weighted by the priors) as the covariance matrix, and use its eigenvectors.
In common principal components (Flury 1988), we assume that the  principal components are the same for each class whereas the variances of these components differ for different classes:
$S_i = CD_iC^T$
This allows pooling data and is a regularization method whose complexity is less than that of a common covariance matrix for all classes,while still allowing flexible differentiation of $S_i$. A related approach is flexible discriminant analysis (Hastie, Tibshirani, and Buja 1994), which does a linear projection to a lower-dimensional space where all features are uncorrelated and then uses a minimum distance classifier.

************************************************************



Comments

Popular posts from this blog

Concepts in Machine Learning- CST 383 KTU Minor Notes- Dr Binu V P

Overview of Machine Learning

Syllabus Concepts in Machine Learning- CST 383 KTU