Packet 15

Application II: Principal component analysis

Raw data points are row vectors. Each entry in the row corresponds to a variable. For example ๐š1=(heightweightageBPHR).

Defining PCA

For applications to dimension reduction, this row vector could be very large: 1,000ร—1,000=1M pixel grayscale values: a single row with 1M entries.

Data matrix A0=(โˆ’๐š1โˆ’โˆ’๐š2โˆ’โ‹ฎโ‹ฎโ‹ฎโˆ’๐šnโˆ’) has row vectors that are data points. Each row is another data point. So A0 is an nร—m vector when the data lives in โ„m and we have n samples.

Sometimes A, sometimes A๐–ณ

Some authors (e.g. Wikipedia and this Packet) put data points as row vectors. Others (Lay et al., Strang) put data points as column vectors. Pay attention.

Convert raw data A0 to mean-deviation form:

A=A0โˆ’(ฮผ1ฮผ2โ‹ฏฮผmฮผ1ฮผ2โ‹ฏฮผmโ‹ฎโ‹ฎโ‹ฏโ‹ฎฮผ1ฮผ2โ‹ฏฮผm)

where ฮผi is the average value of the entries in column i of A. The entries of A record the displacements of the entries of A0 from the mean across all samples (per vector component of data vectors).

Compute the sample covariance matrix:

S=1nโˆ’1A๐–ณA.

This is a symmetric mร—m matrix. (Sized by # variables, not # samples.) The matrix S is large when the row vectors are long, as for pixel data of images.

The division by nโˆ’1 comes from rule for calculating the covariance. (For more details, take a course in probability or statistics.)

The (ij)-entry of S is the dot product ๐šiโ‹…๐šj. This dot product computes the covariance of variable i and variable j across the samples. The scale of this number depends on the number of samples and the relative scale of the typical variable values. After diving by nโˆ’1, it depends only on the typical variable values.

The matrix S is symmetric so the spectral theorem applies. The eigenvalues of S in order, ฮป1โ‰ฅฮป2โ‰ฅฮป3โ‰ฅโ‹ฏ, are the principal variances of the data. The corresponding basis of (orthonormal) eigenvectors of S, namely ๐ฏ1,๐ฏ2,๐ฏ3,โ€ฆ, are the principal components of the data. The square roots ฯƒ1=ฮป1,ฯƒ2=ฮป2,ฯƒ3=ฮป3,โ€ฆ may be called the principal deviations, although that term is not commonly used. In our notation where data vectors are rows of A, it is best to transpose the principal components into row vectors ๐ฏ1,๐ฏ2,๐ฏ3,โ€ฆ that correspond to the format of the data vectors.

Discussion

The SVD of A is implicit in the above. PCA is almost directly the SVD, but there are two formatting changes to watch out for.

  • A๐–ณ may be used instead of A
  • Factor of 1nโˆ’1 appearing on the symmetric matrix S

PCA โ€˜basicallyโ€™ from SVD

PCA is derived from the SVD of 1nโˆ’1A (this Packet) or from 1nโˆ’1A๐–ณ (some authors).

If the data vectors are rows of A and the variables are columns, then the right singular vectors of 1nโˆ’1A form the principal components, and singular values of this matrix form the principal variances.

If we already have the SVD, namely A=UฮฃV๐–ณ, then ฯƒi/nโˆ’1 are the principal variances, and ๐ฏi are the principal components.

For authors who write their data points as column vectors in the data matrix, the SVD of 1nโˆ’1A๐–ณ gives the PCA data: singular values are principal variances, and left singular vectors are the principal components.

Some further terminology.

Statisticians like the concept of a variable. This concept meshes with probability theory as well. The variables are represented as the quantities in the entries of the rows ๐ši of the data matrix A0. The entries of ๐ฏi๐–ณ are also the values of variables.

The vector ๐ฏ1๐–ณ is a unit vector pointing in the direction (in the data vector space) of the greatest variation in the data. The vector ๐ฏ2๐–ณ is a unit vector pointing in the direction (in the data vector space) of the greatest variation in the data from those directions that are orthogonal to ๐ฏ1๐–ณ.

Suppose we take a variable (indeterminant) input vector ๐—=(x1x2โ‹ฏxm) in the data space. Suppose ๐ฏ1๐–ณ=(c1c2โ‹ฏcm). Then the dot product

y1=c1x1+c2x2+โ‹ฏ+cmxm

gives a new variable that corresponds to the first principal component. Variables for the remaining principal components can be given as well. Collecting all variables of the principal components:

๐˜=๐—V.
  • Variance of ๐˜=๐—๐ฏ where ๐— ranges across the data is as large as possible when ๐ฏ=๐ฏ1 the first principal component.
  • Variance capture percentage: ฯƒi2/(ฯƒ12+โ‹ฏ+ฯƒm2)=ฯƒi2/tr(S).

Applications of PCA

center center center

center center center