Principal Components Analysis

Richard Arnold
9 May 2019

Summarising data

Consider a simple data set of \( n=100 \) observations of a numerical variable: petal length of irises (cm)


From: https://www.datacamp.com/community/tutorials/machine-learning-in-r

 [1] 4.7 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5
[18] 4.1 4.5 3.9

… etc … (100 observations)

Summarising data - single variable

plot of chunk unnamed-chunk-3

Mean and standard deviation \[ \bar{x} = 4.906 \qquad \text{and} \qquad s = 0.826 \]

Summarising data - multiple variables

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
5.5 2.3 4.0 1.3 versicolor
6.5 2.8 4.6 1.5 versicolor

… etc … (100 rows, 4 numerical variables + species)

We expect a lot of correlation:

  • larger flowers have larger petals and sepals
  • large petal width goes with a large petal length

But we also expect some variation.

Iris Data

plot of chunk unnamed-chunk-7

High correlations between some pairs of variables.

Numerical measures of correlation

Covariance matrix

\[ V_{k\ell} = \frac{1}{n-1}\sum_{i=1}^n (X_{ik}-\bar{X}_k)(X_{i\ell}-\bar{X}_\ell) \]

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    0.4393495  0.12215758    0.4533616  0.16715960
Sepal.Width     0.1221576  0.11072323    0.1427960  0.08002828
Petal.Length    0.4533616  0.14279596    0.6815798  0.28873131
Petal.Width     0.1671596  0.08002828    0.2887313  0.18042828
  • Diagonal entries are the variances of the variables
  • off-diagonal entries are covariances between the variables

Numerical measures of correlation

Correlation matrix

\[ R_{k\ell} = \frac{V_{k\ell}}{\sqrt{V_{kk}V_{\ell\ell}}} \]

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000   0.5538548    0.8284787   0.5937094
Sepal.Width     0.5538548   1.0000000    0.5198023   0.5662025
Petal.Length    0.8284787   0.5198023    1.0000000   0.8233476
Petal.Width     0.5937094   0.5662025    0.8233476   1.0000000
  • Diagonal entries always +1
  • all other entries \( \in[-1,+1] \).

Iris Data

plot of chunk unnamed-chunk-10

Iris Data

We're used to doing regression analyses

  • treating one variable as the predictor (\( X= \) Petal Length)
  • one variable as the outcome (\( Y= \) Petal Width)

plot of chunk unnamed-chunk-11

\[ \widehat{Y} = -0.402 + 0.424X \]

Summarising data - multiple variables

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
5.5 2.3 4.0 1.3 versicolor
6.5 2.8 4.6 1.5 versicolor

Here we have more than two variables

  • no clear explanatory variable(s) with an identifiable outcome variable
  • just a set of four correlated variables

Another way to look at the data

plot of chunk unnamed-chunk-13

View this as a cloud of points - try to find transformed versions of the data which describe its most important properties: its location and its spread.

Another way to look at the data

plot of chunk unnamed-chunk-14

The mean location is \[ \bar{X}_1 = 4.906 \qquad \bar{X}_2 = 1.676 \]

Another way to look at the data

plot of chunk unnamed-chunk-15

Transformed variables \[ \begin{aligned} Y_1 &= 0.9098 (X_1-4.906) + 0.4151 (X_2-1.676)\\ Y_2 &= 0.4151 (X_1-4.906) - 0.9098 (X_2-1.676) \end{aligned} \] We can calculate \( (Y_1,Y_2) \) from any observation \( (X_1,X_2) \)

Another way to look at the data

plot of chunk unnamed-chunk-16

We can also reconstruct \( (X_1,X_2) \) from \( (Y_1,Y_2) \): \[ \begin{aligned} X_1 &= 4.906 + 0.9098 Y_1 + 0.4151 Y_2\\ X_2 &= 1.676 + 0.4515 Y_1 - 0.9098 Y_2 \end{aligned} \]

Another way to look at the data

plot of chunk unnamed-chunk-17

Here \( Y_1 \) is more informative than \( Y_2 \):

  • It explains the greatest amount of variation in the relationship between \( X_1 \) and \( X_2 \)
  • If we had to choose which one of the two we'd rather know: we'd choose \( Y_1 \)
  • \( Y_1 \) alone allows a good guess at both \( X_1 \) and \( X_2 \)

Principal Components Analysis

We have a matrix of data \( X \):

  • \( n \) rows (e.g. \( n=100 \) irises)
  • \( p \) columns (e.g. \( p=4 \) variables)

In Principal Components Analysis we find transformed versions of the data

  • Each Principal Component (PC) is a linear combination of the data \[ \begin{aligned} Y_1 &= a_{10} + a_{11} X_1 + a_{12} X_2 + \ldots + a_{1p} X_p\\ Y_2 &= a_{20} + a_{21} X_1 + a_{22} X_2 + \ldots + a_{2p} X_p\\ \ldots &\\ Y_p &= a_{p0} + a_{p1} X_1 + a_{p2} X_2 + \ldots + a_{pp} X_p\\ \end{aligned} \] The numbers \( a_{10} \), \( a_{11} \) etc can be calculated from the data.

Principal Components Analysis

  • Each Principal Component (PC) explains some of the variation in the data
  • The PCs are ordered so that the first explains the most variation, and the last explains the least
  • There are \( p \) PCs (if \( n\geq p \))
  • Together all of the PCs reconstruct the data exactly
  • If the first few PCs explain most of the variance, then we can make an approximation the data using just those PCs.

Principal Components Analysis - Iris Data

Form the four PCs for the Iris data

           Sepal.Length Sepal.Width Petal.Length Petal.Width
PC1 -7.943       -0.507      -0.435       -0.544      -0.508
PC2 -0.916       -0.221       0.888       -0.380      -0.134
PC3  3.200        0.693       0.056       -0.019      -0.718
PC4  0.403       -0.462       0.136        0.748      -0.456

Principal Components Analysis - Iris Data

Scree Plot: shows the variance explained by each PC

plot of chunk unnamed-chunk-19

  • 1 PC explains most of the variance
  • 3 PCs explain almost all of the variance

Reconstructing the Iris Data

plot of chunk unnamed-chunk-21

Reconstructing the Iris Data

plot of chunk unnamed-chunk-22

Reconstructing the Iris Data

plot of chunk unnamed-chunk-23

Reconstructing the Iris Data

plot of chunk unnamed-chunk-24

Another example - images of faces

  • 40 people with 10 images each - total of 400 images
  • Each image is 112 \( \times \) 92 pixels (10304 pixels in all)
  • Each image is grey scale (pixel values \( \in [0,1] \))

plot of chunk unnamed-chunk-26

Data from: https://medium.com/@sebastiannorena/pca-principal-components-analysis-applied-to-images-of-faces-d2fc2c083371

Images of faces

plot of chunk unnamed-chunk-27

Images of faces - PCA

  • We can run a Principal Components Analysis on these data
  • Data set is \( X \): 400 rows \( \times \) 10304 columns
  • There are only \( n=400 \) PCs (since \( n\lt p \) )
  • The “average location” is the mean of all the images

plot of chunk unnamed-chunk-30

Images of faces - PCA

plot of chunk unnamed-chunk-31

The first 32 PCs

Images of faces - Screeplots

plot of chunk unnamed-chunk-33 Need at 50 PCs to explain 80% of the variance.

Reconstructing an image

plot of chunk unnamed-chunk-36

Reconstructing an image

plot of chunk unnamed-chunk-37

Reconstructing an image

plot of chunk unnamed-chunk-38

  • Storing a library of PCs means any face can be represented by just 50 weights
  • \( \Rightarrow \) compression of the data

Summary

  • PCA summarises the key features of multivariate data \( X \)
  • It allows a reduced representation of each data item:
    • store only a library of PCs
    • and for each image a set of weights of of each PC

Next Lecture

  • The theory behind PCA
  • How to carry out a PCA in practice
  • Variations on PCA - centering and scaling
  • Intepretation of PCs