Richard Arnold
9 May 2019
Consider a simple data set of \( n=100 \) observations of a numerical variable: petal length of irises (cm)
[1] 4.7 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5
[18] 4.1 4.5 3.9
… etc … (100 observations)
Mean and standard deviation \[ \bar{x} = 4.906 \qquad \text{and} \qquad s = 0.826 \]
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
7.0 | 3.2 | 4.7 | 1.4 | versicolor |
6.4 | 3.2 | 4.5 | 1.5 | versicolor |
6.9 | 3.1 | 4.9 | 1.5 | versicolor |
5.5 | 2.3 | 4.0 | 1.3 | versicolor |
6.5 | 2.8 | 4.6 | 1.5 | versicolor |
… etc … (100 rows, 4 numerical variables + species)
We expect a lot of correlation:
But we also expect some variation.
High correlations between some pairs of variables.
Covariance matrix
\[ V_{k\ell} = \frac{1}{n-1}\sum_{i=1}^n (X_{ik}-\bar{X}_k)(X_{i\ell}-\bar{X}_\ell) \]
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.4393495 0.12215758 0.4533616 0.16715960
Sepal.Width 0.1221576 0.11072323 0.1427960 0.08002828
Petal.Length 0.4533616 0.14279596 0.6815798 0.28873131
Petal.Width 0.1671596 0.08002828 0.2887313 0.18042828
Correlation matrix
\[ R_{k\ell} = \frac{V_{k\ell}}{\sqrt{V_{kk}V_{\ell\ell}}} \]
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 0.5538548 0.8284787 0.5937094
Sepal.Width 0.5538548 1.0000000 0.5198023 0.5662025
Petal.Length 0.8284787 0.5198023 1.0000000 0.8233476
Petal.Width 0.5937094 0.5662025 0.8233476 1.0000000
We're used to doing regression analyses
\[ \widehat{Y} = -0.402 + 0.424X \]
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
7.0 | 3.2 | 4.7 | 1.4 | versicolor |
6.4 | 3.2 | 4.5 | 1.5 | versicolor |
6.9 | 3.1 | 4.9 | 1.5 | versicolor |
5.5 | 2.3 | 4.0 | 1.3 | versicolor |
6.5 | 2.8 | 4.6 | 1.5 | versicolor |
Here we have more than two variables
View this as a cloud of points - try to find transformed versions of the data which describe its most important properties: its location and its spread.
The mean location is \[ \bar{X}_1 = 4.906 \qquad \bar{X}_2 = 1.676 \]
Transformed variables \[ \begin{aligned} Y_1 &= 0.9098 (X_1-4.906) + 0.4151 (X_2-1.676)\\ Y_2 &= 0.4151 (X_1-4.906) - 0.9098 (X_2-1.676) \end{aligned} \] We can calculate \( (Y_1,Y_2) \) from any observation \( (X_1,X_2) \)
We can also reconstruct \( (X_1,X_2) \) from \( (Y_1,Y_2) \): \[ \begin{aligned} X_1 &= 4.906 + 0.9098 Y_1 + 0.4151 Y_2\\ X_2 &= 1.676 + 0.4515 Y_1 - 0.9098 Y_2 \end{aligned} \]
Here \( Y_1 \) is more informative than \( Y_2 \):
We have a matrix of data \( X \):
In Principal Components Analysis we find transformed versions of the data
Form the four PCs for the Iris data
Sepal.Length Sepal.Width Petal.Length Petal.Width
PC1 -7.943 -0.507 -0.435 -0.544 -0.508
PC2 -0.916 -0.221 0.888 -0.380 -0.134
PC3 3.200 0.693 0.056 -0.019 -0.718
PC4 0.403 -0.462 0.136 0.748 -0.456
Scree Plot: shows the variance explained by each PC
Need at 50 PCs to explain 80% of the variance.
Next Lecture