Skip to main content

A warm-up example

Before running the main practical, let's first see how PCa works by using it to recover some simulated structure from a matrix in R.

As before we will start with a giant matrix:

X=column 1column 2column Nrow 1x11x12x1Nrow 2x21x22x2Nrow LxL1xL2xLNX = \begin{array}{c|cccc} & \text{column 1} & \text{column 2} & \cdots & \text{column $N$} \\ \hline \text{row $1$} & x_{11} & x_{12} & \cdots & x_{1N} \\ \text{row $2$} & x_{21} & x_{22} & \cdots & x_{2N} \\ \vdots & \vdots & & \ddots & \\ \text{row $L$} & x_{L1} & x_{L2} & \cdots & x_{LN} \end{array}

that contains some structure, and we'll see if PCA can recover this structure for us. (To make this simple we'll simulate the matrix so we know what the true structure is.)

Simulating some data

Let's simulate some structured data now. We'll work with a matrix with 1,000 rows and 100 columns:

L = 1000
N = 100
X = matrix(
NA,
nrow = L,
ncol = N
)

Let's start by filling this with random values from a normal distribution:

X[,] = rnorm( L*N, mean = 0, sd = 0.5 )
Note

If you're unsure what this has done, explore the matrix now by using your R skills to subset rows and columns, or try hist(X) to see a histogram of values.

Now let's add some structure - first we'll take a couple of sets of rows to work with:

rows1 = 1:500
rows2 = 250:500

and a couple of random sets of samples:

cols1 = sample( 1:N, 66 )
cols2 = sample( cols1, N/3 )

Let's add some structure to our data by adding some random stuff with nonzero mean to each of the subsets:

X[rows1,cols1] = X[rows1,cols1] + rnorm( length(rows1) * length(cols1), mean = 0.25, sd = 0.5 )
X[rows2,cols2] = X[rows2,cols2] + rnorm( length(rows2) * length(cols2), mean = -0.4, sd = 0.5 )

Finally, let's do what we will do in practice for a GWAS, and normalising the matrix. We will do this by standardising the rows:

for( i in 1:L ) {
X[i,] = X[i,] - mean(X[i,] )
X[i,] = X[i,] / sd(X[i,])
}

With this setup the matrix should now be structured like this:

  • the matrix has 100 samples in total, of which:
  • two-thirds of the samples have had a 'bit' added at the first 500 SNPs; and
  • another half of those samples have an extra 'bit' added at SNPs 250-500.

Let's see if we can see this structure visually using image():

image(
t(X),
x = 1:N,
y = 1:L,
xlab = "columns",
ylab = "rows"
)

img

Question

Do you see any structure - if so is it what you expect?

Note. by default, image() will choose a colour scale to match the values in the matrix - try hist(X) again if you want to see the range of values.

Computing principal components

You can probably just about see the structure in the matrix above, but not very clearly - even though we know it is there. But now let's see if principal components can identify it.

There are several equivalent ways to run a PCA but the one we will use is based on computing a matrix of "similarity" (or, as we will say for genetic data below, 'relatedness') between columns. This similarity matrix is of dimension N×NN\times N and can be computed very simply indeed:

R=1LXtXR = \frac{1}{L} \cdot X^t X

or in R:

R = (1/L) * t(X) %*% X
image(R)
Note

In case you're not used to looking at matrix maths, here's what the above means.

It says: take the matrix XX, which has dimension L×NL\times N (i.e. it has LL rows and NN columns) and transpose it. (Transposing it means 'rotating' by 90 degrees, so the rows become columns and vice versa). Then multiply it by itself. Like this:

XtX=(x11x21xL1x1Nx2NxLN)(x11x1Nx21x1NxL1xLN)X^t X = \left(\begin{array}{cccc} x_{11} & x_{21} & \cdots & x_{L1} \\ & & \ddots & \\ x_{1N} & x_{2N} & \cdots & x_{LN} \end{array}\right) \cdot \left(\begin{array}{cccc} x_{11} & \cdots & x_{1N} \\ x_{21} & \cdots & x_{1N} \\ & \ddots & \\ x_{L1} & \cdots & x_{LN} \end{array}\right)

To 'multiply' you go across rows and down columns, i.e. you compute the dot product of each row of XtX^t and each column of XX. (The rows of XtX^t are just the columns of XX, of course, so this is dot producting the columns of X with themselves). The entry rijr_{ij} in row ii and column jj of the result RR is thus the dot product of the iith and jjth columns of XX:

rij=1Ll=1Lxlixljr_{ij} = \frac{1}{L} \sum_{l=1}^L x_{li} x_{lj}

Roughly speaking this dot product, quantifies the extent to which the two columns of XX point in the same direction (in LL-dimensional space). Im fact it is the covariance** between columns ii and jj (or almost - this is not quite true because we standardised the rows, not the columns of XX, but this has a minor impact here.)

We can now compute principal components by computing the eigendecomposition of the matrix RR:

R = t(X) %*% X
pca = eigen(R)
Note

As usual you can see the structure of the output objectusing View() or str(). The matrix XX has 100 columns, and the output object has 100 eigenvalues (pca$values) and 100 eigenvectors (pca$vectors), each of length 100. The entries of the eigenvectors are the principal components.

Let's see what this has made of the result by plotting the first two principal components against each other:

plot(
x = pca$vectors[,1],
y = pca$vectors[,2],
xlab = "Principal component 1",
ylab = "Principal component 2",
# Make the points filled-in, because it looks better
pch = 19
)
grid()

You will probably see something like this: img

Woah! PCA has indeed magically found out some structure in our data!

To see if it is correct, let's add a column with some colour:

colours = rep( "grey50", N )
colours[ cols1 ] = "orangered3"
colours[ cols2 ] = "dodgerblue4"
plot(
x = pca$vectors[,1],
y = pca$vectors[,2],
xlab = "Principal component 1",
ylab = "Principal component 2",
# Make the points filled-in, because it looks better
pch = 19,
# Colour the points
col = colours
)
grid()

img

As you can see, PCA has seperated out the three populations of samples - indeed it has done it pretty well in my version of the data above.

Note

The sample sets are clearly seperated in the first two principal components. But the method isn't perfect, for example, it hasn't actually told us what the three sets are (we can see them allright, but would need extra work to get it to tell us what they actually are.)

PCA duality: computing row 'loadings'

What if we don't eigendecompose XtXX^t X (of dimension N×NN\times N), but decompose XXtX X^t (of dimension L×LL\times L) instead? Answer: we get the row loadings:

loadings = eigen( (1/L) * X %*% t(X) )
layout( matrix( 1:2, ncol = 1 ))
par( mar = c( 4, 4, 1, 1 ))
plot(
1:1000,
abs(loadings$vectors[,1]),
xlab = "row",
ylab = "Loading",
pch = 19
)
plot(
1:1000,
abs(loadings$vectors[,2]),
xlab = "row",
ylab = "Loading",
pch = 19
)

img

What may not be apparent is that these loadings are very closely related to the PCs themselves.

To see this, first look at the eigenvalues from both the above decompositions:

plot( pca$values, loadings$values[1:100] )

They are the same!

It turns out that these two things (PCs and loadings) are 'dual' to each other - in particular the PCs are just the projections of the columns of X onto the loadings. Specifically:

  • The kkth loading expresses how much each row contributes to the kkth PC.

  • The kkth PC also expresses how much each column contributes to the kk loading.

  • The PCs can be computed (up to a scaling factor) by projecting the columns of XX onto the loadings (i.e. taking dot products):

    PCk=t(X)loadingk×scaling factorPC_{k} = t(X) \cdot \text{loading}_{k} \times \text{scaling factor}

You can check this as follows. First compute the projections onto the loadings like this:

projections = t(X) %*% loadings$vectors

Now for example let's plot the first projection against the first pca:

plot(
projections[,1],
pca$vectors[,1]
)
grid()

Note

It turns out that the scaling factor is the square root of the corresponding eigenvalue, which is the other thing returned by eigen(). Make them both the same like this:

plot(
projections[,5] / sqrt(pca$values[5]),
pca$vectors[,5]
)
grid()
Note

PCA as a singular value decomposition

The method of computation above is similar to what we will run in practice for the GWAS data. However, there's a simpler way in R just using the matrix XX itself - the singular value decomposition. Run it like this:

s = svd(X)

At this point:

  • The columns of s$u are the loadings
  • The columns of s$v are the principal components
  • the entries of s$d are the square roots of the eigenvalues above.