Skip to main content

An aside on the maths

The maths of PCA (in the way usually applied) works like this. Suppose XX is a big matrix of genotypes, with LL rows representing genetic variants and NN columns representing different samples / individuals.

It is typical to always standardise the genotypes at by dividing by each variant (row) in XX by its standard deviation, i.e. by f(1f)\sqrt{f(1-f)} where f is the allele frequency of the variant - and then subtracting the mean. So do that first.

It's hip to be square

There are two closely related square matrices you can make out of XX: either:

R=1LXtXorZ=1NXXtR=\frac{1}{L} X^t X \quad \text{or} \quad Z=\frac{1}{N} X X^t

The first matrix RR is an N×NN \times N matrix i.e. has one row and one column for each sample.

The second matrix ZZ is an L×LL \times L matrix with one row and column for each variant in the data.

Because we have normalised and mean-centred, the entries of RR are essentially covariances between the genotypes in the two samples ii and jj. Indeed, we can think of RR as a relatedness matrix: each entry rijr_{ij} captures the degree of allele sharing (covariance) between individual ii and jj. (Because of the frequency scaling, sharing of rarer variants has greater weight in rijr_{ij}).

On the other hand, the matrix ZZ is an LD (linkage disequilibrium) matrix: entry zijz_{ij} is the correlation between genotypes at variants ii and jj. (The diagonal entries are all 11).

What do these matrices have to do with each other? Well if you followed the Wright-Fisher tutorial you will realise that relatedness and LD are in some sense dual to each other. LD arises arises as haplotypes drift up in frequency, that is, increasing the level of identity by descent.

This duality appears in principal components analysis too. Specifically:

  • The right eigenvectors of RR are the principal components of X. Let's call them PC1,PC2,\text{PC}_1, \text{PC}_2, \cdots and so on.

  • The right eigenvectors of ZZ are the principal component loading vectors of XX. Let's call them z1,z2,z_1, z_2, \cdots.

  • The two are related by the following: the principal component for sample ii is the projection of that sample's genotypes onto the ith principal component loading vector.

In other words relatedness and LD are dual to each other - and this is reflected in principal components just like in other aspects of population genetics.

Note

There are actually two sources or types of LD that could turn up here.

The one we are often interested in in PCA is linkage disequilibrium due to population structure - sometimes thought of as 'admixture LD'. This is where variants are correlated to each other because the population is sub-structured, and the variant frequencies vary between sub-populations. This type of LD might exist between variants anywhere in the genome (not just locally in a particular region).

The other source of LD is the local patterns of LD that arise from genetic drift, even in an unstructured population. (You can see how this arises by following the genetic drift simulation tutorial.)

In one sense these two types of LD are both sort of the same thing - they both arise from patterns of relatedness in the data, generating haplotype sharing. However, LD caused by structure clearly arises from demographic features of the populations, while local LD is just a basic feature of genetic drift.

For PCA, as used in a GWAS, we typically are interested in population structure, so it is often appropriate to thin data to look at sets of 'independent' SNPs (not too close together) to avoid capturing the local LD effects. This is why we LD-thinned our data in practice.

The maths

To see why PCA works, consider an eigenvector zz of Z=XXtZ = X X^t, i.e one of the 'loading' vectors. It's an eigenvector, which means:

XXtz=λzX X^t z = \lambda z

where λ\lambda is some number - the eigenvalue corresponding to zz.

The projection of XX onto zz is XtzX^t z. Now it turns out that XtzX^t z must in fact be an eigenvector of the relatedness matrix R=XtXR = X^t X, because:

R(Xtz)=XtX(Xtz)=Xt(XXt)z=Xt(Zz)=λ(Xtz)R(X^t z) = X^t X (X^t z) = X^t (X X^t) z = X^t (Z z) = \lambda (X^t z)

Conversely, if ss is an eigenvector of RR with eigenvalue γ\gamma, that is XtXs=γsX^t X s = \gamma s, then

XXt(Xs)=γXsX X^t (X s) = \gamma X s

for similar reasons.

What this says is that the eigenvalues of RR and ZZ are the same - and the eigenvectors are related by projecting the rows and columns onto the eigenvectors respectively.

Moreover this eigenvalue decomposition of the relatedness matrix has particular properties that make it useful:

  • The first principal component has the maximum possible variance (that is, if you project the samples onto any vector other than the first loading vector z1z_1, you get something with less variance).

  • The second principal component is then chosen to have the maximum possible variance after accounting for the first. (That is, if you project the samples onto any vector orthogonal to (i.e. at right angles to) z1z_1, other than the second loading vector z2z_2, you get a lower variance.)

  • and so on.

So PCA pulls out the directions of maximal variance in the data.

More reading

More PCA theory If all this hasn't melted your brain, try reading Gil McVean's paper on genealogies and PCA.

Computing in big cohorts. In this tutorial we used the eigendecompose-the-relatedness-matrix approach to compute principal components. While this works in many studies, very large cohorts such as the UK Biobank typically require other methods such as flashPCA that avoid computing either of the matrices directly.