Skip to main content

Overview of the practical

Up to the table of contents / Back to the setup page / Forward to the page on LD pruning

In this practical we will use plink to do several things to the data:

  • to remove closely-related samples
  • to compute principal components
  • and to compute the SNP weights or loadings that tell us how principal components are weighted across the genome.

We'll also use R to inspect and plot results.

A note on quality control

Before carrying out a genetic analysis like PCA, it's important to have a good-quality dataset, and this typically means carrying out careful quality control (QC) first. On this course we'll cover QC in later lectures and practicals. For this practical we'll use an already-cleaned dataset contained in the file chr19-clean.vcf.gz. You can look at the data in this file by typing

less -S chr19-clean.vcf.gz
Note

If you are using Mac OS X, you will need to use zless instead of less because the file is gzipped.

This file is a Variant Call Format file. It consists of some metadata, followed by genotype calls at different sites (rows) for different samples (columns). Feel free to look at the data by scrolling around. When you've finished, press the 'q' key to quit back to the terminal prompt.

Preparing data for PCA

Before computing PCs we will need to do some pruning of the data. We will:

  • remove SNPs that are highly correlated to each other (i.e. 'in linkage disequilibrium' (LD)). This is to avoid confounding the analysis by local LD patterns.
  • remove samples that are too closely related (so that our PCs. This is so that our PCs reflect the majority of our data.

When you're ready, go here to start pruning.