Removing highly-correlated SNPs
If you followed the population genetics simulation tutorial, you'll know that correlation between nearby SNPs arises naturally as a result of genetic drift (or selection). This can lead to patterns of local variation that dominate principal components.
Because for our purposes we want to capture 'genome-wide' patterns of relationships, we will first get rid of any too-correlated groups of SNPs.
LD pruning removes correlated pairs of SNPs so that the remaining SNPs are roughly independent. (It also helps to make subsequent computations quicker.) Run the following command in your terminal to prune the dataset:
plink --vcf chr19-clean.vcf.gz --maf 0.01 --indep-pairwise 50 5 0.2 --out chr19-clean
The above command tells plink to load the file chr19-clean.vcf.gz
and to prune SNPs, leaving SNPs
with minor allele frequency (MAF) at least 1%, and with no pairs remaining with pairwise
r2>0.2. (The other parameters, here 50 and 5, affect how the computation works in
windows across the genome. You can read about the behaviour here:
http://www.cog-genomics.org/plink2/ld).
Question
Look at the screen output from the above plink command.
- How many variants were in the original dataset?
- How many were removed because their frequency was below 1%?
- How many variants were removed due to LD pruning?
- How many variants remain?
Type ls
or use the file manager to view the directory. The command above produced a number of files that all begin with the chr19-clean
prefix. For our purposes, the most important one is chr19-clean.prune.in
, as this lists the SNPs that remain after pruning. Feel free to look at all these files using less or a text editor.
Relatedness pruning
When you're ready, go to the next page to identify and remove close relationships.