Variant calling, phasing and imputation
This morning we focussed on quality control, aligning, and inspecting some sequence data for a single sample.
In this afternoon's session we're going to focus on using a set of sequenced samples to do a few things:
- identify genetic variants in a region (and work out the sample genotypes at those variants)
- phase the genotypes so we can see how they align along haplotypes
- and use this to impute variants from a second dataset (for which we only have microarray data).
Not every possible genetic variant is a real one and to make this work well we will need to do some more quality control. This time we'll take care to filter the set of variants based on sensible metrics before we use the data.
All the data in this practical comes from the IGSR, which lists a huge number of open-access datasets that you can use in your analysis (including the 1000 Genomes Project data).
Note. Before doing anything else, please make sure you have downloaded the data. Then come back here.
Steps in the practical
The practical has four main steps:
The first step is to generate a VCF file of variant calls from some 1000 Genomes Project samples.
We'll then perform quality control on the initial variant calls.
Next we will phase the calls so we know how they stack up on haplotypes.
And then we'll use these haplotypes to impute some microarray data
Finally we have - you guessed it - some challenge question. Good luck!