A real (simulated) GWAS: quality control
We will now work with a set of data files containing many SNPs from chromosome 19, which have been genotyped on a set of individuals who were infected with norovirus (disease cases) and a set of population controls
Note
The study we're running is actually simulated - so doesn't really correspond to real individuals or real phenotypes. So you can download and manipulate this data however you like.
But it behaves just like a real GWAS!
Data from a GWAS would contain SNPs at this density across the entire genome, but we will focus on just one chromosome to make the exercises more tractable.
Our first task will be to quality control the dataset to remove unusual SNPs and samples.
Basic quality control using plink
Converting data to binary format
The key files are chr19-example.vcf.gz
and chr19-example.samples
. If you followed the getting started
section, you should already have these in your folder.
Before getting started, let's convert these files from VCF format to a 'binary PED' format that PLINK can read more easily:
./plink --make-bed \
--vcf chr19-example.vcf.gz \
--pheno chr19-example.samples \
--update-sex chr19-example.samples 2 \
--out chr19-example \
--keep-allele-order
This might take a moment to run. You should see plink has created three new files:
chr19-example.bed
chr19-example.bim
chr19-example.fam
These files contain information about the samples and SNPs, as well as the genotypes for each of the samples at each of the
SNPs. You can look at the .bim
and .fam
files using less
, but the .bed
file is in a special binary format that only
plink and other software tools can understand.
Note
We added the --keep-allele-order
option here to the above command to make sure plink preserves the two alleles at each
SNP in the same order as in the VCF file. This is important because in a GWAS analysis you will want to be able to
keep track of the direction of estimated effects.
We can tell plink
to load data in the binary plink format using the --bfile
option. For instance, to calculate
allele frequencies we now use:
./plink --bfile chr19-example --freq --out chr19-example
Question
How many SNPs and samples are in this dataset?
Note
Many other formats are in use for genetic data, including 'GEN' format, BCF (binary VCF) format, and 'BGEN' format. Learning how to deal with these is part of the job description.
Computing and analysing QC metrics
There are many useful QC metrics that we can calculate for our dataset. These metrics can tell us about the quality of loci (i.e. SNPs), and of samples. For instance, we can calculation information about missingness:
./plink --bfile chr19-example --missing --out miss-info
This will produce a .imiss
file with information about individuals and .lmiss with information about loci (SNPs).
Another QC metric is sample heterozygosity:
./plink --bfile chr19-example --het --out het-info
Let's use R to visualize these results.
You can load the output into R to look at them in more detail - we'll use tidyverse
in some of the examples, so load that now as well:
# In R:
library( tidyverse )
MissData = read_table( "miss-info.imiss", col_types = "ccciid" )
HetData <- read_table( "het-info.het", col_types = "ccdddd" )
Note
If you prefer, you can also look at the data using a spreadsheet like Excel. Open a new spreadsheet and then use the
File->Import
menu to load the data. The files are space-delimited text files.
Load the data now and take a look.
Quick question
What columns do these files have? (Note: the column names can be pretty cryptic, but they're described in the plink documentation.)
Which SNP has the highest missing rate?
To make sense of this, let's plot the metrics now - we'll start with the heterozygosity. Curiously enough, even though the
command is called --het
, plink does not actually output the heterzygosity rates in this file. Let's compute it now from the
number of observed homozygotes and plot:
# in R
HetData$heterozygosity = ( HetData[['N(NM)']] - HetData[['O(HOM)']] ) / HetData[['N(NM)']]
print(
ggplot(
data = HetData,
aes( x = heterozygosity )
)
+ geom_histogram( bins = 100, col = 'black', fill = 'grey' )
+ theme_minimal()
+ xlab( "Sample heterozygosity" )
)
The graph you see shows the distribution of heterozygosity across samples. QC plots often look like this; there is a large peak of samples that lie within a range of normal values (the normal samples), and then a small number of outlier samples that are usually poor quality samples.
Note
The idea here is that in any reasonable homogenous population, the level of heterozygosity - that is, the chance that the two chromosomes in an individual differ at a randomly chosen SNP - will be roughtly similar across individuals. Or, if the population is structured, we might see subsets of the population with different values.
If single individuals appear to have drastically different heterozygosity, this could be for example because
- the samples was contaminated (it contains more than one person's DNA); or
- the person was somewhat inbred (contains more homozygosity than expected); or
- the DNA sample was poor quality (so that genotype calling was poor).
To improve the plot without losing any data, let's clamp the values into a smaller x axis range:
clamp <- function( x, lower = -Inf, upper = Inf ) {
pmax( pmin( x, upper ), lower )
}
p = (
ggplot(
data = HetData,
aes( x = clamp( heterozygosity, 0.1, 0.3 ))
)
+ geom_histogram( bins = 100, col = 'black', fill = 'grey' )
+ theme_minimal()
+ xlab( "Sample heterozygosity, clamped to [0.1 - 0.3]" )
)
print(p)
You can see that there is a large peak in heterozygosity around 0.2, with a number of outliers below 0.15 or above 0.25.
Saving images
In Rstudio you can save any figure for later reference - go to the 'Export' button above the plot and choose a destination
file. Alternatively you can save it directly from ggplot using the png()
command like this:
# In R:
ggsave( p, file = "HetHist.pdf" )
Next, lets plot the missingness values.
# In R:
print(
ggplot(
data = MissData,
aes( x = F_MISS )
)
+ geom_histogram( bins = 100, col = 'black', fill = 'grey' )
+ theme_minimal()
+ xlab( "Sample missing data rate" )
)
That's again not very informative because of the scale... let's zoom in to the region near zero on the x axis by clamping values to 0.2:
print(
ggplot(
data = MissData,
aes( x = clamp( F_MISS, 0, 0.1 ) )
)
+ geom_histogram( bins = 100, col = 'black', fill = 'grey' )
+ theme_minimal()
+ xlab( "Sample missing data rate, clamped to [0 - 0.1]" )
)
Again, most of the samples have low missingness (close to zero), with a number of outliers above 0.025. You can see some other features there as well such as the two bumps in the distribution which looks a bit suspicious - in a real study you would want to look into this. For now, let's combine the two QC metrics (missingness and heterozygosity) to select outlying samples:
# In R:
qcFails <- MissData[
MissData$F_MISS > 0.02
| HetData$heterozygosity < 0.18
| HetData$heterozygosity > 0.22,
c(1:2) # Just capture the sample identifier fields
]
write_delim( qcFails, file = "qcFails.txt" )
Here's a plot showing the thresholds we've chosen:
print(
ggplot(
data = cbind( MissData, HetData[,3:7] ),
aes(
x = clamp( F_MISS, 0, 0.1 ),
y = clamp( heterozygosity, 0.1, 0.3 )
)
)
+ geom_point()
+ theme_minimal()
+ geom_vline( xintercept = c( 0.02 ), col = 'red' )
+ geom_hline( yintercept = c( 0.18, 0.22 ))
+ xlab( "Sample missing data rate, clamped to [0 - 0.1]" )
+ ylab( "Sample heterozygosity, clamped to [0.1 - 0.3]" )
)
We should definitely do something about those two clusters!
Challenge question
Explore the dataset. Is there anything different about the samples in the two clusters? (Information on the samples is contained
in thechr19-example.fam
file (format here) or the
chr19-example.samples
file.
Creating an analysis-ready dataset
Let's apply these sample filters and some basic SNP quality filters, to try to create a 'clean' dataset for analysis. We will:
filter samples out that have outlying missingness or heterozygosity, according to the list generated above;
and filter out SNPs based on SNP QC metrics using the
--geno
,--hwe
, and--maf
options.
Here is a plink command that does this:
./plink \
--bfile chr19-example \
--remove qcFails.txt \
--hwe 1e-4 include-nonctrl \
--geno 0.01 \
--maf 0.01 \
--make-bed \
--out chr19-clean \
--keep-allele-order
Note
The backslash characters above are 'line continuation characters' - they're just there to make the command work as if we typed it all one one line.
Questions
What SNP QC filters have we applied here? (To find out, see the documentation on --hwe, on --geno, and on --maf).
Read the output of the command carefully. How many samples did the command retain? Check in R this is the expected number. And how many SNPs did we keep - does this number look sensible?
Important question
Re-compute the sample missingness and heterozygosity metrics and re-create the above QC plot using the clean data. Have the SNP QC filters dealt with the issue that showed two clusters of samples?
(And why are the new missingness values much smaller than the old ones?)
Next steps
Congratulations! You now have a carefully QC'd dataset to work with. Now go on to the association practical.