Part 6: Interpreting GWAS signals
Locating significant signals and defining regions of interest
How do we know if the signals in our scans are worth looking at? That is, how low does a P-value need to be for us to be excited by it?
This question is a bit subtle - it really involves thinking about our prior on how many associations we think there might be, and the statistical power of the study, as well as the P-values themselves. (This reasoning is described in the 'Box' in the Wellcome Trust Case-Control Consortium paper, as well as on the page about interpreting P-values).
Luckily however, there are some well-accepted rules of thumb to use. The value is often used to identify signals that have strong evidence from a GWAS discovery analysis like this. A less stringent value, such as could be used to identify signals that have some evidence.
Note
The threshold is sometimes referred to as 'genome-wide significance'. However, this isn't a particularly helpful term so we'll avoid using it here.
It is, however, a reasonably stringent threshold for most purposes so we'll use it here.
Important warning
However, you shouldn't imagine that just having
is enough to be convinced by a signal! More work is needed to really make this convincing.
The gold standard for a GWAS study is that it
- Has compelling evidence in a discovery analysis (often taken as something like )
- Has consistent evidence in at least one replication analysis (that is, an analysis of the same pohenotype in an independent cohort). An often-used rule is to require 'nominal significance' e.g. with an effect in the same direction as the discovery analysis.
- And very strong combined evidence of (at least) across both discovery and replication arms.
- And it isn't obviously caused by any genotyping artifacts like cluster plot issues, or obvious confounding.
Example: the WTCCC2 GWAS of Multiple Sclerosis used the following scheme to declare its list of associations: they required:
- one-sided
- and a combined .
(A lower threshold of was also used and referred to as 'strong evidence' - 5 regions were in this category.)
Moreover if you look in the supplementary information of that paper you'll see just how much work was done to check the quality of the genotyping - it's extensive.
You should be able to see a single large peak at about 49,000,000 on chromosome 19. You may also have noticed that there are a number of points (SNPs) that have P-values around or smaller than (i.e. are higher than this on the log10 scale). While the large peak suggests a strong signal of association, the others are more difficult to distinguish from artefacts due to population sampling or other confounders we have not accounted for in the analysis.
While there is no fixed rule on what constitutes a likely signal, a good rule of thumb is to check the significance of other SNPs that are linked with the significant SNPs, and carry forward only those signals where we see multiple SNPs that are potentially significant ().
We can make use of linkage disequilibrium to define both significant signals, and potential genomic regions of interest.
For this tutorial, we define linked SNPs as those with an . plink
offers a --clump
utility which can be
used to group SNPs in this way based on a PLINK summary statistics file:
./plink \
--bfile Genotype_data/AMR_genotypes.filtered \
--clump output/AMR_genotypes.qassoc \
--clump-r2 0.2 \
--clump-p1 5e-8 \
--clump-p2 1e-5 \
--clump-verbose \
--out output/AMR_genotypes
This will produce output similar to that shown below, saved to a .clumped
file.
CHR F SNP BP P TOTAL NSIG S05 S01 S001 S0001
19 1 rs2548459:49209339:T:C 49209339 1.84e-12 21 0 0 2 0 19
KB RSQ ALLELES F P
(INDEX) rs2548459:49209339:T:C 0 1.000 C 1 1.84e-12
rs374886:49117247:A:G -92.1 0.437 CA/TG 1 9.55e-07
rs8111874:49168942:G:A -40.4 0.578 CA/TG 1 2.51e-06
rs35106244:49203829:C:T -5.51 0.665 CT/TC 1 8.44e-09
rs281377:49206603:C:T -2.74 0.481 CC/TT 1 4.99e-09
rs507855:49208501:A:G -0.838 0.989 CG/TA 1 2.37e-12
rs632111:49208978:A:G -0.361 0.997 CG/TA 1 2.43e-12
rs281380:49214470:T:C 5.13 0.317 CT/TC 1 4.63e-07
rs4002471:49215095:C:T 5.76 0.642 CT/TC 1 1.36e-08
rs281386:49217305:A:G 7.97 0.27 CA/TG 1 1.48e-06
rs629504:49223633:C:G 14.3 0.603 CG/TC 1 8.15e-09
rs2287921:49228272:T:C 18.9 0.616 CC/TT 1 2.35e-08
rs479486:49229323:G:A 20 0.659 CA/TG 1 3.99e-10
rs838147:49246866:A:G 37.5 0.509 CA/TG 1 2.83e-07
rs8111399:49247963:G:T 38.6 0.271 CG/TT 1 3.06e-06
rs56098615:49247999:T:A 38.7 0.271 CT/TA 1 3.06e-06
rs8111208:49248022:A:C 38.7 0.271 CA/TC 1 3.06e-06
rs838146:49248052:C:T 38.7 0.528 CC/TT 1 1.02e-06
rs12611211:49248331:G:A 39 0.271 CG/TA 1 3.06e-06
rs8103840:49254955:C:T 45.6 0.462 CC/TT 1 7.83e-06
RANGE: chr19:49117247..49254955
SPAN: 137kb
As you can see, many of the SNPs within the large peak are in LD with the most significant SNP in that region. Taking the coordinates of those SNPs, we can define the outer bounds of a region of interest.
Question
Repeat this now for the imputed association files generated by PLINK as well.
Is the region of association consistent between imputed and genotyped data?
Does the index (most significant) SNP change when using imputed data? Why might this happen?
Identifying associated genomic features
Once we have these regions and the SNPs present within them, we can use reference annotations to determine what genes or
genomic elements are present in that part of the genome. We will use the human assembly GRCh37 as the reference genome
for this analysis. To locate potential genes of interest, we can search the reference assembly in
Ensembl or UCSC genome browser using our genomic
coordinates in the form chromosome:start-end
.
Warning
Our data is in GRCh37
or 'hg19' coordinates - not GRCh38
. Make sure you select the relevant genome in the
browsers for this.
Looking in the browsers will give us an idea of the genes that are in the region and other regulatory elements that might be annotated. Another useful database to check is dbSNP, and the Ensembl Variant Effect Predictor is a useful tool to investigate potential effects of significant SNPs.
Yet another place to look is Open Targets, which amalgamates many functional datasets that can shed light on an association.