Skip to main content

A toy GWAS

Before we get started properly let's run a little toy example.

Testing for association

A basic GWAS study begins by running regression (usually linear regression for quantitative traits, or logistic regression for case/control traits like the one we are using) between each genetic variant in the genome and the phenotype of interest. We can run a first tiny GWAS by running:

./plink --vcf snp-example.vcf --logistic --pheno snp-example.samples --allow-no-sex
Note

The --allow-no-sex option is needed when working with VCF files in plink. This works around plink's default behaviour which removes samples without a sex assignment - VCF files don't record sample sex, so this is necessary.

PLINK has again generated two files in the directory we are working in: plink.log and plink.assoc.logistic. The log file simply captures the status information that PLINK reports with each run. The other gives the logistic regression output.

Question

How many cases and controls are included in this data?

If you look in the plink.assoc.logistic file we will see that this SNP, rs8135996, is associated with our phenotype.

Questions

Q. What is the p-value of association? What is the odds ratio?

Q. And what does the odds ratio mean anyway? If this is a disease trait, is it good news to have the 'G' allele? Or to have the 'A' allele? And how bad is it to have the risk allele?

Advanced Question. What does ADD mean in the output? Can you figure out how to run a non-additive test? Which mode of inheritance has the greatest evidence?

Forest plotting

If you followed other parts of our training on statistical modelling, we argued that what you want is to summarise the likelihood by its maximum likelihood estimate and its standard error - thus approximating the likelihood function by a gaussian function. For logistic regression the appropriate values are the log odds ratio and its standard error.

By default plink hasn't given us the standard error so let's compute it now. First let's get plink to give us the logistic regression estimate - the log odds ratio instead of the odds ratio. Reading the documentation for --logistic you'll see there is an option called "beta" to the --logistic command that does this.

"For logistic regressions, the 'beta' modifier causes regression coefficients instead of odds ratios to be reported."

Finally there is also a --ci option that computes a confidence interval. Let's apply those now:

./plink --vcf snp-example.vcf --logistic beta --ci 0.95 --pheno snp-example.samples --allow-no-sex

Load this output file into R now:

# in R
X = read_table( "plink.assoc.logistic" )

If you look at the result, you'll see it has output a BETA column instad of the original OR, and also has SE (standard error) and confidence interval columns. Let's plot that now:

(
ggplot( data = X )
+ geom_point( aes( x = BETA, y = SNP ))
+ geom_segment( aes( x = BETA - 1.96 * SE, xend = BETA + 1.96 * SE, y = SNP, yend = SNP ))
+ geom_vline( xintercept = 0, linetype = 2 )
+ xlab( "Log odds-ratio" )
+ ylab( "SNP" )
+ theme_minimal( 16 )
+ theme( axis.title.y = element_text( angle = 0, vjust = 0.5 ))
)

or in base R:

plot.betas <- function( betas, ses ) {
plot(
betas, 1:length(betas),
pch = 19, # make filled dots
main = "rs8135996 forest plot",
xlab = "log odds ratio",
ylab = "",
xlim = c( -0.5, 0 ),
yaxt = 'n' # Turn off y axis ticks
)
segments(
x0 = betas - 1.96 * ses,
x1 = betas + 1.96 * ses,
y0 = 1, y1 = 1,
col = 'grey'
)
abline( v = 0 ) # draw a solid black line at 0
grid()
}

plot.betas( X$BETA, X$SE )

Congratulations! This is your first GWAS forest plot (albeit with only one row).

Renaming output files

Running plink --logistic is all very well, but it's a bit annoying to have files named simply plink.logistic.assoc and so on. Instead lets re-run plink telling it to rename its output files (important since we’ll be generating lots of them in the practicals). The --out option can be used to do this:

./plink \
--vcf snp-example.vcf \
--logistic beta \
--beta \
--ci 0.95 \
--pheno snp-example.samples \
--allow-no-sex \
--out getting-started

Now the output files will be named getting-started.log and getting-started.assoc.logistic. This is the basic pattern to working with PLINK: specifying input files and analyses, along with an output name to save results.

Note

The backslashes in the above command are shell line continuation characters - they let us put the command on multiple lines. Some of these commands will get pretty long, so it's useful to put them on multiple lines like this!

With this background in using plink - it's time to go on to analysing a simulated GWAS study.