Getting the data
You will be using four sets of files for this exercise. It is recommended that you download these files to your working directory prior to starting the exercise Genotype data files. To get started, make a new directory to run the tutorial in and cd into it:
mkdir gwas_tutorial
cd gwas_tutorial
Now download the data...
curl -O https://www.chg.ox.ac.uk/bioinformatics/training/gms/data/gwas/quantitative_trait_gwas.tgz
...and unpack it:
tar -xzf quantitative_trait_gwas.tgz
Once the data is unpacked, you can safely delete the quantitative_trait_gwas.tgz file to save space, if you like.
Finally, to get started, we'll assume you are working within the gwas_tutorial folder:
cd gwas_tutorial
What's in the data?
You should now have four folders which have this structure:
gwas_tutorial/
Genotype_data/
AMR_genotypes.bed
...
Imputed_data/
AMR_imputed.gen.gz
...
Phenotype_data/
AMR_phenotype.txt
...
Example_scripts/
...
The data is as follows:
The
Genotype_data/folder contains sample genotypes, as typed on a DNA microarray. Take a look - you should see three files - a.bedfile, a.bimfile, and.famfile. (These are in plink binary format, which is a commonly-used file format for GWAS data.)The
Imputed_datafolder contains imputed genotype files, which is the same data 'imputed' up to the full 1000 Genomes Phase 3 reference panel. These are like the array genotype files but at a much higher resolution (more genetic variants), having been filled in using genotype imputation.The
Phenotype_datafolder contains (guess what?) phenotype information. There are two files: one reports the measured norovirus antibody response for each sample, and the other reflects the secretor status.There are also some example scripts are in the
Example_scriptsdirectory.
Next steps
As with all GWAS, the first place to start is to summarise and perform quality control of the data.