Skip to main content

Getting the data

You will be using four sets of files for this exercise. It is recommended that you download these files to your working directory prior to starting the exercise Genotype data files. To get started, make a new directory to run the tutorial in and cd into it:

mkdir gwas_tutorial
cd gwas_tutorial

Now download the data...

curl -O https://www.chg.ox.ac.uk/bioinformatics/training/gms/data/gwas/quantitative_trait_gwas.tgz

...and unpack it:

tar -xzf quantitative_trait_gwas.tgz

Once the data is unpacked, you can safely delete the quantitative_trait_gwas.tgz file to save space, if you like.

Finally, to get started, we'll assume you are working within the gwas_tutorial folder:

cd gwas_tutorial

What's in the data?

You should now have four folders which have this structure:

gwas_tutorial/
Genotype_data/
AMR_genotypes.bed
...
Imputed_data/
AMR_imputed.gen.gz
...
Phenotype_data/
AMR_phenotype.txt
...
Example_scripts/
...

The data is as follows:

  • The Genotype_data/ folder contains sample genotypes, as typed on a DNA microarray. Take a look - you should see three files - a .bed file, a .bim file, and .fam file. (These are in plink binary format, which is a commonly-used file format for GWAS data.)

  • The Imputed_data folder contains imputed genotype files, which is the same data 'imputed' up to the full 1000 Genomes Phase 3 reference panel. These are like the array genotype files but at a much higher resolution (more genetic variants), having been filled in using genotype imputation.

  • The Phenotype_data folder contains (guess what?) phenotype information. There are two files: one reports the measured norovirus antibody response for each sample, and the other reflects the secretor status.

  • There are also some example scripts are in the Example_scripts directory.

Next steps

As with all GWAS, the first place to start is to summarise and perform quality control of the data.