Practical outline
Overview
In this tutorial we will demonstrate a basic pipeline for analysing paired-end short-read genomic sequencing data. We will start with raw data in a FASTQ file, inspect quality control metrics, align the data, and then use it to look for genetic variation.
Prerequisites
If you got here you should hopefully have already downloaded the practical data - if not, please follow the instructions for doing that on the prerequisites page, and then come back here.
A look at the data
You should now have a folder called sequence_data_analysis/
filled with a number of data
files. Now would be a good point to explore what's in there. The folder contains
sequence data reads from malaria parasites (under the
malaria/
folder). These reads are in gzipped fastq format.A malaria reference genome assembly (
Pf3D7_v3.fa
) in FASTA format.A similar set of sequence data reads and reference genome for a human sample - in the
human/
folder.
Note. I placed online a set of solutions files for steps in the practical. Feel free to check these as you go along.
During the practical we'll bring one or more of these datasets to an analysis-ready state.
To get started, start a terminal window and change directory into that folder:
cd sequence_data_analysis
The practical in a nutshell
This practical works as follows: for each step there's a page giving you some information about how to run the step; the page then links back to this one so you can see the next step.
Note. Most of these example work with the malaria data in:
malaria/QG0033-C_Illumina-HiSeq_read1.fastq.gz
malaria/QG0033-C_Illumina-HiSeq_read2.fastq.gz
If you want to go off-piste, feel free to work with any (or all) of the others (that's one of the Challenge questions). Just remember that the human data will align to human genome and the malaria data to the malaria genome.
Go!
The steps are as follows. Can you work out the answers to the questions below? Please make a note of the answers for the consolidation session.
Questions: How many read pairs are in the file? What is the read length?
Questions: What is the GC content in the reads? What is the fragment duplication rate? Are there any sequencing artifacts?
Questions: How are the reads represented in the aligned output file? How many reads were aligned? How many were not nmapped?
Questions: Can you find a SNP? An insertion or deletion? A structural variant?
Challenge questions
If you get this far, congratulations - you're an expert!
To test your mettle, here are some challenge questions. Good luck!