Introduction
In this tutorial you will bring a set of FASTQ files representing reads from an Illumina sequencing platform through a basic NGS data processing pipeline. You will have to QC and align the reads, identify or remove duplicate reads, compute coverage and generate an initial set of variant calls.
Before starting you will need to set up a few things - some data and some software. This page details what you need.
Obtaining the needed software.
You will need quite a bit of software to implement this pipeline - including:
- The
bwa
software for aligning reads. samtools
for general data manipulation.- The pipelining tool
snakemake
. (Or, if you prefer, you can use another workflow management tool of your choice.WDL
andNextflow
are two possibilities. This tutorial focusses on snakemake though, so if you go this route, you'll have to work out all the details yourself.) - For QC:
fastqc
andmultiqc
- For coverage calculations:
bedtools
- For variant calling:
octopus
.
The simplest way to get this software is to install conda - including adding the bioconda and conda-forge channels. At that point installing the software should be as easy as running:
mamba install bwa samtools snakemake fastqc bedtools octopus
If this doesn't work or you are using another system, alternatives are to use your system's package manager and/or to install software from source. Please try to install before starting.
Obtaining the data
In this tutorial we will work with a set of fastq files representing P.falciparum (malaria) sequence reads from 5 samples. The data comes from the MalariaGEN Pf6 open resource (described further in the corresponding publication) and is publicly available via the European Nucleotide Archive.
To get started, create a new directory and cd into it:
mkdir ngs_tutorial
cd ngs_tutorial
To get the data, download the ngs_pipeline_data.tgz
data tarball from
this link. For example you can do this using curl:
curl -O https://www.chg.ox.ac.uk/~gav/projects/chg-training-resources/data/sequence_data_analysis/building_an_ngs_pipeline/ngs_pipeline_data.tgz
This will take a minute or two to download. Once it has finished, extract the tarball into your ngs_tutorial
directory:
tar -xzf /path/to/ngs_pipeline_data.tgz
(Once this is successful you can delete the downloaded tarball if you wish).
Have a look at what has been downloaded. You should see:
A samples file,
samples.tsv
. This lists the identifiers, accessions, and original filenames for five samples that we will process. Have a look at it now usingless
or in a text editor.Some sequence data reads in the
data/reads
folder.A reference genome assembly, named
Pf3D7_v3
, indata/reference
.
Note
If you want to, instead of using the supplied read files you can go and get the original versions
of the fastq files from the links supplied in the samples.tsv
file. The data as deposited on ENA
is about 12Gb - for this tutorial I have downsampled to around 2Gb to make things a bit quicker and
use less space.
Getting started
You're all set! Now go the the pipeline page.