What gene annotation data looks like
The general feature format
For this tutorial we will work with data the 'General Feature Format' or GFF3. You can read its specification here. Along with the closely-related GTF format, it is the most commonly-used format for storing gene annotations.
Before starting you will need to make sure you have a good understanding of what's in the data. To get started, create a new directory to work in:
mkdir genes_programming; cd genes_programming
Now start by downloading three files:
The human gene annotations from the GENCODE project. (The 'Comprehensive gene annotation' on the
CHR
regions will do - make sure to get the 'GFF3'-formatted files.)Another version of the human gene annotations from Ensembl. (Note: the file
Homo_sapiens.GRCh38.107.chr.gff3.gz
will do for now - this contains genes that Ensembl could assign to one of the main chromosomes.)The Plasmodium falciparum (malaria parasites) from Plasmodb.
A backup copy of these files can be found in this folder.
Use your command-line skills to download these files to the
genes_programming
folder and have a look at them. Feel free to download data for more species as well. (To keep
things organised, we recommend working in a new folder called 'genes_programming' or similar.)
Note
You can also run the "exploring gene annotations in BASH" tutorial, if you haven't already, to gain an understanding of these files.)
For example, the GENCODE file looks something like this:
##gff-version 3
#description: evidence-based annotation of the human genome (GRCh38), version 38 (Ensembl 104)
#provider: GENCODE
#contact: gencode-help@ebi.ac.uk
#format: gff3
#date: 2021-03-12
##sequence-region chr1 1 248956422
chr1 HAVANA gene 11869 14409 . + . ID=ENSG00000223972.5;gene_id=ENSG00000223972.5;gene_type=transcribed_unprocessed_pseudogene;gene_name=DDX11L1;level=2;hgnc_id=HGNC:37102;havana_gene=OTTHUMG00000000961.2
chr1 HAVANA transcript 11869 14409 . + . ID=ENST00000456328.2;Parent=ENSG00000223972.5;gene_id=ENSG00000223972.5;transcript_id=ENST00000456328.2;gene_type=transcribed_unprocessed_pseudogene;gene_name=DDX11L1;transcript_type=processed_transcript;transcript_name=DDX11L1-202;level=2;transcript_support_level=1;hgnc_id=HGNC:37102;tag=basic;havana_gene=OTTHUMG00000000961.2;havana_transcript=OTTHUMT00000362751.1
chr1 HAVANA exon 11869 12227 . + . ID=exon:ENST00000456328.2:1;Parent=ENST00000456328.2;gene_id=ENSG00000223972.5;transcript_id=ENST00000456328.2;gene_type=transcribed_unprocessed_pseudogene;gene_name=DDX11L1;transcript_type=processed_transcript;transcript_name=DDX11L1-202;exon_number=1;exon_id=ENSE00002234944.1;level=2;transcript_support_level=1;hgnc_id=HGNC:37102;tag=basic;havana_gene=OTTHUMG00000000961.2;havana_transcript=OTTHUMT00000362751.1
chr1 HAVANA exon 12613 12721 . + . ID=exon:ENST00000456328.2:2;Parent=ENST00000456328.2;gene_id=ENSG00000223972.5;transcript_id=ENST00000456328.2;gene_type=transcribed_unprocessed_pseudogene;gene_name=DDX11L1;transcript_type=processed_transcript;transcript_name=DDX11L1-202;exon_number=2;exon_id=ENSE00003582793.1;level=2;transcript_support_level=1;hgnc_id=HGNC:37102;tag=basic;havana_gene=OTTHUMG00000000961.2;havana_transcript=OTTHUMT00000362751.1
...
Question
GFF files from different projects differ slightly in what's in them. Can you spot differences in how the three files above are encoded?
Note
Each project has its own version of the file format specification as well. For example see:
the GENCODE file format page (this actually described
GTF
format, but it's almost identical and has the same basic encoding. We'll useGFF3
here.)The PlasmoDB files are similar to the Ensembl ones, but if you look closely you'll see the file has some difference to both those above.
Making some test files
These files are pretty big. For code development purposes it is useful to make some smaller datasets that don't use up so much space. Make these now by taking the first 1,000 lines of each file:
cat gencode.v41.annotation.gff3 | head -n 1000 > gencode.v41.annotation.head.gff3
and so on.
Note
If your data file ends in .gz
, it is gzip-compressed. You will need to use gunzip -c
instead of cat
. This
will decompress the data and pipe it into head
.
Aside: how are gene annotations created anyway?
Gene annotation pipelines are pretty complicated - they involve prediction of genes from genome sequence, alignment of known transcript sequences and short-read RNA, transfer of annotations from other genomes, and other steps.
The Ensembl gene annotations are created using a particular pipeline that is described in detail here. This includes a manual curation step, 'HAVANA', which is applied to the best-annotated genomes including humans, mice, zebrafish and rats.
GENCODE is based on the Ensembl annotation as described here but with extra annotation added.
The NCBI is a second project that creates gene annotations (NCBI RefSeq). Their pipeline for Eukaryotes is described here. The process looks like this:
Simple!
An important thing to notice is that these annotations are made from severe data sources - including computational predictions from the genome assemblies themselves, but also alignments of known RNA and protein sequences, and a great deal of automatic and for some genomes manual curation. They shouldn't be regarded as the whole truth about genes, but as a current best guess - and for some genomes like humans they are fairly complete.
Getting started coding
In this tutorial we're going to write some code to process these data files. When you're ready, get started writing some code.