Skip to main content

Counting genes

For simplicity, from here on in we will work with the GFF3 files from the Ensembl ftp site (vertebrate genomes) or from Ensembl genomes (non-vertebrate genomes). These use the terminology mRNA for a gene transcript, and they also have the genome sequence lengths written in the metadata, making life easy for us. Before starting, download some of these files now.

We will focus on protein-coding genes, and their transcripts, exons and coding sequence. They have type gene, mRNA, exon and CDS in the files Ensembl respectively. By now you should be familiar with the basic hierarchy:

  • Each gene can have multiple transcripts - they might differ in how exons are spliced together, or have different transcription start or end sites.

  • Each transcript can have one or more 'exon' records indicatiing exons. The introns in between are spliced out.

  • Each transcript, if it is for a protein-coding gene, can also have one or more 'CDS' records. These denote the coding sequence and the rest is untranslated sequence.

The files we're looking are (roughly speaking) humanities' best guess at what this picture looks like in each organism.

Getting data

I'm going to assume you have found a set of files from the Ensembl ftp site or elsewhere, and run them through your gff_to_sqlite.R program to get them into a database - say genes.sqlite. If not please go and do that now. For example, like this:

Rscript --vanilla gff_to_sqlite.R --attributes Name biotype --input http://ftp.ensembl.org/pub/current_gff3/homo_sapiens/Homo_sapiens.GRCh38.110.chr.gff3.gz --output genes.sqlite
Rscript --vanilla gff_to_sqlite.R --attributes Name biotype --input http://ftp.ensembl.org/pub/current_gff3/pan_troglodytes/Pan_troglodytes.Pan_tro_3.0.110.chr.gff3.gz --output genes.sqlite
Rscript --vanilla gff_to_sqlite.R --attributes Name biotype --input http://ftp.ensembl.org/pub/current_gff3/camelus_dromedarius/Camelus_dromedarius.CamDro2.110.chr.gff3.gz --output genes.sqlite
# ...and so on

I suggest to pick a few species for now - choose ones that interest you.

Note

There are a lot to choose from. For example you can get:

Yet another place to look is the Darwin Tree of Life Data Portal, which at the time of writing has nearly 300 genomes in 'Annotation complete' state.

What are all those species?

If you're confused about how all those species relate, it's time to look at OneZoom.org:

For example, did you know that Dolphins, like Camels, are cloven-hoofed ungulates? Did you know that all species of ape - except humans - are endangered or critically endangered?

Another place to look is the Ensembl species list and the Ensembl species tree. There are also similar pages for each of the branches of life on the Ensembl Genomes site.

Pick some genomes to start with and load them into your database. Once done, you are actually in a good shape for many of our scientific questions. Let's start by counting genes:

Counting genes in sqlite

A simple way to count genes is to use sqlite. For example using the sqlite file you created in the last step, you can count all gene records like this:

sqlite> SELECT COUNT(*) FROM gff WHERE type == 'gene' ;
Note

To run the above, you have to be working interactively in sqlite3 - you get there by typing sqlite3 genes.sqlite in the shell. Type .quit again when you want to quit.

Alternatively you can run from the shell directly, for example to produce a csv file with column names:

% sqlite3 -csv -header genes.sqlite "SELECT COUNT(*) FROM gff WHERE type == 'gene'"

Of course, that just counts all the genes in the file - we'd like it split by species. Simple! If you followed the suggestions above your file will also have a dataset column that lets you differentiate the records for different species. So you can make these counts for multiple species:

sqlite> .mode column
sqlite> .header on
sqlite> SELECT dataset, COUNT(*) FROM gff WHERE type=='gene' GROUP BY dataset ;

In my data, which includes spiny chromis, dromedary camels, red junglefowl, humans, mice, chimpanzees, and malaria parasites, this gives:

dataset                                      COUNT(*)  
------------------------------------------- ----------
Acanthochromis_polyacanthus.ASM210954v1.104 24027
Camelus_dromedarius.CamDro2.104.chr.gff3 18919
Gallus_gallus.GRCg6a.104 16666
Homo_sapiens.GRCh38.104 21451
Mus_musculus.GRCm39.104 25655
Pan_troglodytes.Pan_tro_3.0.104.chr 22056
PlasmoDB-54_Pfalciparum3D7 5318

Interestingly, both house mice and spiny chromis have more (annotated) genes than humans. Chimpanzees have a similar number, while Red junglefowl and Dromedary Camels have respectively 17% and 5% fewer (annotated) protein-coding genes than humans. Plasmodium falciparum has about a quarter of the number of genes. (But that's still pretty impressive because its genome size is less than a hundredth that of humans.)

Note

To make the above query work the P.falciparum data from PlasmoDB, I kludged it by running this sql:

UPDATE gff
SET type = 'gene', biotype = 'protein_coding'
WHERE dataset LIKE 'PlasmoDB-%_Pfalciparum3D7'
AND type == 'protein_coding_gene' ;

I wouldn't generally recommend this type of manually-fix-the-data-until-it-works approach (not least because it would have to be re-done every time we imported new data) but it'll do for this tutorial.

Counting genes in R or python

So how would we do this in R or python? One of the messages of this tutorial is that the same operations are generally possible in different languages, with minor differences in syntax. The tabs below show a few different ways:

First we will load the whole dataframe into R. To do this we have to open a 'connection' to the database and then read from it. (We'll filter down to genes here as well, so that we don't use up too much memory).

library( RSQLite )
library( dplyr )
db = DBI::dbConnect( RSQLite::SQLite(), "genes.sqlite" )
genes = dbGetQuery( db, "SELECT * FROM gff WHERE type == 'gene'" )

Now let's group by dataset, and summarise just as we did in sql:

(
genes
%>% group_by( dataset )
%>% summarise( count = n() )
)

You should see the same output as above, but now in R:

# A tibble: 7 × 2
dataset count
<chr> <int>
1 Acanthochromis_polyacanthus.ASM210954v1.110.gff3 24027
2 Camelus_dromedarius.CamDro2.110.chr.gff3 18919
3 Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.110.chr.gff3 18121
4 Homo_sapiens.GRCh38.110.chr.gff3 21532
5 Mus_musculus.GRCm39.110.chr.gff3 5946
6 Pan_troglodytes.Pan_tro_3.0.110.chr.gff3 22056
7 PlasmoDB-65_Pfalciparum3D7 5318