Skip to main content

Counting genes again

So why are there 60,000 genes in the file - isn't that too many?

If you didn't work this out already, go back and use less -S to look at the 'gene' records in the file again, and remember that gene_type attribute. Many of the records actually don't say they are protein-coding genes but something else. (For example ENSG00000223972.6 is a ['transcribed unprocessed pseudogene', i.e. something that makes mRNA but there isn't evidence it is translated to protein - see the Ensembl biotypes page for a more specific definition.)

Let's try to count just the protein-coding ones. To do this we will use a couple of commands - awk which we are here using just to select rows with "gene" in the type column, and wc which will count the number of lines:

cat gencode.v41.annotation.gff3  | awk '$3=="gene"' | grep 'gene_type=protein_coding' | wc -l
20017

This is a much more sensible number - there are about 20,000 protein-coding genes in the human genome. That’s a lot but we are big animals!