Getting chromosome lengths

To answer questions like 'how much of the genome is in genes' we need to know what the length of each genome is. Happily, this data is already in the GFF files - though it's a bit awkward to get at. In fact:

Some of the GFF files have records with type='chromosome' in, that reflect this data
while others have records with type='region'.
still others have records with type='supercontig'.
Some GFF files - like the ones from PlasmoDB - don't have them at all. (Luckily there are files from Ensembl protists which we can use instead.)

Here is some dbplyr code to load these chromosomes.

contigs = (
    db
    %>% tbl( 'gff' )
    %>% filter(  type %in% c( 'chromosome', 'region', 'supercontig' ))
    %>% select( dataset, ID, type, seqid, source, start, end, attributes )
    %>% collect()
)

As a sanity check let's make sure that this really does capture the seqid values for gene records. We'll use the trusy stopifnot() for this:

table( genes$dataset, genes$seqid %in% contigs$seqid )

                                                     TRUE
  Acanthochromis_polyacanthus.ASM210954v1.110       24016
  Asparagus_officinalis.Aspof.V1.57                 24141
  Bufo_bufo-GCA_905171765.1-2022_05-genes           20900
  Camelus_dromedarius.CamDro2.110.chr               18896
  Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.110.chr 16711
  Homo_sapiens.GRCh38.110.chr                       20046
  Mus_musculus.GRCm39.110.chr                       21633
  Pan_troglodytes.Pan_tro_3.0.110.chr               21879
  Plasmodium_falciparum.ASM276v2.57                  5358
  Plasmodium_knowlesi.ASM635v1.57                    5102
  Plasmodium_vivax.ASM241v2.57                       5389

Good!

This suggests we can compute the genome lengths.

contigs$sequence_length = contigs$end - contigs$start + 1

(
    contigs
    %>% group_by( dataset, type )
    %>% summarise(
        number = n(),
        length = sum(sequence_length)
    )
    %>% mutate(
        length_in_mb = sprintf( "%.1fMb", length / 1E6 )
    )
)

 dataset                                           type        number     length length_in_mb
   <chr>                                             <chr>        <int>      <dbl> <chr>       
Acanthochromis_polyacanthus.ASM210954v1.110       region       30414  991584656 991.6Mb     
Asparagus_officinalis.Aspof.V1.57                 chromosome      10 1111887942 1111.9Mb    
Asparagus_officinalis.Aspof.V1.57                 supercontig  11782   75650082 75.7Mb      
Bufo_bufo-GCA_905171765.1-2022_05-genes           region        1306 5044744194 5044.7Mb    
Camelus_dromedarius.CamDro2.110.chr               region          37 2052758708 2052.8Mb    
Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.110.chr region          42 1041139641 1041.1Mb    
Homo_sapiens.GRCh38.110.chr                       chromosome      25 3088286401 3088.3Mb    
Mus_musculus.GRCm39.110.chr                       chromosome      22 2723431143 2723.4Mb    
Pan_troglodytes.Pan_tro_3.0.110.chr               chromosome      26 2967125077 2967.1Mb    
Plasmodium_falciparum.ASM276v2.57                 chromosome      14   23292622 23.3Mb      
Plasmodium_knowlesi.ASM635v1.57                   chromosome      14   23462187 23.5Mb      
Plasmodium_vivax.ASM241v2.57                      chromosome      14   22621071 22.6Mb      
Plasmodium_vivax.ASM241v2.57                      supercontig   2733    4386630 4.4Mb       

Note

You might be wondering what 'region', 'chromosome', and 'supercontig' are - if so, so am I!

Looking up these genomes in NCBI Datasets sheds some light on this. For example, try looking up the Plasmodium vivax assemblyy. You should see something like this:

In short the total genome size is 27Mb (check!) but some of this is in 2,747 'scaffolds' that are not chromosomes. (They appear in this GFF file as 'supercontig'.) These are, I think, bits of genome that have been assembled but the assemblers don't quite know where they should go. In principle they might be parts of nuclear chromosomes, parts of the mitochondria or other organelles like the apicoplast, or other bits of DNA that were sequenced but not clear where they should go.

However other genomes, like Bufo bufo, seem to have only 'region' records even though several of those do seem to represent chromosomes. It seems that files from the Darwin Tree of Life project, like the Dromedary Camel genome, use 'region' entries. (That number 37 is not wrong - camels have 74 diploid chromosomes).

In any case - for our purposes, let's use the 'regions', 'chromosomes', and 'supercontigs' to represent the genome length.

Ok! Armed with chromosome lengths, we are ready to ask how much of the genome is in genes?