How much of the genome is really in genes?

If you followed the previous section you'll have a function merge_regions() that can 'merge' a set of possibly overlapping regions into a set that don't overlap (but cover the same regions). Let's use this to correct our computation of the proportion of genome in genes now.

Note

If you didn't manage to complete it - don't worry. My final version is here:

Hint
Solution

Click the tab above to reveal the solution.

merge_regions = function( input_regions ) {
   # Catch the case where there's only one or no regions
   # There's nothing to do in this case, may as well just return the input regions:
   if( nrow( input_regions ) < 2 ) {
      return( input_regions ) ;
   }

   # Otherwise:
   # Sort regions by position
   input_regions = input_regions %>% arrange( start, end )

   # Create variables to put the result in
   result = tibble()
   current_merged_region = input_regions[1,]

   # Loop over regions
   for( i in 2:nrow(input_regions)) {
      next_region = input_regions[i,]

      overlapping = (next_region$start <= current_merged_region$end)
      if( overlapping ) {
         current_merged_region$end = max( current_merged_region$end, next_region$end )
      } else {
         result = rbind( result, current_merged_region )
         # Start a new region!
         current_merged_region = next_region
      }
   }

   # Add the last region
   result = rbind(
      result,
      current_merged_region
   )

   return( result )
}

(You can run the test to make sure it works.)

We can use this to compute the length of a set of regions as follows. (We'll put in a print statement so we can see what's happening.):

compute_length_of_regions <- function( input_regions ) {
    merged = merge_regions( input_regions )
    sum( merged$end - merged$start + 1 )
}

So let's go back to our code to compute the total length of genes on each chromosome and make it use this function:

compute_lengths_per_chromosome = function(
   data,
   chromosomes = contigs
) {
   (
      data
      # Group by species / chromosome
      %>% group_by( dataset, seqid )
      # Add up the gene lengths
      %>% summarise(
         number_of_regions          = n(),
         naive_total_length_covered = sum( end - start + 1 ),
         total_length_covered       = compute_length_of_regions( pick( start, end ) )
      )
      # Add the chromosome lengths
      %>% left_join(
         chromosomes[, c( "dataset", "seqid", "attributes", "sequence_length" )],
         by = c( "dataset", "seqid" )
      )
   )
}

Note

If you compare this to the previous, incorrect code you'll see it's the same, except that we've used our compute_length_of_regions() instead of just summing the lengths. (But I've kept the naive calculation in there as well for comparison.)

The pick() function is a way to create a tibble of just the start and end coordinates for each of our chromosomes, and pass it into our function.

Let's run that now to get the correct lengths:

correct_lengths = compute_lengths_per_chromosome( genes )
print( correct_lengths )

This takes a minute or two to run - after all, it has to run our merging loop for each species and chromosome.

Now at last we can compute the proportion of each genome covered by genes:

compute_proportion_of_genome_covered = function(
    per_chromosome_lengths
) {
    (
        per_chromosome_lengths
        %>% group_by( dataset )
        %>% summarise(
            naive_total_length_covered = sum( naive_total_length_covered ),
            total_length_covered = sum( total_length_covered ),
            total_genome_length = sum( sequence_length )
        )
        %>% mutate(
            naive_proportion_covered = naive_total_length_covered / total_genome_length,
            proportion_covered = total_length_covered / total_genome_length
        )
    )
}

It looks like this:

gene_proportions = compute_proportion_of_genome_covered( correct_lengths )
print( gene_proportions )

# A tibble: 11 × 6
   dataset                                           naive_total_length_covered total_length_covered total_genome_length naive_proportion_covered proportion_covered
   <chr>                                                                  <dbl>                <dbl>               <dbl>                    <dbl>              <dbl>
Acanthochromis_polyacanthus.ASM210954v1.110                        422171067            415283627           830201259                    0.509              0.500
Asparagus_officinalis.Aspof.V1.57                                  165883925            165883925          1015569172                    0.163              0.163
Bufo_bufo-GCA_905171765.1-2022_05-genes                           1290309576           1267077494          5003028965                    0.258              0.253
Camelus_dromedarius.CamDro2.110.chr                                825762826            815771759          2052758708                    0.402              0.397
Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.110.chr                  583352821            556604796          1041139641                    0.560              0.535
Homo_sapiens.GRCh38.110.chr                                       1379802830           1305325027          3088286401                    0.447              0.423
Mus_musculus.GRCm39.110.chr                                       1070748851           1040456122          2723431143                    0.393              0.382
Pan_troglodytes.Pan_tro_3.0.110.chr                               1110722691           1101782444          2967125077                    0.374              0.371
Plasmodium_falciparum.ASM276v2.57                                   13776689             13577802            23292622                    0.591              0.583
Plasmodium_knowlesi.ASM635v1.57                                     12970692             12970692            23462187                    0.553              0.553
Plasmodium_vivax.ASM241v2.57                                        14052272             14020584            23768694                    0.591              0.590

You can see that the correct proportion is a bit (but not much) smaller than the naive_total_region_length column.

Question

How much difference was there between the 'naive' calculation and the correct one? Was it big enough to worry about?