Skip to main content

How much of the genome is really in genes?

If you followed the previous section you'll have a function merge_regions() that can 'merge' a set of possibly overlapping regions into a set that don't overlap (but cover the same regions). Let's use this to correct our computation of the proportion of genome in genes now.

Note

If you didn't manage to complete it - don't worry. My final version is here:

Click the tab above to reveal the solution.

We can use this to compute the length of a set of regions as follows. (We'll put in a print statement so we can see what's happening.):

compute_length_of_regions <- function( input_regions ) {
merged = merge_regions( input_regions )
sum( merged$end - merged$start + 1 )
}

So let's go back to our code to compute the total length of genes on each chromosome and make it use this function:

compute_lengths_per_chromosome = function(
data,
chromosomes = contigs
) {
(
data
# Group by species / chromosome
%>% group_by( dataset, seqid )
# Add up the gene lengths
%>% summarise(
number_of_regions = n(),
naive_total_length_covered = sum( end - start + 1 ),
total_length_covered = compute_length_of_regions( pick( start, end ) )
)
# Add the chromosome lengths
%>% left_join(
chromosomes[, c( "dataset", "seqid", "attributes", "sequence_length" )],
by = c( "dataset", "seqid" )
)
)
}

Note

If you compare this to the previous, incorrect code you'll see it's the same, except that we've used our compute_length_of_regions() instead of just summing the lengths. (But I've kept the naive calculation in there as well for comparison.)

The pick() function is a way to create a tibble of just the start and end coordinates for each of our chromosomes, and pass it into our function.

Let's run that now to get the correct lengths:

correct_lengths = compute_lengths_per_chromosome( genes )
print( correct_lengths )

This takes a minute or two to run - after all, it has to run our merging loop for each species and chromosome.

Now at last we can compute the proportion of each genome covered by genes:

compute_proportion_of_genome_covered = function(
per_chromosome_lengths
) {
(
per_chromosome_lengths
%>% group_by( dataset )
%>% summarise(
naive_total_length_covered = sum( naive_total_length_covered ),
total_length_covered = sum( total_length_covered ),
total_genome_length = sum( sequence_length )
)
%>% mutate(
naive_proportion_covered = naive_total_length_covered / total_genome_length,
proportion_covered = total_length_covered / total_genome_length
)
)
}

It looks like this:

gene_proportions = compute_proportion_of_genome_covered( correct_lengths )
print( gene_proportions )
# A tibble: 11 × 6
dataset naive_total_length_covered total_length_covered total_genome_length naive_proportion_covered proportion_covered
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Acanthochromis_polyacanthus.ASM210954v1.110 422171067 415283627 830201259 0.509 0.500
2 Asparagus_officinalis.Aspof.V1.57 165883925 165883925 1015569172 0.163 0.163
3 Bufo_bufo-GCA_905171765.1-2022_05-genes 1290309576 1267077494 5003028965 0.258 0.253
4 Camelus_dromedarius.CamDro2.110.chr 825762826 815771759 2052758708 0.402 0.397
5 Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.110.chr 583352821 556604796 1041139641 0.560 0.535
6 Homo_sapiens.GRCh38.110.chr 1379802830 1305325027 3088286401 0.447 0.423
7 Mus_musculus.GRCm39.110.chr 1070748851 1040456122 2723431143 0.393 0.382
8 Pan_troglodytes.Pan_tro_3.0.110.chr 1110722691 1101782444 2967125077 0.374 0.371
9 Plasmodium_falciparum.ASM276v2.57 13776689 13577802 23292622 0.591 0.583
10 Plasmodium_knowlesi.ASM635v1.57 12970692 12970692 23462187 0.553 0.553
11 Plasmodium_vivax.ASM241v2.57 14052272 14020584 23768694 0.591 0.590

You can see that the correct proportion is a bit (but not much) smaller than the naive_total_region_length column.

Question

How much difference was there between the 'naive' calculation and the correct one? Was it big enough to worry about?