Skip to main content

Testing it out

Hopefully you have now got a working function, read_gff() and got the extremely rewarding message:

++ test_read_gff(): Congratulations,all tests passed!

In addition it will be useful for your function to also extract the 'gene_type' and 'gene_name' attributes, because these are useful in the gencode files.

If you haven't reached this point, don't worry! Here is my solution:

Solution

Please feel free to use your solution, if it's working. If not, see the tabs for my solutions.

Trying some real data

Let's use this to do some work on some real data now. Load up the gencode data:

gencode = read_gff( "gencode.v41.annotation.gff3.gz" )

Use your R skills from the Introduction to R tutorial or python skills to view these files and explore a bit - for example pulling out all gene records, or records pertaining to specific genes. (For example you could look at FUT2.)

The data 'verbs'

Now is a good point to introduce in a bit more detail a set of data manipulation verbs that make working with data frames easy. You've already been working with several of these - filter, group by, summarise, and arrange. We'll describe a couple more 'verbs' - 'join' and 'select' - below.

We'll calling them data 'verbs' because they do things to dataframes. For example

OperationDescriptionR/dplyr functionpandas/polars function
selectselects (and optionally renames) columnsselect()?
mutateadds columnsmutate()?
filterfilters rows based on column valuesfilter()?
arrange or sortorders rowsarrange()?
group bygroups rows based on column valuesgroup_by()?
summarisecomputes summary values over the rowssummarise()?
joinjoins two dataframes together, based on shared values.inner_join, outer_join, etc.?

You can build these into pipelines you can conduct complex data manipulation tasks in a highly expressive way.

For example, here's a simple way to use filter():

filter( X, ID == `ENSG00000176920.13` )

If we also wanted to filter out

Note

For example, here's a cool way to filter the dataframe - using a kind of 'pipe', just like the one in bash. Instead of using filter() like this:

filter( gencode, ID == `ENSG00000176920.13` )

you can write

X %>% filter( ID == `ENSG00000176920.13` )

Here %>% plays the same role as | does in bash - it pipes the output of one command into the input of the next. The advantage is that you can put multiple things in the same pipeline. For example let's find all the FUT2 transcripts that start before chr19:48,696,000:

(
gencode
%>% filter( Parent == 'ENSG00000176920.13' )
%>% filter( start < 48696000 )
)

So, just like in the command-line, you can build up pipelines of commands to get the data you want. This filtering syntax is a feature of dplyr, which is part of tidyverse.

If it's working, well done!

Note

Another great thing to do is group and count the data - much like the pipeline using uniq -c in BASH. For example let's make a count of record types:

(
gencode
%>% group_by( type )
%>% summarise( count = n() )
)

Note. Another way to do this is R's built-in function table():

table( gencode$type )
Warning

As you start to play around with loading multiple files, keep an eye on the memory usage of your process. (You can do this in your system monitor, or by opening a terminal and running:

  • top -u <username> -o '%MEM' on linux or Ubuntu for Windows; or
  • top -U gav -o MEM in Mac OS

Next steps

A better way to solve the memory issue to store the data in a database and only load what's needed into memory - we'll see a way to do that later. But first let's package up the code.