Skip to main content

A first go at parsing GFF

This tutorial will lead you through a process of writing some code to load GFF data into either python or R. We will then use this to create a useful utility program.

The aim of this tutorial is that you to write the code that does this (either on your own or working in a group if you prefer). But the tutorial will guide you through one way to do it, and if you run this as part of a WHG course, there will be lots of support.

Note

Remember we're not writing this code for it's own sake but to answer our questions like the ones in our introduction:

  • How many genes are there?
  • How big are they?
  • How much of the genome is in genes?
  • How complex are genes - How many exons? How many different transcripts?
  • How much of genes is actually protein-coding sequence - and how much is untranslated?
  • How much do these patterns differ across species?

Or indeed any other questions you are interested in.

Nevertheless for the moment we will focus on the programming part of this task, and leave the analysis part until later. The idea is to guide you to some useful ways to approach coding.

Writing code

How to start writing code? Well, it is worth noting that there are libraries around that will load GFF data for you. (rtracklayer in R is one). Conversely, you could use very low-level approaches to doing this - build the data structures yourself. But this tutorial will take a 'middle' way - we will use population general 'dataframe' libraries to load the data - tidyverse in R and pandas in python.

Note

You can choose to use R or python. One of the interesting things of these libraries is that they are more similar than they are different, so we can write almost the same code in both programming languages.

In the course of the tutorial we'll develop a little R or python library to help us answer the questions.

Diving straight in - parsing data

If you looked at the gene annotation data, you'll know that it comes in rows of data that are tab-delimited. That's good and will be a good fit to a dataframe structure. But the data is also relational (meaning that the records refer to each other, via the Parent attribute). And since this has several levels (e.g. exons are associated with transcripts, which are in turn associated with genes), we might ultimately have to build some form of hierarchical data structure to capture this.

That sounds complex, so let's break off a manageable first bit of the job by just focussing on getting the data in. We'll start by trying to write a function that loads the data. We'll call this function read_gff() because that's what it will do. It will look like this:

read_gff = function( file ):
result = (some code to load the data from the file here)
return( result )
and can be run like this:
X = read_gff( "gencode.v41.annotation.head.gff3" )

This should produce a dataframe with columns id, parent, seqid, source, type, start, end, score, strand, phase, and attributes.

Simple! If only we knew what bit of code to write in the function there.

Challenge

Can you write read_gff()?

Note

To make your function really good, here are a few things it should get right:

  • It should parse the main data records (skipping over the metadata).
  • it should get the column names right
  • it should get the data types of columns right.
  • it should handle missing data values appropriately.

And, because we want to capture the relational structure of the file:

  • it should extract out the ID and Parent attributes as new columns.

In other words - your function has to pass the test in the next section. Good luck!

If you don't know where to start - don't worry, we will walk through a process of writing this below.

Test-driven development

Funnily enough our function above already has a useful property, even though we haven't yet written it yet! It is already testable. To see this let's design some test data:

test_data = "##gff-version 3
#description: test data
chr1\tme\tgene\t1\t1000\t.\t+\t.\tID=gene1;other_data=stuff
chr1\tme\texon\t10\t900\t.\t+\t.\tID=gene1.1;Parent=gene1
"
Note

The \t's in there are tab characters. This is how R and python (and most other programming languages) encode tabs. If you want to see the full data (with tabs expanded), use cat() in R or print() in python:

cat( test_data )

You should see that it's properly GFF formatted.

Now let's use that test data to write a test capturing our requirements:

test_read_gff = function() {
test_data = "##gff-version 3
#description: test data
chr1\tme\tgene\t1\t1000\t.\t+\t.\tID=gene1;other_data=stuff
chr1\tme\texon\t10\t900\t.\t+\t.\tID=gene1.1;Parent=gene1
"
cat( "Using test data:\n" )
cat( test_data )
# 1. run our function to parse the data:
gff = read_gff( test_data )
print(gff)
# 2. test it:
# Check we have all the basic columns
columns = c(
"seqid", "source", "type", "start", "end",
"score", "strand", "phase", "attributes"
)
stopifnot(
length( which( columns %in% colnames(gff) )) == length(columns)
)
# check some string fields, does it get them right?
stopifnot( gff[['seqid']][1] == 'chr1' )
stopifnot( gff[['strand']][1] == '+' )
stopifnot( gff[['attributes']][1] == 'ID=gene1;other_data=stuff' )
stopifnot( gff[['seqid']][2] == 'chr1' )
stopifnot( gff[['strand']][2] == '+' )
stopifnot( gff[['attributes']][2] == 'ID=gene1.1;Parent=gene1' )

# check that start and end are integers
stopifnot( gff[['start']][1] == 1 )
stopifnot( gff[['end']][1] == 1000 )
stopifnot( gff[['start']][2] == 10 )
stopifnot( gff[['end']][2] == 900 )

# check that missing data is handled right
# "." indicates missing data in the GFF spec
# but we should have translated that to an R missing value
stopifnot( is.na( gff[['score']][2] ) )

# check that we extracted `ID` and `Parent` right.
stopifnot(
length( which( c( "ID", "Parent" ) %in% colnames(gff) )) == 2
)

stopifnot( gff[['ID']][1] == 'gene1' )
stopifnot( gff[['ID']][2] == 'gene1.1' )
stopifnot( gff[['Parent']][2] == 'gene1' )
# etc.
# add your own checks here!

cat( "\n++ test_read_gff(): Congratulations, all tests passed!\n" )
}

And then run this to test the function:

test_read_gff()

This prints an error, something like:

Error in read_gff(data) : 
could not find function "read_gff"

Of course it's an error - we haven't written the function yet!

But we now have a concrete target to shoot for: when our code passes the test it will be doing the right thing.

Note

This style of programming is known as test-driven development - in which you write the test(s) first and only then write the implementation.

You don't have to work this way but it's quite cool if you can bring yourself to do it, because

  1. It makes you figure out how you want your code to be used before you write it.
  2. It forces you to write code that can be tested. (This typically means you end up with pieces that are smallish and hopefully composable).
  3. When you've written the code - hey presto, you've also written the tests (no extra work).

For these reasons I thought we'd get the test in up-front here.

When your function passes the test, you're done! Go ahead and turn it into a module.

Getting it to work

To figure out how to write read_gff() let's try a few things. Start an R or python session now if you haven't already. Also make sure you have installed the tidyverse (in R) or pandas (in python) because that's what we'll use. You can find the documentation for these packages here:

A first go

The data in a GFF is basically tabular, so let's try to load the data using the appropriate function. In R/tidyverse this is read_tsv() while in python/pandas it's called read_table(). You can find the documentation for this function here:

Let's have a first go. To get things working let's work with a small file - the "gencode.v41.annotation.head.gff" file you made earlier:

library( readr )
X = readr::read_tsv( "gencode.v41.annotation.head.gff" )

Try running this now - what happens?

Note

You will probably find you have an error.

If using R, print the resulting dataframe X now. How many rows and columns does it have? What are the first few lines?

It doesn't look right - what has gone wrong?

Skipping metadata

We haven't told the parser about the metadata lines - so it is unsurprisingly confused. For now we're interested in the data records so would like the code to just ignore those lines. There are a few ways to do this, but the easiest is to tell read_tsv() or read_table() that these lines are comments - that is, bits of text that aren't part of the data.

If you read the read_tsv() documentation you may spot there's an argument called comment. About this argument it says:

comment: A string used to identify comments. Any text after the comment characters will be silently ignored.

This sounds like the right thing! Let's try:

X = readr::read_tsv(
"gencode.v41.annotation.head.gff",
comment = '#'
)

Try this again. Does it work? (Print X again to see.)

It sort of works!

...but not quite. Thus begins our long war of attrition to get this in shape.

Adding column names

At the momentX doesn't have the right column names. In fact, how could it? The file didn't have them in.

We can fix that by adding a col_names argument. What names? Well, the ones from the GFF spec of course:

X = readr::read_tsv(
"gencode.v41.annotation.head.gff",
comment = '#',
col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' )
)

Print out X again. It is starting to look right:

img

Now it has column names!

Let's try putting this in our function and testing now?

read_gff = function( filename ) {
readr::read_tsv(
filename,
comment = '#',
col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' )
)
}

and run the test:

test_read_gff()
Note

Look at the output - does it pass the test? If not, what fails? Can you see why?

Dealing with missing values

This raises a more subtle issue. The spec says that . indicates a missing value - but our code thinks they are simply strings. Look:

print( X$score[1:10] )

Because of this our test is failing when it tests that the score has missing values.

This is going to make the test fail when it tests for missing data. Somehow we need to convert those .'s to missing values.

However if you check the read_tsv() docs again you should see there's again an argument that we can use to tell it to treat specific strings as missing values. (Can you see it?)

Let's try again:

read_gff = function( filename ) {
readr::read_tsv(
filename,
comment = '#',
col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ),
na = "."
)
}
test_read_gff()

Now the missing values come up as NA - R's shorthand for missing values.

Specifying the right column types

There is also the more subtle issue of column types. Currently our function has guessed the column types for us, by looking at the data.

To see them, print X again and look at the second row - it will look something like this:

> X
# A tibble: 993 × 9
seqid source type start end score strand phase attributes
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 chr1 HAVANA gene 11869 14409 . + . ID=ENSG00000223972.5;
(etc)

The <chr> means 'character' data (i.e. strings), and <dbl> means floating-point numbers.

So this is pretty good for this particular dataset. But there are good reasons we should fix the column types. First, they're not quite right at the moment (for example score should be a floating-point number). More importantly, because we're letting it guess the types from the input data, it might conceivable guess different types depending on what data is passed in - likely to break code that uses the results.

Here we know what the types should be - they're specified in the spec. So let's specify them directly.

The documentation tells us how to do this: there's a col_types argument for this purpose. The syntax is a bit involved (described here), but works like this:

read_gff = function( filename ) {
library( readr )
readr::read_tsv(
filename,
comment = '#',
na = ".",
col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ),
col_types = cols(
col_character(),
col_character(),
col_character(),
col_integer(),
col_double(),
col_double(),
col_character(),
col_integer(),
col_character()
)
)
}

And let's try again:

X = read_gff( "gencode.v41.annotation.head.gff" )
Note

Run this and look at the output. Are the column types right now? (If not, fix them.)

Question

What about the test - does it pass now?

test_read_gff()

If not, why not?

Next steps

Once you've got your code working - go ahead and turn it into a module.