A first go at parsing GFF
This tutorial will lead you through a process of writing some code to load GFF data into either python or R. We will then use this to create a useful utility program.
The aim of this tutorial is that you to write the code that does this (either on your own or working in a group if you prefer). But the tutorial will guide you through one way to do it, and if you run this as part of a WHG course, there will be lots of support.
Note
Remember we're not writing this code for it's own sake but to answer our questions like the ones in our introduction:
- How many genes are there?
- How big are they?
- How much of the genome is in genes?
- How complex are genes - How many exons? How many different transcripts?
- How much of genes is actually protein-coding sequence - and how much is untranslated?
- How much do these patterns differ across species?
Or indeed any other questions you are interested in.
Nevertheless for the moment we will focus on the programming part of this task, and leave the analysis part until later. The idea is to guide you to some useful ways to approach coding.
Writing code
How to start writing code? Well, it is worth noting that there are libraries around that will load GFF data for you. (rtracklayer in R is one). Conversely, you could use very low-level approaches to doing this - build the data structures yourself. But this tutorial will take a 'middle' way - we will use population general 'dataframe' libraries to load the data - tidyverse in R and pandas in python.
Note
You can choose to use R or python. One of the interesting things of these libraries is that they are more similar than they are different, so we can write almost the same code in both programming languages.
In the course of the tutorial we'll develop a little R or python library to help us answer the questions.
Diving straight in - parsing data
If you looked at the gene annotation data, you'll know
that it comes in rows of data that are tab-delimited. That's good and will be a good fit to a
dataframe structure. But the data is also relational (meaning that the records refer to each other,
via the Parent
attribute). And since this has several levels (e.g. exons are associated with
transcripts, which are in turn associated with genes), we might ultimately have to build some form
of hierarchical data structure to capture this.
That sounds complex, so let's break off a manageable first bit of the job by just focussing on
getting the data in. We'll start by trying to write a function that loads the data. We'll call this
function read_gff()
because that's what it will do. It will look like this:
- In R
- In python
read_gff = function( file ):
result = (some code to load the data from the file here)
return( result )
def read_gff( file ):
result = (some code to load the data from the file here)
return result
- In R
- In python
X = read_gff( "gencode.v41.annotation.head.gff3" )
file = open( "gencode.v41.annotation.head.gff" )
X = read_gff( file )
This should produce a dataframe with columns id
, parent
, seqid
, source
, type
, start
, end
,
score
, strand
, phase
, and attributes
.
Simple! If only we knew what bit of code to write in the function there.
Challenge
Can you write read_gff()
?
Note
To make your function really good, here are a few things it should get right:
- It should parse the main data records (skipping over the metadata).
- it should get the column names right
- it should get the data types of columns right.
- it should handle missing data values appropriately.
And, because we want to capture the relational structure of the file:
- it should extract out the
ID
andParent
attributes as new columns.
In other words - your function has to pass the test in the next section. Good luck!
If you don't know where to start - don't worry, we will walk through a process of writing this below.
Test-driven development
Funnily enough our function above already has a useful property, even though we haven't yet written it yet! It is already testable. To see this let's design some test data:
- In R
- In python
test_data = "##gff-version 3
#description: test data
chr1\tme\tgene\t1\t1000\t.\t+\t.\tID=gene1;other_data=stuff
chr1\tme\texon\t10\t900\t.\t+\t.\tID=gene1.1;Parent=gene1
"
test_data = """##gff-version 3
#description: test data
chr1\tme\tgene\t1\t1000\t.\t+\t.\tID=gene1;other_data=stuff
chr1\tme\texon\t10\t900\t.\t+\t.\tID=gene1.1;Parent=gene1
"""
Note
The \t
's in there are tab characters. This is how R and python (and most other programming languages) encode tabs.
If you want to see the full data (with tabs expanded), use cat()
in R or print()
in python:
- In R
- In python
cat( test_data )
print( test_data )
You should see that it's properly GFF formatted.
Now let's use that test data to write a test capturing our requirements:
- In R
- In python
test_read_gff = function() {
test_data = "##gff-version 3
#description: test data
chr1\tme\tgene\t1\t1000\t.\t+\t.\tID=gene1;other_data=stuff
chr1\tme\texon\t10\t900\t.\t+\t.\tID=gene1.1;Parent=gene1
"
cat( "Using test data:\n" )
cat( test_data )
# 1. run our function to parse the data:
gff = read_gff( test_data )
print(gff)
# 2. test it:
# Check we have all the basic columns
columns = c(
"seqid", "source", "type", "start", "end",
"score", "strand", "phase", "attributes"
)
stopifnot(
length( which( columns %in% colnames(gff) )) == length(columns)
)
# check some string fields, does it get them right?
stopifnot( gff[['seqid']][1] == 'chr1' )
stopifnot( gff[['strand']][1] == '+' )
stopifnot( gff[['attributes']][1] == 'ID=gene1;other_data=stuff' )
stopifnot( gff[['seqid']][2] == 'chr1' )
stopifnot( gff[['strand']][2] == '+' )
stopifnot( gff[['attributes']][2] == 'ID=gene1.1;Parent=gene1' )
# check that start and end are integers
stopifnot( gff[['start']][1] == 1 )
stopifnot( gff[['end']][1] == 1000 )
stopifnot( gff[['start']][2] == 10 )
stopifnot( gff[['end']][2] == 900 )
# check that missing data is handled right
# "." indicates missing data in the GFF spec
# but we should have translated that to an R missing value
stopifnot( is.na( gff[['score']][2] ) )
# check that we extracted `ID` and `Parent` right.
stopifnot(
length( which( c( "ID", "Parent" ) %in% colnames(gff) )) == 2
)
stopifnot( gff[['ID']][1] == 'gene1' )
stopifnot( gff[['ID']][2] == 'gene1.1' )
stopifnot( gff[['Parent']][2] == 'gene1' )
# etc.
# add your own checks here!
cat( "\n++ test_read_gff(): Congratulations, all tests passed!\n" )
}
def test_read_gff():
test_data = """##gff-version 3
#description: test data
chr1\tme\tgene\t1\t1000\t.\t+\t.\tID=gene1;other_data=stuff
chr1\tme\texon\t10\t900\t.\t+\t.\tID=gene1.1;Parent=gene1
"""
print( "Using test data:" )
print( test_data )
from io import StringIO # see comment below
# 1. run our function to parse the data:
data = read_gff( StringIO( test_data ))
# 2. test it:
# check some string fields:
assert data['seqid'][0] == 'chr1'
assert data['strand'][0] == '+'
assert data['attributes'][0] == 'ID=gene1;other_data=stuff'
assert data['seqid'][1] == 'chr1'
assert data['strand'][1] == '+'
assert data['attributes'][1] == 'ID=gene1.1;Parent=gene1'
# check that start and end are integers
assert data['start'][0] == 1 # start and end are integers, not strings
assert data['end'][0] == 1000
assert data['start'][1] == 10
assert data['end'][1] == 900
# check that missing data is handled right
# "." indicates missing data in the GFF spec
# but we should have translated that to `NaN`, which
# is pandas' way of indicating missing data.
from math import isnan
assert isnan( data['score'][1] )
# check that we extracted `ID` and `Parent` right.
assert data['ID'][0] == 'gene1'
assert data['ID'][1] == 'gene1.1'
assert data['Parent'][1] == 'gene1'
# etc.
print( "++ test_read_gff(): Congratulations,all tests passed!" )
And then run this to test the function:
test_read_gff()
This prints an error, something like:
- In R
- In python
Error in read_gff(data) :
could not find function "read_gff"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in test_read_gff
NameError: name 'read_gff' is not defined
Of course it's an error - we haven't written the function yet!
But we now have a concrete target to shoot for: when our code passes the test it will be doing the right thing.
Note
This style of programming is known as test-driven development - in which you write the test(s) first and only then write the implementation.
You don't have to work this way but it's quite cool if you can bring yourself to do it, because
- It makes you figure out how you want your code to be used before you write it.
- It forces you to write code that can be tested. (This typically means you end up with pieces that are smallish and hopefully composable).
- When you've written the code - hey presto, you've also written the tests (no extra work).
For these reasons I thought we'd get the test in up-front here.
When your function passes the test, you're done! Go ahead and turn it into a module.
Getting it to work
To figure out how to write read_gff()
let's try a few things. Start an R or python
session now if you haven't already. Also make sure you have installed the
tidyverse (in R) or pandas (in python)
because that's what we'll use. You can find the documentation for these packages here:
A first go
The data in a GFF is basically tabular, so let's try to load the data using the appropriate function. In R/tidyverse this is
read_tsv()
while in python/pandas it's called read_table()
. You can find the documentation for this function here:
- for tidyverse in R, search for
read_tsv
in the readr documentation - for pandas in python, search for `read_table`` in the pandas documentation
Let's have a first go. To get things working let's work with a small file - the "gencode.v41.annotation.head.gff" file you made earlier:
- In R
- In python
library( readr )
X = readr::read_tsv( "gencode.v41.annotation.head.gff" )
import pandas
X = pandas.read_table( "gencode.v41.annotation.head.gff" )
Try running this now - what happens?
Note
You will probably find you have an error.
- In R
- In python
If using R, print the resulting dataframe X
now. How many rows and columns does it have? What
are the first few lines?
It doesn't look right - what has gone wrong?
Running the above command prints out a bunch of stuff and then says, right near the end:
ParserError : Error tokenizing data. C error: Expected 1 fields in line 8, saw 9
If you're not used to this kind of thing, errors like this may seem pretty cryptic. But they are often more helpful than they look at first. This one tells us for example that a problem occurred on line 8 of the input file. It expected to see 1 field there but found 9. Look back at the input data. Is there something different about line 8 than earlier lines? (Hint: yes there is!).
Skipping metadata
We haven't told the parser about the metadata lines - so it is unsurprisingly confused. For now
we're interested in the data records so would like the code to just ignore those lines. There are a
few ways to do this, but the easiest is to tell read_tsv()
or read_table()
that these lines are
comments - that is, bits of text that aren't part of the data.
- In R
- In python
If you read the read_tsv()
documentation you may spot there's an
argument called comment
. About this argument it says:
comment: A string used to identify comments. Any text after the comment characters will be silently ignored.
This sounds like the right thing! Let's try:
X = readr::read_tsv(
"gencode.v41.annotation.head.gff",
comment = '#'
)
If you read the
read_table()
documentation
you may spot an argument - called comment
. About this argument it says:
comment: Character indicating that the remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether.
This sounds like the right thing! Let's try:
X = pandas.read_table(
"gencode.v41.annotation.head.gff",
comment = '#'
)
Try this again. Does it work? (Print X
again to see.)
It sort of works!
...but not quite. Thus begins our long war of attrition to get this in shape.
Adding column names
At the momentX
doesn't have the right column names. In fact, how could it? The file didn't have
them in.
- In R
- In python
We can fix that by adding a col_names
argument. What names? Well, the ones from
the GFF spec of course:
X = readr::read_tsv(
"gencode.v41.annotation.head.gff",
comment = '#',
col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' )
)
We can fix that by adding a names argument. What names? Well, the ones from the GFF spec of course:
X = pandas.read_table(
"gencode.v41.annotation.head.gff",
comment = '#',
names = [ 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ]
)
Print out X
again. It is starting to look right:
Now it has column names!
Let's try putting this in our function and testing now?
- In R
- In python
read_gff = function( filename ) {
readr::read_tsv(
filename,
comment = '#',
col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' )
)
}
def read_gff( filename ):
pandas.read_table(
filename,
comment = '#',
names = [ 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ]
)
and run the test:
test_read_gff()
Note
Look at the output - does it pass the test? If not, what fails? Can you see why?
Dealing with missing values
This raises a more subtle issue. The spec
says that .
indicates a missing value - but our code thinks they are simply strings. Look:
- In R
- In python
print( X$score[1:10] )
[ 'this_is_a_%s_character' % s for s in X["score"] ]
Because of this our test is failing when it tests that the score
has missing values.
This is going to make the test fail when it tests for missing data. Somehow we need to convert
those .
's to missing values.
- In R
- In python
However if you check the read_tsv()
docs again you should see there's
again an argument that we can use to tell it to treat specific strings as missing values. (Can you
see it?)
Let's try again:
read_gff = function( filename ) {
readr::read_tsv(
filename,
comment = '#',
col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ),
na = "."
)
}
test_read_gff()
Now the missing values come up as NA
- R's shorthand for missing values.
However if you read the read_table()
docs you'll see this is easy
as well - we need the na_values
argument:
def read_gff( filename ):
pandas.read_table(
filename,
comment = '#',
names = [ 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ],
na_values = '.'
)
Now the missing values come up as NaN
(which is
what pandas uses for a missing value.
)
Specifying the right column types
There is also the more subtle issue of column types. Currently our function has guessed the column types for us, by looking at the data.
- In R
- In python
To see them, print X
again and look at the second row - it will look something like this:
> X
# A tibble: 993 × 9
seqid source type start end score strand phase attributes
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 chr1 HAVANA gene 11869 14409 . + . ID=ENSG00000223972.5;…
(etc)
The <chr>
means 'character' data (i.e. strings), and <dbl>
means floating-point numbers.
To see the types, try:
X.dtypes
which prints
seqid object
source object
type object
start int64
end int64
score float64
strand object
phase float64
attributes object
dtype: object
Most of this is actually fine (object
is referring to a python object, which in this case means a
string, and it has correctly realised that start
and end
are integers. It has got phase
wrong - it thinks it is a floating-poing value, when it
really isn't.
So this is pretty good for this particular dataset. But there are good reasons we should fix the column
types. First, they're not quite right at the moment (for example score
should be a floating-point
number). More importantly, because we're letting it guess the types from the input data, it might
conceivable guess different types depending on what data is passed in - likely to break code that
uses the results.
Here we know what the types should be - they're specified in the spec. So let's specify them directly.
- In R
- In python
The
documentation tells us how to do this:
there's a col_types
argument for this purpose. The syntax is a bit involved (described
here), but works like this:
read_gff = function( filename ) {
library( readr )
readr::read_tsv(
filename,
comment = '#',
na = ".",
col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ),
col_types = cols(
col_character(),
col_character(),
col_character(),
col_integer(),
col_double(),
col_double(),
col_character(),
col_integer(),
col_character()
)
)
}
Let's specify them instead using the dtype
argument:
def read_gff( filename ):
return pandas.read_table(
filename,
comment = '#',
names = [ 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ],
na_values = '.',
dtype = {
'seqid': 'string',
'source': 'string',
'type': 'string',
'start': 'Int64',
'end': 'Int64',
'score': 'float',
'strand': 'string',
'phase': 'string',
'attributes': 'string'
}
)
And let's try again:
X = read_gff( "gencode.v41.annotation.head.gff" )
Note
Run this and look at the output. Are the column types right now? (If not, fix them.)
Question
What about the test - does it pass now?
test_read_gff()
If not, why not?
Next steps
Once you've got your code working - go ahead and turn it into a module.