Skip to main content

Extracting the ID and Parent attributes

If you followed the previous page, you'll have a function called read_gff() which almost - but not quite - loads GFF data the way we want it.

What's missing is that we need to extract out attributes - in particular the ID and Parent attributes that specify the structure of the file. As you saw before they are tucked away inside the attributes column.

For example look at the first of these - it will look something like:


> X$attributes[1]
[1] "ID=ENSG00000223972.5;gene_id=ENSG00000223972.5;gene_type=transcribed_unprocessed_pseudogene;gene_name=DDX11L1;level=2;hgnc_id=HGNC:37102;havana_gene=OTTHUMG00000000961.2"

The value is a big long string, separated by semicolons, and somewhere in there is the bit we want: ID=ENSG....... We also want to get the Parent attribute, which only exists in some rows, because it indicates how the records are linked together. If it's not there, we should give it a missing value.

So we have to somehow split these bits out of the string. How to do this?

Extracting attributes

It turns out there is a great tool for this type of job - regular expressions. Regular expressions are a language in their own right that can be used to parse and extract pieces of text from larger strings. It works like this: we define a regular expression that captures the bit of the attributes we want to extract. Then we use a functon to apply that to all of the attributes strings in the dataframe.

For example, to capture a field of the form ID=something;, we could use the basic regular expression:

ID=([^;]+)
Note

If you haven't used regular expressions before this might look pretty opaque! But this is how regular expressions look; they are very short and compact but pack a lot in. This one can be understood as follows:

  • The 'ID=' part looks for the exact string "ID=";
  • The '[^;]+' part looks for a string of characters except semicolons. (The + means the string has to be at least one character long).
  • And the parentheses ( and ) tell the regexp to capture (i.e. remember) whatever bit of the string matched. (This is called a 'capture group')

So put together this says "find something of the form "ID=value", up to but not including any semicolon, and remember the value."

Let's test this out by making some dummy data with the ID in different places in the string:

test_attributes = tibble(
attributes = c(
"a=b;ID=gene_1",
"ID=gene_2;a=b",
"ID=gene_3",
"a=b"
)
)

The str_extract() function from the stringr library (part of tidyverse) can now be used to extract attributes:

str_extract(
test_attributes$attributes,
'ID=([^;]+)'
)

You should see something like this:

[1] "ID=gene_1" "ID=gene_2" "ID=gene_3" NA         

This looks good - for example it has correctly returned a missing value (NA) for that last row where there was no ID attribute. Check it carefully to make sure the others are correct.

Question

Of course we want to only get the value in there - that is the bit in the capture "group" ([^;]+). In fact str_extract() has an argument that can be used to do that

  • can you find it? Add this and check that it gives the right answers.

Getting our ID field into a new column of X is now easy. (Since the ID is so important, let's put it at the start).

In R there are a few ways to do this. For example let's do it by creating a new column ID at the start with NA values, and then filling it with the values:

X = add_column( X, ID = NA, .before = 1 )
X[['ID']] = str_extract(
X[['attributes']],
'ID=([^;]+)',
group = TRUE
)

Note. In R, indexing starts at 1, so here .before = 1 means "place at the start".

Note

Thanks to Jonathan Chan for pointing out this neat way to extract attributes.

Print out X again to see your handiwork - there should be a new ID column in there.

An aside on timing

The full dataset has something like three million rows - that's a lot of data to process! So how long does it take to extract these attributes? Let's time it now:

In R a simple way to time some code is using system.time():

system.time({
X[['ID']] = str_extract(
X[['attributes']],
'ID=([^;]+)',
group = TRUE
)
})

On my laptop, the R version takes about 2 seconds (measuring 'elapsed' time), pretty consistently, while the python version is slightly slower, and there are 34 million rows. So it is parsing over a million rows per second.

This is in fact one of the really good reasons to use regular expressions for this task. They are extremely well optimised. For example, compare the speed of a naive implementation that tries to extract the ID attribute:

# Extract the `ID` attribute by:
# 1. splitting up the string on ';' characters
# 2. Finding one that starts with "ID="
# 3. and using `substring()` to get the value:
my_naive_extract = function(x) {
elts = strsplit( x, split = ';', fixed = T )[[1]]
w = which( stringr::str_starts( elts, "ID=" ))
result = NA
if( length(w) == 1 ) {
result = substring( elts[w], 4 )
}
return( result )
}

To apply this to the whole column we can use map():

system.time({
ID = map( X[['attributes']], my_naive_extract )
})

On my system (an M1 mac) this 'naive' version takes about 80 seconds in R! (Though the python version is much better, at around 7 seconds.) So we have a table of timings, something like:

LanguageMethodBest of three timings
Rregex1.8s
pythonregex2.6s
pythonnaive6.0s
Rnaive78s
Note

This illustrates a key point when working with large genomic datasets: it's quite easy to find that things get really slow, and some care over implementation is often needed to speed them up again. (Although 70s might not seem much in the scheme of things, it's easily enough for your mind to wander. Or imagine we were processing fifty files instead of one...)

Particularly when working in interpreted languages like R or python, finding ways to write efficient code soon becomes important. (For text manipulation, regex methods are often hard to beat.)

Warning

Apart from speed - you should also be worried about how much memory your process is using!

Check this now using your system's activity monitor or by running top -u <username> -o '%MEM' (on linux) top -U gav -o MEM (in Mac OS) in a terminal. On my system, before running the above both R and python are using somewhere around 2Gb of memory, while after extracting the ID they are using about 2.4Gb each.

This is closely related to speed, because if your system runs out of memory it will probably keep working but will start using 'swap' space (on your hard drive). At this point the code adn everything else on your laptop will likely start to slow down dramatically.

Testing it out

Now you should be all set to get our function working!

Challenge

Get the function working, by following these steps:

  1. Add the above code to your read_gff() so that it adds the ID attribute.

  2. Don't forget to add the similar code to extract the Parent attribute as well.

  3. Run the test:

test_read_gff()

If all goes well you should see:

++ test_read_gff(): Congratulations,all tests passed!

Congratulations!

Note - if you get stuck - don't worry, solutions will be provided! Just go on to the testing it out page. On the other hand, if you have got this far we have some additional challenges.