Testing it out

Hopefully you have now got a working function, read_gff() and got the extremely rewarding message:

++ test_read_gff(): Congratulations,all tests passed!

In addition it will be useful for your function to also extract the 'gene_type' and 'gene_name' attributes, because these are useful in the gencode files.

If you haven't reached this point, don't worry! Here is my solution:

Solution

Your solution
R solution, simpler version
python solution, simpler version
R solution, post-challenge version
python solution, post-challenge version

Please feel free to use your solution, if it's working. If not, see the tabs for my solutions.

read_gff = function(
    filename
) {
    result = readr::read_tsv(
        filename,
        comment = '#',
        na = ".",
        col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ),
        col_types = readr::cols(
            readr::col_character(),
            readr::col_character(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_double(),
            readr::col_double(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_character()
        )
    )
    result[['ID']] = stringr::str_extract( result[['attributes']], 'ID=([^;]+)', group = TRUE )
    result[['Parent']] = stringr::str_extract( result[['attributes']], 'Parent=([^;]+)', group = TRUE )
    result[['gene_type']] = stringr::str_extract( result[['attributes']], 'gene_type=([^;]+)', group = TRUE )
    result[['gene_name']] = stringr::str_extract( result[['attributes']], 'gene_name=([^;]+)', group = TRUE )
    return( result )

def read_gff( file ):
    import pandas
    result = pandas.read_table(
        file,
        comment = '#',
        names = [ 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ],
        na_values = '.',
        dtype = {
            'seqid': str,
            'source': str,
            'type': str,
            'start': int,
            'end': int,
            'score': float,
            'strand': str,
            'phase': str,
            'attributes': str
        }
    )
    result['ID'] = result.attributes.str.extract( 'ID=([^;]+)' )
    result['Parent'] = result.attributes.str.extract( 'Parent=([^;]+)
    result['gene_type'] = result.attributes.str.extract( 'gene_type=([^;]+)' )
    result['gene_name'] = result.attributes.str.extract( 'gene_name=([^;]+)
    return result

read_gff = function(
    filename,
    extra_attributes = c( "gene_type", "gene_name" )
) {
    result = readr::read_tsv(
        filename,
        comment = '#',
        na = ".",
        col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ),
        col_types = readr::cols(
            readr::col_character(),
            readr::col_character(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_double(),
            readr::col_double(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_character()
        )
    )

    # Put ID and Parent at the start.
    result = tibble::add_column(result, ID = NA, .before = 1)
    result = tibble::add_column(result, Parent = NA, .before = 2)

    # Now extract all the attributes
    for( attribute in c( "ID", "Parent", extra_attributes )) {
        # Create the appropriate regex using `sprintf()`
        regex = sprintf("%s=([^;]+)[;|$]", attribute)
        # Extract the attribute...
        result[[attribute]] = stringr::str_extract(result[["attributes"]], regex, group = TRUE)
        # ...and remove it from the attributes column.
        result[["attributes"]] = stringr::str_remove(result[["attributes"]], regex)
    }
    return( result )
}

def read_gff( file, extra_attributes = [] ):
    import pandas
    result = pandas.read_table(
        file,
        comment = '#',
        names = [ 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ],
        na_values = '.',
        dtype = {
            'seqid': str,
            'source': str,
            'type': str,
            'start': int,
            'end': int,
            'score': float,
            'strand': str,
            'phase': str,
            'attributes': str
        }
    )
    result.insert( loc = 0, column = 'ID', value = None )
    result.insert( loc = 1, column = 'Parent', value = None )
    for attribute in [ 'ID', 'Parent' ] + extra_attributes:
        regexp = '%s=([^;]+)[;|$]' % attribute
        result[attribute] = result.attributes.str.extract( regexp )
        result["attributes"] = result["attributes"].str.replace( regexp, "" )
    return result

Trying some real data

Let's use this to do some work on some real data now. Load up the gencode data:

gencode = read_gff( "gencode.v41.annotation.gff3.gz" )

Use your R skills from the Introduction to R tutorial or python skills to view these files and explore a bit - for example pulling out all gene records, or records pertaining to specific genes. (For example you could look at FUT2.)

The data 'verbs'

Now is a good point to introduce in a bit more detail a set of data manipulation verbs that make working with data frames easy. You've already been working with several of these - filter, group by, summarise, and arrange. We'll describe a couple more 'verbs' - 'join' and 'select' - below.

We'll calling them data 'verbs' because they do things to dataframes. For example

Operation	Description	R/dplyr function	pandas/polars function
`select`	selects (and optionally renames) columns	`select()`	?
`mutate`	adds columns	`mutate()`	?
`filter`	filters rows based on column values	`filter()`	?
`arrange` or sort	orders rows	`arrange()`	?
`group by`	groups rows based on column values	`group_by()`	?
`summarise`	computes summary values over the rows	`summarise()`	?
`join`	joins two dataframes together, based on shared values.	`inner_join`, `outer_join`, etc.	?

You can build these into pipelines you can conduct complex data manipulation tasks in a highly expressive way.

For example, here's a simple way to use filter():

filter( X, ID == `ENSG00000176920.13` )

If we also wanted to filter out

Note

In R
In python

For example, here's a cool way to filter the dataframe - using a kind of 'pipe', just like the one in bash. Instead of using filter() like this:

filter( gencode, ID == `ENSG00000176920.13` )

you can write

X %>% filter( ID == `ENSG00000176920.13` )

Here %>% plays the same role as | does in bash - it pipes the output of one command into the input of the next. The advantage is that you can put multiple things in the same pipeline. For example let's find all the FUT2 transcripts that start before chr19:48,696,000:

(
    gencode
    %>% filter( Parent == 'ENSG00000176920.13' )
    %>% filter( start < 48696000 )
)

So, just like in the command-line, you can build up pipelines of commands to get the data you want. This filtering syntax is a feature of dplyr, which is part of tidyverse.

Here's a cool way to filter the data frame by rows in python - use query(). This takes a string as an expression, for example:

gencode.query( "ID == 'ENSG00000176920.13'" )

The query() function is actually part of the dataframe object (it's a 'method' of the dataframe), which is why you can call it in the style 'object.method()'.

This also makes it easy to chain multiple filtering criteria together. For example, let's find all the FUT2 transcripts that start before chr19:48,696,000:

(
    gencode
    .query( "Parent == 'ENSG00000176920.13'" )
    .query( "start < 48696000" )
)

This dataframe and filtering syntax is part of pandas.

If it's working, well done!

Note

Another great thing to do is group and count the data - much like the pipeline using uniq -c in BASH. For example let's make a count of record types:

In R
In python

(
    gencode
    %>% group_by( type )
    %>% summarise( count = n() )
)

Note. Another way to do this is R's built-in function table():

table( gencode$type )

(
    gencode
    .groupby( "type" )
    .agg({ "ID": 'count' })
)

This works, but I personally find this piece of code and its output harder to understand than the R / dplyr version. Another way to do this in pandas is to use the simpler value_counts():

gencode['type'].value_counts()

...although that returns something called a 'Series', as opposed to a data frame.

Warning

As you start to play around with loading multiple files, keep an eye on the memory usage of your process. (You can do this in your system monitor, or by opening a terminal and running:

top -u <username> -o '%MEM' on linux or Ubuntu for Windows; or
top -U gav -o MEM in Mac OS

Next steps

A better way to solve the memory issue to store the data in a database and only load what's needed into memory - we'll see a way to do that later. But first let's package up the code.

Testing it out

Solution

Trying some real data​

The data 'verbs'​

Note

Note

Warning

Next steps​

Trying some real data

The data 'verbs'

Next steps