Getting setup
If you followed the R version of our earlier tutorials you will have:
- An R package called gmsgff which provides the
read_gff()
function. (We made this in the Making an R or python package tutorial.) - An R script called
gff_to_sqlite.R
which converts a GFF file into the sqlite database format. (We made this in the Making a command-line program tutorial.)
If you don't have these or want an updated version, don't worry! You can get my versions as follows.
First, to install the R package, try (from an R session):
install.packages(
"https://www.chg.ox.ac.uk/bioinformatics/training/gms/code/R/gmsgff.tgz",
repos = NULL,
type = "source"
)
This will install the gmsgff package - this is the same as the one you developed with some additions that came from the challenge questions.
For example you can use it to load GFF data like this:
data = gmsgff::read_gff( "/path/to/Homo_sapiens.GRCh38.107.chr.gff3.gz", extra_attributes = c( "biotype", "Name" ))
You should see some nice output like this:
++read_gff(): Reading GFF3 data from "Homo_sapiens.GRCh38.107.chr.gff3.gz"...
++read_gff(): Extracting "ID" attribute...
++read_gff(): Extracting "Parent" attribute...
++read_gff(): Extracting "biotype" attribute...
++read_gff(): Extracting "Name" attribute...
++read_gff(): Removing prefixes from ID fields...
++read_gff(): ok.
You should get back a dataframe with all the GFF columns, plus
the extracted ID
, Parent
, biotype
and Name
attributes as seperate columns.
Warning
Remember these are big files and they use lots of memory. It is worth having a seperate terminal running and monitoring
the memory usage as this data loads - use top -u <username> -o '%MEM'
(on linux) top -U gav -o MEM
(on Mac OS) to do
this, or use your system activity monitor. How much memory does the process use?
A command-line program
If you completed the tutorial you should also have a command-line program gff_to_sqlite.R
which can be used to
convert a GFF file into sqlite format.
Note
If you don't have this program, fear not! You can download my version at this link:
gff_to_sqlite.R. Click on 'Raw', copy the code, and paste into your gff_to_sqlite.R
file in your current directory.
(This program depends on the gmsgff
R library above, so make sure to install that first.)
In the command-line you can now run the program like this:
Rscript --vanilla gff_to_sqlite.R --input Homo_sapiens.GRCh38.107.chr.gff3.gz --output genes.sqlite --attributes biotype Name
which will produce a new file called genes.sqlite
. If you're not used to using sqlite files, you can see some ways to
access that data on this page.
Note
We added the 'biotype' and 'Name' attributes above - these are useful in the Ensembl files (but not in the Gencode files, which use gene_type
and gene_name
instead.)
Note
Another amazing feature of this program is that it will fetch data from the internet for you! For example, try:
Rscript \
--vanilla gff_to_sqlite.R \
--input http://ftp.ensembl.org/pub/current_gff3/camelus_dromedarius/Camelus_dromedarius.CamDro2.110.chr.gff3.gz \
--attributes biotype Name \
--output genes.sqlite
This didn't require any new programming - it is all done by the underlying read_tsv()
function which we used to load
data.
Next steps
You are all set to start counting genes.