Investigating specific genes
Let's switch track and try to dig out info about a specific gene - FUT2. That's an interesting gene because it encodes a fucosyltransferase which is involved the synthesis pathway for 'soluble' A and B antigens - that is A and B antigens found in bodily fluids. Quoth wikipedia:
"Secretor status refers to the presence or absence of water-soluble ABO blood group antigens in a person's bodily fluids, such as saliva, tears, breast milk, urine, and semen. People who secrete these antigens in their bodily fluids are referred to as secretors, while people who do not are termed non-secretors. Secretor status is controlled by the FUT2 gene (also called the Se gene), and the secretor phenotype is inherited in an autosomal dominant manner." - Wikipedia
A specific mutation (rs601338 G>A, at position 48703417 on GRCh38 assembly) in FUT2 is the major determinant of secretor status. Because pathogens including norovirus bind to these secreted antigens, these mutations can confer protection against norovirus.
Let's look at FUT2 in our annotation file now.
Finding the gene
A simple way to look this up is to just to grep (i.e. conduct a text search) for FUT2
:
grep FUT2 gencode.v41.annotation.gff3 | less -S
Unfortunately that returns a lot of rows - let's just get genes:
grep FUT2 gencode.v41.annotation.gff3 | awk '$3 == "gene"' | less -S
Ok, this returns two records. If you look at the gene_name
attribute you'll see one, on chromosome 19, is
FUT2, while the other is a different gene called POFUT2
. Let's use that to do a bit better:
grep 'gene_name=FUT2' gencode.v41.annotation.gff3 | awk '$3 == "gene"' | less -S
We got it! Copy its ID to the clipboard - in my file it is ENSG00000176920.13
.
Questions
- How long is FUT2 on the chromosome?
Note. to get the answer 100% right, you actually have to take the formula
This is because both start and end are expressed in a 1-based, closed coordinate system i.e. they both point at bases included in the gene. (Think of a gene with only two bases in it to see why this is.)
Finding transcripts
So how many transcripts does FUT2 have? Well we know how to do this - look for transcript records with the FUT2 gene as parent:
grep 'Parent=ENSG00000176920.13' gencode.v41.annotation.gff3 | awk '$3 == "transcript"' | less -S
So it has 4 transcripts - that is, the file suggests the gene may be transcribed to mRNA in 4 different ways. Scroll around a bit to look at the attributes of these transcripts. If you look closely you'll see there is some more information in there. For example a transcript support level which reflects how confident GENCODE is about the transcript. See the Ensembl page for a description of these.
One of these transcripts (ENST00000425340.3) is also marked as ‘Ensemble canonical’ which means "a single, representative transcript identified at every locus". So let's focus on that transcript and dig a bit deeper
Finding exons
This is easy now:
grep 'Parent=ENST00000425340.3' gencode.v41.annotation.gff3 | awk '$3 == "exon"' | less -S
Aha, it has two exons.
So, how long are these exons? To make that easier let's use cut
to get rid of the noise:
grep 'Parent=ENST00000425340.3' gencode.v41.annotation.gff3 | awk '$3 == "exon"' | cut -f1,3-5
Adding that up, the two exons have length 119 and 2,997 - so only about 30% of the gene is actually transcribed into RNA!
What about the bit that codes for protein? We can find that by looking for the coding sequence records - they
have type=CDS
:
grep 'Parent=ENST00000425340.3' gencode.v41.annotation.gff3 | awk '$3 == "CDS"' | cut -f1,3-5
If you look at this you'll see the gene has one annotated coding sequence, and it lives entirely inside the second exon. Its length is 1032 base pairs. So only about th of the gene codes for protein.
Note
If we've got this right then the nucleotide length of the coding sequence should be a multiple of something - what? Is 1032 an appropriate multiple?
Challenge question
Now repeat the above process for another gene and see if things look similar. For example, try the genes that encode
alpha globin, named HBA1
and HBA2
.