Counting genes
Quit less
if you are in it (by pressing q
) and let's generate some basic statistics.
First, how many genes and other things are in the file? For this, we can use the cut
command to cut out the
third column (which contains the type
). Then we'll pipe the output into the sort command (which sorts the
rows). And finally we will ask the uniq
command to count:
cut -f3 gencode.v41.annotation.gff3 | sort | uniq -c
This will take a minute or two to run - it's a big file!
Ok - the output is not really useful because of all the metadata. Let's use grep -v
to get rid of it:
grep -v '#' gencode.v41.annotation.gff3 | cut -f3 | sort | uniq -c
This finds lines that don't contain #
, extracts the third column from them, sorts them, and counts
the unique values.
Picking apart the pipeline
If this command isn't making sense to you, a good idea is to look at what each step does. Try running these commands one by one to parse it apart:
View the whole file:
less -S gencode.v41.annotation.gff3
Just the data rows:
grep -v '#' gencode.v41.annotation.gff3 | less -S
Just the third column of the data rows:
grep -v '#' gencode.v41.annotation.gff3 | cut -f3 | less -S
The third column sorted:
grep -v '#' gencode.v41.annotation.gff3 | cut -f3 | sort | less -S
The sorted unique values in the third column....
grep -v '#' gencode.v41.annotation.gff3 | cut -f3 | sort | uniq | less -S
...and the same thing with counts:
grep -v '#' gencode.v41.annotation.gff3 | cut -f3 | sort | uniq -c | less -S
Hopefully by this point it is clear(er) what each step is doing.
It prints:
872459 CDS
1625321 exon
171599 five_prime_UTR
61852 gene
97009 start_codon
90749 stop_codon
119 stop_codon_redefined_as_selenocysteine
203260 three_prime_UTR
251236 transcript
So - there are 1.6 million exons in the file and... wait a moment, are there really 60,000 genes in the human genome?
Question
The number 60,000 is way too big - why?