Filtering files: sort, uniq, cut, tr
Pipelines are great, but you need some commands to put in them. Luckily bash contains lots of commands which can take standard input, transform it, and then send the result to standard output.
Let's imagine we have the following file
% cat beatles.tsv
John vocals,guitars,keyboards,harmonica,bass
Paul vocals,bass,guitars,keyboards,drums
George guitars,vocals,sitar,keyboards,bass
Ringo drums,percussion,vocals
You can get a copy of this file for yourself by running one of these commands:
wget https://www.chg.ox.ac.uk/bioinformatics/training/msc_gm/2023/data/beatles.tsv
or
curl -O https://www.chg.ox.ac.uk/bioinformatics/training/msc_gm/2023/data/beatles.tsv
Sorting files
sort
sorts its input alphabetically, numerically, or even randomly.
To try it, let's sort the Beatles based on their first names (from column 1):
% cat beatles.tsv | sort -k 1
George guitars,vocals,sitar,keyboards,bass
John vocals,guitars,keyboards,harmonica,bass
Paul vocals,bass,guitars,keyboards,drums
Ringo drums,percussion,vocals
::tip Note
Like other commands here, you can also use a file as input to sort
, as in
% sort -k 1 beatles.tsv
But we are going to keep using the pipeline syntax because it is flexible. :::
Cutting columns from files
cut
lets you pick out particular columns of the input to keep.
For example if we only want to take the names above, then we could use cut to pick out just the first column:
% cat beatles.tsv | cut -f 1
John
Paul
George
Ringo
By default, cut bases its columns on tabs, but it can use anything you like. For example let's get just the first instrument from that list by using cut again with the "delimiter" set to a comma.
% cat beatles.tsv | cut -f 2 | cut -d ',' -f 1
vocals
vocals
guitars
drums
Question
Check you understand how this works. The first cut
in the pipeline uses tabs as a delimiter, so it cuts out the second column, which looks like 'guitars,vocals,sitar,keyboards,bass' and so on. The second cut
then cuts out the first instrument from that list by using ,
as a delimiter.
Finding matching (or mismatching) lines
grep
finds lines that match a certain pattern.
For example, suppose we wanted information about just Paul. We could do it by grepping for Paul:
% cat beatles.tsv | grep Paul
We could also use the -v
(or --invert-match
) option to find out about all the beatles except Paul:
% cat beatles.tsv | grep -v Paul
Do all beatles sing?
% cat beatles.tsv | grep vocals
grep
is actually a sophisticated tool that can find complex matches - we'll come back to this below.
Getting unique lists of values, and counting them
uniq
removes duplicate lines from standard input. Often this is best if you sort it first.
For example, to get a unique list of first instruments we could extend the above with a call to uniq
:
% cat beatles.tsv | cut -f 2 | cut -d ',' -f 1 | uniq
vocals
guitars
drums
uniq
also has a very useful option -c
, which also counts the
% cat beatles.tsv | cut -f 2 | cut -d ',' -f 1 | uniq -c
2 vocals
1 guitars
1 drums
Transforming input
tr
translates characters into others.
Specifically, if you give it two equal length strings, it will swap characters in the first string in standard input for the corresponding characters from the second string.
% echo veryinsecurepassword | tr osi 05!
very!n5ecurepa55w0rd
It can also be used to change the case of a string
% echo veryinsecurepassword | tr '[:lower:]' '[:upper:]'
VERYINSECUREPASSWORD
tr
also has an option -d
which means 'delete these characters':
% echo veryinsecurepassword | tr -d 'aeiou'
vrynscrpsswrd
As you can see, you can do a lot by combining these simple tools together with pipes.
Complex processing with grep, awk and sed
As a last part to this tutorial, we will mention three programs that are sophisticated tools in their own right. They are well worth looking into in detail, but we will use them in fairly simple ways in this module. Here are a few basic recipes that you can use with each of them without needing to understand them deeply.
grep
extracts lines containing a pattern
You could use it to see who in the Beatles plays keyboards:
% cat beatles.tsv | grep keyboards
John vocals,guitars,keyboards,harmonica,bass
Paul vocals,bass,guitars,keyboards,drums
George guitars,vocals,sitar,keyboards,bass
or the names of Beatles who play keyboards and sing:
% cat beatles.tsv | grep keyboards | grep vocals
::tip Note
grep
is actually quite a bit more sophisticated than this - the pattern is really a regular expression. We won't go
into detail, but for example, here is a command which only finds lines with 'bass' right at the end of the line:
% cat beatles.tsv | grep 'bass$'
(The $
in a regular expression means 'match the end of the line')
:::
awk
filter lines based on the contents of columns
In this module we'll also make very simple use of awk
to find values in specific columns.
(This is different to grep
that will search for patterns in the whole line.)
For example let's pick out just Paul using his name in column one
$ cat beatles.tsv | awk '$1 == "Paul"'
Paul vocals,bass,guitars,keyboards,drums
Here the '$1' means 'look at column 1' and the == "Paul"
means 'is equal to "Paul". (We have to put single quotes
around the whole thing to make sure all this is interpreted by awk
, not the command line itself.)
- Finally,
sed
can parse and transform text in a more sophisticated way thantr
For example, it can be used to substitute some text with other text using the 's' command:
$ cat beatles.tsv | sed 's/keyboards/piano/'
John vocals,guitars,piano,harmonica,bass
Paul vocals,bass,guitars,piano,drums
George guitars,vocals,sitar,piano,bass
Ringo drums,percussion,vocals
By putting together pipelines made out of these commands you can perform complex sorting, filtering, and transformations on files.
Conclusion
Congratulations! You reached the end of this tutorial. Remember you can see everything you've run so far by typing
% history
Now try some test questions or read some more advanced topics.