One of the first steps towards automating repetitive tasks is to become familiar with the command line. In this chapter we will accomplish this by using the command line to extract UniProt identifiers for all human proteins in Swiss-Prot.
More precisely we will be using a terminal emulator to run Bash. A terminal emulator is an application that gives you access to another program known as the shell. The shell allows you to interact with the operating system’s services and programs. The most widely used shell and the default on Mac and most Linux based distributions is Bash.
You will get the most benefit from this chapter if you work through the example as you go along. You are therefore encouraged to open up a terminal now (Fig. 2)!
On Macs the command line is available through the Terminal application. There are a number of terminal emulators available for Linux. If you are using the Gnome-based desktop the default is likely to be the Gnome Terminal.
On Windows things are a little bit more complicated. Windows does not come bundled with Bash by default. I would recommend installing VirtualBox and running a Linux distribution such as BioLinux in it. VirtualBox is a so called hypervisor that lets you run a virtual machine. In this scenario you would run BioLinux as a virtual machine. For more information on how to run BioLinux as a virtual machine see the BioLinux installation notes.
The benefit of learning how to use Linux/Mac and Bash is that most bioinformatics software are developed on and designed to run on these platforms. Furthermore, the terminal and Bash provides an excellent way of creating analysis pipelines.
Most of the commands on the command line have built in help that can be accessed by
providing either the argument -h
or --help
. For example to access help for
the gunzip
command, which we will use later, enter the command below into the shell
and press enter:
$ gunzip --help
More descriptive documentation can usually be found using the man
(manual)
command. For example to view the manual page of the ls
command you can run
the command below.
$ man ls
To get out of the “man-page” press the “q” key. Just like I encourage you to try out the examples outlined in this chapter, I also encourage you to examine the help and man-page documentation for the commands that we use to get a better understanding of what the commands do.
First of all let us make sure that we are working in our home directory.
$ cd
The cd
command, short for change directory, is used to move between
directories. If called without a path to a directory it will move you into your
home directory.
We can print out the name of the current working directory using the pwd
command. Furthermore we can list the contents of a directory using the ls
command.
$ pwd
/home/olsson
$ ls
Now that we know where we are and what files and directories are present let us
create a new directory for our project. This is achieved using the mkdir
command, short for make directory. After having created the directory move
into it using the cd
command.
$ mkdir first_step_towards_automation
$ cd first_step_towards_automation
Note
When using the command line one learns to avoid using white spaces in
file and directory names. This is because white spaces are used to separate
arguments. In the example above we used underscores instead of white spaces.
However, one could just as well have used hyphens. This comes down to personal
preference. It is possible to represent file names with spaces in them on the
command line by using the backlash character (\
) to “escape” the
whitespace, for example first\ steps\ towards\ automation
or by surrounding
the text in quotes "first steps towards automation"
.
UniProt (Universal Protein Resource) is a comprehensive resource of protein sequences and annotations. The UniProt Knowledgebase (UniProtKB) consists of Swiss-Prot and TrEMBLE. Both are annotated. However, the procedure in which they are annotated differ. TrEMBLE uses an automatic annotation system, whereas the annotation in SwissProt is manual and includes a review process.
It is time to download the Swiss-Prot knowledge base from UniProt. We will
use the curl
program to do this. The curl
command is a C program that
allows us to stream data from URLs and FTP sites. By default the curl
program writes the content of the URL to the standard output stream.
To see this in action try running the command:
$ curl www.bbc.com
You should see a whole lot of HTML text appearing in your terminal window.
However, because we are going to download a large file we would like to
write it to disk for future use. Many command line programs allow the user to
specify additional options. In this particular case we can use the
--output
option to specify a file name that the output should be
written to. To exemplify this let us download the BBC home page to a file named
bbc.html
.
$ curl --output bbc.html www.bbc.com
Here we will use a URL shortened using bitly to save on typing. The shortened URL contains a redirect to the relevant Swiss-Prot FASTA file hosted on the UniProt FTP site. To find out where the shortned URL redirects to run the command:
$ curl http://bit.ly/1l6SAKb
To allow the redirection to occur we need to use the --location
option,
which will redirect the request to the new location.
Let us download the gzipped FASTA file from the UniProt FTP site:
$ curl --location --output uniprot_sprot.fasta.gz http://bit.ly/1l6SAKb
The downloaded file uniprot_sprot.fasta.gz
has been compressed using the
gzip
protocol. We can extract it using the gunzip
command. However,
when extracted it more than doubles in size. So we will use the --to-stdout
option to extract the content to the standard output stream whilst leaving the
original file compressed.
Try running the command:
$ gunzip --to-stdout uniprot_sprot.fasta.gz
You should see a lot of FASTA lines printed to your terminal, or more formally the standard output stream.
Options starting with two dashes, --
, are known as long options. Many of
these long options also have abbreviated “short” options. For example, the
-c
option of gunzip
is equivalent to the --to-stdout
option. Try
running the command:
$ gunzip -c uniprot_sprot.fasta.gz
From now on the text will use the short -c
option rather than the long
--to-stdout
option to save on typing.
Note
Remember that you can use the --help
or -h
option to get
information on the meanings of the various options available to you.
Now it is time to introduce one of the greatest features of the command line: pipes!
Pipes are a means to redirect the output from one command into another. The character
used to represent a pipe is the vertical bar: |
.
To illustrate the use of pipes we will redirect the output of the previous
gunzip
command to the word count program wc
. Try running the command
below:
$ gunzip -c uniprot_sprot.fasta.gz | wc
It should give you three numbers, these are the line, word and character counts. To
only see the line count one could use the -l
option:
$ gunzip -c uniprot_sprot.fasta.gz | wc -l
Pipes are powerful because they allow a set of simple commands to be combined to perform tasks that are beyond the scope of any of the individual commands. This has led to a central Unix philosophy of having simple programs that do one task well and a rich ecosystem of such programs. The user is then free to combine these programs to create personalised tools to automate repetitive processing tasks.
Another powerful feature of pipes is that the program being piped to gets access to the output stream of data from the program piping data into the pipe as soon as it is available. This means that the processing of data can happen in parallel.
Unix-based systems make a distinction between programs that are used for examining files, known as pagers, and programs that are used for editing files, known as text editors. The reason for making this distinction is to help prevent accidental changes to files when reading them.
To view the beginning of a file one can use the head
command. Let us examine
the first lines of the uniprot_sprot.fasta.gz
file by pipeing the output of the
gunzip
command into head
:
$ gunzip -c uniprot_sprot.fasta.gz | head
You should see something like the output below being written to the terminal window.
>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog ...
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS
EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD
AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL
EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD
SFRKIYTDLGWKFTPL
>sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L OS=Frog virus ...
MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR
IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL
AERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAADC
The beauty of the head
command is that it allows you to quickly view the
beginning of a file without having to read in the content of the entire file.
The latter can present a real problem if working on “big data” files. In fact,
this is also the beauty of pipes, which allows downstream programs to work on
the stream of data without having to wait for it to be written to or read from
disk.
By default the head
command writes out the first ten lines. However, this
can be modified using the -n
option, for example to write out the first 20
lines:
$ gunzip -c uniprot_sprot.fasta.gz | head -n 20
Similarly, there is a tail
command for displaying the tail end of a file,
again ten lines by default.
$ gunzip -c uniprot_sprot.fasta.gz | tail
You may have noticed that the workflow above, to view the last ten lines, took a little longer to complete. That is because we needed to decompress the whole file before we could access the last ten lines of it.
To page through an entire file one can use the less
command.
$ gunzip -c uniprot_sprot.fasta.gz | less
One can use the “Up” and “Down” arrows to navigate through the file using
less
. One can also use the “Space” key to move forward by an entire page,
hence the term pager. To page back one page press the “b” key. When you are
finished examining the file press “q” to quit less
.
Now that we have an idea of what the file looks like it is time to extract the FASTA identifiers that correspond to human proteins.
A powerful command for finding lines of interest in text is the grep
program, which can be used to search for strings and patterns. Let us use it to
search for the string “Homo”:
$ gunzip -c uniprot_sprot.fasta.gz | grep Homo | less
To make the match more visible we can add the --color=always
option, which
will highlight the matched string as red.
$ gunzip -c uniprot_sprot.fasta.gz | grep --color=always Homo | less
If you scroll through the matches you will notice that we have some false
positives. We can highlight these by performing another grep
command that
finds lines that do not contain the string “sapiens”, using the
--invert-match
option or the equivalent -v
short option.
$ gunzip -c uniprot_sprot.fasta.gz | grep Homo | grep -v sapiens
To make the search more specific we can search for the string “OS=Homo sapiens”. To do this we need to surround the search pattern by quotes, which tells the shell that the two parts separated by a white space should be treated as one argument.
$ gunzip -c uniprot_sprot.fasta.gz | grep "OS=Homo sapiens"
To work out how many lines were matched we can pipe the output of grep
to
the wc
command.
$ gunzip -c uniprot_sprot.fasta.gz | grep "OS=Homo sapiens" | wc -l
Below are the first three lines identified using the grep
command.
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens ...
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=...
>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHA...
Now that we can identify description lines corresponding to human proteins we
want to extract the UniProt identifiers from them. In this instance we will use
the command cut
to chop the line into smaller fragments, based on a
delimiter character, and print out the relevant fragment. The delimiter we are
going to use is the vertical bar (“|”). This has got nothing to do with
pipeing, it is simply the character surrounding the UniProt identifier. By
splitting the line by “|” the UniProt id will be available in the second
fragment.
The command below makes use of the backslash character (\
) at the end of
the first line. This tells bash that the command continues on the next line.
You can use this syntax in your scripts and in the terminal. Alternatively, you
can simply include the content from both lines below in a single line, omitting
the \
.
$ gunzip -c uniprot_sprot.fasta.gz | grep "OS=Homo sapiens" \
| cut -d '|' -f 2
In the above the -d
option specifies the delimiter to use to split
the line, in this instance the pipe symbol (|
). The -f 2
option
specifies that we want to extract the second field.
Note
Remember to try out these commands on your computer to see the actual output of the commands.
By default the output from commands are written to the standard output stream.
Earlier we saw that we could use the pipes to redirect the output to another
command. However, it is also possible to redirect the output to a file, i.e.
save the output to a file. This is achieved using the greater than symbol
(>
). You can use the idea of an arrow as a mnemonic, the output is going
from the command into the file as indicated by the arrow.
$ gunzip -c uniprot_sprot.fasta.gz | grep "OS=Homo sapiens" \
| cut -d '|' -f 2 > human_uniprot_ids.txt
Now if you run the ls
command you will see the file
human_uniprot_ids.txt
in the directory and you can view its contents using
less
:
$ less human_uniprot_ids.txt
Well done! You have just extracted the UniProt identifiers for all human proteins in Swiss-Prot. Have a cup of tea and a biscuit.
The remainder of this chapter will go over some more useful commands for working on the command line and reiterate some of the key take home messages.
Okay, so you have had a relaxing cup of tea and your head is no longer buzzing from information overload. However, you have also forgotten how you managed to extract those UniProt identifiers.
Not to worry. You can view the history of your previous commands using history
:
$ history
Note that each command has a history number associated with it. You can use the number in the history to rerun a previous command without having to retype it. For example to rerun command number 597 you would type in:
$ !597
Note that the exclamation mark (!
) in the above is required.
After having run the history
command the terminal window is full of information.
However, you find it distracting to have all those commands staring at you whilst
you are trying to think.
To clear the screen of output one can use the clear
command:
$ clear
Sometimes, for example if you try to view a binary file using a pager, your
shell can start displaying garbage. In these cases it may help to run the
reset
command.
$ reset
In general it is advisable to use clear
as it only clears the terminal screen
whereas reset
reinitialises the terminal.
You want to store a copy of your human_uniprot_id.txt
file in a backup
directory.
For this exercise let us start by creating a backup directory.
$ mkdir backup
Now we can copy the file into the backup directory using the cp
command.
$ cp human_uniprot_id.txt backup/
The command above uses the original name of the file. However, we could have given it a different name, for example including the date.
$ cp human_uniprot_id.txt backup/human_uniprot_id_2015-11-10.txt
Finally, suppose that one wanted to rename the original file to use hyphens
rather than under scores. To to this one would use the mv
command, mnemonic
move.
$ mv human_uniprot_id.txt human-uniprot-id.txt
Having experimented with the command line we want to clean up by removing unwanted files and directories.
One can remove files using the rm
command:
$ rm backup/human_uniprot_id.txt
Empty directories can be removed using the rmdir
command:
$ mkdir empty
$ rmdir empty
To remove directories with files in them one can use the rm
command with
the recursive option:
$ rm -r backup
Avertissement
Think twice before deleting files, they will be deleted permanently.
When using rm
there is no such thing as recycle bin from which
the files can be recovered.
-help
option to understand a command and its options