Streptomyces coelicolor is a soil-dwelling bacterium with a complex life-cycle involving mycelial growth and sporulation. It is of particular interest in that it can produce a range of natural products of pharmaceutical relevance. In fact it is the major source of natural antibiotics.
In 2002 the genome of Streptomyces coelicolor A3(2), the model actinomycete organism, was sequenced. In this chapter we will do some basic bioinformatics on this genome to illustrate some fundamental concepts of computer programming.
One feature of interest when examining DNA is the guanine-cytosine (GC) content. DNA with high GC-content is more stable than DNA with low GC-content. The GC-content is defined as:
To solve this problem we need to write code to be able to read in sequence data from a file, process the input to calculate the local GC-content and write out the results to another file. In the process the computational concepts of variables, functions and loops will be introduced.
In this chapter we will use the Python scripting language to perform our data analysis.
Python is a high-level scripting language that is growing in popularity in the scientific community. It uses a syntax that is relatively easy to get to grips with and which encourages code readability.
To start off with we will make use of Python using its interactive mode, which means that we can type Python commands straight into the terminal. In fact when working with Python in interactive mode one can think of it as switching the terminal’s shell from Bash to Python.
To start Python in its interactive mode simply type python
into the terminal.
$ python
Python 2.7.10 (default, Jul 14 2015, 19:46:27)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
Note that this prints out information about the version of Python that is being used and how it was compiled before leaving you at the interactive prompt. In this instance I am using Python version 2.7.10.
The three greater than signs (>>>
) represent the primary prompt into which
commands can be entered.
>>> 1 + 2
3
The secondary prompt is represented by three dots (...
) and denotes a
continuation line.
>>> line = ">sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R"
>>> if line.startswith(">"):
... print(line)
...
>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R
A variable is a means of storing a piece of information using using a descriptive name. The use of variables is encouraged as it allows us to avoid having to repeat ourselves, the DRY principle.
In Python variables are assigned using the equals sign.
>>> pi = 3.14
When naming variables being explicit is more important than being succinct.
One reason for this is that you will spend more time reading your code than
you will writing it. Avoiding the mental overhead of trying to understand
what all the acronyms mean is a good thing. For example, suppose that we
wanted to create a variable for storing the radius of a circle. Please
avoid the temptation of naming the variable r
, and go for the longer
but more explicit name radius
.
>>> radius = 1.5
Note
Many modern text editors have auto complete functionality so longer variable names does not necessarily need to mean that there is more typing required. Remember, spend time finding a text editor that works well for you!
Suppose that we had a string representing a DNA sequence.
>>> dna_string = "attagcgcaatctaactacactactgccgcgcggcatatatttaaatata"
>>> print(dna_string)
attagcgcaatctaactacactactgccgcgcggcatatatttaaatata
A string is a data type for representing text. As such it is not ideal for data processing purposes. In this case the DNA sequence would be better represented using a list, with each item in the list representing a DNA letter.
In Python we can convert a string into a list using the built-in list()
function.
>>> dna_list = list(dna_string)
>>> print(dna_list)
['a', 't', 't', 'a', 'g', 'c', 'g', 'c', 'a', 'a', 't', 'c', 't', 'a', 'a',
'c', 't', 'a', 'c', 'a', 'c', 't', 'a', 'c', 't', 'g', 'c', 'c', 'g', 'c',
'g', 'c', 'g', 'g', 'c', 'a', 't', 'a', 't', 'a', 't', 't', 't', 'a', 'a',
'a', 't', 'a', 't', 'a']
Python’s list class has got a method called count()
that we can use
to find out the counts of particular elements in the list.
>>> dna_list.count("a")
17
To find out the total number of items in a list one can use Python’s built-in
len()
function, which returns the length of the list.
>>> len(dna_list)
50
When using Python you need to be careful when dividing integers, because in Python 2 the default is to use integer division, i.e. to discard the remainder.
>>> 3 / 2
1
One can work around this by ensuring that at least one of the numbers is represented using floating point.
>>> 3 / 2.0
1.5
Avertissement
In Python 3, the behaviour of the division operator has been changed, and dividing two integers will result in normal division.
One can convert an integer to a floating point number using Python’s built-in
float()
function.
>>> float(2)
2.0
We now have all the information required to calculate the GC-content of the DNA sequence.
>>> gc_count = dna_list.count("g") + dna_list.count("c")
>>> gc_fraction = float(gc_count) / len(dna_list)
>>> 100 * gc_fraction
38.0
Suppose that we wanted to calculate the GC-content for several sequences. In this case it would be very annoying, and error prone, to have to enter the commands above into the Python shell manually for each sequence. Rather, it would be advantageous to be able to create a piece of code that could be called repeatedly to calculate the GC-content. We can achieve this using the concept of functions. In other words functions are a means for programmers to avoid repeating themselves, thus adhering to the DRY principle.
Let us create a simple function that adds two items together.
>>> def add(a, b):
... return a + b
...
>>> add(2, 3)
5
In Python functions are defined using the def
keyword. Note that the
def
keyword is followed by the name of the function. The name of the
function is followed by a parenthesized set of arguments, in this case the
function takes two arguments a
and b
. The end of the function
definition is marked using a colon.
The body of the function, in this example the return
statement, needs to be
indented. The standard in Python is to use four white spaces to indent code
blocks. In this case the function body only contains one line of code. However,
a function can include several indented lines of code.
Avertissement
Whitespace really matters in Python! If your code is not correctly
aligned you will see IndentationError
messages telling you
that everything is not as it should be. You will also run into
IndentationError
messages if you mix white spaces and tabs.
Now we can create a function for calculating the GC-content of a sequence. As with variables explicit trumps succinct in terms of naming.
>>> def gc_content(sequence):
... gc_count = sequence.count("g") + sequence.count("c")
... gc_fraction = float(gc_count) / len(sequence)
... return 100 * gc_fraction
...
>>> gc_content(dna_list)
38.0
Suppose that we wanted to look at local variability in GC-content. To achieve this we would like to be able to select segments of our initial list. This is known as “slicing”, as in slicing up a salami.
In Python slicing uses a [start:end]
syntax that is inclusive for the start
index and exclusive for the end index. To illustrate slicing let us first
create a list to work with.
>>> zero_to_five = ["zero", "one", "two", "three", "four", "five"]
To get the first two elements we therefore use 0 for the start index, as Python uses a zero-based indexing system, and 2 for the end index as the element from the end index is excluded.
>>> zero_to_five[0:2]
['zero', 'one']
Note that the start position for the slicing is 0 by default so we could just as well have written.
>>> zero_to_five[:2]
['zero', 'one']
To get the last three elements.
>>> zero_to_five[3:]
['three', 'four', 'five']
It is worth noting that we can use negative indices, where -1 represents the last element. So to get all elements except the first and the last, one could slice the list using the indices 1 and -1.
>>> zero_to_five[1:-1]
['one', 'two', 'three', 'four']
We can use list slicing to calculate the local GC-content measurements of our DNA.
>>> gc_content(dna_list[:10])
40.0
>>> gc_content(dna_list[10:20])
30.0
>>> gc_content(dna_list[20:30])
70.0
>>> gc_content(dna_list[30:40])
50.0
>>> gc_content(dna_list[40:50])
0.0
It can get a bit repetitive, tedious, and error prone specifying all the ranges manually. A better way to do this is to make use of a loop construct. A loop allows a program to cycle through the same set of operations a number of times.
In lower level languages while loops are common because they operate in a way that closely mimic how the hardware works. The code below illustrates a typical setup of a while loop.
>>> cycle = 0
>>> while cycle < 5:
... print(cycle)
... cycle = cycle + 1
...
0
1
2
3
4
In the code above Python moves through the commands in the while loop executing
them in order, i.e. printing the value of the cycle
variable and then
incrementing it. The logic then moves back to the while
statement and
the conditional (cycle < 5
) is re-evaluated. If true the commands in the
while statment are executed in order again, and so forth until the conditional
is false. In this example the print(cycle)
command was called five times,
i.e. until the cycle
variable incremented to 5 and the cycle < 5
conditional evaluated to false.
However, when working in Python it is much more common to make use of for loops. For loops are used to iterate over elements in data structures such as lists.
>>> for item in [0, 1, 2, 3, 4]:
... print(item)
...
0
1
2
3
4
In the above we had to manually write out all the numbers that we wanted. However, because iterating over a range of integers is such a common task Python has a built-in function for generating such lists.
>>> range(5)
[0, 1, 2, 3, 4]
So a typical for loop might look like the below.
>>> for item in range(5):
... print(item)
...
0
1
2
3
4
The range()
function can also be told to start at a larger number. Say for
example that we wanted a list including the numbers 5, 6 and 7.
>>> range(5, 8)
[5, 6, 7]
As with slicing the start value is included whereas the end value is excluded.
It is also possible to alter the step size. To do this we must specify the start and end values explicitly before adding the step size.
>>> range(0, 50, 10)
[0, 10, 20, 30, 40]
We are now in a position where we can create a naive loop for for calculating the local GC-content of our DNA.
>>> for start in range(0, 50, 10):
... end = start + 10
... print(gc_content(dna_list[start:end]))
...
40.0
30.0
70.0
50.0
0.0
Loops are really powerful. They provide a means to iterate over lots of items and can be used to automate repetitive tasks.
So far we have been working with Python in interactive mode. This is a great way to explore what can be achieved with Python. It is also handy to simply use Python’s interactive mode as a command line calculator. However, it can get a little bit clunky when trying to write constructs that span several lines, such as functions.
Now we will examine how one can write a Python script as a text file and how to run that text file through the Python interpreter, i.e. how to run a Python script from the command line.
Start off by creating a new directory to work in.
$ mkdir S.coelicolor-local-GC-content
$ cd S.coelicolor-local-GC-content
Use your favourite text editor to enter the code below into a file
named gc_content.py
.
1 2 | sequence = list("attagcgcaatctaactacactactgccgcgcggcatatatttaaatata")
print(sequence)
|
Note
If your text editor is not giving you syntax highlighting find out how it can be enabled. If your text editor does not support syntax highlighting find a better text editor!
Open up a terminal and go to the directory where you saved the gc_content.py
script. Run the script using the command below.
$ python gc_content.py
You should see the output below printed to your terminal.
['a', 't', 't', 'a', 'g', 'c', 'g', 'c', 'a', 'a', 't', 'c', 't', 'a', 'a',
'c', 't', 'a', 'c', 'a', 'c', 't', 'a', 'c', 't', 'g', 'c', 'c', 'g', 'c',
'g', 'c', 'g', 'g', 'c', 'a', 't', 'a', 't', 'a', 't', 't', 't', 'a', 'a',
'a', 't', 'a', 't', 'a']
In the script we used Python’s built-in list()
function to convert the
DNA string into a list. We then printed out the sequence
list.
Now let us add the gc_content()
function to the script.
1 2 3 4 5 6 7 8 | def gc_content(sequence):
"Return GC-content as a percentage from a list of DNA letters."
gc_count = sequence.count("g") + sequence.count("c")
gc_fraction = float(gc_count) / len(sequence)
return 100 * gc_fraction
sequence = list("attagcgcaatctaactacactactgccgcgcggcatatatttaaatata")
print(gc_content(sequence))
|
In the above the gc_content()
function is implemented as per our
exploration in our interactive session. The only difference is the
addition of a, so called, “docstring” (documentation string) to the body
of the function (line 2). The docstring is meant to document the purpose
and usage of the function. Documenting the code in this way makes it
easier for you, and others, to understand it.
Note that the script now prints out the GC-content rather than the sequence (line 8). Let us run the updated script from the command line.
$ python gc_content.py
38.0
The next piece of code will be a bit more complicated. However, note that that it represents the most complicated aspect of this chapter. So if you find it difficult, don’t give up, it gets easier again later on.
Now let us implement a new function for performing a sliding window analysis.
Add the code below to the start of the gc_content.py
file.
1 2 3 4 5 6 7 8 9 10 11 12 | def sliding_window_analysis(sequence, function, window_size=10):
"""Return an iterator that yields (start, end, property) tuples.
Where start and end are the indices used to slice the input list
and property is the return value of the function given the sliced
list.
"""
for start in range(0, len(sequence), window_size):
end = start + window_size
if end > len(sequence):
break
yield start, end, function(sequence[start:end])
|
There is quite a lot going on in the code above so let us walk through it
slowly. One of the first things to note is that the
sliding_window_analysis()
function takes another function
as its
second argument. Functions can be passed around just like variables and on line
12 the function
is repeatedly called with slices of the input sequence.
The sliding_window_analysis()
function also takes a window_size
argument. This defines the step size of the range()
function used to
generate the start
indices for the slicing. Note that in this case we
provide the window_size
argument with a default value of 10. This means
that the window_size
argument does not need to be explicitly set when
calling the function (if one is happy with the default).
On line 9, inside the for loop, we generate the end
index by adding the
window_size
to the start
index. This is followed by a check that the
generated end
index would not result in a list slice that spanned beyond
the end of the sequence.
On line 11, inside the if
statment there is a break
clause. The break
statement is used to break out of the loop immediately. In other words if the
end
variable is greater than the length of the sequence we break out of the
loop immediately.
At the end of the for loop we make use of the yield
keyword to pass on the
start
and end
indices as well as the value resulting from calling the
input function
with the sequence slice. This means that rather than
returning a value the sliding_window_analysis()
function returns an
iterator. As the name suggests an “iterator” is an object that one can iterate
over, for example using a for loop. Let us add some code to the script to
illustrate how one would use the sliding_window_analysis()
function in
practise.
20 21 22 | sequence = list("attagcgcaatctaactacactactgccgcgcggcatatatttaaatata")
for start, end, gc in sliding_window_analysis(sequence, gc_content):
print(start, end, gc)
|
Let us test the code again.
$ python gc_content.py
(0, 10, 40.0)
(10, 20, 30.0)
(20, 30, 70.0)
(30, 40, 50.0)
(40, 50, 0.0)
The current implementation of the sliding_window_analysis()
is very
dependent on the frame of reference as the window slides along. For example if
the window_size
argument was set to 3 one would obtain the analysis of the
first codon reading frame, but one would have no information about the second
and third codon reading frames. To overcome this one can perform sliding window
analysis with overlapping windows. Let us illustrate this visually by extracting
codons from a DNA sequence.
# Original sequence.
atcgctaaa
# Non overlapping windows.
atc
gct
aaa
# Overlapping windows.
atc
tcg
cgc
gct
cta
taa
aaa
To enable overlapping windows in our sliding_window_analysis()
function
we need to add a step_size
argument to it and make use of this in the call
to the range()
function.
1 2 3 4 5 6 7 8 9 10 11 12 | def sliding_window_analysis(sequence, function, window_size=10, step_size=5):
"""Return an iterator that yields (start, end, property) tuples.
Where start and end are the indices used to slice the input list
and property is the return value of the function given the sliced
list.
"""
for start in range(0, len(sequence), step_size):
end = start + window_size
if end > len(sequence):
break
yield start, end, function(sequence[start:end])
|
Let us run the script again to see what the output of this overlapping sliding window analysis is.
$ python gc_content.py
(0, 10, 40.0)
(5, 15, 40.0)
(10, 20, 30.0)
(15, 25, 40.0)
(20, 30, 70.0)
(25, 35, 100.0)
(30, 40, 50.0)
(35, 45, 0.0)
(40, 50, 0.0)
Note that the gc_content()
function is now applied to overlapping
segments of DNA. This allows us, for example, to note that the 25 to 35 region
has a GC-content of 100%, which is something that we did not manage to pick out
before.
It is time to start working on some real data. Let us download the genome of
Streptomyces coelicolor from the Sanger Centre ftp site. The
URL shortened using bitly point to the Sco.dna
file.
$ curl --location --output Sco.dna http://bit.ly/1Q8eKWT
$ head Sco.dna
SQ Sequence 8667507 BP; 1203558 A; 3121252 C; 3129638 G; 1213059 T; 0 other;
cccgcggagc gggtaccaca tcgctgcgcg atgtgcgagc gaacacccgg gctgcgcccg 60
ggtgttgcgc tcccgctccg cgggagcgct ggcgggacgc tgcgcgtccc gctcaccaag 120
cccgcttcgc gggcttggtg acgctccgtc cgctgcgctt ccggagttgc ggggcttcgc 180
cccgctaacc ctgggcctcg cttcgctccg ccttgggcct gcggcgggtc cgctgcgctc 240
ccccgcctca agggcccttc cggctgcgcc tccaggaccc aaccgcttgc gcgggcctgg 300
ctggctacga ggatcggggg tcgctcgttc gtgtcgggtt ctagtgtagt ggctgcctca 360
gatagatgca gcatgtatcg ttggcagaaa tatgggacac ccgccagtca ctcgggaatc 420
tcccaagttt cgagaggatg gccagatgac cggtcaccac gaatctaccg gaccaggtac 480
cgcgctgagc agcgattcga cgtgccgggt gacgcagtat cagacggcgg gtgtgaacgc 540
In order to be able to process the genome of Streptomyces coelicolor we need
to be able to read in the Sco.dna
file. In Python reading and writing of files
is achieved using the built-in open()
function, which returns a file handle.
Before we start adding code to our script let us examine reading and writing of files
using in Python’s interactive mode. Let us open up the Sco.dna
file for reading.
>>> file_handle = open("Sco.dna", mode="r")
We can access the current position within the file using the tell()
method of
the file handle.
>>> file_handle.tell()
0
The integer zero indicates that we are at the beginning of the file.
To read in the entire content of the file as a single string of text one can use the
read()
method of the file handle.
>>> text = file_handle.read()
After having read in the content of the file the position of the file handle will point at the end of the file.
>>> file_handle.tell()
11701261
When one has finished working with a file handle it is important to remember to close it.
>>> file_handle.close()
Let us examine the text that we read in.
>>> type(text)
<type 'str'>
>>> len(text)
11701261
>>> text[:60]
'SQ Sequence 8667507 BP; 1203558 A; 3121252 C; 3129638 G; 1'
However, rather than reading in files as continuous strings one often want to process
files line by line. One can read in a file as a list of lines using the readlines()
method.
>>> file_handle = open("Sco.dna", "r")
>>> lines = file_handle.readlines()
>>> file_handle.close()
Let us examine the lines that we read in.
>>> type(lines)
<type 'list'>
>>> len(lines)
144461
>>> lines[0]
'SQ Sequence 8667507 BP; 1203558 A; 3121252 C; 3129638 G; 1213059 T; 0 other;\n'
A third way of accessing the content of a file handle is to simply treat it as
an iterator. This is possible as the Python file handles implement a method called
next()
that returns the next line in the file. When it reaches the end of the
file the next()
function raises a StopIteration
exception, which tells the
iterator to stop iterating.
Let’s see the workings of the next()
method in action.
>>> file_handle = open("Sco.dna", "r")
>>> file_handle.next()
'SQ Sequence 8667507 BP; 1203558 A; 3121252 C; 3129638 G; 1213059 T; 0 other;\n'
>>> file_handle.next()
' cccgcggagc gggtaccaca tcgctgcgcg atgtgcgagc gaacacccgg gctgcgcccg 60\n'
We can go to the end of the file using the seek()
method of the file handle.
>>> file_handle.seek(11701261)
Let’s see what happens when we call the next()
method now.
>>> file_handle.next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
As explained above this raises a StopIteration
exception. Now that we are done
with our experiment we must remember to close the file handle.
>>> file_handle.close()
In practise one tends to use file handles directly within for loops.
>>> num_lines = 0
>>> file_handle = open("Sco.dna", "r")
>>> for line in file_handle:
... num_lines = num_lines + 1
...
>>> print(num_lines)
144461
In the for loop above the file handle acts as an iterator, yielding the lines within the opened file.
Again we must not forget to close the file handle.
>>> file_handle.close()
Having to constantly remember to close file handles when one is done with them
can become tedious. Furthermore, forgetting to close file handles can have dire
consequences. To make life easier one can make use of Python’s built-in with
keyword.
The with
keywords works with context managers. A context
manager implements the so called “context manager protocol”. In the case of a
file handle this means that the file is opened when one enters into the context
of the with
statement and that the file is automatically closed when one
exits out of the context. All that is a fancy way of saying that we do not need
to worry about remembering to close files if we access file handles using the
syntax below.
>>> with open("Sco.dna", mode="r") as file_handle:
... text = file_handle.read()
...
Let us shift the focus to the writing of files. There are two modes for writing
files w
and a
. The former will overwrite any existing files with the
same name whereas the latter would append to them. Let us illustrate this with
an example.
>>> with open("tmp.txt", "w") as file_handle:
... file_handle.write("Original message")
...
>>> with open("tmp.txt", "r") as file_handle:
... print(file_handle.read())
...
Original message
>>> with open("tmp.txt", "a") as file_handle:
... file_handle.write(", with more info")
...
>>> with open("tmp.txt", "r") as file_handle:
... print(file_handle.read())
...
Original message, with more info
>>> with open("tmp.txt", "w") as file_handle:
... file_handle.write("scrap that...")
...
>>> with open("tmp.txt", "r") as file_handle:
... print(file_handle.read())
...
scrap that...
Armed with our new found knowledge of how to read and write files we will now
create a function for reading in the DNA sequence from the Sco.dna
file.
Let us create a function that reads in a genome as a string and returns the DNA sequence as a list. At this point we have a choice of what the input parameter should be. We could give the function the name of the file containing the genome or we could give the function a file handle of the genome. Personally, I prefer to create functions that accept file handles, because they are more generic. Sometimes the data to be read comes from sources other than a file on disk. However, as long as these behave as a file object one can still pass them to the function.
Let us have a look at the file containing the Streptomyces coelicolor genome.
$ head Sco.dna
SQ Sequence 8667507 BP; 1203558 A; 3121252 C; 3129638 G; 1213059 T; 0 other;
cccgcggagc gggtaccaca tcgctgcgcg atgtgcgagc gaacacccgg gctgcgcccg 60
ggtgttgcgc tcccgctccg cgggagcgct ggcgggacgc tgcgcgtccc gctcaccaag 120
cccgcttcgc gggcttggtg acgctccgtc cgctgcgctt ccggagttgc ggggcttcgc 180
cccgctaacc ctgggcctcg cttcgctccg ccttgggcct gcggcgggtc cgctgcgctc 240
ccccgcctca agggcccttc cggctgcgcc tccaggaccc aaccgcttgc gcgggcctgg 300
ctggctacga ggatcggggg tcgctcgttc gtgtcgggtt ctagtgtagt ggctgcctca 360
gatagatgca gcatgtatcg ttggcagaaa tatgggacac ccgccagtca ctcgggaatc 420
tcccaagttt cgagaggatg gccagatgac cggtcaccac gaatctaccg gaccaggtac 480
cgcgctgagc agcgattcga cgtgccgggt gacgcagtat cagacggcgg gtgtgaacgc 540
From this we want a function that:
Extracting the relevant sequence can be achieved by noting that each sequence line consists of seven “words”, where a word is defined as a set of characters separated by one or more white spaces. The first six words correspond to sequence, whereas the last word is an index listing the number of nucleotide bases.
Let us implement such a function. Add the lines below to the top of the
gc_content.py
file.
1 2 3 4 5 6 7 8 9 10 | def parse_dna(file_handle):
"""Return DNA sequence as a list."""
first_line = file_handle.next() # Discard the first line.
sequence = []
for line in file_handle:
words = line.split()
seq_string = "".join(words[:-1])
seq_list = list(seq_string)
sequence.extend(seq_list)
return sequence
|
There are a couple of new string methods introduced in the above, let’s explain them now.
Let’s look at line six first.
6 | words = line.split()
|
Here we use the split()
method to split the string into
a list of words, by default the split()
method splits text based on one
or more white space characters.
On line seven we use the join()
method to join the words together.
7 | seq_string = "".join(words[:-1])
|
In this instance there are no characters separating the words to be joined. It
is worth clarifying this with an example, if we wanted to join the words using
a comma character one would use the syntax ",".join(words[:-1])
.
On line seven it is also worth noting that we exclude the last word (the
numerical index) by making use of list slicing words[:-1]
.
Finally, on line nine we make use of the list method extend()
, this
extends the existing sequence
list with all the elements from the
seq_list
list. Because words
seq_string
and seq_list
will be
overwritten when the loop moves on to the next line in the input file.
9 | sequence.extend(seq_list)
|
Now let us update the gc_content.py
script to initalise the sequence by
parsing the Sco.dna
file.
31 32 33 34 35 | with open("Sco.dna", "r") as file_handle:
sequence = parse_dna(file_handle)
for start, end, gc in sliding_window_analysis(sequence, gc_content):
print(start, end, gc)
|
Finally, let us change the default window_size
and step_size
values. In
the below I have split the function definition over two lines so as not to make
the line exceed 78 characters. Exceeding 78 characters is considered poor “style”
because it makes it difficult to read the code.
12 13 14 15 16 17 18 19 | def sliding_window_analysis(sequence, function,
window_size=100000, step_size=50000):
"""Return an iterator that yields (start, end, property) tuples.
Where start and end are the indices used to slice the input list
and property is the return value of the function given the sliced
list.
"""
|
Let us run the script again.
$ python gc_content.py
Note that this will produce a lot of output. To find out the number of lines
that are generated we can make use of piping and the wc -l
command
(mnemonic wc
word count, -l
lines) .
$ python gc_content.py | wc -l
173
The result makes sense as there are 8667507 base pairs in the sequence and we are stepping through it using a step size of 50000.
>>> 8667507 / 50000
173
Finally we will write out the analysis to a text file. Since this data is tabular we will use the CSV file format. Furthermore, since we will want to plot the data using ggplot2 in Data visualisation we will use a “tidy” data structure, see Tidy data for details.
Edit the end of the gc_content.py
script to make it look like the below.
32 33 34 35 36 37 38 39 40 41 | with open("Sco.dna", "r") as file_handle:
sequence = parse_dna(file_handle)
with open("local_gc_content.csv", "w") as file_handle:
header = "start,middle,end,gc_content\n"
file_handle.write(header)
for start, end, gc in sliding_window_analysis(sequence, gc_content):
middle = (start + end) / 2
row = "{},{},{},{}\n".format(start, middle, end, gc)
file_handle.write(row)
|
On line 35 we open a file handle to write to. On lines 36 and 37 we write a header to the CSV file. Lines 38 to 41 then performs the sliding window analysis and writes the results as rows, or lines if you prefer, to the CSV file. Line 39 calculates the middle of the local sequence by calculating the mean of the start and end positions.
The main new feature introduced in the code snippet above is on line 40 where
we use Python’s built in string formatting functionality. The matching curly
braces in the string are replaced with the content of the format()
string
method. Let us illustrate this with an example in interactive mode.
>>> print("{},{},{},{}")
{},{},{},{}
>>> print("{},{},{},{}".format(1, 5, 10, 38.5))
1,5,10,38.5
Okay, let us see what happens when we run the script.
$ python gc_content.py
This should have created a file named local_gc_content.csv
in the working directory.
$ ls
Sco.dna
gc_content.py
local_gc_content.csv
We can examine the top of this newly created file using the head
command.
$ head local_gc_content.csv
start,middle,end,gc_content
0,50000,100000,69.124
50000,100000,150000,70.419
100000,150000,200000,72.495
150000,200000,250000,71.707
200000,250000,300000,71.098
250000,300000,350000,72.102
300000,350000,400000,72.712
350000,400000,450000,73.15
400000,450000,500000,73.27
Great, everything seems to be working. Let’s start tracking our code using Git, see Keeping track of your work.
$ git init
$ git status
On branch master
Initial commit
Untracked files:
(use "git add <file>..." to include in what will be committed)
Sco.dna
gc_content.py
local_gc_content.csv
nothing added to commit but untracked files present (use "git add" to track)
We have got three untracked files in our directory, the script, the input data and the output data. We don’t want to track the input and the output data so let’s create a .gitignore
file and add those files to it.
Sco.dna
local_gc_content.csv
Let’s check the status of our Git repository again.
$ git status
On branch master
Initial commit
Untracked files:
(use "git add <file>..." to include in what will be committed)
.gitignore
gc_content.py
nothing added to commit but untracked files present (use "git add" to track)
Let’s start tracking the gc_content.py
and the .gitignore
files and take a snapshot of them in their current form.
$ git add gc_content.py .gitignore
$ git commit -m "Initial file import"
[master (root-commit) 6d8e0cf] Initial file import
2 files changed, 43 insertions(+)
create mode 100644 .gitignore
create mode 100644 gc_content.py
Well done! We have covered a lot of ground in this chapter. I suggest digging out some good music and chilling out for a bit.