Data analysis

Streptomyces coelicolor is a soil-dwelling bacterium with a complex life-cycle involving mycelial growth and sporulation. It is of particular interest in that it can produce a range of natural products of pharmaceutical relevance. In fact it is the major source of natural antibiotics.

In 2002 the genome of Streptomyces coelicolor A3(2), the model actinomycete organism, was sequenced. In this chapter we will do some basic bioinformatics on this genome to illustrate some fundamental concepts of computer programming.

One feature of interest when examining DNA is the guanine-cytosine (GC) content. DNA with high GC-content is more stable than DNA with low GC-content. The GC-content is defined as:

\[100 * \frac{G + C}{A + T + G + C}\]

To solve this problem we need to write code to be able to read in sequence data from a file, process the input to calculate the local GC-content and write out the results to another file. In the process the computational concepts of variables, functions and loops will be introduced.

In this chapter we will use the Python scripting language to perform our data analysis.

What is Python?

Python is a high-level scripting language that is growing in popularity in the scientific community. It uses a syntax that is relatively easy to get to grips with and which encourages code readability.

Using Python in interactive mode

To start off with we will make use of Python using its interactive mode, which means that we can type Python commands straight into the terminal. In fact when working with Python in interactive mode one can think of it as switching the terminal’s shell from Bash to Python.

To start Python in its interactive mode simply type python into the terminal.

$ python
Python 2.7.10 (default, Jul 14 2015, 19:46:27)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

Note that this prints out information about the version of Python that is being used and how it was compiled before leaving you at the interactive prompt. In this instance I am using Python version 2.7.10.

The three greater than signs (>>>) represent the primary prompt into which commands can be entered.

>>> 1 + 2
3

The secondary prompt is represented by three dots (...) and denotes a continuation line.

>>> line = ">sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R"
>>> if line.startswith(">"):
...     print(line)
...
>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R

Variables

A variable is a means of storing a piece of information using using a descriptive name. The use of variables is encouraged as it allows us to avoid having to repeat ourselves, the DRY principle.

In Python variables are assigned using the equals sign.

>>> pi = 3.14

When naming variables being explicit is more important than being succinct. One reason for this is that you will spend more time reading your code than you will writing it. Avoiding the mental overhead of trying to understand what all the acronyms mean is a good thing. For example, suppose that we wanted to create a variable for storing the radius of a circle. Please avoid the temptation of naming the variable r, and go for the longer but more explicit name radius.

>>> radius = 1.5

Note

Many modern text editors have auto complete functionality so longer variable names does not necessarily need to mean that there is more typing required. Remember, spend time finding a text editor that works well for you!

Determining the GC count of a sequence

Suppose that we had a string representing a DNA sequence.

>>> dna_string = "attagcgcaatctaactacactactgccgcgcggcatatatttaaatata"
>>> print(dna_string)
attagcgcaatctaactacactactgccgcgcggcatatatttaaatata

A string is a data type for representing text. As such it is not ideal for data processing purposes. In this case the DNA sequence would be better represented using a list, with each item in the list representing a DNA letter.

In Python we can convert a string into a list using the built-in list() function.

>>> dna_list = list(dna_string)
>>> print(dna_list)  
['a', 't', 't', 'a', 'g', 'c', 'g', 'c', 'a', 'a', 't', 'c', 't', 'a', 'a',
 'c', 't', 'a', 'c', 'a', 'c', 't', 'a', 'c', 't', 'g', 'c', 'c', 'g', 'c',
 'g', 'c', 'g', 'g', 'c', 'a', 't', 'a', 't', 'a', 't', 't', 't', 'a', 'a',
 'a', 't', 'a', 't', 'a']

Python’s list class has got a method called count() that we can use to find out the counts of particular elements in the list.

>>> dna_list.count("a")
17

To find out the total number of items in a list one can use Python’s built-in len() function, which returns the length of the list.

>>> len(dna_list)
50

When using Python you need to be careful when dividing integers, because in Python 2 the default is to use integer division, i.e. to discard the remainder.

>>> 3 / 2
1

One can work around this by ensuring that at least one of the numbers is represented using floating point.

>>> 3 / 2.0
1.5

Avertissement

In Python 3, the behaviour of the division operator has been changed, and dividing two integers will result in normal division.

One can convert an integer to a floating point number using Python’s built-in float() function.

>>> float(2)
2.0

We now have all the information required to calculate the GC-content of the DNA sequence.

>>> gc_count = dna_list.count("g") + dna_list.count("c")
>>> gc_fraction = float(gc_count) / len(dna_list)
>>> 100 * gc_fraction
38.0

Creating reusable functions

Suppose that we wanted to calculate the GC-content for several sequences. In this case it would be very annoying, and error prone, to have to enter the commands above into the Python shell manually for each sequence. Rather, it would be advantageous to be able to create a piece of code that could be called repeatedly to calculate the GC-content. We can achieve this using the concept of functions. In other words functions are a means for programmers to avoid repeating themselves, thus adhering to the DRY principle.

Let us create a simple function that adds two items together.

>>> def add(a, b):
...     return a + b
...
>>> add(2, 3)
5

In Python functions are defined using the def keyword. Note that the def keyword is followed by the name of the function. The name of the function is followed by a parenthesized set of arguments, in this case the function takes two arguments a and b. The end of the function definition is marked using a colon.

The body of the function, in this example the return statement, needs to be indented. The standard in Python is to use four white spaces to indent code blocks. In this case the function body only contains one line of code. However, a function can include several indented lines of code.

Avertissement

Whitespace really matters in Python! If your code is not correctly aligned you will see IndentationError messages telling you that everything is not as it should be. You will also run into IndentationError messages if you mix white spaces and tabs.

Now we can create a function for calculating the GC-content of a sequence. As with variables explicit trumps succinct in terms of naming.

>>> def gc_content(sequence):
...     gc_count = sequence.count("g") + sequence.count("c")
...     gc_fraction = float(gc_count) / len(sequence)
...     return 100 * gc_fraction
...
>>> gc_content(dna_list)
38.0

List slicing

Suppose that we wanted to look at local variability in GC-content. To achieve this we would like to be able to select segments of our initial list. This is known as “slicing”, as in slicing up a salami.

In Python slicing uses a [start:end] syntax that is inclusive for the start index and exclusive for the end index. To illustrate slicing let us first create a list to work with.

>>> zero_to_five = ["zero", "one", "two", "three", "four", "five"]

To get the first two elements we therefore use 0 for the start index, as Python uses a zero-based indexing system, and 2 for the end index as the element from the end index is excluded.

>>> zero_to_five[0:2]
['zero', 'one']

Note that the start position for the slicing is 0 by default so we could just as well have written.

>>> zero_to_five[:2]
['zero', 'one']

To get the last three elements.

>>> zero_to_five[3:]
['three', 'four', 'five']

It is worth noting that we can use negative indices, where -1 represents the last element. So to get all elements except the first and the last, one could slice the list using the indices 1 and -1.

>>> zero_to_five[1:-1]
['one', 'two', 'three', 'four']

We can use list slicing to calculate the local GC-content measurements of our DNA.

>>> gc_content(dna_list[:10])
40.0
>>> gc_content(dna_list[10:20])
30.0
>>> gc_content(dna_list[20:30])
70.0
>>> gc_content(dna_list[30:40])
50.0
>>> gc_content(dna_list[40:50])
0.0

Loops

It can get a bit repetitive, tedious, and error prone specifying all the ranges manually. A better way to do this is to make use of a loop construct. A loop allows a program to cycle through the same set of operations a number of times.

In lower level languages while loops are common because they operate in a way that closely mimic how the hardware works. The code below illustrates a typical setup of a while loop.

>>> cycle = 0
>>> while cycle < 5:
...     print(cycle)
...     cycle = cycle + 1
...
0
1
2
3
4

In the code above Python moves through the commands in the while loop executing them in order, i.e. printing the value of the cycle variable and then incrementing it. The logic then moves back to the while statement and the conditional (cycle < 5) is re-evaluated. If true the commands in the while statment are executed in order again, and so forth until the conditional is false. In this example the print(cycle) command was called five times, i.e. until the cycle variable incremented to 5 and the cycle < 5 conditional evaluated to false.

However, when working in Python it is much more common to make use of for loops. For loops are used to iterate over elements in data structures such as lists.

>>> for item in [0, 1, 2, 3, 4]:
...     print(item)
...
0
1
2
3
4

In the above we had to manually write out all the numbers that we wanted. However, because iterating over a range of integers is such a common task Python has a built-in function for generating such lists.

>>> range(5)
[0, 1, 2, 3, 4]

So a typical for loop might look like the below.

>>> for item in range(5):
...     print(item)
...
0
1
2
3
4

The range() function can also be told to start at a larger number. Say for example that we wanted a list including the numbers 5, 6 and 7.

>>> range(5, 8)
[5, 6, 7]

As with slicing the start value is included whereas the end value is excluded.

It is also possible to alter the step size. To do this we must specify the start and end values explicitly before adding the step size.

>>> range(0, 50, 10)
[0, 10, 20, 30, 40]

We are now in a position where we can create a naive loop for for calculating the local GC-content of our DNA.

>>> for start in range(0, 50, 10):
...     end = start + 10
...     print(gc_content(dna_list[start:end]))
...
40.0
30.0
70.0
50.0
0.0

Loops are really powerful. They provide a means to iterate over lots of items and can be used to automate repetitive tasks.

Creating a sliding window GC-content function

So far we have been working with Python in interactive mode. This is a great way to explore what can be achieved with Python. It is also handy to simply use Python’s interactive mode as a command line calculator. However, it can get a little bit clunky when trying to write constructs that span several lines, such as functions.

Now we will examine how one can write a Python script as a text file and how to run that text file through the Python interpreter, i.e. how to run a Python script from the command line.

Start off by creating a new directory to work in.

$ mkdir S.coelicolor-local-GC-content
$ cd S.coelicolor-local-GC-content

Use your favourite text editor to enter the code below into a file named gc_content.py.

1
2
sequence = list("attagcgcaatctaactacactactgccgcgcggcatatatttaaatata")
print(sequence)

Note

If your text editor is not giving you syntax highlighting find out how it can be enabled. If your text editor does not support syntax highlighting find a better text editor!

Open up a terminal and go to the directory where you saved the gc_content.py script. Run the script using the command below.

$ python gc_content.py

You should see the output below printed to your terminal.

['a', 't', 't', 'a', 'g', 'c', 'g', 'c', 'a', 'a', 't', 'c', 't', 'a', 'a',
'c', 't', 'a', 'c', 'a', 'c', 't', 'a', 'c', 't', 'g', 'c', 'c', 'g', 'c',
'g', 'c', 'g', 'g', 'c', 'a', 't', 'a', 't', 'a', 't', 't', 't', 'a', 'a',
'a', 't', 'a', 't', 'a']

In the script we used Python’s built-in list() function to convert the DNA string into a list. We then printed out the sequence list.

Now let us add the gc_content() function to the script.

1
2
3
4
5
6
7
8
def gc_content(sequence):
    "Return GC-content as a percentage from a list of DNA letters."
    gc_count = sequence.count("g") + sequence.count("c")
    gc_fraction = float(gc_count) / len(sequence)
    return 100 * gc_fraction

sequence = list("attagcgcaatctaactacactactgccgcgcggcatatatttaaatata")
print(gc_content(sequence))

In the above the gc_content() function is implemented as per our exploration in our interactive session. The only difference is the addition of a, so called, “docstring” (documentation string) to the body of the function (line 2). The docstring is meant to document the purpose and usage of the function. Documenting the code in this way makes it easier for you, and others, to understand it.

Note that the script now prints out the GC-content rather than the sequence (line 8). Let us run the updated script from the command line.

$ python gc_content.py
38.0

The next piece of code will be a bit more complicated. However, note that that it represents the most complicated aspect of this chapter. So if you find it difficult, don’t give up, it gets easier again later on.

Now let us implement a new function for performing a sliding window analysis. Add the code below to the start of the gc_content.py file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def sliding_window_analysis(sequence, function, window_size=10):
    """Return an iterator that yields (start, end, property) tuples.

    Where start and end are the indices used to slice the input list
    and property is the return value of the function given the sliced
    list.
    """
    for start in range(0, len(sequence), window_size):
        end = start + window_size
        if end > len(sequence):
            break
        yield start, end, function(sequence[start:end])

There is quite a lot going on in the code above so let us walk through it slowly. One of the first things to note is that the sliding_window_analysis() function takes another function as its second argument. Functions can be passed around just like variables and on line 12 the function is repeatedly called with slices of the input sequence.

The sliding_window_analysis() function also takes a window_size argument. This defines the step size of the range() function used to generate the start indices for the slicing. Note that in this case we provide the window_size argument with a default value of 10. This means that the window_size argument does not need to be explicitly set when calling the function (if one is happy with the default).

On line 9, inside the for loop, we generate the end index by adding the window_size to the start index. This is followed by a check that the generated end index would not result in a list slice that spanned beyond the end of the sequence.

On line 11, inside the if statment there is a break clause. The break statement is used to break out of the loop immediately. In other words if the end variable is greater than the length of the sequence we break out of the loop immediately.

At the end of the for loop we make use of the yield keyword to pass on the start and end indices as well as the value resulting from calling the input function with the sequence slice. This means that rather than returning a value the sliding_window_analysis() function returns an iterator. As the name suggests an “iterator” is an object that one can iterate over, for example using a for loop. Let us add some code to the script to illustrate how one would use the sliding_window_analysis() function in practise.

20
21
22
sequence = list("attagcgcaatctaactacactactgccgcgcggcatatatttaaatata")
for start, end, gc in sliding_window_analysis(sequence, gc_content):
    print(start, end, gc)

Let us test the code again.

$ python gc_content.py
(0, 10, 40.0)
(10, 20, 30.0)
(20, 30, 70.0)
(30, 40, 50.0)
(40, 50, 0.0)

The current implementation of the sliding_window_analysis() is very dependent on the frame of reference as the window slides along. For example if the window_size argument was set to 3 one would obtain the analysis of the first codon reading frame, but one would have no information about the second and third codon reading frames. To overcome this one can perform sliding window analysis with overlapping windows. Let us illustrate this visually by extracting codons from a DNA sequence.

# Original sequence.
atcgctaaa

# Non overlapping windows.
atc
   gct
      aaa

# Overlapping windows.
atc
 tcg
  cgc
   gct
    cta
     taa
      aaa

To enable overlapping windows in our sliding_window_analysis() function we need to add a step_size argument to it and make use of this in the call to the range() function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def sliding_window_analysis(sequence, function, window_size=10, step_size=5):
    """Return an iterator that yields (start, end, property) tuples.

    Where start and end are the indices used to slice the input list
    and property is the return value of the function given the sliced
    list.
    """
    for start in range(0, len(sequence), step_size):
        end = start + window_size
        if end > len(sequence):
            break
        yield start, end, function(sequence[start:end])

Let us run the script again to see what the output of this overlapping sliding window analysis is.

$ python gc_content.py
(0, 10, 40.0)
(5, 15, 40.0)
(10, 20, 30.0)
(15, 25, 40.0)
(20, 30, 70.0)
(25, 35, 100.0)
(30, 40, 50.0)
(35, 45, 0.0)
(40, 50, 0.0)

Note that the gc_content() function is now applied to overlapping segments of DNA. This allows us, for example, to note that the 25 to 35 region has a GC-content of 100%, which is something that we did not manage to pick out before.

Downloading the genome

It is time to start working on some real data. Let us download the genome of Streptomyces coelicolor from the Sanger Centre ftp site. The URL shortened using bitly point to the Sco.dna file.

$ curl --location --output Sco.dna http://bit.ly/1Q8eKWT
$ head Sco.dna
SQ   Sequence 8667507 BP; 1203558 A; 3121252 C; 3129638 G; 1213059 T; 0 other;
     cccgcggagc gggtaccaca tcgctgcgcg atgtgcgagc gaacacccgg gctgcgcccg        60
     ggtgttgcgc tcccgctccg cgggagcgct ggcgggacgc tgcgcgtccc gctcaccaag       120
     cccgcttcgc gggcttggtg acgctccgtc cgctgcgctt ccggagttgc ggggcttcgc       180
     cccgctaacc ctgggcctcg cttcgctccg ccttgggcct gcggcgggtc cgctgcgctc       240
     ccccgcctca agggcccttc cggctgcgcc tccaggaccc aaccgcttgc gcgggcctgg       300
     ctggctacga ggatcggggg tcgctcgttc gtgtcgggtt ctagtgtagt ggctgcctca       360
     gatagatgca gcatgtatcg ttggcagaaa tatgggacac ccgccagtca ctcgggaatc       420
     tcccaagttt cgagaggatg gccagatgac cggtcaccac gaatctaccg gaccaggtac       480
     cgcgctgagc agcgattcga cgtgccgggt gacgcagtat cagacggcgg gtgtgaacgc       540

Reading and writing files

In order to be able to process the genome of Streptomyces coelicolor we need to be able to read in the Sco.dna file. In Python reading and writing of files is achieved using the built-in open() function, which returns a file handle.

Before we start adding code to our script let us examine reading and writing of files using in Python’s interactive mode. Let us open up the Sco.dna file for reading.

>>> file_handle = open("Sco.dna", mode="r")

We can access the current position within the file using the tell() method of the file handle.

>>> file_handle.tell()
0

The integer zero indicates that we are at the beginning of the file.

To read in the entire content of the file as a single string of text one can use the read() method of the file handle.

>>> text = file_handle.read()

After having read in the content of the file the position of the file handle will point at the end of the file.

>>> file_handle.tell()
11701261

When one has finished working with a file handle it is important to remember to close it.

>>> file_handle.close()

Let us examine the text that we read in.

>>> type(text)
<type 'str'>
>>> len(text)
11701261
>>> text[:60]
'SQ   Sequence 8667507 BP; 1203558 A; 3121252 C; 3129638 G; 1'

However, rather than reading in files as continuous strings one often want to process files line by line. One can read in a file as a list of lines using the readlines() method.

>>> file_handle = open("Sco.dna", "r")
>>> lines = file_handle.readlines()
>>> file_handle.close()

Let us examine the lines that we read in.

>>> type(lines)
<type 'list'>
>>> len(lines)
144461
>>> lines[0]
'SQ   Sequence 8667507 BP; 1203558 A; 3121252 C; 3129638 G; 1213059 T; 0 other;\n'

A third way of accessing the content of a file handle is to simply treat it as an iterator. This is possible as the Python file handles implement a method called next() that returns the next line in the file. When it reaches the end of the file the next() function raises a StopIteration exception, which tells the iterator to stop iterating.

Let’s see the workings of the next() method in action.

>>> file_handle = open("Sco.dna", "r")
>>> file_handle.next()
'SQ   Sequence 8667507 BP; 1203558 A; 3121252 C; 3129638 G; 1213059 T; 0 other;\n'
>>> file_handle.next()
'     cccgcggagc gggtaccaca tcgctgcgcg atgtgcgagc gaacacccgg gctgcgcccg        60\n'

We can go to the end of the file using the seek() method of the file handle.

>>> file_handle.seek(11701261)

Let’s see what happens when we call the next() method now.

>>> file_handle.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

As explained above this raises a StopIteration exception. Now that we are done with our experiment we must remember to close the file handle.

>>> file_handle.close()

In practise one tends to use file handles directly within for loops.

>>> num_lines = 0
>>> file_handle = open("Sco.dna", "r")
>>> for line in file_handle:
...     num_lines = num_lines + 1
...
>>> print(num_lines)
144461

In the for loop above the file handle acts as an iterator, yielding the lines within the opened file.

Again we must not forget to close the file handle.

>>> file_handle.close()

Having to constantly remember to close file handles when one is done with them can become tedious. Furthermore, forgetting to close file handles can have dire consequences. To make life easier one can make use of Python’s built-in with keyword.

The with keywords works with context managers. A context manager implements the so called “context manager protocol”. In the case of a file handle this means that the file is opened when one enters into the context of the with statement and that the file is automatically closed when one exits out of the context. All that is a fancy way of saying that we do not need to worry about remembering to close files if we access file handles using the syntax below.

>>> with open("Sco.dna", mode="r") as file_handle:
...     text = file_handle.read()
...

Let us shift the focus to the writing of files. There are two modes for writing files w and a. The former will overwrite any existing files with the same name whereas the latter would append to them. Let us illustrate this with an example.

>>> with open("tmp.txt", "w") as file_handle:
...     file_handle.write("Original message")
...
>>> with open("tmp.txt", "r") as file_handle:
...     print(file_handle.read())
...
Original message
>>> with open("tmp.txt", "a") as file_handle:
...     file_handle.write(", with more info")
...
>>> with open("tmp.txt", "r") as file_handle:
...     print(file_handle.read())
...
Original message, with more info
>>> with open("tmp.txt", "w") as file_handle:
...     file_handle.write("scrap that...")
...
>>> with open("tmp.txt", "r") as file_handle:
...     print(file_handle.read())
...
scrap that...

Armed with our new found knowledge of how to read and write files we will now create a function for reading in the DNA sequence from the Sco.dna file.

Creating a function for reading in the Streptomyces sequence

Let us create a function that reads in a genome as a string and returns the DNA sequence as a list. At this point we have a choice of what the input parameter should be. We could give the function the name of the file containing the genome or we could give the function a file handle of the genome. Personally, I prefer to create functions that accept file handles, because they are more generic. Sometimes the data to be read comes from sources other than a file on disk. However, as long as these behave as a file object one can still pass them to the function.

Let us have a look at the file containing the Streptomyces coelicolor genome.

$ head Sco.dna
SQ   Sequence 8667507 BP; 1203558 A; 3121252 C; 3129638 G; 1213059 T; 0 other;
     cccgcggagc gggtaccaca tcgctgcgcg atgtgcgagc gaacacccgg gctgcgcccg        60
     ggtgttgcgc tcccgctccg cgggagcgct ggcgggacgc tgcgcgtccc gctcaccaag       120
     cccgcttcgc gggcttggtg acgctccgtc cgctgcgctt ccggagttgc ggggcttcgc       180
     cccgctaacc ctgggcctcg cttcgctccg ccttgggcct gcggcgggtc cgctgcgctc       240
     ccccgcctca agggcccttc cggctgcgcc tccaggaccc aaccgcttgc gcgggcctgg       300
     ctggctacga ggatcggggg tcgctcgttc gtgtcgggtt ctagtgtagt ggctgcctca       360
     gatagatgca gcatgtatcg ttggcagaaa tatgggacac ccgccagtca ctcgggaatc       420
     tcccaagttt cgagaggatg gccagatgac cggtcaccac gaatctaccg gaccaggtac       480
     cgcgctgagc agcgattcga cgtgccgggt gacgcagtat cagacggcgg gtgtgaacgc       540

From this we want a function that:

  1. Discards the first line, as it does not contain any sequence
  2. Iterates over all subsequent lines extracting the relevant sequence from them

Extracting the relevant sequence can be achieved by noting that each sequence line consists of seven “words”, where a word is defined as a set of characters separated by one or more white spaces. The first six words correspond to sequence, whereas the last word is an index listing the number of nucleotide bases.

Let us implement such a function. Add the lines below to the top of the gc_content.py file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def parse_dna(file_handle):
    """Return DNA sequence as a list."""
    first_line = file_handle.next()  # Discard the first line.
    sequence = []
    for line in file_handle:
        words = line.split()
        seq_string = "".join(words[:-1])
        seq_list = list(seq_string)
        sequence.extend(seq_list)
    return sequence

There are a couple of new string methods introduced in the above, let’s explain them now.

Let’s look at line six first.

6
        words = line.split()

Here we use the split() method to split the string into a list of words, by default the split() method splits text based on one or more white space characters.

On line seven we use the join() method to join the words together.

7
        seq_string = "".join(words[:-1])

In this instance there are no characters separating the words to be joined. It is worth clarifying this with an example, if we wanted to join the words using a comma character one would use the syntax ",".join(words[:-1]).

On line seven it is also worth noting that we exclude the last word (the numerical index) by making use of list slicing words[:-1].

Finally, on line nine we make use of the list method extend(), this extends the existing sequence list with all the elements from the seq_list list. Because words seq_string and seq_list will be overwritten when the loop moves on to the next line in the input file.

9
        sequence.extend(seq_list)

Now let us update the gc_content.py script to initalise the sequence by parsing the Sco.dna file.

31
32
33
34
35
with open("Sco.dna", "r") as file_handle:
    sequence = parse_dna(file_handle)

for start, end, gc in sliding_window_analysis(sequence, gc_content):
    print(start, end, gc)

Finally, let us change the default window_size and step_size values. In the below I have split the function definition over two lines so as not to make the line exceed 78 characters. Exceeding 78 characters is considered poor “style” because it makes it difficult to read the code.

12
13
14
15
16
17
18
19
def sliding_window_analysis(sequence, function,
                            window_size=100000, step_size=50000):
    """Return an iterator that yields (start, end, property) tuples.

    Where start and end are the indices used to slice the input list
    and property is the return value of the function given the sliced
    list.
    """

Let us run the script again.

$ python gc_content.py

Note that this will produce a lot of output. To find out the number of lines that are generated we can make use of piping and the wc -l command (mnemonic wc word count, -l lines) .

$ python gc_content.py | wc -l
    173

The result makes sense as there are 8667507 base pairs in the sequence and we are stepping through it using a step size of 50000.

>>> 8667507 / 50000
173

Writing out the sliding window analysis

Finally we will write out the analysis to a text file. Since this data is tabular we will use the CSV file format. Furthermore, since we will want to plot the data using ggplot2 in Data visualisation we will use a “tidy” data structure, see Tidy data for details.

Edit the end of the gc_content.py script to make it look like the below.

32
33
34
35
36
37
38
39
40
41
with open("Sco.dna", "r") as file_handle:
    sequence = parse_dna(file_handle)

with open("local_gc_content.csv", "w") as file_handle:
    header = "start,middle,end,gc_content\n"
    file_handle.write(header)
    for start, end, gc in sliding_window_analysis(sequence, gc_content):
        middle = (start + end) / 2
        row = "{},{},{},{}\n".format(start, middle, end, gc)
        file_handle.write(row)

On line 35 we open a file handle to write to. On lines 36 and 37 we write a header to the CSV file. Lines 38 to 41 then performs the sliding window analysis and writes the results as rows, or lines if you prefer, to the CSV file. Line 39 calculates the middle of the local sequence by calculating the mean of the start and end positions.

The main new feature introduced in the code snippet above is on line 40 where we use Python’s built in string formatting functionality. The matching curly braces in the string are replaced with the content of the format() string method. Let us illustrate this with an example in interactive mode.

>>> print("{},{},{},{}")
{},{},{},{}
>>> print("{},{},{},{}".format(1, 5, 10, 38.5))
1,5,10,38.5

Okay, let us see what happens when we run the script.

$ python gc_content.py

This should have created a file named local_gc_content.csv in the working directory.

$ ls
Sco.dna
gc_content.py
local_gc_content.csv

We can examine the top of this newly created file using the head command.

$ head local_gc_content.csv
start,middle,end,gc_content
0,50000,100000,69.124
50000,100000,150000,70.419
100000,150000,200000,72.495
150000,200000,250000,71.707
200000,250000,300000,71.098
250000,300000,350000,72.102
300000,350000,400000,72.712
350000,400000,450000,73.15
400000,450000,500000,73.27

Great, everything seems to be working. Let’s start tracking our code using Git, see Keeping track of your work.

$ git init
$ git status
On branch master

Initial commit

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        Sco.dna
        gc_content.py
        local_gc_content.csv

nothing added to commit but untracked files present (use "git add" to track)

We have got three untracked files in our directory, the script, the input data and the output data. We don’t want to track the input and the output data so let’s create a .gitignore file and add those files to it.

Sco.dna
local_gc_content.csv

Let’s check the status of our Git repository again.

$ git status
On branch master

Initial commit

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        .gitignore
        gc_content.py

nothing added to commit but untracked files present (use "git add" to track)

Let’s start tracking the gc_content.py and the .gitignore files and take a snapshot of them in their current form.

$ git add gc_content.py .gitignore
$ git commit -m "Initial file import"
[master (root-commit) 6d8e0cf] Initial file import
 2 files changed, 43 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 gc_content.py

Well done! We have covered a lot of ground in this chapter. I suggest digging out some good music and chilling out for a bit.

Key concepts

  • Python is a powerful scripting language that is popular in the scientific community
  • You can explore Python’s syntax using its interactive mode
  • Variables and functions help us avoid having to repeat ourselves, the DRY principle
  • When naming variables and functions explicit trumps succinct
  • Loops are really powerful, they form the basis of automating repetitive tasks
  • Files are accessed using file handles
  • A file handle is a data structure that handles the book keeping of the position within the file and the mode in which it was opened
  • The mode of the file handle determines how you will interact with the file
  • Read mode only allows reading of a file
  • Append mode will keep the existing content of a file and append to it
  • Write mode will delete any previous content before writing to it