Author: Thomas Robitaille
Solved by: Marta Reina Campos
Total points: 30
Due: Wednesday 12th October 7pm CEST
Format: IPython Notebook or python program
The number of points in this problem sheet is not directly proportional to the difficulty. In fact, Part 3 is more difficult than Part 1 and 2, but is worth fewer points in terms of the amount of work/code, and are there for people who enjoy challenges. So you can choose to not complete Part 3, and you should still be able to get more than 60% of points from the previous questions if you answer them correctly.
This problem set is about DNA data and methods typically applied to this kind of data. As a reminder, DNA is a molecule that encodes genetic instructions for living organisms. DNA is typically found as a double-stranded helix, in which each strand corresponds to a sequence of nucleotides. Each nucleotide consists of a nucleobase (guanine, adenine, thymine, or cytosine) attached to sugars, which are in turn separated from each other by phosphate groups:
(image from Wikipedia)
The nucleobases are commonly referred to with the letters G (guanine), A (adenine), T (thymine), and C (cytosine). The nucleobases form pairs between the two strands: G pairs with C, and T pairs with A.
DNA is therefore most commonly represented as a sequence of nucleobases, such as GATTACACCTCATTATAAA
.
For more information about DNA, see the wikipedia page in German or English.
In this problem set, we will be working with fake genetic data, since real-life data often has added complications, but the functions and techniques uses here are the same or very similar to techniques used on real data.
The reverse complement of DNA is found by reversing the DNA sequence, then replacing each base by its complement (A is replaced by T, T is replaced by A, G is replaced by C, and C is replaced by G). For example, the reverse complement of ATGCGGC
is GCCGCAT
Write a Python function reverse_complement
that takes a DNA sequence as a string, and returns the reverse complement. Test your function by ensuring that reverse_complement('ATGCGGC')
is 'GCCGCAT'
, then find the reverse complement of the following sequence:
ATGCGCGGATCGTACCTAATCGATGGCATTAGCCGAGCCCGATTACGC
Print the result out using print
.
Note that this function is NOT needed for the remaining questions below.
# function to give the reverse complement of a DNA sequence
def reverse_complement(dna_sequence):
dna_dictionary = {'A':'T', 'T':'A', 'G':'C', 'C':'G'} # dictionary with the pairs of nucleobases
inversed_sequence = dna_sequence[::-1] # 1 - Inverse the DNA sequence
reversed_dna = ""
for nucleobase in inversed_sequence: # 2 - Replace each nucleobase by each pair
reversed_dna = reversed_dna + dna_dictionary[nucleobase]
return reversed_dna
# input DNA sequences
dna_1 = "ATGCGGC"
dna_2 = "ATGCGCGGATCGTACCTAATCGATGGCATTAGCCGAGCCCGATTACGC"
# function calls to determine the reverse complement
reverse_dna_1 = reverse_complement(dna_1)
reverse_dna_2 = reverse_complement(dna_2)
# output reverse complements
print("Original DNA sequence: ", dna_1)
print("And its reverse complement: ", reverse_dna_1)
print("Original DNA sequence: ", dna_2)
print("And its reverse complement: ", reverse_dna_2)
Ribonucleic acid (RNA) is a family of large biological molecules that is transcribed from DNA by the RNA Polymerase enzyme. It consists of a single strand of nucleotides that are identical to the ones found in DNA, with the exception of uracil (U), which replaces thymine (T).
Messenger RNA molecules (or mRNA) are a subset of RNA molecules that are used to pass information from DNA to ribosomes, which then translates the mRNA to protein sequences.
Write a Python function dna_to_mrna
that takes a DNA sequence and returns the corresponding mRNA sequence. For example, the DNA sequence ATCGCGAT
should produce the mRNA sequence AUCGCGAU
(note that you do not need to find the reverse complement of the DNA here)
# function to translate a DNA sequence into a mRNA molecule
def dna_to_mnra(sequence_dna):
dna_to_mnra_dictionary = {'A':'A', 'T':'U', 'G':'G', 'C':'C'} # dictionary to convert from DNA to mRNA
sequence_mnra = ""
for nucleobase in sequence_dna: # the thymine nucleobase is replaced by uracil
sequence_mnra = sequence_mnra + dna_to_mnra_dictionary[nucleobase]
return sequence_mnra
# input - DNA sequence
dna_sequence = "ATCGCGAT"
# function call to translate a DNA sequence into a mRNA sequence
mnra_sequence = dna_to_mnra(dna_sequence)
# output the translated mRNA sequence
print("Original DNA sequence: ", dna_sequence)
print("mRNA sequence: ", mnra_sequence)
When the mRNA is translated to a protein sequence, each set of three nucleotides, called a codon, is translated into a single amino acid. For example, the codon UUC
translates to the amino acid Phenylalanine. Each amino acid can be represented by a single letter - for example Phenylalanine is represented by the letter F
. A protein, which is formed from a sequence of amino acids, can therefore be written as a sequence of letters in the same way as DNA or mRNA, but using more of the letters of the alphabet since there are more than four amino acids.
The data/problem_1_codons.txt file contains two columns. The first column gives a list of codons, and the second column gives the corresponding amino acid (represented by a single letter). Certain codons do not correspond to an amino acid, but instead indicate that the amino acid sequence is finished. These are indicated by Stop
.
Write a function mrna_to_protein
that takes an mRNA sequence (as a string) and returns the sequence of amino acids (as a string), stopping the first time a Stop
codon is encountered. Make sure that the file is only read once when running the script (and not every time you want to translate a codon). You will likely need to use a Python dictionary to help.
Finally, write a function dna_to_protein
that takes a DNA sequence (as a string) and returns the sequence of amino acids (as a string), making use of the functions that you wrote previously.
Print out the amino acid sequence for the following DNA sequence:
AATCTCTACGGAAGTAGGTCAGTACTGATCGATCAGTCGATCGGGCGGCGATTTCGATCTGATTGTACGGCGGGCTAG
# function to translate a mRNA sequence into a protein
def mnra_to_protein(sequence_mnra, codons_to_aminoacids_dictionary):
codons = [] # empty array to store the codons
for i in range(int(len(sequence_mnra)/3)): # the mRNA sequence is separated every three nucleobases
codons.append(sequence_mnra[i*3:(i+1)*3]) # which are stored as codons
protein = "" # empty array to store the aminoacids
for codon in codons: # every codon is translated into an aminoacid and stored
if codons_to_aminoacids_dictionary[codon] != "Stop": # in the protein (unless their value is 'Stop')
protein = protein + codons_to_aminoacids_dictionary[codon]
else:
break
return protein
# read input file containing the codons-aminoacid conversion
file_codons = open("PS01-data/problem_1_codons.txt", 'r')
# create a dictionary containing the information in the file
codons_to_aminoacids_dictionary = {}
for line in file_codons:
line = line.strip() # the final "\n" is removed
columns = line.split() # line string is splitted by the blanks into the two columns
codons_to_aminoacids_dictionary[columns[0]] = columns[1]
# DNA sequence to convert
dna_sequence = "AATCTCTACGGAAGTAGGTCAGTACTGATCGATCAGTCGATCGGGCGGCGATTTCGATCTGATTGTACGGCGGGCTAG"
# convert DNA into a mRNA sequence
mnra_sequence = dna_to_mnra(dna_sequence)
# convert mRNA sequence into protein
protein = mnra_to_protein(mnra_sequence, codons_to_aminoacids_dictionary)
# output the protein sequenced from the original DNA sequence
print("Original DNA sequence: ", dna_sequence)
print("mRNA sequence: ", mnra_sequence)
print("Protein sequenced: ", protein)
In the previous questions, we have been specifying the DNA sequence by hand, but DNA sequences are usually long and are stored in files. A common file format is the FASTA format which looks similar to this:
>label1
ACTGTATCGATGCTAGCTACGTAGCTAGCTAGCTAGCTGACGTA
ACGATGTGCGAGGGTCATGGGACGCGAGCGAGTCTAGCACGATC
>label2
ACTGGGCTTGACTACGGCGGTATCTGACGGGCGAGCTGTACGAG
ACGGACTAGGGCGCGGCGGGGCGGATTTTCGAGTCGAGCGTTAT
The first line starts with a >
which is immediately followed by a label (which might be the name of the gene for example). The sequence then starts on the second line, and may continue on several lines. It is common to limit the length of each line to 80, but this may vary from file to file. The sequence stops once either the file ends, or a line starts with >
, which indicates that a new sequence is being given. There may be any number of sequences in a file.
Write a function read_fasta
, that takes the name of a file (as a string) and returns a Python dictionary containing all the sequences from the file, with the keys in the dictionary corresponding to the label. If a sequence is given over several lines, you should remove any line returns and spaces. You should then be able to access the DNA for label1
with d['label1']
for example (if d
is the name of the dictionary).
Use this function and the functions you have written above to read in the data/problem_1_question_4.fasta file and print out, for each sequence, the label, followed by the amino acid sequence (not the DNA sequence!).
# function to read a FASTA file and store the DNA sequences in a dictionary
def read_fasta(name_input_file):
# read input file containing the DNA sequences
file_fasta = open("PS01-data/"+name_input_file, 'r')
# create a dictionary containing the information in the file
genes_dictionary = {}
gene = ""
for line in file_fasta:
line = line.strip() # the final "\n" is removed
if line[0] == '>': # line with a label
if gene == "": # will only enter in the first label line
genes_dictionary[line[1:]] = gene
label = line[1:]
else: # leave the stored gene in the dictionary
genes_dictionary[label] = gene
gene = "" # with the appropiate label
label = line[1:]
else:
gene = gene + line # append DNA sequence until it is over
return genes_dictionary
# name of the input file to look at
name_input_file = "problem_1_question_4.fasta"
# function call to read the input file and store the values in a dictionary
genes_dictionary = read_fasta(name_input_file)
# per each gene in the dictionary
for label in genes_dictionary.keys():
dna_sequence = genes_dictionary[label] # DNA sequence from the genes dictionary
mnra_sequence = dna_to_mnra(dna_sequence) # translated mRNA sequence
protein = mnra_to_protein(mnra_sequence, codons_to_aminoacids_dictionary) # translated aminoacid sequence (protein)
print("Label: ", label, "and the aminoacid sequence: ", protein)
Given several sequences with the same length, but which may include point mutations (i.e. individual nucleotides are changed), we want to try and find the most likely original sequence. For example, if we have the following sequences:
sequence 1: A C T C T
sequence 2: A C T C G
sequence 3: G C C C T
sequence 2: A C T C T
sequence 4: A T G C T
we can go through each position and find the most common nucleotide. To do this, we can first construct a matrix that looks like:
A: 4 0 0 0 0
C: 0 4 1 5 0
G: 1 0 1 0 1
T: 0 1 3 0 4
which indicates how many nucleotides of each type are found at each position, and from this we can see that the most common first base is A, the most common second base is C, and so on. The most common sequence is then ACTCT
. This is the consensus sequence.
Write a function consensus_sequence
that takes a dictionary of sequences (such as the one returned by read_fasta
), and then returns the consensus sequence. Read the data/problem_1_question_5.fasta file using the function you wrote previsouly, and print out the corresponding consensus sequence. All the sequences in this file are the same length.
Note that the selection is based on the nucleotides!
# function to find the consensus sequence
def consensus_sequence(sequences_dictionary):
# construct an empty matrix
length_each_gene_sequence = len(list(sequences_dictionary.values())[0])
nucleotides_matrix = {'A':[0]*length_each_gene_sequence,
'T':[0]*length_each_gene_sequence,
'G':[0]*length_each_gene_sequence,
'C':[0]*length_each_gene_sequence}
# find the frequency of each nucleotide at each position
for label in sequences_dictionary.keys():
gene_sequence = sequences_dictionary[label]
for i in range(len(gene_sequence)):
if gene_sequence[i] == 'A':
nucleotides_matrix['A'][i] += 1
elif gene_sequence[i] == 'T':
nucleotides_matrix['T'][i] += 1
elif gene_sequence[i] == 'G':
nucleotides_matrix['G'][i] += 1
elif gene_sequence[i] == 'C':
nucleotides_matrix['C'][i] += 1
# find the consensus sequence
the_consensus_sequence = ""
for index in range(length_each_gene_sequence):
maximum_frequency = max(nucleotides_matrix['A'][index], # per each position, find the maximum
nucleotides_matrix['T'][index], # number of repetitions
nucleotides_matrix['G'][index],
nucleotides_matrix['C'][index])
for label in nucleotides_matrix.keys():
if maximum_frequency == nucleotides_matrix[label][index]: # and to which nucleotide it corresponds
the_consensus_sequence = the_consensus_sequence + label # and add it to the consensus sequence
return the_consensus_sequence
# input file containing all the sequences
name_input_file = "problem_1_question_5.fasta"
# sequences stored in a dictionary
sequences_dictionary = read_fasta(name_input_file)
# consensus sequence found
the_consensus_sequence = consensus_sequence(sequences_dictionary)
print("The consensus sequence is: ", the_consensus_sequence)
In some cases, it is useful to be able to identify the longest common sub-sequence between two sequences. For example, in the sequences ACTGCT
and TGCCCT
, the longest common sub-sequence is TGC
(ACTGCT and TGCCCT). Note that these do not have to be at the same positon in each sequence.
Write a function, longest_common_sequence
, that takes a dictionary of sequences (such as the one returned by read_fasta
) and returns the longest common sub-sequence found in all the sequences.
Read the data/problem_1_question_6.fasta file, and print out the longest common sequence between all the sequences.
# function to find the longest common sequence in a set of string sequences
def longest_common_sequence(sequences_dictionary):
labels = list(sequences_dictionary.keys()) # array with the label of the dictionary
answer = "" # empty answer
for i in range(len(labels)): # every sequence is compared to all the others
if i != len(labels) - 1:
string1 = sequences_dictionary[labels[i]]
string2 = sequences_dictionary[labels[i+1]]
else:
string1 = sequences_dictionary[labels[i]]
string2 = sequences_dictionary[labels[0]]
for index_string1 in range(len(string1)): # start the comparison
match = ""
for index_string2 in range(len(string2)):
if index_string1+index_string2 < len(string1) and \
string1[index_string1 + index_string2] == string2[index_string2]:
match += string2[index_string2]
else:
if (len(match) > len(answer)):
answer = match
match = ""
return answer
# input file containing all the sequences
name_input_file = "problem_1_question_6.fasta"
# sequences stored in a dictionary
sequences_dictionary = read_fasta(name_input_file)
# longest common substring from a set of string sequences
longest_common_substring = longest_common_sequence(sequences_dictionary)
print("The longest common substring is: ", longest_common_substring)