Building the Central Dogma in Python: Part 1

Michael French,genomicspython

I am a Data Engineer by trade, and haven't officially touched anything biology related since high school. I recently picked up an interest in genomics and wanted to improve my biology basics. If you were like me coming in, you probably aren't familiar with the central dogma of biology or it probably was an idea tucked away in the depths of your brain.

In a nutshell, the central dogma of molecular biology (opens in a new tab) describes how DNA turns into stuff. It says DNA holds information on how to make RNA which then makes proteins which then do stuff.

This idea that DNA holds all of the information to make us and every other living thing on this planet fascinates me as a software guy. I find it very analagous to how 1s and 0s in a computer can lead to the multitude of applications that exist on the web. I wanted to learn how and why it worked, so that's what I'm doing.

All of the code for this project exists on my Github (https://github.com/frenchytheasian/dna_simulator).If (opens in a new tab) I got anything wrong throughout this article, or you have any questions, feel free to reach out to me at michael.x.french@gmail.com.

The project that I am sharing in this article is a fun little aid in what I have learnt so far. The current implementation mirrors a high school level explanation of transcript and translation, but I hope to build upon it to make it better reflect the nuances of nature.

DNA

The first thing I must address is what DNA is. DNA is the molecule in which our biological information is stored. I like to think of it as a long string of beads, where each bead is a different letter. DNA is stored in tightly packed bundles called chromosomes (opens in a new tab). We humans have 23 of these DNA packages under the majority of circumstances, but that number varies throughout nature. DNA is composed of four different nucleotides (opens in a new tab); adenine, thynine, guanine, and cytosine. These different nucleotides are the lettered beads on the long string that I mentioned earlier.

Typically, each of the different nucleotides is represented with its capitalized first letter; A, C, G, T. An example stretch of DNA might look like this: ATGTTTAAAGTAGAGAAAGTGAGAGAGTGAGTAGTAAAGAGTGGGAGGAGGAGAG When working with DNA in software, this makes things really easy as a lot of operations on DNA are string operations.

Transcription

The first step of turning any sequence of DNA is to transcribe it into RNA. This is a biologically complex operation. Here's an awesome video that goes relatively in depth to this process with animations and everything.

Here's a shorter animation of this process.

However, describing this process in code is actually a pretty straightforward process (at least at a high school level description. I have learnt that there is a ton of nuance in determining what should be transcribed and when which I do not attempt to capture here). The only change that needs to be made to the DNA string is to change all of the T's to U's because in transcription, all of the thynines are converted to uracils. The python code for transcription looks like this:

def transcribe(sequence: str) -> str:
    return sequence.replace("T", "U")

Translation

The next step is to turn the RNA (this piece of RNA is specifically called the messenger RNA or mRNA. It turns out that RNA is super cool and can do a bunch of different stuff) into a Protein. This is also a super complicated and nuanced biological process, but the simplified rundown of it is that the DNA is read in threes (called codons). Each codon maps to a specific amino acid shown in the below table. RNA to Amino Acide mapping table

You might notice that AUG is highlighted blue. AUG is the start codon. It is used by your cellular machinery to determine where in the mRNA it should start the translation process. UAA, UGA, and UAG are stop codons. They are used to determine where to stop translation.

In python, this process looks like the following:

def translate(mrna: str) -> str:
    start = mrna.find("AUG")
    
    aa = []
    for start_bp in range(start, len(mrna), 3):
        current_codon = mrna[start_bp: start_bp + 3]
        if MRNA_TO_AA_MAPPING[current_codon] == "*":
            return "".join(aa)
        
        aa.append(MRNA_TO_AA_MAPPING[current_codon])

where MRNA_TO_AA_MAPPING is a python dictionary that maps the input codons to their corresponding amino acids, which are represented by single letters.

So after completing this second step, we have completed our Python implementation of the central dogma! Now to actually see it in action and to verify, we need some data.

Does this Implementation Work?

To verify how well this worked on real data, I decided that I wanted to test this program on all of the real protein coding human genes. In order to do this, I created scripts to download all of the necessary data from NCBI's (https://www.ncbi.nlm.nih.gov/ (opens in a new tab)) nucleotide and Protein databases. I grabbed a table containing a list of all protein coding nucleotide sequences from the following link: (https://www.ncbi.nlm.nih.gov/datasets/gene/taxon/9606/ (opens in a new tab)). I then used this table to download the raw DNA sequences that I was to run my program on and the corresponding protein sequence to validate my results against.

Here's a snippet of code showing the process of downloading the nucleotide files. The process for downloading the protein sequence files was very similar and is located here: (https://github.com/frenchytheasian/dna_simulator/blob/main/src/utils.py (opens in a new tab))

def download_verification_files(table: pd.DataFrame, email: str) -> None:
    Entrez.email = email
    for i in range(0, len(table), 100):
        end = i + 100 if i + 100 < len(table) else len(table)
        sub_table = table[i:end]
 
        transcripts_accessions = ",".join(sub_table['Transcripts accession'].astype('str').to_list())
        dna_seqs_handle = Entrez.efetch(db="nucleotide", id=transcripts_accessions, rettype="fasta", retmode="text")
        dna_parser = SimpleFastaParser(dna_seqs_handle)
        for entry in dna_parser:
            title, seq = entry 
            accession = title.split(" ")[0]
            row = sub_table[sub_table["Transcripts accession"] == accession].iloc[0]
            symbol = row["Symbol"]
            begin = row["Begin"]
            file_name = f"dna_seq/{symbol}_{begin}_dna.fasta"
            with open(file_name, "w") as f:
                f.write(f">{title}\n{seq}")

I then wrote a simple test script to run all of the downloaded files through my program and mark whether or not they were correctly translated.

import os 
 
from src.main import main, read_fasta
 
dna_dir = 'dna_seq'
dna_sequences = os.listdir(dna_dir)
 
matches = 0
for file in dna_sequences:
    try:
        protein_seq = main(f"{dna_dir}/{file}")
    except (KeyError, ValueError) as e:
        print(e)
        continue 
 
    try:
        protein_seq_val = read_fasta(f"protein_seq/{file.replace('_dna.fasta', '_protein.fasta')}")
    except FileNotFoundError as e:
        print(e)
        continue
    if protein_seq_val != protein_seq:
        print(f"{file} does not match")
        # print(protein_seq_val, end="\n\n")
        # print(protein_seq)
    else:
        matches += 1
 
print(f"{matches}/{len(dna_sequences)} sequences matched their protein sequence!")

Interestingly, this program only correcly translates 12975/22859 of the sequences that I downloaded from NCBI. Upon doing a little more digging, this discrepency seems to results from the fact that our program doesn't encode all of the nuances that exist in the real world. For example, the couple of genes that I spot checked, I noticed that in the validation data, translation actually doesn't start at the first AUG, as to why, that will be more for me to research and implement in later iterations of this program.