Pretend-playing with nucleic acids: using free software tools to manipulate DNA and RNA sequences to get them in the form you need and figure out what proteins they might code for… A tech tip post on complementing, reverse-complementing, translating, and more! Basically, a post trying to explain some of the nucleic acid sequence terminology that might trip you up and providing some tips and tools to help. So put on your reading frames and let’s go!
Today’s post is more technical, so I’m going to assume that if you care enough to read it, you know a bit about nucleic acids (DNA and RNA) already (and will provide links to relevant posts if you want to review). The basic gist is that proteins are like molecular workers, made up of a string of amino acid letters. And the instructions for them are kept in the form of double-stranded DNA (dsDNA) as part of long chromosomes. In this genomic DNA, there are parts that have protein-making instructions (coding parts aka exons) and parts which “just” have regulatory instructions (non-coding parts – introns (parts between exons that get removed during mRNA processing) and intergenic regions (parts between genes)).
When a cell wants to make a protein, it makes a messenger RNA (mRNA) copy of one strand of the DNA and that is used by ribosomes to make the encoded protein in a process called translation.
so you have gene (DNA) gets transcribed -> pre-mRNA gets processed (introns spliced out, cap and tail added) -> mature mRNA gets translated by ribosomes -> protein
If you have a nucleic acid sequence (DNA or RNA form) you might want to know what the corresponding amino acid sequence is – so you want to “translate” it. You can use online software like Expasy translate to do this (and there are a bunch of other tools if you just Google it). https://web.expasy.org/translate/
What that does is it takes the sequence you enter and “pretends it’s a ribosome” – it (in make-believe-land) translates all 6 possible open reading frames (ORFs).
note on open reading frames – the ribosome reads mRNAs in 3 RNA-letter “words” called codons. For example, GAA spells Glutamate (Glu, E), GGU spells Glycine (Gly, G), GCU spells Alanine (Ala, A), GAC spells Aspartate (Asp, D), and AGC spells Serine (Ser, S). DNA has the letter T instead of U (so, for example, you would get the mRNA sequence GAAGGUGCUGACAGC by copying the DNA sequence GAAGGTGCTGACAGC). The biochemical basis for this is that the when the ribosome sits on a codon, a molecule called a transfer RNA (tRNA) with the complementary 3-letters (the anticodon) and the corresponding amino acid on the the other end brings that amino acid to the ribosome and the ribosome adds it to the growing chain of amino acids and then scoots over to the next codon to do it again. https://bit.ly/translationtimestwo
That biochemical basis doesn’t really matter for this, but I hate leaving people hanging… The thing that does matter is that as a consequence, codons are read as non-overlapping words. http://bit.ly/learngeneticcode
So, for example, …GAAGGUGCUGACAGC… (…GAAGGTGCTGACAGC.. in DNA letters) spells …EGADS…. This means that where you start reading (your reading frame) matters. (… GAA GGU … is different from … ..G AAG GU. … which is different from … .GA AGG U.. …).
What strand you’re on matters too. The details are too complicated to get into here, but DNA and RNA have 2 ends – a starting end called the 5’ (five prime) end and an ending end called the 3’ end. So you read (and write) them 5’->3’. And the ribosome makes protein starting at the 5’ end and moving towards the 3’ end. DNA is typically double-stranded, and those strands are antiparallel, meaning that if you looked at the sequence double-stranded DNA (dsDNA) one strand (usually the top strand by convention) would be written 5->3 and the other strand would be written 3’->5’. Like
The other thing to know about dsDNA is that the sequences are complementary, not identical! Genes are present in double-stranded DNA, but only one of the strands has “readable” instructions for a protein (but which strand this is varies for different proteins). DNA (and RNA) have 4 letters (A, C, G, & T (in DNA) or U (in RNA)), with specific 1:1 base pairing relationship (A to T (U in RNA) and C to G). So, for example, if one strand of DNA reads GAAGGTGCTGACAGC, the other strand would read CTTCCACGACTGTCG, except that the strands are directional and antiparallel, so if the first strand is 5’-GAAGGTGCTGACAGC-3’ the second strand is 5’-GCTGTCAGCACCTTC-3’. The unreversed form is called the complement and the reversed form is the reverse-complement.
So it’d be like
where the complement of the top strand is CTTCCACGACTGTCG and the reverse complement is GCTGTCAGCACCTTC.
Basically, the complement is the one that “looks opposite” and the reverse complement is the one that “looks opposite and backwards.”
WARNING!!!!!! If you’re trying to design oligonucleotide (short DNA) probes or PCR primers, you need to use the reverse complement so it will stick! http://bit.ly/pcrtrain
But you don’t need to try to do it by hand. Here’s a helpful website that will convert them for you: https://www.bioinformatics.org/sms/index.html
Going back to the idea of reading frames. The ribosome knows where to start reading (and making a protein) because of something called a start codon. The codon AUG signals “start” but it also stands for the amino acid Methionine (Met, M) and when it’s not at the start it will just mean Met. So the ribosome needs other clues to know which AUG is the start. The translate software doesn’t have these clues. And it’s likely that you’ve only given it part of a protein sequence that doesn’t contain that start part. So the software gives you all possible opens
Say you put the sequence GAAGGTGCTGACAGA into Expasy translate. It doesn’t know whether you’ve given it the “right strand” so, unless you tell it you have, it will generate the reverse complementary strand and then it will translate it in all 6 possible reading frames (3 forward reading frames from the original strand and 3 reverse reading frames from the reverse-complementary sequence) -the reason there isn’t more than 3 of each is that, since codons are 3 letters, if you shift one more, you’ll be back to the reading frame you started with (just starting one letter later)
the forward frames:
GAA GGT GCT GAC AGC
G AAG GTG CTG ACA GC
GA AGG TGC TGA CAG C
and the reverse frames:
GCT GTC AGC ACC TTC
G CTG TCA GCA CCT TC
GC TGT CAG CAC CTT C
The translated frames (amino acid sequences) are:
5’3’ Frame 1: EGADS
5’3’ Frame 2: KVLT
5’3’ Frame 3: RC-Q (where the – indicates a stop codon)
3’5’ Frame 1: AVSTF
3’5’ Frame 2: LSAP
3’5’ Frame 3: CQHL
Note that if I started with the reverse complement, I would get the same translated frames except the 5’3 vs 3’5’ would be swapped, i.e.
5’3’ Frame 1: AVSTF
5’3’ Frame 2: LSAP
5’3’ Frame 3: CQHL
3’5’ Frame 1: EGADS
3’5’ Frame 2: KVLT
3’5’ Frame 3: RC-Q (where the – indicates a stop codon)
This is because even though we might write one strand on the top, there really is no “top” – if you rotate
180° you still get a 5’-3’ strand on top
I don’t know if this post was helpful to anyone. I really hope it was. I just know this whole “reverse complementarity” thing can really trip people up so thought I’d make this tech tip to try to help out. Maybe at least the figures…
more on nucleic acids: http://bit.ly/nucleicacids2
more on topics mentioned (& others) #365DaysOfScience All (with topics listed) 👉 http://bit.ly/2OllAB0⠀