Protein expression problems? Perhaps you need to be more polite…. What color is the sky before a storm? Brits may write “grEy” while Americans write “grAy” but both recognize it as meaning a color between white & black. Similarly, the 3-letter genetic “words” that “spell” the amino acid building blocks of proteins can be written with different spellings. All organisms can understand these words (the GENETIC CODE is universal), BUT different organisms prefer different spellings (CODON BIAS). If you want to get an organism to express a protein for you, you might want to “be polite” and use their spelling! Codon optimization (which also includes considering the codon in context, making sure you don’t cause the DNA or RNA to fold up weird, and that you don’t accidentally introduce sequences the cell thinks are ribosome binding sites or something – more on all this stuff in the video).
note: refreshed & video added 2/8/22
Proteins are like charm bracelets made up of amino acids, where the chain links are a generic peptide backbone and the charms sticking off are the unique side chains (for more check out https://bit.ly/aminoacidsposts). Due to the letters’ unique properties (e.g. big & bulky, small & flexible, negative, neutral, or positive, water-loving or water-excluded) the order of the charms largely determines how the protein folds & functions.
This order is written in DNA form in the protein’s gene. Through the process of transcription, this DNA version gets copied into a messenger RNA (mRNA) version (really similar except DNA has one less oxygen on its sugar part and, instead of the nucleotide letter T, RNA has U).
In addition to linking up through their generic backbone to form chains, nucleotides can “base pair” between the unique backbones: A’s attracted to T (or U) and G to C. These attractions are of the type we call “hydrogen bonds” (H-bonds) and the important thing to know here is that they’re specific and reversible. So DNA and RNA strands can be specifically zipped & unzipped & rezipped, and if you know the sequence of 1 strand you can predict the sequence of a “complementary strand.”
When we talk about “complementary strands” we’re usually referring to the 2 strands of double-stranded DNA, which is the way your DNA usually hangs out. It’s great for protecting the bases from damage and easy to make copies of (just unzip and use one strand as a template for making a copy of the other.
mRNA, however, is single-stranded. So it’s like half a zipper, and its bases are exposed in prime position for complementary sequences to bind. Instead of full strands of complementary nucleic acids bound long-term, these mRNA strands are bound by shorter RNA pieces that come and go – these come-and-go-ers are called transfer RNAs (tRNAs) and what’s super cool about them is that they’re hooked up to amino acid letters that they hand off before leaving. tRNA is a type of “functional RNA” meaning that, unlike the messenger RNA (mRNA) intermediary that’s just an RNA copy of the DNA gene, tRNA never gets made into protein – but it does help make other proteins by TRANSFERing free-floating amino acids to a growing protein chain!
It works like this: The protein charm bracelet is put together amino acid by amino acid through the process of TRANSLATION. Molecular machinery called RIBOSOMES help link them together, but they rely on tRNA “servants” to bring them the right amino acid charm to add.
How do they know which one’s right? Different amino acids are specified by 3-letter RNA “words” called CODONS. There are 4 nucleotide letters – A, C, G, & T/U – so 64 possible codons. BUT there are only 20 (common) amino acids. 3 of the codons don’t spell an amino acid – instead they spell STOP & signal the end of the protein. But that still leaves 61. So some amino acids have multiple codons (we call this degeneracy or redundancy). BUT any 1 codon will only ever spell 1 amino acid (NOT ambiguous) if you want to learn more about how this was discovered: http://bit.ly/nirenbergcodecracking
One part of tRNA binds a specific amino acid and the other end contains a 3-nucleotide anticodon that is complementary to the matching 3-letter codon on the mRNA. Different tRNAs have different anticodons & carry different amino acids.
Because of degeneracy, multiple servants may bring the same amino acids (e.g. “gray” and “grey” both bring the same colored charm), but the ribosome’s a bit of a snob – it will only add that amino acid if it’s brought by the right servant. And the servant it wants is determined by the codon in the mRNA.
Each tRNA only ever brings 1 type of amino acid, BUT some tRNA can read multiple codons because there’s some “wiggle room” in the 3rd position – in this so called “wobble position” “non-canonical base-pairing” (such as U to G) is sometimes allowed so you don’t need 61 different tRNAs (you need at least 32, some cells use more). This is why a lot of the degenerate codons have the same 2 first letters. This also provides a source of genetic protection – if the 3rd base gets mutated the protein made may still be ok! This kind of changing, where the DNA/RNA is altered without changing the resultant protein happens naturally throughout evolution and is sometimes referred to as a “silent mutation”
Yesterday we looked at how we could introduce such “silent mutations” to create cut sites for restriction enzymes – these DNA scissors recognize specific sequences and cut them, which can be really useful for molecular cloning, in which you “recombine” pieces of DNA, such as sticking protein instructions into a circular piece of DNA called a plasmid vector and then stick that into cells (often harmless bacteria or insect cells) to make the protein for us. We call this recombinant expression and we’re in control of the DNA we put in there, but the expression cells are in control of whether that protein actually gets made
To make the making more likely, we can introduce “silent mutations” is for “codon optimization” – Organisms make each type of tRNA servant, but how many of each they have depends on how popular the corresponding codon spelling is – they stock up on the ones they have to use the most.
You can think of it kinda like one of those magnet poetry sets – if you’re selling a color-themed word magnet set in America, you’d probably include more “grAy” than “grEy” because that’s what your customers will demand more of. And vice versa in England. If an American needs a “grEy” magnet, this might slow down their poem-making because they’ll have to do a bunch of digging through magnets to find one. They might even “give up”
Similarly, when a ribosome’s traveling along and it comes to a “rare codon” it’ll have to stop and wait for the matching tRNA. If they have to wait too long, “translational stalling” can lead to things like premature termination (“giving up” before the full protein’s made) or mistake-making (things like skipping letters or sticking in the wrong ones). However, sometimes a bit of a brief breather can be a benefit – cells use it to the forming protein’s advantage because “pausing” can allow for proper folding of the part of the protein that’s been made so far. Another time rareness can be useful for cells is as a way to control gene expression. If an mRNA contains a lot of rare codons, the corresponding protein will likely be translated more slowly, so less of it will be made.
That can be good way for cells to regulate gene expression (it can even be used to respond/adapt to changing external conditions which lead to certain tRNAs being made more than others). But, while cells may benefit from reducing expression of certain genes at certain times, when it comes to recombinant protein expression, we want to get maximum expression of our protein (well, maybe not maximum maximum because if you overwhelm the cell too much, your protein can become “toxic” and/or it can clump up into insoluble “inclusion bodies”). So how to do this?
If you want to sell a magnet set, consider your audience’s preferences – replace those “grAy” magnets with “grEy” or vice vera. Similarly, we can order genes codon-optimized for the cell type we want to express them in (e.g. bacteria, yeast, insect cells). Companies like GenScript use algorithms to determine what the optimal codons are based in large part on what codons are most “popular” in those cells. Then they synthesize the genes to match.
For example, there are 6 codons that spell Leucine (Leu, L) & E. coli have 4 Leu tRNAs. The tRNA that recognizes CUG is very abundant, whereas the one for CUA is rare, so swapping CUA for CUG can lead to the recruitment of a more common servant and less holdup.
I said “in large part” based on relative #s, because it’s not just quantity that matters – even when a tRNA can read 2 codons, 1 form is usually “preferred” – the tRNA can read either, but it works better for one of them. So even if the cells have lots of the tRNA that can read that codon, you’ll get better results if you use the one it prefers.
Going back to our E. coli codons: The phenylalanine (Phe, F) codons UUU & UUC are read by the same tRNA, but that tRNA has a 3’-AAG-5’ anticodon, so it prefers UUC, which it can bind perfectly to. So if you change a sequence from UUU to UUC you might have better luck even though they use the same codon.
Codon optimization can improve recombinant protein expression, but it can also be expensive, so you probably don’t want to try it unless you’re having problems with the native version. BUT making point mutations (changing single codons) is much easier than replacing lots of them, and we do this ourselves a lot. When I do this site-directed mutagenesis, I consult a codon usage bias table like this https://www.biologicscorp.com/tools/CodonUsage#.XFcv_89KhsM
A (cheaper) alternative to codon optimization is expression of alternative tRNAs – basically, instead of changing your sequence to match tRNA servant availably, you change tRNA servant availability to match your sequence, getting the cells to make more of the tRNAs your mRNA called for. For example, Rosetta™️ E. coli strains contain a “pRARE” plasmid containing copies of tRNAs that are more common in humans than in bacteria (for example it adds copies of the CUA-readers as well as tRNAs that recognize AGG, AGA, AUA, CCC & GGA), so human proteins can be expressed in E. coli with less translational holdups.
This can be great because it can help out for expressing a whole host of different proteins – but your host might not think it’s so great… Cells have evolved to work best at their natural tRNA #s/proportions, so if you skew those you can anger them and this can have effects including reduced cell growth.
for more info on other things that go into codon optimization…
Kudla G, Murray AW, Tollervey D, Plotkin JB. Coding-sequence determinants of gene expression in Escherichia coli. Science. 2009 Apr 10;324(5924):255-8. doi: 10.1126/science.1170160. PMID: 19359587; PMCID: PMC3902468.
GenScript webinar by Dr. Rachel Speer: Codon optimization: Why & how to design DNA sequences for optimal soluble protein expression https://youtu.be/P-fjZPf3Dnw