Cloning success! How do I know? Because my DNA sequencing data tells me so! I recently got back the sequencing results for a molecularly-cloned protein construct. Basically, there’s a me-altered version of the protein (aka a construct) which I want to express (get cells to make for me), so I took the genetic instructions for that protein and stuck that recipe into a circular piece of DNA called a plasmid. That plasmid serves as a vector or “vehicle” for  getting (and keeping) the protein instructions in bacterial cells. But before I try to get cells to make the protein, I want to make sure that the recipe got into the plasmid okay and there aren’t any typos. A technique called colony PCR can quickly tell me if my recipe *likely* got in there, but only sequencing can tell me if there are any typos. How do they work? Here goes…

note, this is an updated form of a past post – I’ve included more practical advice on analyzing sequencing data 

Polymerase Chain Reaction (PCR) is a way to amplify (make lots of copies of) short stretches of DNA from longer pieces of double-stranded (ds) DNA we call the TEMPLATE. We choose what region to copy by designing short pieces of DNA called PRIMERS to bookend the start & stop of this region (1 per strand) so that a protein called DNA POLYMERASE (DNA Pol) can copy each strand.. more here: 

But where does the template itself come from? That depends.

You’ve likely heard of PCR being used to test for paternity or presence at a crime scene. In this case, the template comes from cheek swabs, trace evidence, etc. My templates come from someplace a bit different… Most frequently I use PCR to make lots of copies of a gene to put into a vector to put into bacteria to make more of the gene to put into other bacteria to make a bacmid to put into insect cells to make a baculovirus to infect more insect cells to make more baculovirus to infect more insect cells to express my protein.

There are different variations of PCR and reasons to use it, but I mainly use PCR during the process of MOLECULAR CLONING -> I copy an (edited) gene for a protein I want to study from one template and put that gene INSERT into a circular piece of DNA called a PLASMID VECTOR that has the “bells and whistles” I want, like “tags” to help with purification and start signals for turning the gene into protein. Then I stick this RECOMBINANT plasmid into bacterial cells so the bacteria will make more of the DNA and/or protein.

But how do I know if the bacteria *really* have my gene in them? The plasmid vector has a selection marker – often an antibiotic resistance gene – so that if you grow the bacteria that should have it on food containing that antibiotic, only the bacteria that have the plasmid (and hence the resistance gene) are able to grow. These bacteria grow and replicate to form individual “colonies” on a bacterial plate. Each colony has lots of cells but they all have the same genetic makeup

BUT this only tells you if the *plasmid* is inside the bacteria not if your gene is inside that plasmid. To answer this latter question, you can use PCR with cleverly designed primers. You have a few options and presence/size of the copied produces (which you can tell by agarose gel electrophoresis) can tell you different things:

INSERT-SPECIFIC PRIMERS: both primers are in the INSERT (the gene you put in). This is a YES/NO for whether your insert’s present. If your gene’s not there there will be nothing for the primers to bind to -> no product. But if your gene is there the primers will latch on & Pol will copy between them -> product (note that by product I mean a defined, specific product, not “nonspecific products” that can come from primers binding incorrectly (mispriming)

🔹tells you if your gene is present BUT NOT if your gene is where you want it…

🔹advantage is that you can use this same set of primers to test for your insert in different plasmids

VECTOR-SPECIFIC PRIMERS: both primers are in the VECTOR, straddling the insertion site. As long as the plasmid’s present, you should get some sort of product, but it’s the SIZE of the product that gives you your answer (not a simple yes/no like above) – if your insert’s not in the vector the product will be really short but if your insert’s in there, the product should be bigger (that short length PLUS the length of your insert)

🔹tells you if your gene (or something of that same size) is present IN YOUR VECTOR

🔹useful because you can use the same pair of primers to test different constructs since the primers are specific for the vector not the insert

🔹does NOT tell you whether your insert is inserted in the correct direction. for that you can use

ORIENTATION-SPECIFIC PRIMERS: one primer is in the insert & the other is in the vector

🔹you’ll only get a defined product if your gene if facing the right way (not put in backwards so that the “start making protein here” message on the plasmid is next to the “stop making protein here” message on the DNA

🔹tells you 1) is your plasmid present 2) is your gene present 3) is your gene in your plasmid and 4) is your gene “backwards”

🔹downside is you have to design a specific primer

So we can use PCR as a secondary “screen” when cloning, but we still haven’t answered the question of how we get the DNA to screen. You can purify plasmid DNA out of bacteria – often using easy-to-use “mini prep kits” – they’re easy to use but if you have lots of bacteria to test, you don’t want to waste time purifying something “useless” so you can skip the purification (for now) and add a teeny bit of the whole bacterial cells into your PCR mix.

Just barely touch the colony with a sterile toothpick or pipet tip & swirl it around a bit in your PCR mix. (alternatively, you can resuspend a bit of it (pipet it up in down in some water) and add some of this to the PCR mix).

When the reaction heats up to MELT the DNA (separate the strands) it also LYSES the cells (breaks them open) so that the DNA “spills out” and DNA Pol can latch on.

If you get a positive result, you can then go ahead and grow up more of that colony and purify it. 

Another “quick check” is an analytical restriction digest – more here:

but the basic idea is that you cut out, within, etc., the part of your plasmid that should contain your gene. Then you see how many & how big those pieces are (with agarose gel electrophoresis). If your gene is there the piece will be much bigger than if it’s not there and/or depending on where your cut sites are you will get more pieces. And while you can’t tell exactly how many DNA letters are there, you get an idea whether you’re in the right ballpark. 

BUT – with either of these methods, you still don’t know if there are any typos! (is the sequence correct?) Both restriction enzymes and colony PCR primers only require that the short stretches of DNA they recognize are there & typo-free but that’s like seeing that one word in a document is spelled correctly and then taking that as proof you didn’t make any typos anywhere else in the document. 

For definitive evidence, you turn to DNA sequencing (note: I don’t usually do the colony PCR or digest step unless I’m having problems (often not worth it)). The conclusive proof that it’s the correct sequence comes from DNA SEQUENCING – but unlike the type of sequencing that sequences “all” your DNA, we’re only interested in sequencing the specific region with our gene.

Using sequencing primers is similar in setup and concept to vector-specific colony PCR – use 1 primer that matches a sequence upstream of your gene and one downstream. But, unlike in colony PCR, where you have both primers in the same reaction, for the sequencing reactions you do the reactions separately. Instead of focusing on making tons of copies, you focus on reading carefully – you read out the sequence as you add each base. Instead of adding both primers in the same reaction, it’s one at a time, so instead of making double-stranded (ds) copies of a defined region of DNA, you start making a copy of a single strand and you “stalk it” as it works 

You put in fluorescently-labeled nucleotides (nucleic acids) so you can “watch them be added” – most methods use special dye-terminator nucleotide methods where the fluorescently-labeled letters are “defective” – they’re dideoxynuclotides (as opposed to the “normal” singly-oxygen-defficient deoxynuclotides (dNTPs). DNA has 1 less oxygen than RNA (at the 2’ position (“right leg” of the sugar) – and ddNTPs are also missing the 3’OH oxygen (“left leg”) so there’s nowhere for more nucleotides to be added after it – it thus acts as a chain terminator – and if it’s fluorescently-labeled (with different colors for the different letters) you can see what letter it ended in

You put in a mix of unlabeled, normal letters and labeled defective letters so you get pieces that all start at the same place (primer binding site) but stop at different letters ofter traveling different distances – you can run those pieces through capillary electrophoresis to separate them by size (like a really long, thin version of the agarose slab gels we often run) and you shine a laser at them as they travel so you can “read out” what letter the pieces end in then read out the sequence. more here: 

To make gene-in-plasmid checking easier, plasmids often contain “standard” sequencing primer sites flanking the gene insertion site. These match standard primers that the sequencing companies will often provide for free – just ship them your plasmid, tell them what to use & they’ll send you the sequences.

But you can also use your own sequencing primers. I have primers that match the regions of the plasmids I commonly use right before and after where the gene goes in. They work no matter what gene’s in there and they’re designed so that their orientation sends Pol traveling into the insert. When you’re checking cloning products it’s especially important that you get good coverage of the insertion sites because that’s where errors are most likely to occur.

If you have a short gene you might be able to read it all with just end primers (if it’s really short only one may suffice) – but if it’s longer you might need to add additional primers that start in the gene itself (you’ll definitely have to custom-design those since they’re gene-specific not vector-specific) 

Note: I’m going to talk in “I” terms – and when I first wrote this post, I was doing all this, but we have 2 awesome lab technicians now who have been handling the cloning and I’m super grateful because it gives me more time to purify and play with the proteins!

I usually send DNA from 3-5 colonies of each construct for sequencing – I add the colony to liquid media to let it grow lots overnight, then do a mini prep (alkaline lysis) to purify out the plasmid DNA. Then I use the NanoDrop to figure out its concentration based on its UV absorbance (light-stealing)(the more DNA the higher the absorbance & you can use Beer’s Law to convert between the 2). I want to know the concentration because sequencing companies want a certain quantity of DNA so I need to know how much to add. more here: 

I calculate so I’m in the recommended range, add one of the primers, & add water to the desired final volume. I do this one per primer. And then I wrap it in bubble wrap, stick it in a 50mL Falcon tube, and send it off! A couple days later I get sequencing results as a chromatograph with a peak of a “different color” for each letter (the different colors are just the company’s way of overlaying the different fluorescence channels, so it’s not like “T” is actually red & “C” actually blue. 

In addition to this raw chromatograph data you get the corresponding sequence – at least how the computer reads it…

Sometimes if the traces are “messy” or you have a long string of the same letter it can miss one, call the wrong one etc – so be sure to look at the traces and not just the letters!

You can do this using a variety of different software programs – I use CLC Main Workbench, but in undergrad I used SnapGene and there are also programs like DNAStar, ApE (which is free), and 4 Peaks, which you can use to view traces but you have to do the alignment separately I think. Note: if you’re trying to semi-manually do an alignment with sequencing data alone (the called letters) try using a snippet from the middle, where the sequencing data is most reliable – if you include the ends you might not get a match. 

You want to align the traces to your “reference sequence,” which is the sequence you want. The software should be able to show both the nucleotides (DNA letters) and the corresponding amino acids (protein letters). It’s able to do this because 3 nucleotides spell one amino acid, and these 3-letter groups (codons) don’t overlap. Therefore, as long as the program knows where to start reading (which reading frame) it can show you the protein letters (we call this the translation because the process of piecing together amino acids to make a protein is called translation). There’s no punctuation, so when you’re viewing, if you want to look at the translation (and have it make sense for your protein) you may need to tell the software which reading frame is right. It should give you the option to view all the reading frames (forward and reverse). I’m not going to try to give too much explainer about how to do it because it’ll differ between the software programs and I haven’t even used them all.

Regardless of which you choose, be suspicious of the beginning and ends of the data where the signal’s weak and the base calling not too reliable. If there seems to be a mutation there, it might not really be one. Sometimes the software can’t decide so it displays and “N” or an “X” – this can happen because the signal’s too weak or if there are overlapping peaks because you have contaminating DNA in there. 

Speaking of contaminating DNA, sometimes you get back sequencing results and have no idea where the sequence comes from… sometimes the sequence comes from another location on plasmid or the bacteria’s own DNA, but sometimes it doesn’t match anything that makes sense. In those cases, out of curiosity, I usually copy and paste some of the sequence and search BLAST (a free tool from NCI). It should be able to find similar sequences and tell you what that sequence corresponds to. Sometimes it’s something one of your labmates is working on and sometimes it’s just a big mystery. 

The software should point out places where the sequence you put in doesn’t match the reference sequence. Go to each of those places, see if you agree, and see if the change makes a difference. There are often several different ways to spell an amino acid (multiple synonymous codons). For example, “GCT” and “GCC” both spell alanine (A). If the software says there’s a mutation but it doesn’t change the amino acid it spells IN THE CORRECT READING FRAME (i.e. it’s a synonymous mutation like GCT to GCC)you should still be okay if you only care about the protein. It might not even be a mutation, it could just be that the reference sequence is off, which I think is the case with this clone of mine. 

Hopefully that was helpful and not too technical.

more on topics mentioned (& others) #365DaysOfScience All (with topics listed) 👉

Leave a Reply

Your email address will not be published.