Learning to read DNA! You’re probably familiar with the DNA double helix (thanks Rosalind Franklin!) That helix is 2 complementary DNA strands and yesterday we saw how PCR uses an enzyme (reaction mediator/speed-upper) called DNA Polymerase (DNA Pol) to make lots of copies of specific stretches of DNA by using one strand as a template for making the other strand. So, we have a way to copy DNA, which is a huge deal because DNA holds genetic instructions for almost everything. But how do you read that DNA? The traditional DNA sequencing method is SANGER SEQUENCING (aka CHAIN TERMINATION METHOD). It may be “old-gen” but it’s really accurate and is still the go-to if you have specific regions of DNA you want to read the sequence of. I’m going to explain this method in the most detail, but I also want to touch on the so-called “next-gen” sequencing methods including the 3 main ones, Illumina, PacBio, & NanoPore.
I will go into more detail below on DNA & Sanger style, but I thought I’d start with an overall comparison of the methods. note: I’m not going to try to give exact numbers of maximum read lengths and accuracy percentages and stuff because they’re constantly changing – and genomics really isn’t my field, but people kept asking for this so…
DNA letters are called nucleotides (technically deoxynucleotides). There are 4 of them (A, T, C, & G) and they specifically base pair (A:T, G::C) so if you have one strand you can use it as a template to create the complementary strand, which can be used as a template for creating the “original” strand (so basically you just need to be able to read one of them). Sanger sequencing, Illumina, and PacBio read DNA as (or after) it gets written but with key differences, and NanoPore takes a wildly different approach – instead of making copies using labeled letters, it threads DNA through a pore and senses changes in electricity as the DNA passes through. Because the 4 DNA letters are slightly different, they produce different electricity changes, so the machine can tell what letter is going through the pore.
Sanger: incorporates “chain terminator nucleotides” – these nucleotides are labeled, but they’re “dead ends” – so if they get incorporated, it’s the end of the line. By using small amounts of these labeled letters and larger amounts of normal letters, and then letting DNA Pol go to work you get a mix of fragments that all start at the same place but end in different places and you can read the last letter of each, so you can read out the sequence.
You can get longer fragments by using lower levels of terminator, but you’re still limited to fairly short reads. If you want to get longer reads, you can use primers that start at different start points and then line up overlapping regions to figure out the bigger thing (keep this idea in mind for later). Sanger sequencing was used for the first whole-genome sequencing initiatives, but these days, if you want to sequence a whole genome, you typically turn to one of the “next-gen sequencing” (NGS) methods.
Illumina: incorporates REVERSIBLE chain terminator nucleotides. When they get added, no more letters can get added *under the adding reaction conditions. This allows the machine dump in labeled letters -> let the one that matches the template strand get added -> wash away un-added letters -> read what letter was added (the letters are labeled with different fluorophores so they glow different colors) -> change the reaction conditions to remove the protective group from the labeled letter which was preventing further letters from being added. This thus “un-terminates” the chain and allows more letters to be added, so you can do this all again. and again. They call this “Sequencing by Synthesis” or SBS.
Might not sound that fancy, but if you try looking up how Illumina works you get these kinda crazily complicated diagrams – most of the fancy stuff is happening in the “prep” phase. Basically you start by making a lot of copies of the DNA you want to sequence. And this copying is done in lots and lots and lots and lots of sites on a chip using something called isothermal amplification (unlike PCR, this copying takes place at a single temperature).
You start by shearing the DNA into shorter pieces and adding adapters to the end that allow them to stick to the chip and then copies are made “on site” in some DNA gymnastics that are best explained in their video and which I’m not gonna try to get into, sorry! https://bit.ly/33qIBXW
Bottom line is you can get massively parallel high-throughput sequencing (you can read a lot of DNA at once). It’s still limited to short reads though and it’s more error-prone than Sanger – even though you’re not terminating each time, you still have to do all the copying and stuff and there are a lot of chances to mess up, especially as you try to push the length. In order to be able to sequence something really big, like a whole genome, you have to line up lots and lots of overlapping fragments (since you’re using adapters which you can add to *any* sequence, as opposed to Sanger where you use primers where you dictate where to start, by randomly shearing the DNA in the beginning you generate lots of overlapping sequences). So the computers have a lot of work to do once the sequencing’s done (and the scientists have a lot of work to do analyzing it and helping the computer out and stuff – basically it’s not as easy as it sounds because of things like repetitive sequences that you have to figure out where they go and stuff)
You can reduce the “what goes where?” problem by using longer reads.
You can get longer (but not super long) reads from Pacific Biosciences (PacBio), which uses labeled DNA letters that give off fluorescent pulses when held by the DNA Pol (which is in the path of an excitation laser), so you can read out the sequence of fluorophores as they’re added. The fluorophore is attached to the end phosphate, which gets released when a letter gets added, so you don’t have to worry about the signal lingering around and you don’t have to pause after each letter like ilumina does. They call this Single Molecule Real Time Sequencing (SMRT). https://bit.ly/3bQxcEI
If you want *really long* reads, you can use Oxford NanoPore. As I mentioned above, it doesn’t “add” anything, it just looks to see what’s already there. But DNA is really tiny so instead of trying to visually read it they sense it’s vibes… They thread DNA through a pore and measure the electrical current flowing through the channel. That current changes different depending on which DNA letters are in the pore, so they can read out the DNA sequence as it passes through.
Since you’re able to get long reads, it can be great for repetitive sequences and stuff.
Unlike illumina, it doesn’t do any “pre-copying” (there’s no amplification). This might sound like a bad thing because you have less starting material. But it can actually be a good thing because it reduces bias that can occur if some DNA regions get copied more/better than others. They have a number of different devices ranging from tiny sequencers like the MinION to the big PromethION (there’s also the Flongle, the MinION Mk1C & the GridION). This is totally not an advertisement or endorsement or anything, just so that you can connect the dots if you see these names in papers.
NanoPore and PacBio are both more error-prone than Illumina, though. So one strategy that’s sometimes taken is to use a combination. Use NanoPore to figure out the general arrangement of things and then use Illumina to get shorter, more accurate, pieces that you can fit it to.
Ilumina is more accurate than NanoPore and PacBio, but it’s still not as accurate as Sanger sequencing. And it’s not as cost-effective if you only have a single sample where you want to look at a single region, as opposed to trying to figure out an entire genome. So let’s take a closer look at Sanger sequencing.
The way NGS is talked about, you might think that Sanger sequencing is a thing of the past. But it’s definitely not! We use it ALL THE TIME! (or at least we mail samples to a company that uses it all the time (and can do it fast & cheap) – we used to use GenScript but they shut down (at least temporarily) due to the COVID-19 pandemic, so we switched to GENEWIZ). In fact, just yesterday I was analyzing sequencing results. I didn’t need a whole genome sequenced, I just wanted to check the sequence of a very specific part of a specific gene for the protein I was cloning. Molecular cloning is where we stick a gene into a vector (such as a circular piece of DNA called a plasmid) that we can stick into cells like bacteria cells to get them to use.
We want to make sure that the sequence got into the plasmid ok, without any typos in the DNA sequence (which could cause typos in the resultant protein or even prevent it from being made all together). So, before we try to get cells to express the protein, we put the plasmid we engineered into bacteria to make lots of copies of it, which we then purify out using alkaline lysis (“minipreps”) more here: http://bit.ly/3azLDMh
After that, you’re left with pure plasmid. And you want to check the sequence of the part you put in, which “we” do using Sanger sequencing. Yesterday’s sample was actually some detective-izing involving a case of “what the heck did our ex-colleague leave in the freezer?!” From the lab database I could see the name he’d given the protein construct (the modified version of the protein) (e.g. ProteinX_middle_deleted). But he hadn’t specified which part he actually deleted – I’m sure it was in his notes but I didn’t have those, so I sent it for sequencing and figured it out.
All this detective work looks pretty boring from our lab’s end. We just wrap up tiny tubes with bubblewrap, stick them in a big tube, stick that in an envelope (too many “your tubes arrived damaged” emails…) and drop them in the outgoing mail box (too many crushed tubes…).
We do this a lot, as do lots of people in labs all around the world, but we don’t often stop to think about what really happens when it gets to the facility. To understand what happens when it gets to the sequencing people, lets first review what it is we’re trying to read – what is DNA?
DNA stands for DeoxyriboNucleic Acid & it’s made up of long chains of “letters” called NUCLEOTIDES (nt), which usually pair with another strand to form double-stranded DNA (dsDNA) There are 4 DNA nt -> A, T, C, & G & they’re made up of 3 main parts – a deoxyribose sugar & phosphate(s) form the generic “backbone” part & then each letter has a unique “nitrogenous base” (“base”) which has 1 ring (the pyrimidines C & T) or 2 rings (the purines A & G). The different bases pair with specific other bases on other strands – A:T and G::C. So if you know the sequence of one strand you know the sequence of the other.
I like to picture them as tiny little cartoons where the sugar’s 5-sided ring forms the core body & various groups stick off of its arms & legs. The “right arm” (as in the right of your screen/paper) is the “1’” position (the ‘ is pronounced “prime”) & this is where the base attaches. The “left arm” (5’ position) is where the phosphate(s) link on. The 5’ position is actually more like an elbow because there’s a “linker” from the 4’ “shoulder” & the “left leg” (3’ position) has a hydroxyl (-OH) group.
Nucleotides link together left arm (5’ phosphate) to left leg (3’ OH) through PHOSPHODIESTER BONDS. You can link up as many as you want to get a chain, one end of which will have a free 5’ phosphate (the 5’ end) & the other end of which will have a free 3’ hydroxyl (the 3’ end).
DNA & RNA (RiboNucleic Acid) are different in that RNA has a right leg (2’ -OH) and a left leg (3’ -OH) but DNA only has a left leg (they actually both have 2 right legs and 2 left legs, but if the leg is just a “Stub” (hydrogen) it doesn’t really do anything but take up a little space and satisfy the electrons carbon needs, so we don’t usually draw it. (The other difference between DNA & RNA is that RNA has a “U” instead of a T)
sidenote: So if you see a carbon with less than 4 bonds, you just assume that there are hydrogens there. Also, if you see a “corner” without an element letter, you assume that there’s a carbon there. Carbons (with hydrogens as sorts of “placeholders”) form the skeleton of organic molecules (organic as in carbon-based, not “all-natural”), but it’s often the things they’re bonded to (functional groups) that do the exciting reacting stuff so we want to make them stand out more. So we’ll often draw the chemical structures of organic molecules with implied carbons and hydrogens, and just write in the C’s or H’s in places where they’re actually involved in what we’re interested in. This shorthand is really helpful, but it can also be confusing to people unfamiliar to it, so I hope this helps make biochem a bit more accessible.
So, back to the sequencing story -> it’s okay that DNA doesn’t have that right leg because it doesn’t need it to link to another letter (polymerize). But this linking DOES need the left leg – that’s where the incoming letter will latch on.
A molecule called DNA Polymerase (DNA Pol) facilitates this linkage. It acts like a train that can only travel on double-stranded track. So to travel on single-stranded track it first has to add the complementary nucleotide (the one that base pairs with it)(e.g. to travel past an A on the template strand it has to add a T to the growing strand)(so the product that’s being made is the complement to the template strand, but if you know one you know the other).
Because it can only travel on double-stranded track, you also have to provide a primer (short complementary sequence) for it to start from. In Polymerase Chain Reaction (PCR), you use 2 primers to define the “start” and “stop” of a region you want to make copies of. With SANGER SEQUENCING, you only use 1 primer – you just give it the “start” station and then you let it stop wherever it adds one of the defective tracks you give it and see how far it goes.
These “defective tracks” are DIdeoxynucleic acids (ddNTPs) which don’t have a left leg to latch onto (they have a 3’ H instead of an -OH). So these defective NTs act as CHAIN TERMINATORS.
The basic premise of SANGER SEQUENCING is -> you give it mostly normal NTs (dNTPs) mixed in with some “defective” NTs DNA Pol will add normal NTs (dNTPs) normally but when a terminator gets incorporated, nothing else can be added. So, depending on how many normal ones got added before the terminator, you’ll get pieces of different sizes.
You can run this on a urea-PAGE gel which separates them by their size by using the DNA’s negative charge to drive it through a gel towards a positive charge, with the gel mesh slowing bigger things down more along the way. Compared to agarose gels, urea-PAGE offers much higher resolution because you can make a tighter gel mesh (more here: http://bit.ly/2XsNzQg) -> can detect single-NT length differences – so you can tell XXX apart from XXXX, BUT you can’t tell what letters those X’s are (e.g. AAA and TTT look the same) So, you had to do 4 separate reactions, with each reaction only having terminator versions of a single letter.
You don’t want all of the letter to be terminator-y because then you’d never be able to get past the 1st instance of it, so you include ~100X-less of the ddNTP than the dNTP (e.g. in the “A” reaction for every 100 A’s have in the mix, 99 will be dATP (normal) and 1 will be ddATP (terminating))(and you’ll also have all normal dGTP, dTTP, & dCTP in there).
Then, technology advanced, allowing for DYE-TERMINATOR SEQUENCING -> scientists began using fluorescently-labeled nucleotides. Fluorescent molecules absorb light at one wavelength (excitation wavelength) and release it at a different wavelength (emission wavelength). Different wavelengths have different colors, so if you use fluorophores that have different emission wavelengths, you can tell them apart
You can label the different terminators (ddATP, ddTTP, ddGTP, ddCTP) with different fluorophores and add all 4 at once. The fluorophores are added to the base, in a position that doesn’t interfere with the base pairing.
To make things even easier, you can use CAPILLARY GEL ELECTROPHORESIS. Instead of running it through a “slab gel”, you run it through a vertical tube of gel. And as it runs through it gets “scanned” by a laser.
The light from the laser is at the fluorophore’s emission wavelength so it excites the fluorophore, which then emits light at a different wavelength, which gets recorded by a detector as peaks of fluorescence intensity at each wavelength, drawn on a CHROMATOGRAPH. Because the different ddNTPs have different fluorophores and give off light with different wavelengths, the detector can tell them apart.
Sanger sequencing is kinda like the “gold standard” in terms of accuracy (which is really important in our case), but it’s expensive (relatively speaking). It’s really cheap if you only have like one reaction (~$5), and if you only have a low number of “targets” it’s still the most cost-effective way to go. In addition to our using it in the lab, doctors can use it for things like sequencing specific genes from their patients if they have a disease caused known to be caused by a mutation in that gene and they want to figure out what the exact mutation is.
But if you wanted to sequence an entire genome (which you’d first have to break up into lots of shorter pieces you’d later “stitch together” computationally) it’d be really expensive. So for big projects, things have switched to those massively-parallel “next-gen sequencing” methods where you have lots of reactions happening at the same time, usually on a chip, with really tiny volumes.
note: In addition to Whole Genome Sequencing (WGS), there’s something called whole exome sequencing, which only sequences protein-coding genes.
more on DNA polymerization: http://bit.ly/2TFdQN9
more on PCR: http://bit.ly/2FiBXsl
more on peptide bonds: http://bit.ly/2lVQsuJ
more on Sanger’s insulin protein sequencing: http://bit.ly/2OkqE6T
more on topics mentioned (& others) #365DaysOfScience All (with topics listed) 👉 http://bit.ly/2OllAB0