How do you spell “insulin”? Hint: it starts with a G and an F! Before you go thinking WTF?!, let me clarify – I’m talking about “spelling” in terms of the order of amino acid “letters” in the 2 peptide chains of the protein hormone insulin, the first protein to be sequenced. The winner of the protein spelling bee? Frederick Sanger – who, with colleagues in the early 1950s, correctly spelled insulin (and developed crucial protein-sequencing techniques along the way).
note: originally posted February 2020 but updated & added vid 11/3/21
When you hear of “Sanger sequencing” if anything comes to mind it’s likely to be the dideoxynucleotide (ddNTP) method of sequencing DNA by making copies of it while including “terminator” DNA letters that get added but serve as dead ends that can be identified. More on that here: http://bit.ly/DNAsequencingmethods
But before Sanger developed his RNA and then DNA sequencing methods (Nobel #2) he developed protein sequencing methods which he used to determine the order of amino acids in the hormone insulin (Nobel #1). Amino acids are “protein letters” – there are 20 common ones different proteins have different numbers of them arranged in different ways (based on the sequence of DNA in their genetic “recipes”). Each letter has a generic part that allows them to link together, through peptide bonds, into chains, and a unique part (a side chain or “R group”) that sticks off like a charm in a charm bracelet & influences how the peptide chain folds up to make a functional protein.
So sequence matters. And Sanger wanted to sequence a protein. He knew there would be a lot of trial and error, a lot of loss of yield along the way, and a lot of room for impurities to mess with his results and interpretation of his experiments. So he wanted to choose a protein that he could get lots of relatively cheaply and with high purity. Oh – and if it was a protein that had some broader relevance that people cared about, that would be great too!
Insulin fit the bill – insulin is a hormone (chemical messenger) that gets released from the pancreas when blood sugar is high (like after you’ve eaten a carb-rich meal). It binds to receptors on cells and those receptors pass the message on into the cell to let in glucose. So glucose enters & gets put to use and blood sugar goes back down to normal. Except that is, in the case of diabetes. Patients with diabetes either don’t make enough insulin (type I) or have receptors that have been desensitized to it so they don’t respond readily (type II) so blood sugar remains high. more here: http://bit.ly/insulindiabetes
Especially for patients with type I diabetes (T1D), a mainstay of treatment is injection of insulin (since their receptors are fine, it’s just the insulin-making that’s defective). Once doctors figured out the life-saving powers of insulin, they started producing lots of it (first by purifying it from cows & pigs, and then later, making it “recombinantly” – sticking the gene for insulin into bacteria and having the bacteria make it for them. more here: http://bit.ly/bacoverexpression
During Sanger’s sequencing time, they were still in the animal-produced-insulin phase, and Sanger readily got his hands on what he called “cattle insulin” (more frequently referred to as bovine insulin these days).
Turned out the choice of protein was fortuitous in one sense – it turned out to be really small – but unlucky in some other ways – the mature form of insulin is made up of 2 chains (the A chain & B chain) that are connected through interchain disulfide crosslinks (more explanation forthcoming). So even though insulin is small (only 51 letters instead of the several hundreds many other proteins have) It took them several years to finish the job, with different pieces of the puzzle getting published along the way. The basic history is:
1945: figure out the “starting ends”
1949: figure out the beginnings (first few letters)
1951: complete sequencing of the 30-amino-acid-long, “phenylalanyl chain” (B chain)
1953: complete sequencing of the 21-amino-acid-long “glycyl chain” (A chain)
1955: figure out the crosslinks between the chains
1958: win first Nobel Prize
1980: win second Nobel Prize (for nucleic acid sequencing)
Each step came with its own challenges and required Sanger and his colleagues to be innovative (and very patient!) The general scheme was to use different methods (e.g. partial acid hydrolysis & enzymatic digestion) to break up the chains into smaller, overlapping, peptide pieces – separate and isolate those pieces based on charge, solubility, etc. Figure out what their first letter was, and then break them up into individual letters using complete acid hydrolysis. Then separate those letters & compare the letters in the peptides to one another to determine where they overlapped and in that way piece together the whole sequence.
But before we get into the details, to understand what they were faced with and how they were able to carry it out, it’s important to have a basic understanding of what “proteins” and “amino acids” are, so a quick review: Amino acids are protein “letters” that have a generic part that allows them to link together, through peptide bonds, into chains of amino acids called peptides. Peptides can be short or long and when they’re really long and fold up into a nice functional 3D shape we call them proteins.
The reason proteins fold up and the manner in which they do so is largely determined by what letters are in the protein in what order – because in addition to that generic part, each letter has a unique “side chain” or “R group.” There are 20 common amino acids with side chains running from small and flexible (like glycine (Gly, G) whose side chain is just a hydrogen) to big & bulky (like phenylalanine (Phe, F), hydrophilic (water-loving) to hydrophobic (water-avoided). They fold up in order to best accommodate each letter’s desires, so the sequence of amino acids matters for the final structure and is, in fact, considered the “primary structure” of a protein. Bonds/attractions between the backbone atoms lead to “secondary structure” (alpha helixes, beta strands, etc.) and bonds/attractions between side chain atoms lead to “tertiary” & “quaternary” structure (tertiary is between atoms of the same chain, whereas quaternary refers to interactions between different chains (of the same or different protein). More here: https://bit.ly/proteinstructure
Her groundbreaking work was aided greatly by Sanger figuring out its primary structure (sequence of amino acids), so how did he do it?
Things always make a lot more sense with hindsight, so, to help you interpret things, I’m gonna give you some “spoiler alerts” that Sanger didn’t have…
Well first some stuff he did know. The “generic part” of amino acids consists of an amino group (-NH₂ or -NH₃⁺ depending on the pH) and a carboxyl group (-(C=O)-OH in the carboxylic acid form and -(C=O)-O⁻ in the carboxylate form) attached to the same central “alpha” carbon. This is also the same carbon that is attached to unique side chains (aka R groups) that make different protein letters different. When amino acids link into chains through peptide bonds, they join carbonyl carbon of one letter to amino group N of the next, so the only free “alpha” amino group is at the first letter in the chain (which we call the N-terminus) and the only free “alpha” carboxyl group is at the last letter in the chain (which we call the C-terminus). Each chain only has 1 N terminus and 1 C terminus, but “proteins” can sometimes consist of multiple chains, which can be made separately or come from cutting a longer chain up.
The latter is the case with insulin – he didn’t know this yet, but it gets made as a longer chain called preproinsulin that gets cleaved in 2 places to give you mature insulin which is made up of 2 stuck together smaller chains in each “monomer” of insulin. The first 24 amino acids form a “signal peptide” – As the protein gets made, these come out of the ribosomal tunnel first and signal to the cell that this protein is destined for secretion (getting shipped out). For such secreted proteins, processing usually takes place in a special compartment in the cell called the endoplasmic reticulum (ER). So the ribosome sends the finished chain in there. And then, since it’s no longer needed, the signaling peptide gets cleaved off, leaving you with proinsulin.
This pro-insulin then folds up, and gets cleaved again to give you 2 chains, alpha (21 amino acids) & beta (30 amino acids) (both from that original chain). These chains stay stuck to one another because they have 2 key disulfide crosslinks. Unlike most side chain interactions, disulfide bonds, which can form between cysteine residues (eg. protein-SH + HS-protein -> protein-S-S-protein) are covalent bonds. So they’re strong. And keep the strands stuck together even though their backbone’s broken. So each mature insulin “monomer” is 51 amino acids in 2 chains from 1 original chain.
These monomers are the active form, but they can also dimerize and even hexamerize and this caused some confusion about how big insulin actually was. When Sanger and friends started out (and for the first several years of their work even) they didn’t even know the true molecular weight of insulin (they thought it was ~12,000 Da (12 kDa) while it’s actually closer to 6 – so don’t let that confuse you if you go to read their papers). Speaking of those papers, let’s dive in!
Let’s start at the very beginning… It’s a very good place to start… When you count you begin with 1, 2, 3, when you sequence insulin you begin with DNP! Actually you start with FDNB. FDNB stands for 1, Fluoro-2,4-DiNitroBenzene. Benzene is that six-sided resonance-stabilized ring you see a lot in biochemistry & o-chem – long story short, atoms join together by sharing pairs electrons and with resonance stabilization multiple atoms share “extra suggest ” electrons evenly amongst the group. More here: http://bit.ly/phenylalaninearomatic
But for now just know that it makes it easier to absorb visible light, and it changes its solubility so that it becomes more soluble in “organic” solvents like ether. So, if you attach the benzene derivative DNPB which has 2 nitro (NO₃) groups sticking off of it, to something (like an amino acid), you get DNP-something which will make it appear yellow and allow you to “extract” it into things like ether that non-derived amino acids avoid. (if you’re wondering how that B became a P, P stands for “phenyl” and it’s the name given to benzene attached to something.
This would give Sanger a way to label peptides & amino acids and visually track them with “real” chromatography (more on this in a minute). But how to convince the DNB to stick on? In addition to those 2 nitro groups, FDNB has a fluorine atom. And that fluorine is relatively easy to convince to leave, so FDNB can swap out the F for another nitrogen (leaving as hydrofluoric acid, HF). And it can find such a nitrogen in the amino groups of proteins. It can only do this at “end” amino groups – like the free N-termini of proteins, or the end amino group in lysine’s side chain – so it will only attach to a protein at the N-termini & lysines.
So if you label a protein with FDNB and then use acid to split the protein up into individual letters and separate those letters, you can isolate the yellow ones and, based on how quickly they travel through various solvents compared to known standards (synthesized versions of the DNB derivatives of those letters), figure out what they are – if it’s not a lysine you’ve located an N-terminal residue. If it is a lysine, it could be anywhere in the protein, but if it’s a “middle” one it’ll only have a single DNP (and different chromatographic properties).
When Sanger did this, he isolated DNP derivatives of glycine (Gly, G) & phenylalanine (Phe, F) & a “middle” lysine. This told him he had 2 peptide chains. Note: in 1935, Jensen & Evans had figured out that the was an N-terminal Phe (using a different method) – and in his Nobel lecture paper (which I highly suggest reading) Sanger credits them with the first discovery of the position of an amino acid in a protein http://bit.ly/391JVBI
Now Sanger had to figure out what letters were in the chains and in what order.
He started by using partial acid hydrolysis, which “randomly” cleaves peptide bonds. This would give him a random assortment of pieces, good for generating overlapping peptides. But he had to time things carefully – if he let it go too long he’d get pieces that were too short to be of any help – but if he didn’t let it go long enough he’d get pieces that were too long to have much usefulness since he could only figure out which letter was first (by DNP labeling) and what other letters were in the pieces, not what order they were in.
How’d he do that? Lots of separating in different ways including fractionation via “extraction” where some peptides dissolve in one solvent while others dissolve in another solvent, so you give them a choice (mix both solvents) to kinda bulk separate them. He could also use charcoal to absorb some of the aromatic amino acid containing ones.
This winnowed down the number of overlapping spots or bands he’d end up with when he used his more sensitive methods – a combination of electrophoretic and chromatographic techniques to separate them based on things like charge (in the case of the electrophoresis), size, & solubility in various solvents. You might remember some of these techniques from the “peptide fingerprinting” Ingram developed to identify the amino acid swap in the protein hemoglobin that causes sickle cell anemia http://bit.ly/paulingingram
Because some amino acids are charged (positively or negatively) at some pHs, and different peptides have different numbers and combos of these charged letters, peptides can be separated based on their charge at different pHs using a technique called electrophoresis (Sanger refers to it as ionophoresis), which uses an electric field to motivate peptides to move (e.g. across a wet paper or through a gel) towards oppositely-charged electrodes. But a lot of the pieces had similar (or no) charges and thus didn’t separate well with electrophoresis. So he also used silica gel column chromatography and paper chromatography.
Whereas electrophoresis exploited differences in molecules’ charge, chromatography exploits differences in a molecules’ desire to hang out with a movable phase (like a liquid solvent wicking through a piece of paper) compared to a stationary phase (like the fibers in that piece of paper). The more a molecule likes the liquid, the faster/further it will travel with the liquid without getting sidetracked by the paper. Molecules can have the same charge but different solubilities in different solvents, and thus, if you take that band of peptides that ran together in electrophoresis and subject them to chromatography, you can get them to separate.
And you can even run chromatographs in various solvents to separate the spots even further – Sanger used a lot of “2D paper chromatography” where he ran the peptides in one solvent, then turned the paper 90 degrees and ran in another solvent. Then he could compare them to standards to see what letter they are and estimate how many of that letter there are based on the intensity of the spot when dyed with ninhydrin: http://bit.ly/ninhydrin
I called the acid hydrolysis “random” but its cut spots are actually biased because some bonds are easier to break up than others – in particular, serine (Ser, S) & threonine (Thr, T) are split-happy and would get split even with short hydrolysis times.
So, with acid alone, he was only able to figure out the first 4-5 letters of the chains with acid alone: Phe.Val.Asp.Glu & Gly.Ileu.Val.Glu.Glu. As well as an internal lysine-containing peptide, Thr.Pro.Lys.Ala.
It became clear that he’d need to find another technique to help him figure out the rest. So he, ignoring the warnings of some colleagues, turned to endoproteases, or as I like to refer to them, peptide/protein scissors. Endoproteases are enzymes (biochemical reaction mediators/speed-uppers) that cleave peptide bonds at specific locations – like next to to the amino acids lysine (Lys, K) & arginine (Arg, R) in the case of the endoprotease trypsin, and next to aromatic amino acids like Tyrosine (Tyr, Y), Phenylalanine (Phe, F), and Tryptophan (Trp, W) in the case of the endoprotease chymotrypsin. Pepsin’s less predictable, but prefers to cleave next to aromatic amino acids & leucine. The cuts wouldn’t be random, but by using multiple enzymes and cleaving for various lengths of time he could still generate overlapping pieces (and even get some sequence info based on knowing the cleavage preferences).
He had been warned against using them because, as enzymes catalyze (speed up) reactions in both directions, some scientists were worried that proteases could piece together pieces in addition to cutting them, which could lead to new-sequence-making. But turns out this isn’t a problem because the cut products are much happier apart (it takes energy and a lot of help to stick them back together) – and acid causes many more artifacts!
So, by using a combination of endoproteases (trypsin, chymotrypsin, & pepsin) they were able to figure out the complete sequences of the chains (the B chain in 1951 & the A chain in 1953).
But speaking of those acid artifacts – this isn’t a major part of things, and it isn’t emphasized a lot, but it confused me when I was making the figures and seeing “Glu” instead of “Gin” and “Asp” instead of “Asn” – in addition to cutting apart amino acids, acid can cut off the amide groups of glutamine & asparagine as ammonia. But proteases can’t do that. So by comparing peptides derived from acid vs enzymatic cleavage they were able to figure out where they were dealing with glutamate or glutamine, aspartate or asparagine. But he indicated glutamine (which we normally abbreviate Gln) as Glu-NH₂. So hope that doesn’t confuse you too!
Back from the super geeky segue – one of the more emphasized problems they had to overcome – after figuring out the sequences of the individual chains, they still had to figure out how the chains linked together. They knew there were 3 cys-cys crosslinks. And only 2 chains, so one of the cross-links was within a chain. The other 2 were cross-chain, but they didn’t know what crossed with what. And to further complicate things, those crosslinks had a tendency to rearrange themselves (i.e. switch partners) under the acid hydrolysis conditions they were using. Proteases wouldn’t pose that problem, but one of the crosslinks involved a Cys-Cys sequence they didn’t have a protease that could cut between. So they had to figure out the right conditions to do acid hydrolysis without the interchange reaction. Then they were able to cut up the insulin while the chains were still together. And they isolated the pieces (which included still-crosslinked pieces) and then further oxidized the cysteines, converting each protein-S-S-protein into protein-SO₃ O₃S-protein. And then sequenced these free pieces to figure out what was paired with what.
And then he won his first Nobel prize. Here are a couple of my favorite quotes from his lecture:
“Examination of the sequences of the two chains reveals no evidence of periodicity of any kind nor does there seem to be any basic principle which determines the arrangement of the residues. They seem to be put together in a random order, but nevertheless a unique and most significant order, since on it must depend the important physiological action of the hormone.”
“The determination of the structure of insulin clearly opens up the way to similar studies on other proteins and already such studies are going on in a number of laboratories. These studies are aimed at determining the exact chemical structure of the many proteins that go to make up living matter and hence at understanding how these proteins perform their specific functions on which the processes of Life depend. One may also hope that studies on proteins may reveal changes that take place in disease and that our efforts may be of more practical use to humanity.”
and here are links to some of those key papers:
1945: “The free amino groups of insulin” http://bit.ly/2RSsXA9
1949: “The terminal peptides of insulin” http://bit.ly/3b4yw69
1951 (With H. Tuppy): “The amino-acid sequence in the phenylalanyl chain of insulin. 2. The investigation of peptides from enzymic hydrolysates” http://bit.ly/31i1vyR
1953: (With E. O. Thompson): “The amino-acid sequence in the glycyl chain of insulin. 2. The investigation of peptides from enzymic hydrolysates” http://bit.ly/394Ihj1
1955 (With E. O. Thompson & R. Kitai): “The amide groups of insulin” http://bit.ly/2GLvW7h
1955: “The disulphide bonds of insulin” http://bit.ly/2OiPowk
more on Dorothy Crowfoot Hodgkin’s solving the 3D structure of insulin: http://bit.ly/dorothycrowfoothodgkin
more on some topics mentioned (and others) #365DaysOfScience All (with topics listed) 👉 http://bit.ly/2OllAB0