Happy 50th birthday to the PDB! 50 years since its founding, The Protein Data Bank continues to be, *the* place to go if you want to protein-see! When it was launched in 1971, it only had 7 structures in it. Now, there are over 175,000! And they’re all freely accessible to anyone around the world through the PDB’s websites: RCSB PDB (based out of the US); PDBe (based out of Europe); and PDBj (based out of Japan).
If you’ve ever read an article discussing the structure of a protein (what it “looks like” at the atomic scale) or seen a picture of a protein model, you might have seen something like: PDB ID 2hhb, 2hho, or 6vsb. Each of those “accession codes” is a specific name for a structure. 2hhb is a structure of deoxyhemoglobin; 1hho is a structure of oxyhemoglobin; and 6vsb is one of the many structures of the coronavirus Spike protein (the one that juts off from the viral membrane and docks onto our cells). Yeah, with >175,000 structures, they had to move away from codes that “made sense” to random codes instead… Sometimes, you’ll see the codes written in uppercase, but that makes it hard to tell apart “O” and “0,” capital i and lowercase l, etc.. so people are moving towards consistent use of lowercase. Speaking from experience as someone who has tried to look up structures and gotten totally unrelated ones because that “O” was really an “0,” I support the movement!
These codes are more than just nicknames. In one sense they’re more like the structural biology equivalent of citing the photographer. Except they’re way cooler because they goes way beyond just giving credit where credit’s due!
If you search the PDB for that name, it’ll pop right up and let you explore it. In 3D! You can actually play around with rotating it, coloring it different ways, etc. instead of just looking at a static snapshot. And you can find out more information about it as well as do a lot more with it, especially if you’re in the field and you know what you’re doing. So, basically the PDB is awesome for casual protein admirers to hard-core structural biologists.
Let me step back a sec and explain what “structural biology” is because I didn’t even hear of the term until college, but now that I’m in the field I can sometimes forget that most people aren’t and might not know the term. more on structural biology here: http://bit.ly/cryoemxray but basically it’s the sub-field of biology that deals with trying to figure out what macromolecules (things like proteins, DNA, RNA, and mix-and-matched complexes of those components) look like (their form). And how that form fits with what they do (their functions).
To get an idea of why this might be relevant, think of how a spoon is good for scooping ice cream whereas and a knife is good for cutting cake. Macro means large, but these molecules are only large in comparison to other molecules. Compared to us, they’re tiny! So tiny that they’re invisible to our eyes, and even to our conventional microscopes. Therefore, we have to use fancy-dancy techniques like X-ray crystallography, cryo-electron microscopy (cryo-EM), and nuclear magnetic resonance (NMR) to visualize them.
The details are complicated and vary from technique to technique, but a key thing to know about all of them is that they don’t give you the actual atomic positions of each of the atoms that make up the molecule (individual carbons, hydrogens, oxygens, nitrogens, etc.). Instead they just give evidence of whereabouts those atoms are and then you have to use math-y stuff (or at least the computer does) to generate a “model” of the structure, placing in the atoms to fit the evidence. This allows scientists to generate “models” of the atomic structure of the thing they were looking at. And when someone does this, we say they’ve “solved the structure” of that protein or complex or whatever it was.
If they then want to publish a paper talking about that structure, they have to deposit the “coordinates” for it in the PDB. These coordinates are like addresses for each of the atoms in the model – plus some extra info like how confident the depositors are in the position. Basically, the structure-solvers have to upload enough data that everyone can evaluate the validity of the model for themselves and see if they can find any “hidden treasures” in it (features that the depositors might have missed, like evidence for a bound metal ion). And the way that anyone can look at it for themselves is through the PDB, so I want to tell you a bit more about how to use it.
Some of the specific features and viewing options will depend on what technique was used to generate the data. I haven’t done NMR, or cryo-EM (although everyone else in my lab has, so I hear about it a lot), but I have done X-ray crystallography, so I’m most familiar with that. Lucky for me, most of the structures in the PDB were solved using crystallography (although cryo-EM has gained steam in recent years thanks to technological advances).
much more on crystallography here: http://bit.ly/xraycrystallography2
The basic idea with X-ray crystallography is that you shoot X-rays at a crystallized molecule -> the X-rays interact with electrons surrounding the nulcei of the atoms in the crystal, knocking them off-course -> the knocked-off (scattered) X-rays interact with one another to either cancel out or make a stronger X-ray beam in a process called diffraction -> those new “mega-beams” hit a detector to give you a pattern of spots called a diffraction pattern -> you work backwards from that diffraction pattern to make a blobby mesh thing called an electron density map representing the rough position of the electrons -> you build an atomic model into that map showing what you think is the position of each atom in the molecule.
Most of the time, what people look at when they go to the PDB is just that model. But, at least for more recent structures, the PDB also holds all that data which was used to make the model. So, if you know what you’re doing, you can see the blobby stuff (electron density map) and how well (or poorly) the model fits it.
Some of the data the PDB tells you is about the quality of the diffraction data itself, and some of it is about the quality of the modeling.
As for the diffraction data itself, you can find a lot of different details, but a key one is the resolution, which is a measure of how close together two things can be and you can still tell that they’re different. The blobbier the map is, the poorer the resolution, and therefore the higher the value (in angstroms, Å). Most structures you’ll be looking at are probably in the 2ish range.
The better the data you start with, the better the potential is for a really good model. Think about looking at a baseball mitt and trying to model where the fingers are vs looking at a tight-fitting glove and doing the same. It’s much easier to do it (and you can be more confident you got it right) if you have better resolution (like that tight-fitting glove). But you can also make a really bad model from really good data! (even with that nice glove you might draw the fingers sticking the wrong way or something). So it’s important not to just look at the resolution. You want to look at how well the molecule fits that data. And the PDB tries to make this easier with their wwPDB validation report (wwPDB stands for worldwide PDB and we’ll talk more about it later).
At least on the RCSB website, which is what I’m going to be referencing and showing in the figures, an overview of the validation report is shown in a quick visual form as slider bars. And then you can click to see the full report. These bars contain some global statistics that tell you about the overall fit and the molecular feasibility of the model, shown as red(worse than average) to blue(better than average) slider bars in their wwPDB validation report. “Average” here is compared to all other X-ray structures (black bars) as well as to all other X-ray structures with similar resolution (unshaded bars).
quick terminology note: proteins are chains of letters called amino acids linked together and folded up. Amino acids have a generic backbone that lets them link up and unique “side chains” that stick off from the backbone. When they link together, they do it through their amino and acid groups, so they’re no longer technically “amino acids” so we call the individual linked letters residues instead
A few of the metrics you see in the sliders are:
- Rfree: this is a measure of how well their model fits the evidence overall
- clashscore: this is a measure of how much the atoms would physically clash with each other if they’re really where the model says they are (basically, the models only model the center of the atoms but the atoms also have electron clouds around their center, and those electron clouds don’t like to get near each other, so each atom needs some elbow room and the clashscore looks at whether they have enough)
- Ramachandran outliers: because different residues have different side chains, and those side chains have different amounts of bulkiness, the flexibility of the protein backbone is more constrained around certain residues and there are particular backbone angles that each residue likes to take, which can be visualized as “allowed zones” on a Ramachandran plot. If you plot the angles in the model, some of them will likely fall outside of their allowed zone, indicating an awkward angle and we call that a Ramachandran outlier.
- sidechain outliers: this is similar in concept to the Ramachandran outliers, but here you’re looking at the angles of the side chains instead of the backbone
- RSRZ outliers: this is a measure of how many individual residues don’t fit the density well (kinda like how many of the stick-y groups in the diagram stick out of the blob they’re supposed to be inside)
note: it’s known from really high-res structures that it’s normal for actual proteins to physically have some outliers and deviations from ideal, so it’s also normal for models to have some. If the outliers are too low, it could mean that the model was being refined too much to be “ideal” based on geometry instead of being refined to actually match the data. Refinement should include a mixture of the two, and you should end up with a low, but not too low, level of weirdos.
Those only tell you an overall assessment of the fit and sometimes you want to look in more detail at a specific region you’re interested in, and there, looking at the density maps can help, as well as something called a B-factor which can tell you about how reliable the pinpointed position in the model is likely to be. You can set the PDB to color based on B-factor if you want to see that. You can also click on the “3D report” to see where potential problems are on the structure. And if you click on the validation report, you can get a lot more details.
Speaking of details, there is sooooo much you can do with the PDB (way more than I even know how to do) and the details are too complex to try to explain here, but thankfully, the PDB is also an educational resource! I’ll get into it a bit more later, but “the PDB” is actually a multi-continent collaboration with several different centers. The one based out of the US is the the Research Collaboratory of Structural Bioinformatics (RCSB) PDB, and they have a great website called PDB-101: http://pdb101.rcsb.org/
If you go to this page in the “learn” tab, you’ll find this Introduction to PDB Data: http://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/introduction
They also have lots of cool resources for teachers and learners, including downloadable infographics and posters and make-it-yourself paper models. I used the GFP model when I taught a summer camp lesson: https://bit.ly/gfpfunscience
Structural biology can be kinda hard to get into, so they have ways to lure you in, like their “Molecule of the Month” series. Each month they feature a different molecule or class of molecules, tell you cool stuff about them, and walk you through some structures. May 2021’s Molecule of the Month is fetal hemoglobin.
Hemoglobin is the protein that carries oxygen through your blood. It’s a multi-subunit protein with different subunits made by different genes. One of those subunits has a fetal version and an “adult” version. The fetal version is extra good at snatching up oxygen, which allows the fetus to get oxygen from its mother’s placenta, which is low in oxygen since the mom needs oxygen too. Then, once the baby is born and breathing on its own, where there’s plenty of oxygen so it doesn’t need to have such desperate oxygen snatchers, it starts making adult hemoglobin instead.
Behind the switch is a protein called BCL11A which normally acts as transcriptional brakes, binding to the promoter region in front of the fetal hemoglobin gene and preventing mRNA copies of that genetic recipe from getting made. Without those recipe copies, the protein-making complexes (ribosomes) don’t make the fetal hemoglobin subunit. But, when you take away the brakes, such as by using CRISPR-based genetic engineering on blood stem cells, the fetal hemoglobin can get made even in adult cells, and this can compensate for deficiencies with the adult version caused by sickle cell anemia or other diseases involving adult hemoglobin (hemoglobinopathies). This strategy is still experimental, but it’s showing great promise (but at a cost that will make it out of reach for many). More on this here: http://bit.ly/sicklecelldiseases
That’s exciting, and I hope you’ll read that post to learn more. But that’s not why I told you that story. Instead, it’s just a little background for one of the exploration of the structure of BCL11A bound to DNA of the fetal hemoglobin promoter, PDB ID 6ki6 (note that this is one of those places where lowercase is so much better! compare 6ki6 to 6KI6…)
Once you read more about it in the PDB-101 article http://pdb101.rcsb.org/motm/257
You can visit the page of the structure and play around: https://youtu.be/ZffY6f6pxvE That’s the link to it on the RCSB, but you can access it from PDBe or PDBj as well – just search for “6ki6”
The best way to learn to use the PDB is really just to play around and check out the different features. And one of my favorite features is the “Protein Feature View,” which you can access if you scroll down on the main entry page in RCSB or, with more detail, under the “sequence” tab. Here they show various things, but one of the ones I want you to be aware of is “unmodeled regions” – these are parts of the structure where the density wasn’t resolvable. These regions often correspond to flexible or dynamic regions of the protein that didn’t like to sit still for the X-ray beam, so their signals kinda canceled each other out and there wasn’t enough blobby stuff there to try to fit atoms into. It’s not really like this at the technical level, but you can think of it a bit like taking a picture of someone waving their hand – you can see most of the person fine, but the waving hand is too blurry to make out. With unmodeled regions, you have a similar thing – you can see most of the protein but not some parts, even though those parts are physically there.
Sometimes, however, there will be parts that aren’t in the model that aren’t there because they were physically taken out of the protein – scientists will often intentionally make mutations to proteins like chopping off regions that are predicted to be disordered, or changing some residues in order to try to get them to crystallize better. These will show up in the feature view as well if there are any.
You can see even more features on the “Sequence” tab – things like metal binding sites – you can see that BCL11A binds to zinc.
If you go to look at the 3D structure, you have a bunch of options of what you want to display and how. The terminology can get pretty confusing and overwhelming, but hopefully this helps give you an overview for some of the X-ray crystallography terms.
A crystal is made up of lots and lots and lots of copies of the molecule(s) you’re trying to look at arranged in a repeating 3D pattern. Each copy is too tiny to see by itself, but if you put a ton of them together, their signals can contribute to each other such that you get a strong enough spots on the diffraction pattern to work back from.
That protein or complex you’re interested is looking at is usually the “biological assembly” which is the form of the molecule that is thought to be the active form – so, in our case it would be one copy of the protein bound to the DNA. But, if you look at the “Model” in the 3D viewer, you’ll see 2 copies. This is because it’s showing you the “asymmetric unit” which is the smallest repeating part of the crystal. If you know the structure of the asymmetric unit and the dimensions of something called the unit cell, you can recreate the whole crystal just using symmetry operations (rotate, move left, move up, etc.). Confusingly, that asymmetric unit might itself contain more than one copy of the biological assembly (like maybe one’s upside down and the other is right-side up). And that’s the case here.
Even though there are 2 copies of the BCL11A/DNA complex per asymmetric unit, and multiple asymmetric units within the unit cell, its functional form is just 1 copy. The copies of the biological assembly within the asymmetric unit are identical in their sequence but they might be slightly different in their shape, and during the structure-solving they are therefore treated separately. In the PDB, these “different copies” within the asymmetric unit are called “instances” so if you’re confused by that, that’s what it refers to and you can choose which one you want to look at.
The paper associated with that structure was
Yang, Y., Xu, Z., He, C. et al. Structural insights into the recognition of γ-globin gene promoter by BCL11A. Cell Res 29, 960–963 (2019). https://rdcu.be/ckj6Y
So if you want to learn more, you can read the authors discuss some of the implications of their structures. One cool thing – it’s been known that some people have mutations in BCL11A promoter that cause them to produce fetal hemoglobin even as adults. This structure shows that the residues that are mutated in those patients make critical interactions with the promoter DNA, so the mutations would likely mess up the binding.
Now that we’ve talked about how you can explore the PDB, let’s explore the PDB’s history! The PDB’s actual 50th birthday won’t be until October, but celebrations have already begun, inspiring this post to help you all get in on the fun!
The PDB had super humble beginnings. It was operated by Brookhaven National Laboratory (in the US) and the Cambridge Crystallographic Data Centre (in the UK). Instead of a website, they’d mail coordinate files to interested people on magnetic tapes! you can see their public debut announcement in the figures and here: https://www.nature.com/articles/newbio233223b0.pdf
In 1995, they took things to the worldwide web, and then in 2003, they took things worldwide, launching the Worldwide PDB (wwPDB) was launched with centers located on three continents: the Research Collaboratory of Structural Bioinformatics (RCSB) PDB in the US, the EMBL-EBI’s PDB in Europe (PDBe) and PDB Japan (PDBj). The wwPDB’s debut paper can be found here: https://rdcu.be/ckqM8
All these institutions hold the same data, but they distribute it in different ways (but still all free). A 4th member group is the Biological Magnetic Resonance Data Bank which deals with the NMR stuff.
I typically use the RSCB site, which is managed by three member institutions of the RSCB: Rutgers; the University of California, San Diego UCSD; and the University of California at San Francisco (UCSF)
here’s a cool interactive timeline: https://www.rcsb.org/pages/about-us/history
The PDB does a lot more than just display data – it’s a huge community effort to maintain data quality and accessibility. Their “vision” is to “Sustain freely accessible, interoperating Core Archives of structure data and metadata for biological macromolecules as an enduring public good to promote basic and applied research and education across the sciences.”
and their mission points are:
- Manage the wwPDB Core Archives as a public good according to the FAIR Principles.
- Provide expert deposition, validation, biocuration, and remediation services at no charge to Data Depositors worldwide.
- Ensure universal open access to public domain structural biology data with no limitations on usage.
- Develop and promote community-endorsed data standards for archiving and exchange of global structural biology data.
that’s direct from their website where you can find out LOTS more: http://www.wwpdb.org/
Going back to the whole concept of models not being flawless, I also encourage you to check out this awesome article about the coronavirus structural task force (structural biology superheroes) working to validate & correct any problems with the constant stream of SARS-CoV-2 structures. And the corresponding author (the person who you contact with any questions) is Andrea Thorn, whom I met at CSHL’s X-ray crystallography course and found incredibly awesome and definitely superhero-worthy! Huge thanks to the all of the team members. https://t.co/6Kas9tiz2i?amp=1
If you’re interested in learning more about how to evaluate the quality of published structures for yourself, I highly recommend the free article, “Protein crystallography for non‐crystallographers, or how to get the best (but not more) from published macromolecular structures” by Alexander Wlodawer, Wladek Minor, Zbigniew Dauter, and Mariusz Jaskolski. It is literally one of my all-time favorites. They do a really great job explaining what the various terms mean, what to look out for, etc. https://febs.onlinelibrary.wiley.com/doi/full/10.1111/j.1742-4658.2007.06178.x
figure on PDB growth: wwPDB consortium, Protein Data Bank: the single global archive for 3D macromolecular structure data, Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D520–D528, https://doi.org/10.1093/nar/gky949