A guide to interpreting, analyzing, and playing with structures in the Protein Data Bank (PDB), with an emphasis on x-ray crystallography structures. Apologies for the roughness and extra-bumbliness. Grad school business is not conducive to high-quality video making. But hope this is still helpful.
text adapted from a May post I did on the PDB, video new (first the full video and then an abbreviated version focused on PDB entry basics and then an abbreviated version focused on understanding & evaluating crystal structures in the PDB)
an abbreviated form focused on basics of the entries and crystal contents
an abbreviated form of the video focused on understanding & evaluating crystal structures in the PDB
If you’ve ever read an article discussing the structure of a protein (what it “looks like” at the atomic scale) or seen a picture of a protein model, you might have seen something like: PDB ID 2hhb, 2hho. Each of those “accession codes” is a specific name for a structure. In these cases, 2hhb is a structure of deoxyhemoglobin; 1hho is a structure of oxyhemoglobin. These codes are more than just nicknames. In one sense they’re more like the structural biology equivalent of citing the photographer. Except they’re way cooler because they goes way beyond just giving credit where credit’s due! If you search the Protein Data Bank (PDB) for that name, it’ll pop right up and let you explore it. In 3D! You can actually play around with rotating it, coloring it different ways, etc. instead of just looking at a static snapshot. And you can find out more information about it (how it was “solved,” how reliable it is, etc.) as well as do a lot more with it, especially if you’re in the field and you know what you’re doing. So I thought I’d make a video walking you through how you can make the most of it even if you aren’t a hard-core structural biologist.
Let me step back a sec and explain what “structural biology” is because I didn’t even hear of the term until college, but now that I’m in the field I can sometimes forget that most people aren’t and might not know the term. more on structural biology here: http://bit.ly/cryoemxray but basically it’s the sub-field of biology that deals with trying to figure out what macromolecules (things like proteins, DNA, RNA, and mix-and-matched complexes of those components) look like (their form). And how that form fits with what they do (their functions).
To get an idea of why this might be relevant, think of how a spoon is good for scooping ice cream whereas and a knife is good for cutting cake. Macro means large, but these molecules are only large in comparison to other molecules. Compared to us, they’re tiny! So tiny that they’re invisible to our eyes, and even to our conventional microscopes. Therefore, we have to use fancy-dancy techniques like X-ray crystallography, cryo-electron microscopy (cryo-EM), and nuclear magnetic resonance (NMR) to visualize them.
The details are complicated and vary from technique to technique, but a key thing to know about all of them is that they don’t give you the actual atomic positions of each of the atoms that make up the molecule (individual carbons, hydrogens, oxygens, nitrogens, etc.). Instead they just give evidence of whereabouts those atoms are and then you have to use math-y stuff (or at least the computer does) to generate a “model” of the structure, placing in the atoms to fit the evidence. This allows scientists to generate “models” of the atomic structure of the thing they were looking at. And when someone does this, we say they’ve “solved the structure” of that protein or complex or whatever it was.
If they then want to publish a paper talking about that structure, they have to deposit the “coordinates” for it in the PDB. These coordinates are like addresses for each of the atoms in the model – plus some extra info like how confident the depositors are in the position. Basically, the structure-solvers have to upload enough data that everyone can evaluate the validity of the model for themselves and see if they can find any “hidden treasures” in it (features that the depositors might have missed, like evidence for a bound metal ion). And the way that anyone can look at it for themselves is through the PDB, so I want to tell you a bit more about how to use it.
Some of the specific features and viewing options will depend on what technique was used to generate the data. I haven’t done NMR, or cryo-EM (although everyone else in my lab has, so I hear about it a lot), but I have done X-ray crystallography, so I’m most familiar with that. Lucky for me, most of the structures in the PDB were solved using crystallography (although cryo-EM has gained steam in recent years thanks to technological advances).
much more on crystallography here: http://bit.ly/xraycrystallography2
The basic idea with X-ray crystallography is that you shoot X-rays at a crystallized molecule -> the X-rays interact with electrons surrounding the nulcei of the atoms in the crystal, knocking them off-course -> the knocked-off (scattered) X-rays interact with one another to either cancel out or make a stronger X-ray beam in a process called diffraction -> those new “mega-beams” hit a detector to give you a pattern of spots called a diffraction pattern -> you work backwards from that diffraction pattern to make a blobby mesh thing called an electron density map representing the rough position of the electrons -> you build an atomic model into that map showing what you think is the position of each atom in the molecule.
Most of the time, what people look at when they go to the PDB is just that model. But, at least for more recent structures, the PDB also holds all that data which was used to make the model. So, if you know what you’re doing, you can see the blobby stuff (electron density map) and how well (or poorly) the model fits it.
Some of the data the PDB tells you is about the quality of the diffraction data itself, and some of it is about the quality of the modeling.
As for the diffraction data itself, you can find a lot of different details, but a key one is the resolution, which is a measure of how close together two things can be and you can still tell that they’re different. The blobbier the map is, the poorer the resolution, and therefore the higher the value (in angstroms, Å). Most structures you’ll be looking at are probably in the 2ish range.
The better the data you start with, the better the potential is for a really good model. Think about looking at a baseball mitt and trying to model where the fingers are vs looking at a tight-fitting glove and doing the same. It’s much easier to do it (and you can be more confident you got it right) if you have better resolution (like that tight-fitting glove). But you can also make a really bad model from really good data! (even with that nice glove you might draw the fingers sticking the wrong way or something). So it’s important not to just look at the resolution. You want to look at how well the molecule fits that data. And the PDB tries to make this easier with their wwPDB validation report (wwPDB stands for worldwide PDB and we’ll talk more about it later).
At least on the RCSB website, which is what I’m going to be referencing and showing in the figures, an overview of the validation report is shown in a quick visual form as slider bars. And then you can click to see the full report. These bars contain some global statistics that tell you about the overall fit and the molecular feasibility of the model, shown as red(worse than average) to blue(better than average) slider bars in their wwPDB validation report. “Average” here is compared to all other X-ray structures (black bars) as well as to all other X-ray structures with similar resolution (unshaded bars).
quick terminology note: proteins are chains of letters called amino acids linked together and folded up. Amino acids have a generic backbone that lets them link up and unique “side chains” that stick off from the backbone. When they link together, they do it through their amino and acid groups, so they’re no longer technically “amino acids” so we call the individual linked letters residues instead
A few of the metrics you see in the sliders are:
– Rfree: this is a measure of how well their model fits the evidence overall
– clashscore: this is a measure of how much the atoms would physically clash with each other if they’re really where the model says they are (basically, the models only model the center of the atoms but the atoms also have electron clouds around their center, and those electron clouds don’t like to get near each other, so each atom needs some elbow room and the clashscore looks at whether they have enough)
– Ramachandran outliers: because different residues have different side chains, and those side chains have different amounts of bulkiness, the flexibility of the protein backbone is more constrained around certain residues and there are particular backbone angles that each residue likes to take, which can be visualized as “allowed zones” on a Ramachandran plot. If you plot the angles in the model, some of them will likely fall outside of their allowed zone, indicating an awkward angle and we call that a Ramachandran outlier.
– sidechain outliers: this is similar in concept to the Ramachandran outliers, but here you’re looking at the angles of the side chains instead of the backbone
– RSRZ outliers: this is a measure of how many individual residues don’t fit the density well (kinda like how many of the stick-y groups in the diagram stick out of the blob they’re supposed to be inside)
note: it’s known from really high-res structures that it’s normal for actual proteins to physically have some outliers and deviations from ideal, so it’s also normal for models to have some. If the outliers are too low, it could mean that the model was being refined too much to be “ideal” based on geometry instead of being refined to actually match the data. Refinement should include a mixture of the two, and you should end up with a low, but not too low, level of weirdos.
Those figures only tell you an overall assessment of the fit and sometimes you want to look in more detail at a specific region you’re interested in, and there, looking at the density maps can help, as well as something called a B-factor which can tell you about how reliable the pinpointed position in the model is likely to be. You can set the PDB to color based on B-factor if you want to see that. You can also click on the “3D report” to see where potential problems are on the structure. And if you click on the validation report, you can get a lot more details.
That’s exciting, and I hope you’ll read that post to learn more. But that’s not why I told you that story. Instead, it’s just a little background for one of the exploration of the structure of BCL11A bound to DNA of the fetal hemoglobin promoter, PDB ID 6ki6 (note that this is one of those places where lowercase is so much better! compare 6ki6 to 6KI6…)
Once you read more about it in the PDB-101 article http://pdb101.rcsb.org/motm/257
You can visit the page of the structure and play around: https://www.rcsb.org/structure/1fdh That’s the link to it on the RCSB, but you can access it from PDBe or PDBj as well – just search for “6ki6”
The best way to learn to use the PDB is really just to play around and check out the different features. So let’s visit the page of a structure and play around with, that of BCL11A bound to DNA of the fetal hemoglobin promoter (more on what this is here: http://bit.ly/thepdb but I just want to use it as a “generic” example in this post). https://www.rcsb.org/structure/1fdh That’s the link to it on the RCSB, but you can access it from PDBe or PDBj as well – just search for “6ki6”
And one of my favorite features is the “Protein Feature View,” which you can access if you scroll down on the main entry page in RCSB or, with more detail, under the “sequence” tab. Here they show various things, but one of the ones I want you to be aware of is “unmodeled regions” – these are parts of the structure where the density wasn’t resolvable. These regions often correspond to flexible or dynamic regions of the protein that didn’t like to sit still for the X-ray beam, so their signals kinda canceled each other out and there wasn’t enough blobby stuff there to try to fit atoms into. It’s not really like this at the technical level, but you can think of it a bit like taking a picture of someone waving their hand – you can see most of the person fine, but the waving hand is too blurry to make out. With unmodeled regions, you have a similar thing – you can see most of the protein but not some parts, even though those parts are physically there.
Sometimes, however, there will be parts that aren’t in the model that aren’t there because they were physically taken out of the protein – scientists will often intentionally make mutations to proteins like chopping off regions that are predicted to be disordered, or changing some residues in order to try to get them to crystallize better. These will show up in the feature view as well if there are any.
You can see even more features on the “Sequence” tab – things like metal binding sites – you can see that BCL11A binds to zinc.
If you go to look at the 3D structure, you have a bunch of options of what you want to display and how. The terminology can get pretty confusing and overwhelming, but hopefully this helps give you an overview for some of the X-ray crystallography terms.
A crystal is made up of lots and lots and lots of copies of the molecule(s) you’re trying to look at arranged in a repeating 3D pattern. Each copy is too tiny to see by itself, but if you put a ton of them together, their signals can contribute to each other such that you get a strong enough spots on the diffraction pattern to work back from.
That protein or complex you’re interested is looking at is usually the “biological assembly” which is the form of the molecule that is thought to be the active form – so, in our case it would be one copy of the protein bound to the DNA. But, if you look at the “Model” in the 3D viewer, you’ll see 2 copies. This is because it’s showing you the “asymmetric unit” which is the smallest repeating part of the crystal. If you know the structure of the asymmetric unit and the dimensions of something called the unit cell, you can recreate the whole crystal just using symmetry operations (rotate, move left, move up, etc.). Confusingly, that asymmetric unit might itself contain more than one copy of the biological assembly (like maybe one’s upside down and the other is right-side up). And that’s the case here.
Even though there are 2 copies of the BCL11A/DNA complex per asymmetric unit, and multiple asymmetric units within the unit cell, its functional form is just 1 copy. The copies of the biological assembly within the asymmetric unit are identical in their sequence but they might be slightly different in their shape, and during the structure-solving they are therefore treated separately. In the PDB, these “different copies” within the asymmetric unit are called “instances” so if you’re confused by that, that’s what it refers to and you can choose which one you want to look at.
much more here: http://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies
The paper associated with that structure was
Yang, Y., Xu, Z., He, C. et al. Structural insights into the recognition of γ-globin gene promoter by BCL11A. Cell Res 29, 960–963 (2019). https://rdcu.be/ckj6Y
If you’re interested in learning more about how to evaluate the quality of published structures for yourself, I highly recommend the free article, “Protein crystallography for non‐crystallographers, or how to get the best (but not more) from published macromolecular structures” by Alexander Wlodawer, Wladek Minor, Zbigniew Dauter, and Mariusz Jaskolski. It is literally one of my all-time favorites. They do a really great job explaining what the various terms mean, what to look out for, etc. https://febs.onlinelibrary.wiley.com/doi/full/10.1111/j.1742-4658.2007.06178.x