I’m a wet lab scientist through and through, but once you do data-generating experiments, you’ve got some stats-y stuff to do. And this can lead me to get confused about STANDARD DEVIATION & STANDARD ERROR OF THE MEAN, and when each should be used! And I’m confident I needed a refresher on CONFIDENCE INTERVALS. So, since I was refreshing myself, I thought I would share with my non-self.
Scientists often may seem wishy-washy when discussing results, saying things like “it’s likely that” or “while we cannot rule out other causes…” It’s not because we don’t have confidence, it’s just that we know that we can never really be certain of anything. BUT we can be more or less certain & we need a way to express how certain or uncertain we are. We can do this with a few different statistical terms depending on what we’re trying to show:
- if we want to show the distribution within a sample of the thing we’re measuring, we use STANDARD DEVIATION (SD) (for example, if we were measuring heights, the spread of heights in the group we sample would be the SD)
- if we want to show how representative the value we calculate from a sample is of the overall population, we use STANDARD ERROR (SE) (how much variation is there between the average heights of different samples from the same population. how accurate is your estimate?)
- if we want to show where we’re pretty sure that the true population value lies, we use the CONFIDENCE INTERVAL (e.g. we’re 95% sure that the average height is between x & y).
So, SE can kind of tell us how confident we are that we know what the true average (aka mean) is. But it doesn’t tell us whether the non-average people are really short/tall or only kinda short/tall, or if there’s a wide range with some of them all! And confidence interval? We’ll get back to that later. First let’s look at a couple examples.
Most of the examples you’ll find for this sort of thing are population-based studies. You take several samples of a population, measure something, and want to show how much variability there was in the thing you were measuring (standard deviation) and how much variability there was between samples in the “final” calculations you got from the thing you were measuring in each (standard error).
note: whenever a standard deviation is calculated for a statistic, we call it a standard error. So you can have a standard error of a slope or a median or anything. But, most commonly, standard error is used for means, where “mean” means the average you get if you add up all the individual things and divide by the number of things. (it’s basically what probably first comes to mind when you think of “average”).
Normally, they’re in terms of people populations, and I’ll give you one of those in a minute, but I’m not really a people person… Instead, one of the only times I deal with “populations” is with insect cells, so let’s start there….
I study proteins & in order to get proteins to study, I express them in insect cells. We grow them in flasks & they multiply quickly. So they can quickly get too dense (too many cells competing for too little food & space). Therefore, you have to check them regularly to see when you need to add more media (liquid food) & split them into more flasks so they have more breathing room.
If I wanted to know the true *population* density (how packed together are cells in the whole flask), I’d have to count every cell in the flask & divide that # by the total volume in the flask. Not only is this not practical, it would waste all our cells! So, instead, I check them by removing a SAMPLE – a small amount of liquid from the flask (which has our population) & counting them under a microscope.
To help, I use a hemocytometer, which has a little grid to help me count & I can use a clicker to keep track of counts. The grid has a bunch of different “mini-grids” you can use to help you count. So you can choose which mini-grid to count. Sometimes there are randomly more cells in some grids than others, so if I counted # of cells in the exact same sample, but from different grids, I’d get different numbers. I can average them together to get the mean (add them and divide by # of samples), & how close together the measures are to this calculated mean is a measure of variability within the sample (and a measure of precision of my measuring skills, which depends on my patience & eyesight). We can calculate variability between measurements of a single sample as the STANDARD DEVIATION (SD)(σ)
There’s this useful rule of thumb called the 68-95-99.7 rule: 68% of the data are w/in 1 SD of the mean (either above or below), 95% are w/in 2SD, & 99.7% are wi/in 3 SDs (when your data is normally distributed (shaped like a symmetrical bell curve). A normal curve is symmetric, but it can be steeper or flatter, depending on how spread out the data are. The more “squished” the curve, the larger the SD.
SD isn’t really that useful in this particular situation, because we don’t really care that much about the variation between the different mini-grids. What we really care about is how representative that mean is of the population mean. And, right now, we kinda have no clue! We’ve only tested a single sample. We’d feel more confident if we got similar results from multiple samples (and this confidence would be reflected in a lower SEM). So let’s take some more!
So, now, imagine you were to take a DIFFERENT RANDOM SAMPLE from the same flask and count it multiple times. This sample might, by chance, have more or less cells than the previous sample. The SD shouldn’t change much, because that was pretty random anyway, so differences in average values you count BETWEEN SAMPLES reflect variability in the POPULATION (i.e. its something about the sample itself & not the measurer or measuring method that’s different)
If we only take a couple random samples, & our population’s not well-mixed, we might get 2 that are much denser than the “true” average, etc. so we can get an inaccurate average density. BUT the more samples we take, the better our estimate – the more dense cancel out the less-dense and our calculated average (from the samples) gets closer & closer to the true population average. We can describe this using STANDARD ERROR of the MEAN (SEM) -> mathematically, SEM = SD/√(sample size). When you divide by a bigger number, you get a smaller number, so the more samples you have, the lower the SEM
To summarize: SD measures variability in data we used to get 1 average (in this case, cell counts). If we then average a bunch of averages (in this case average cell counts from different samples) we can use SEM to describe the variability in new average.
So, in that situation, SD by itself didn’t really tell us much that was biologically interesting, but sometimes it can. So let’s look at a different situation, one you might be more familiar with: consider an experiment to find average heart rate (HR) of Americans. To get the true answer, you’d have to measure the HR of every single American. That’s not practical, so instead you measure a sample of Americans.
Assuming your sampling’s RANDOM (you’re not only measuring Americans visiting cardiologists) the spread of HRs within your sample (SD) should reflect the spread of HRs among all Americans
So if you were to take another RANDOM sample, it should have a similar spread w/in the sample (similar SD). If the SDs are NOT similar, you may have biased samples (I told you not to ask at the cardiologist’s office!) AND it should have a similar average to the 1st sample’s (low SEM) since they’re sampling the same population
SD: if you were to randomly select a person in the study, how close to the sample average are they likely to be? (how certain are we that they’re are good representative of the sample?)
SEM: If you were to take the average of another random sample, how close is it likely to be to the real average? (how certain are we that the average from a random sample is a good representative of the entire population?)
SD can be really helpful in medicine because it can be used to determine whether some value is “normal.” So, for example, doctors don’t just measure your HR for fun – they want to see if it’s “abnormal” which could indicate a problem.
If they only compared your measured HR to the average heart rate, that wouldn’t be that useful. Because just knowing the average doesn’t tell you about the spread. Think about it – the type of average we’re talking about is the mean, which is where you just add up the values and divide by how many values you added up. So as long as you have something bigger to exactly cancel out something smaller, the mean won’t change no matter what is going on in either side.
For example, these all have the same mean:
4, 4, 4: (4+4+4)/3 = 4
1, 4, 7: (1+4+7)/3 = 4
3, 4, 5: (3+4+5)/3 = 12
So if your mean is 4 and you measure and get a 1 is that abnormal? In situation 2, not really. But, in situations 1 or 3, it is (obviously with just 3 values you can’t really know, but hopefully you see what I’m getting at). So, we need to know the spread of “normal” – we need a normal curve! (that’s not why they’re called normal curves, but…)
There have been a number of studies to try to determine what’s normal for resting heart rate. Here’s an example I found that was recently published. https://bit.ly/311kzBC
They took a bunch of people wearing fit bit heart rate monitors and calculated their average resting heart rate (RHR). They found it varied a lot between people based on things like age, weight, sex, etc. and they broke it down further for those groups. But, overall, they found that the average RHR was 65.5 ± 7.7 bpm.
If we look at our 68-95-99.7 rule (68% of the data are w/in 1 SD of the mean (either above or below), 95% are w/in 2SD, & 99.7% are wi/in 3 SDs) and apply it here, we get that
68% of values are between (68-7.7) and (68+7.7), so 71.3-75.7
95% of values are between (68-(2*7.7)) and (68+(2*7.7)), so 52.6-83.4
and 99.7% are between (68-(3*7.7)) and (68+(3*7.7)), so 44.9-91.1
So, if your average RHR is below 44.9 or above 91.1, you might wanna get that checked out…
But how sure are we that we actually know the true average? What if we were to find the averages calculated for a bunch of different similar studies. Then we could find the average of that. And we can calculate the SEM and then, as promised, the CONFIDENCE INTERVAL, which tells you where you’re pretty sure that the true value is – typically, 95% sure. What a 95% CI means in the strict statistical definition sense is that, if you were to take a random sample, there is a 95% chance that the value you calculate for that sample will fall within that interval. In a more intuitive, though not strictly linguistically accurate sense, CI can be interpreted as saying: we’re 95% sure that the true mean falls somewhere in this interval. The higher the confidence level, the wider a net you have to cast.
CI is calculated from the SE. SE varies based on how many samples you measure – the more samples, the lower the SE (and more confident you are). You get the CI from the SE by multiplying by a value called the t-value. You can find charts of t-values and they tell you what the t-value is for a certain confidence level and a certain # of samples. For a 95% CI, for n above 10, the t-value is pretty close to 2, so you can estimate the 95% CI by doubling the SE. But, for n below 10, the t-value varies, getting higher the fewer samples you have. This makes sense because the fewer samples you have, the less sure you are that your samples are really representative and that they contain the “true” population value. Therefore, you’re gonna have wider error bars (larger CI) which you get by multiplying SE by larger values. A benefit of CI is that it gives you a way to standardize and directly compare calculated values with different n-values.
⚠️ But be careful: a ± sign is used to show SD OR SEM or CI! -> e.g 21 +/- 1 g Look to the figure legend to see what they’re using.
And speaking of that ± business… how many digits to use? That +/- is an indicator of certainty and you certainly can’t be more certain than the measuring device. So when it comes to sig figs, with error you just get 1! Even if your calculator says the error is 1.12836 you only report the 1. If your measuring device only lets you report 20 or 21 or 22, there’s no way that 21.112836 could ever get reported. In that case the number that the device would record would be 21.1. more on that here: http://bit.ly/33hkCd8
I got the figures I used from for the examples in the graphics from this really great paper I encourage you to check out: “Error bars in experimental biology” https://bit.ly/39O4MdH
more on sig figs: http://bit.ly/2T1zN5r