Since two of my loves are data and the law I have been interested in the recent controversy surrounding Edward Snowden’s leak of the NSA program. One of the original articles from the Washington Post states that the NSA will not search the content of a communication if key terms do not produce at least 51 percent confidence in a target’s “foreignness.”
Most articles I have read automatically assumes that 51% confidence means a 51% probability. There is something wrong with making that assumption. Statisticians reserve the word confidence to mean something other than probability. So, what is confidence in statistics? That is what this post tries to explain.
Confidence in statistics relates to the confidence interval and is more akin to the accuracy of the probability. When the statistician takes a sample from the whole population, they make a calculation about the probability based upon the sample; confidence tells us how well the probability of the sample might match the actual percentage of the population.
From our NSA example, let’s say that a key term gives a 90% probability that the communication source is a foreigner. That is the probability. The statistician has taken, let’s say, 1,000 people and has found that 90% of that sample that use the key term are foreign. However, the statistician realizes that there might be some difference between the 1,000 person sample and the whole population. We might find a 90% probability of foreignness in the sample, but the percentage of foreignness of the whole population could be 93% foreign. That is why you see statistics that say, “There is a 90% +/- 5% probability.” The 5% is the error rate that accounts for this probable difference between the sample and the whole population. That error rate is what statistical confidence is about. The statistician has a certain level of confidence that the key word has an 85% to 95% probability of being from a foreign source.
But, how confident are they? The larger the error rate the more confident the statistician is in the calculation. Think about it. If a key term has a foreign source 93% of the time in the whole population, a probability of 90% +/- 5% from the sample data will capture the whole population, but a probability of 90% +/- 1% won’t. A statistician has a higher confidence in the 5% error rate than the 1% error rate. When a statistician says that there is 90% +/- 5% probability of something, a key piece of information is being left out, the confidence. Usually, you can find the confidence in the small print below a publishes statistic. The statistician will calculate the confidence of the error rate. A 90% +/- 5% probability may have a 95% confidence. You can see why statistics uses a separate term for confidence. It would get confusing to write that it is 95% probable that it is 90% +/- 5% probable.
There are three common confidence rates that statisticians use: 68%, 95%, and 51%. Next time you see a statistic that says 75% +/- 5%, look around for the small print showing the confidence. 68% may seem strange, but it has to deal with how confidence is calculated. Just know that the two most common confidence intervals are 68% and 95%. 51% is very rare because it has such a low accuracy rate. There really is no difference in the population, sample, or the calculation between 68% and 95% confidence. It is like the difference between 6 or 1/2 a dozen; or the difference between 90% foreign and 10% not foreign. Most lay people might be a little surprised that the error rate is this arbitrary. We can have 90% +/- 5% probability with a 95% confidence, and the exact same data can have a 90% +/- 1% probability of being foreign with a 68% confidence.
Since it is a little arbitrary, how do statisticians pick the confidence rate: 68% or 95%? I will have some mathematician angry at me, but it depends on how well the sample matches the whole population and which one looks best. If the data matches poorly to the population, it looks better to use 68% confidence. Let’s say the probability is 75% and the error rate is 25% with a confidence of 95%. A statistician might reduce the confidence to 68% since saying there is a 50% to 100% probability really doesn’t convey well. Saying a 65% to 85% probability with 68% confidence, looks a little better. Conversely, if the sample data matches the population really well, and the error rate is low, statisticians love to brag and point out that it has a 95% confidence. The confidence rate is usually in the fine print below the published statistic. You will probably notice, from now on, that large error rates use a 68% confidence, and you will really notice that when an error rate is really low, the 95% confidence rate in the small print is very discernible. Statisticians love to brag when there numbers work out well.
I don’t think that this adds anything to the privacy and security debate. They are just using the wrong term and should be saying a 51% probability to mean probability and not confidence. First, as I stated, 51% is really low and almost never used. Also, we don’t know what the probability is. You could be 51% confident that a key term has a 0.00001% probability of being foreign, which I am pretty sure this isn’t what they mean. Confidence could also be a legal term, like beyond a reasonable doubt or a preponderance of the evidence. Nevertheless, this has given me a good opportunity to explain the difference between the statistical terms, confidence and probability.