That feeling when your p-value is lower than your alpha, aww yeah! But what does it really mean? Itâs one thing to say there is significance and on the surface it means the two things are different âenoughâ to be considered two things, but I think thereâs a simpler way to explain it. So today weâre going to talk about what significance actually means in the practical sense. Maybe itâs superâŠ
âOi, ye proclaimers of scientific truth, how is it that the scientific method finds objective truths?â
âThrough the scientific method, of course. Scientists make hypotheses (e.g. "Aspirin reduces headaches"), acquire data, then make conclusions. Youâd be absurd to deny their proofs. Donât show such folly by taking your philosophy to the extreme. Scientists are principled, rational, and logical. Take a leaf out of their book.â
âYouâre speaking with a matter of factâvintage behaviour for a scientist, I should addâand I am one!âbut can you follow it up? You see, youâre placing faith in scientific realism: in the notion that science finds objective truths to contribute to real-world knowledge. But how? It is totally plausible that scientific theories are built on the *rejection* of hypotheses (e.g. through null hypotheses and p-values), not by finding objective truths. May I suggest some Karl Popperâor, perhaps, some David Hume. Oh, yes, Hume. Consider this: Will the sky be blue tomorrow? While I believe that it will, I accept that the scientific method canât show me why. Although we have deduced the phenomenon of Rayleigh scattering, this deduction is based on prior data and says nothing of tomorrowâs facts. There is no proof today that can serve as proof tomorrow. As absurd as this all seems to our personal intuitions, given our daily experiences of the sky, with one simple example we have reached the boundaries of the scientific method.â
The case for, and against, redefining "statistical significance."
Thereâs a huge debate going on in social science right now. The question is simple, and strikes near the heart of all research: What counts as solid evidence?
The answer matters because many disciplines are currently in the midst of a âreplication crisisâ where even textbook studies arenât holding up against rigorous retesting. The list includes: ego depletion, the idea that willpower is a finite resource; the facial feedback hypothesis, which suggested if we activate muscles used in smiling, we become happier; and many more.
Scientists are now figuring out how to right the ship, to ensure scientific studies published today wonât be laughed at in a few years.
One of the thorniest issues with this question is statistical significance. Itâs one of the most influential metrics to determine whether a result is published in a scientific journal.
Most casual readers of scientific research know that for results to be declared âstatistically significant,â they need to pass a simple test. The answer to this test is called a p-value. And if your p-value is less than .05 â bingo, you got yourself a statistically significant result.
Now a group of 72 prominent statisticians, psychologists, economists, sociologists, political scientists, biomedical researchers, and others want to disrupt the status quo. A forthcoming paper in the journal Nature Human Behavior argues that results should only be deemed âstatistically significantâ if they pass a higher threshold.
âWe propose a change to P< 0.005,â the authors write. âThis simple step would immediately improve the reproducibility of scientific research in many fields.â
This may sound nerdy, but itâs important. If the change is accepted, the hope is that fewer false positives will corrupt the scientific literature. Itâs become too easy â using shady techniques known as p-hacking and outcome switching â to find some publishable result that reaches the .05 significance level.
âThereâs a major problem using p-values the way we have been using them,â says John Ioannidis, a Stanford professor of health research and one of the authors of the paper. âItâs causing a flood of misleading claims in the literature.â
Donât be mistaken: This proposal wonât solve all the problems in science. âI see it as a dam to contain the flood until we make sure we have the more permanent fixes,â Ioannidis says. He calls it a âquick fix.â Though not everyone agrees itâs the best course of action.
At best, the proposal is an easy change to implement to protect academic literature from faulty change. At worst, itâs a patronizing decree that avoids addressing the real problem at the heart of scienceâs woes.
There is a lot to unpack and understand here. So weâre going to take it slow. [keep reading]
Anti-Ivermectin Deception in a Major Medical Journal?
The failure to reach a significant p-value can come from too few patients in a trial study or from selecting an outcome like COVID death from that's too infrequent to give a significant p-value. Does Big Pharma and JAMA use this against Ivermectin?
In the image above you can see the weakest to the strongest information categories with the strongest, most logically reliable type of study at the top (called Meta Analysis) and the weakest at the bottom (called Expert Opinion / Background Information). Of course meta analysis articles like the one linked here brought a swift response from the powerful gatekeepers who tried to push meta studiesâŠ
Weâve not done a critical appraisal nugget for a while now. No excuses, weâve just been busy with other stuff, but we need to rectify that now. This week I sat down with Rick Body and talked through some of the issues around the use and abuse of p-values in research, and how that affects how we interpret them in critical appraisal.
Some useful links to stuff weâve mentioned on the podcast.
Uncertainty Wednesday: The Problem with P-Values (Learning)
Todayâs Uncertainty Wednesday will be the concluding post in my mini-series on the problem with p-values. We have already seen that it is much easier than expected to reject a null hypothesis if you have incentives to do so. We also saw that the ability to work backwards and generate hypotheses from the data is a big issue. Today we will consider a more foundational, epistemological problem with p-values: what is it that we are really learning when we are rejecting a null hypothesis?
Letâs once again consider the original example of a coin toss where our null hypothesis is that the coin is fair (and independent). We have done everything by the book. We had our null hypothesis ahead of time (not generated from the data). We did exactly 6 tosses and they all came up as heads (or tails for that matter), instead of cheating on our data collection. And so with great satisfaction we reject the null hypothesis at a p-value of 0.03125.
But what does that actually mean? What have we really learned from doing so? Our null hypothesis here is incredibly narrow. It is that the coin is precisely fair. Rejecting that leaves open a ton of other possibilities. Is the coin just slightly unfair or is it extremely unfair? Which of these two possibilities is more likely given what we have observed? And why did we pick this narrow null hypothesis in the first place?
Letâs take a step back. Suppose I donât tell you that we are dealing with a coin, just with a process that has two possible observable signals H and T. If you know nothing else about the process, that allows for anything from observing only Hs to only Ts to some random mix of the two. That makes it clear that having as your null hypothesis that the mix will be random at exactly 50% Hs and 50% Ts is an incredibly narrow assumption. It is picking a single real number, 0.5, on a continuous interval from 0 (no Hs) to 1 (all Hs).
This is related to the issue we encountered previously with spurious correlation. A null hypothesis of zero correlation between two variables is an incredibly narrow assumption, when possible correlation is a continuous interval from -1 to +1. So again, when we reject that narrow hypothesis what have we actually learned? Only that some very narrowly defined assumption is unlikely. Thatâs not a lot of learning.
This is a fundamental limitation of the p-values approach. Generally people tend to pick very narrow null hypotheses and rejecting them doesnât tell us much about the alternatives. Now this can be seen as a slightly unfair criticism. If you get a p-value of 0.0000001 on a coin toss and you do it with a large number of tosses you have the information that the coin is likely to be very unfair. But with the p-values approach that additional step tends to be buried.
What is the alternative? The alternative is to take a Baeysian approach instead. We saw that already in the case of correlation how that provides a lot more information than the rejection of a null hypothesis.
Uncertainty Wednesday: The Problem with P-Values (Generating Hypotheses)
Todayâs Uncertainty Wednesday continues our exploration into p-values and why they are problematic. Last Wednesday we saw that if you have incentives to reject a null hypothesis, it takes less work than you would initially think to find data that gets you there. I ended that post suggesting that the problem is even bigger than that. How so?
We now live in the age of âbig dataâ -- researchers in many fields have access to massive data sets. This lends itself to an approach that has become known as âdata dredging.â Instead of starting with the null hypothesis of a âfair and independent coinâ we start with a large database of pre-recorded coin flips. Now we work backwards to find a hypothesis that we can reject with a p-value of 0.05 or maybe even 0.01 in our data set!
How would we do such a thing and what would such a hypothesis look like? Well with a dataset containing just Hs and Ts we would have to be a bit creative. But we could generate hypotheses that take the form of a probabilistic finite state machine. For instance: the coin first has a probability of 20% H and 80% T, if H it has a subsequent probability of 70% H again, but if T then it only has a 10% of repeating T. You get the idea. You could write computer code that generates such hypotheses until you find one that you can reject with a really significant p-value in your dataset. Then you go and publish!
Now you might object: Albert, these are completely arbitrary hypotheses, why would anyone believe these? Well, they only come across as arbitrary because I on purpose stayed within the domain of a coin flip. But most big dataset are really complex containing many different variables. Just take the coin flip database and combine it with a database of stock price fluctuations. Now you can test tons of different hypotheses of the form: price movements for stock x are not correlated with the coin flips (where H might be stock price for x moves up and T it moves down).
Again you can have your computer generate these hypotheses for you and test them until you find one you can reject with a p-value thatâs deemed significant. These hypotheses are just as arbitrary as the coin state machines I suggested above, but they donât look that way. They look really simple and thus credible.
But this approach completely violates the statistical reasoning behind p-values. That reasoning only applies if you start with the hypothesis and then apply the test. In any large dataset you will always be able to work backwards towards hypotheses that can be rejected *in that dataset*. Just recall the prior posts about spurious correlation.
OK, so thatâs pretty bad given that so many people have incentives to find hypotheses they can reject so that they can publish a paper or claim that a product is effective. But next Wednesday we will look into an even more profound problem with p-values.
Uncertainty Wednesday: The Problem with P-Values (Intro)
As promised for some time, today in Uncertainty Wednesday, I will talk about p-values and what makes them so problematic. We will once again look at a super simple example by going back to considering a coin flip. As before we will consider the highest uncertainty explanation which is that the mechanism producing the coin flip produces heads (H) and tails (T) each with probability 0.5 and that each flip is independent.
Now the idea behind p-values is to attempt an argument of reductio ad absurdum but in a setting with uncertainty. We will assume that the explanation is true (this is also called the null hypothesis) and then see if our observations are so unlikely that they amount to a contradiction of our assumption. In the coin example: we will start by assuming that H and T are equally probable, then we will observe a sequence of Hs and Ts and if that sequence is really unlikely given our assumption, then we will reject that assumption.
Now the first question we have to ask ourselves is what does it mean for a sequence to be really unlikely given our assumption? This is an important question because we know that every sequence of a given length is actually equally probable given our assumption of equal probability and independence. What do I mean by that? Letâs take sequences of length 6 for example:
And we see that each of them is equally likely (or unlikely) given our assumptions. So this would not seem to help us much at all!
So how could we distinguish between these sequences? The idea is to compute a statistic, i.e. a condensation of the data. The statistic we might be most interested in here is the sample mean. Letâs say we take H=1 and T=0, then the sample means are as follows:
Now we are getting somewhere. There are many sequences that will give us a mean of 0.5 or close to it. There are only 2 sequences that will give us a mean of 1: HHHHHH and TTTTTT.
The p-value then is defined as the probability of a sample statistic given our explanation (aka assumption, aka null hypothesis). So in our example the p-value of observing mean = 1 is
Finding a sample mean of 1 given our assumptions has a p-value of 0.03125. That is less than 0.05 which is often used as the cutoff in many studies across fields as diverse as medicine and education. Following that approach we would thus reject our explanation that both H and T are equally probable.
Now all of this sounds super logical. There doesnât seem to be some obvious error of reasoning. And yet the use of p-values is wildly problematic. Over the next few posts we will explore why.
As âhomework,â you might consider the following scenario: you are a researcher who gets paid only if you reject the explanation of equal probability with a p-value cutoff of 0.05. How much work do you have to do to come up with a sequence of observations that gets you the desired result?