To P or Not to P
The exercise of hypothesis testing in experimental research has long been a fundamental practice, and is commonplace in publications of medical research.
Statistically speaking, the technical definition of a P-value confuses many researchers, and while its details are not discussed here, succinctly put, a P-value in a standard randomized controlled trial comparing two interventions A and B can be thought of as a probability value between 0 and 1 that one of two competing hypotheses is supported. These two competing hypotheses are the null hypothesis (which suggests there is no difference in the relative effectiveness of A versus B for some relevant clinical outcome of interest) and the alternative hypothesis (suggesting that some form of difference between A and B exists).
A classical approach to hypothesis testing would apply an appropriate statistical test to evaluate this pair of hypotheses. Such a test would return what is commonly known as a P-value, often used by researchers to interpret their findings. The exercise of hypothesis testing and the use of P-values have long been topics of debate. Investigators often have a primary goal in mind of achieving statistical significance (which is typically considered to occur when a P-value is less than 0.05) rather than focusing on estimating the difference of interest, a practice that is problematic.
First, in regard to the standard exercise of hypothesis testing, asking the question of whether or not the level of response in groups of patients receiving A and B is exactly the same (as the null hypothesis suggests) is a question whose answer is virtually always known a priori: “no” is that answer, as the probability of being exactly the same is remarkably small.
A more relevant question, to the extent that it has important clinical implications, is whether or not the levels of response in groups A and B are sufficiently different. Regarding the latter issue of reliance on P-values, these measures suggest only whether or not the difference between A and B is non-zero; this information is less interesting than the estimation of the size of this difference, which is more helpful in clinical interpretation.
The use of P-values alone fails readers on several fronts: they do not provide insight as to the magnitude of difference in the clinical response of A versus B; they provide no indication of the direction of the difference between A and B; and they rely on a standard cut-off (typically 0.05) for determining “important” and “unimportant” results, a simplistic practice that can potentially lead to misinterpretations.
P-values are also highly influenced by the sample size of a study, and those with particularly large numbers of patients may return P-values indicating a statistically significant difference between A and B which, clinically, is negligible. Conversely, studies of limited sample size may identify clinically important differences, but fail to achieve statistical significance based on a P-value due to small sample size or other issues.
It has been suggested several times in prominent journals that dependence upon P-values in reporting of research findings be replaced by the utilization of confidence intervals. Confidence intervals focus on the concept of estimation mentioned earlier; they are accompanied by a “best guess” of the difference between A and B, and provide a range of plausible values amongst which the true value of A and B may lie.
This approach enables clinical interpretation and avoids the sacrifice of information relevant to readers. The direction of effect is immediately clear, and avoidance of an unnecessary “yes” or “no” decision as to importance of the finding is replaced with a more thoughtful consideration of the data. These advantages produce a more transparent presentation of clinical findings that is of greater value to interested medical professionals, and leads to an increasingly sensible interpretation of findings by the researchers involved.
Confidence intervals rather than P values: estimation rather than hypothesis testing. Gardner MJ, Altman DG. BMJ(1986); 292: 746-750.
Confidence interval or p-value? Du Prel JB, Hommel G, Rohrig B, Blettner M. Dtsch Arztebl Int 2009; 106(19): 335–9.
P values: what they are and what they are not. Schervish MJ. Am Stat 1996;50(3):203-6.
That confounded p-value. Lang JM, Rothman KJ, Cann CI. Epidemiology 1998;9(1):7-8.