Here are some things you can find and report as alternatives
to the p-value: confidence intervals, Bayes factors, and magnitude-based interference. These are used mostly the same situations, but all of them are
more informative especially in combination with the P value.
The Problems with P-Value
First what is the p-value, and why do people hate it? P-value refers to the probability of obtaining at least as extreme evidence against your current null hypothesis if that null hypothesis is actually true.
First what is the p-value, and why do people hate it? P-value refers to the probability of obtaining at least as extreme evidence against your current null hypothesis if that null hypothesis is actually true.
There are some complications with the definition. First, “as
extreme” needs to be further clarified with a one-sided or two-sided alternative
hypothesis. Another issue comes from the fact that you're dealing with a
hypothesis as if it’s already true. If the parameter comes from a continuous distribution,
the chance of it being any given value is zero, so we’re assuming something
that is impossible by definition. If we are hypothesizing about a continuous variable
parameter, the hypothesis could be false by some trivial amount that would take
an extremely large sample to find.
P-values also convey little information on their own. When
used to describe effects or differences, they can only really reveal if some effect
can be detected. We use terms like statistically significant to describe this
detectability, which makes the problem more confusing. The word ‘significant’
sounds like the effect it should be meaningful in real world terms; it isn’t.
P-value is sometimes used as an automatic tool to decide if
something is publication worthy (this is not as pervasive as it was even ten
years ago, but it still happens). There’s
also undue reverence from the threshold of 0.05. If a p-value is less than
0.05, even by a little, then it the effect or difference it describes is
(sometimes) seen as much more important than if the p-value were even a little
greater than 0.05. There is no meaningful difference between p-values of 0.049 and
0.051, but using default methods, the smaller p-value leads to a conclusion
where an effect is ‘significant’, where the larger p-value does not. Adapting
to this reverence to the 0.05, some researchers make small adjustments to their
analysis when a p-value is slightly above 0.05 in order to try and push it
below that threshold artificially. This practice is called p-hacking.
So, we have an unintuitive, but very general, statistical
method that gets overused by one group and reviled by another. These two groups
aren't necessarily mutually exclusive.
The general-purpose feature is p-values is fantastic though,
it’s hard to beat a p-value for appropriateness in varied situations. p-values
aren’t bad, they’re just misunderstood. They’re also not alone.
Confidence intervals.
Confidence intervals are ranges that are assumed to contain
the true parameter value somewhere within them with a fixed probability. In
many cases confidence intervals are computed alongside p-value by default. A
hypothesis test can be conducted by checking if the confidence interval
includes the null hypothesis value for the parameter. If we were looking for a
difference between two means the null hypothesis would be that the mean is 0
and we would check if the confidence interval includes 0. If we were looking
for a difference in odds we could get a confidence interval of the odds ratio
and see if that includes one.
There are two big advantages to confidence intervals over p-values.
First, they explicitly state the parameter being estimated. If we're estimating
a difference of means, the confidence interval will also be measured in terms
of a difference. If we're estimating a slope effect in linear regression model,
the confidence interval will give the probable bounds of that slope effect.
The other, related, advantage is that confidence intervals
imply the magnitude of the effect. Not only can we see if a given slope or difference
is plausibly zero given the data, but we can get a sense of how far from zero
the plausible values reach.
Furthermore, confidence intervals expand nicely into
two-dimensional situations with confidence bands, and into multi-dimensional
situations with confidence regions. There are Bayesian analogues called
credible intervals and credible regions, which have a similar end results to
confidence intervals / regions, but different mathematical interpretations.
Bayes factors.
Bayes factors are used to compare pairs of hypotheses.
For simplicity let’s call these the alternative and null
respectively. If the Bayes factor of an alternative hypothesis is 3, that
implies that the alternative is three times as likely as the null hypothesis
given the data.
The simplest implementation of Bayes factor is between two
hypotheses that are both at some fixed value, like a difference of means of 5 versus
a difference of 0, or a slope coefficient of 3 versus a slope of 0. However, we
can also the alternative hypothesis value to our best (e.g. maximum likelihood,
or least squares) estimate of that value. In this case the Bayes factor is
never less than 1, and it increases but naturally as we move further away from
the null hypothesis value. For these situations we typically use the log Bayes
Factor instead.
As with p-values, we can set thresholds for rejecting a null
hypothesis. For example, we may use the informal definition of a Bayes factor
of 10 as strong evidence towards the alternative hypothesis, and reject any
null hypotheses for tests that produce a Bayes factor of 10 or greater. This
has the advantage over p-values of giving a more concrete interpretation of one
thing as more likely than another, instead of relying on the assumption that
the null is true. Furthermore, greater evidence of significance produces a
larger Bayes factor, which makes it more intuitive for people expecting a large
number for strong evidence. In programming languages like R, computing Bayes factor
is nearly as simple as p-values, albeit more computationally intense.
Magnitude based
inference
Magnitude based inference (MBI) operates a lot like
confidence intervals except that it also incorporates information about biologically
significant effects. Magnitude based inference requires a confidence interval (generated
in the usual ways) and two researcher-defined thresholds: one above and one
below the null hypothesis value. MBI was developed for physiology and medicine,
so these thresholds are usually referred to as the beneficial and detrimental thresholds,
respectively.
If we only had a null hypothesis value and a confidence
interval we could make one of three inferences based on this information: The
parameter being estimated is less than the null hypothesis value,: Is more than
the null hypothesis value, or it is uncertain. These correspond to the
confidence interval being entirely below the null hypothesis value, entirely
above the null hypothesis value, and straddling the value respectively.
With these two additional thresholds, we can make a greater
range of inferences. For example,
-
If a confidence interval is entirely beyond the
beneficial threshold, then we can say with some confidence is beneficial.
-
If the confidence interval is entirely above the
null hypothesis value, but includes the beneficial threshold, we can say with
confidence that the effect is real and non-detrimental, and that it may be
beneficial.
-
If a confidence interval includes the null
hypothesis value but no other threshold, we can say with some confidence that
the effect is trivial. In other words, we don't know what the value is but
we're reasonably sure that it isn't large enough to matter.
MBI offers much greater Insight than a p-value or a
confidence interval alone, but it does require some additional expertise from
outside of statistics in order to determine what is a minimum beneficial effect
or a minimum detrimental effect. Sometimes thresholds involve guesswork, and
often involve research discretion, so it also opens up a new avenue for p-hacking.
However, as long as the thresholds are transparent, it’s easy to readers to
check work for themselves.
very nice, clear succinct explanation.
ReplyDelete