Statistical power
From Wikipedia, the free encyclopedia
The power of a statistical test is the probability that the test will reject a false null hypothesis (that it will not make a Type II error). As power increases, the chances of a Type II error decrease. The probability of a Type II error is referred to as the false negative rate (β). Therefore power is equal to 1 − β.
Power analysis can be used to calculate the minimum sample size required to accept the outcome of a statistical test with a particular level of confidence (power).
Contents |
[edit] A priori vs. post hoc analysis
Power analysis can either be done before (a priori or prospective power analysis) or after (post hoc or retrospective power analysis) data is collected. A priori power analysis is conducted prior to the research study, and is typically used to determine an appropriate sample size to achieve adequate power. Post-hoc power analysis is conducted after a study has been completed, and uses the obtained sample size and effect size to determine what the power was in the study, assuming the effect size in the sample is equal to the effect size in the population. Whereas the utility of prospective power analysis in experimental design is universally accepted, the usefulness of retrospective techniques is controversial [1].
[edit] Background
Statistical tests use data from samples to determine if differences or similarities exist in a population. That is, do the criteria for selecting the samples divide the population into statistically distinct sub-populations. For example, to test the null hypothesis that the mean scores of men and women on a test do not differ, samples of men and women are drawn, the test is administered to them, and the mean score of one group is compared to that of the other group using a statistical test. The power of the test is the probability that the test will find a statistically significant difference between men and women, as a function of the size of the true difference between those two populations. Note that power is the probability of finding a difference that does exist, as opposed to the likelihood of declaring a difference that does not exist (which is known as a Type I error).
Statistical power depends on:
- the statistical significance criterion used in the test
- the size of the difference or the strength of the similarity (that is, the effect size) in the population
- the sensitivity of the data.
A significance criterion is a statement of how unlikely a result must be, if the null hypothesis is true, to be considered significant. The most commonly used criteria are probabilities of 0.05 (5%, 1 in 20), 0.01 (1%, 1 in 100), and 0.001 (0.1%, 1 in 1000). If the criterion is 0.05, the probability of the difference must be less than 0.05, and so on. One way to increase the power of a test is to increase (that is, weaken) the significance level. This increases the chance of obtaining a statistically significant result (rejecting the null hypothesis) when the null hypothesis is false, that is, reduces the risk of a Type II error. But it also increases the risk of obtaining a statistically significant result when the null hypothesis is in fact true; that is, it increases the risk of a Type I error.
Calculating the power requires first specifying the effect size that the (non-null) hypothesis expects to detect. The greater the effect size, the greater the power.
Sensitivity can be increased by using statistical controls, by increasing the reliability of measures (as in psychometric reliability), and by increasing the size of the sample. Increasing sample size is the most commonly used method for increasing statistical power.
A common misconception by those new to statistical power is that power is a property of a study or experiment. In reality any statistical result that has a p-value has an associated power. For example, in the context of a single multiple regression, there will be a different level of statistical power associated with the overall r-square and for each of the regression coefficients. When determining an appropriate sample size for a planned study, it is important to consider that power will vary across the different hypotheses.
There are times when the recommendations of power analysis regarding sample size will be inadequate. Power analysis is appropriate when the concern is with the correct acceptance or rejection of a null hypothesis. In many contexts, the issue is less about determining if there is or is not a difference but rather with getting a more refined estimate of the population effect size. For example, if we were expecting a population correlation between intelligence and job performance of around .50, a sample size of 20 will give us approximately 80% power (alpha = .05, two-tail). However, in doing this study we are probably more interested in knowing whether the correlation is .30 or .60 or .50. In this context we would need a much larger sample size in order to reduce the confidence interval of our estimate to a range that is acceptable for our purposes. These and other considerations often result in the recommendation that when it comes to sample size, "More is better!"
[edit] Application
Funding agencies, ethics boards and research review panels frequently request that a researcher perform a power analysis, for example to determine the minimum number of animal test subjects needed for an experiment. If a study is inadequately powered, then, in frequentist statistics, there is little point in completing the research, as it is unlikely to allow one to choose between hypotheses at the desired significance level. By contrast, in Bayesian statistics, any properly-conducted experiment is valuable, as the data is used in the context of all data collected, and allows one to update one's beliefs via Bayesian inference, regardless of how little is collected. However, even in Bayesian statistics, power is a useful measure of how much a given experiment size can be expected to refine one's beliefs.
Although there are no formal standards for power, most researchers who assess the power of their tests use 0.80 as a standard for adequacy.
[edit] Example
Suppose we plan to compare research subjects in terms of a quantity that is measured before and after a treatment, analyzing the data using a paired t-test. Let Bi, Ai denote the pre-treatment and post-treatment measures on subject i. In the paired t-test, we let Di = Ai −Bi, then proceed by analyzing D as in a one-sample t-test. Begin by computing the sample variance
of the Di, which estimates the corresponding population variance
. The one-sided test for the alternative hypothesis ED >0 rejects the null hypothesis if
where n is the sample size,
is the average of the Di, and 1.64 is the approximate decision threshold for a level 0.05 test based on a normal approximation to the test statistic.
Now suppose that the alternative hypothesis is true and ED = τ. Then the power is
Since
approximately follows a standard normal distribution when the alternative hypothesis is true, the approximate power can be calculated as
Note that according to this formula, as either n or τ increase, the power increases, whereas if σD (and hence its sample-based estimate) increase, the power will decrease.
[edit] See also
[edit] Notes
- ^ Thomas, L. (1997) Retrospective power analysis. Conservation Biology 11(1):276-280
[edit] References
- Cohen, J.: Statistical Power Analysis for the Behavioral Sciences. (2nd ed.) 1988. ISBN 0-8058-0283-5.
[edit] External links
- Hypothesis Testing and Statistical Power of a Test
- G*Power – A free program for Statistical Power Analysis for Macintosh OS and MS-DOS
- R/Splus package of power analysis functions along the lines of Cohen (1988)
- Examples of all ANOVA and ANCOVA models with up to three treatment factors, including tools to estimate design power
- Free A-priori Sample Size Calculator for Multiple Regression from Daniel Soper's Free Statistics Calculators website. Computes the minimum required sample size for a study, given the alpha level, the number of predictors, the anticipated effect size, and the desired statistical power level.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||




