Quantitative: How To Detect Effects: Statistical Power and Evidence-Based Practice in Occupational Therapy Research

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

QUANTITATIVE RESEARCH

SERIES

How To Detect Effects: Statistical Power and Evidence-Based Practice in Occupational Therapy Research
Kenneth J. Ottenbacher, Frikkie Maas Key Words: probability (statistics) 9 quantitative method ~ statistics

The findings from 30 research investigations examining the effectiveness of occupational therapy interventions were reviewed and analyzed. The statistical conclusion validity was determined by computing post hoc power coefficients for the statistical hypothesis tests included in the examined studies. Data analysis revealed the median power values to detect small, medium, and large effect sizes were. 09, .33, and.66, respectively. These results suggest a high probability of Type II errors in the sample of occupational therapy intervention research examined. In practical terms, this means the intervention produced a potentially useful treattreatment effect, but the effect was not detected as significant. Examples are provided that illustrate how low statistical power contributes to increases in Type H errors and inhibits the development of consensus through replication in the research literature. The presence of low-power studies with high rates offalse negative findings prevents the establishment of guidelines for evidence-based practice and impedes the scientific progress of rehabilitation professions such as occupational therapy.
Ottenbacher,K., & Maas, E (1999). QuantitativeResearchSeries-How to detecteffects:Statisticalpowerand evidence-basedpracticein occupationaltherapyresearch.AmericanJournalof Occupational

Therapy,53, 181-188.
Evidence-based practice is gaining wide acceptance in health care service delivery (Pittler & Ernst, 1997; Rosenberg & Donald, 1996; Sackett, Rosenberg, Gray, Haynes, & Richardson, 1996). Sackett and colleagues (1996) described evidence-based practice as the "conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients" (p. 71). The "best evidence" used to support evidence-based practice is derived from a series of research studies that address a particular Clinical problem. Ideally, the series of studies results in an empirical consensus regarding the effectiveness of a treatment approach. Evidence-based practice is being advocated to help clinicians answer the following questions:

Kenneth J. Ottenbacher, PhD,OTR,FAOTA,is Vice Dean, School of Allied Health Sciences, University of Texas Medical Branch, 301 University Boulevard, Galveston, Texas 77555-1028, kottenba@utmb.edu. Frikkie Maas, PhD, is Senior Lecturer, Department of Occupational Therapy, School of Health and Rehabilitation Science, University of Queensland, Australia.

9 How does rehabilitation intervention work? 9 For whom is it most effective? 9 Are certain programs better for certain persons? 9 Who should be enrolled? When? And for how long? (Fletcher & Fletcher, 1997; Kirby, 1994). For the past 20 years, rehabilitation researchers have attempted to address these questions (Fuhrer, 1997; Kirby, 1994). Numerous studies have been reported, and the results of these investigations have been synthesized in several quantitative reviews or meta-analyses of rehabilitation intervention research (Casto & Mastropieri, 1986; Lau et al., 1992; Ottenbacher & Jannell, 1993). These studies have contributed substantial knowledge regarding the

This articlewasaccepted publicationJune 19, 1998. for

effectiveness of various rehabilitation interventions, but they have not answered all of the questions posed above. Additional research is needed to establish the "best evidence" necessary to make informed clinical judgments regarding treatment for persons with disabilities and their family members (Wilkerson & Johnston, 1997). On the basis of the progress made during the past two decades, it has been argued that rehabilitation researchers should move beyond asking whether rehabilitation intervention works and focus instead on determining what dimensions of rehabilitation intervention are related to changes in different outcome measures (Fuhrer, 1997; Kirby, 1994). Answering these questions will require welldesigned examinations of clinical intervention that are sensitive to specific treatment effects. Lack of sensitivity and inappropriate clinical design have been frequently identified as concerns in the rehabilitation intervention research literature (Dobkin, 1989; Dunst, Snyder, & Mankiner, 1989; Reding & McDowell, 1989). For example, Lind (1982) reviewed the literature on the effectiveness of stroke rehabilitation treatment and concluded that "the results of these studies conflicted and the conclusions vary widely" (p. 134). Lind (1982) criticized the studies for poor design and inadequate statistical analysis. In a comprehensive examination of the methodology of 49 studies examining the effectiveness of early intervention rehabilitation programs for children with biological impairment, Dunst and Rheingrover (1981) concluded that "flaws in the experimental designs make most of the studies uninterpretable from a scientific point of view" (p. 320). Along similar lines, Halpern (1984) reviewed existing early intervention studies and concluded that limitations in measurement, research design, and statistical analysis obscured positive effects. An important methodological factor that may obscure positive intervention effects in some rehabilitation intervention studies is low statistical power (Cohen, 1988). Inadequate statistical power was considered a form of statistical conclusion invalidity by Cook and Campbell (1979). They define statistical conclusion invalidity as the inappropriate use or interpretation of statistical tests or procedures. The presence of statistical conclusion invalidity can reduce the sensitivity of statistical manipulations and affect the ability of a clinical trial to produce results reflecting the effect of the treatment under investigation. The research hypothesis, which may be directional or nondirectional, is the researchers' prediction of the outcome of their investigations. The aim of the remaining steps in the traditional scientific process is to test the research hypothesis by obtaining sample data, to statistically analyze these data, and to arrive at rational conclusions. Through the statistical analysis of the sample data, researchers are able to make (cautious) inferences about the populations in terms of 2 hypothetical states of the actual

situation within the populations. Unlike the research hypothesis, these statistical hypotheses are always directional (and complementary). The first of these two statistical hypotheses is the null hypothesis (Ho), which states that the observed differences between samples occurred by chance and does not reflect a similar situation in the populations. The second, complementary statistical hypothesis is the alternative hypothesis (Ha), which states that the observed sample differences did not occur by chance. Instead, differences in sample statistics indicate similar differences among the populations from which the samples were drawn. Statistical decision theory dictates that it is the null hypothesis (not the alternative hypothesis) that is tested, although usually with the intention of trying to reject it. If the probability of chance differences in the samples is less than 5% (p < .05), it is usually assumed that the risk of error is small enough to reject H o. This arbitrary cutoff level is called the alpha level (alpha), level of significance, and repor resents the maximum risk the researchers are willing to take to incorrectly, reject a true null hypothesis. To incorrectly reject a true null hypothesis is called a Type I error, and the probability of this occurring is usually p = .05. Researchers run the risk of incorrectly retaining a false null hypothesis. On most occasions, this means they then fail to confirm their research hypothesis when in fact it is true (a "miss"). To incorrectly retain a false null hypothesis is called a Type II error. Researchers must be willing to take some risk of committing TypeII error. Most researchers prefer to limit the risk of committing Type II error to no more than .20 or 20% (the miss rate). This arbitrary level of .20 is called the beta level ([3). If [3 is set at .20, then the power of the statistical analysis to avoid Type II error is .80 (1 - 13). In summary, because decisions regarding Type I and Type II errors are made on sample data, it is not guaranteed that researchers make correct decisions when they generalize from the samples to the corresponding populations. Table 1 shows the possible decisions in statistical hypothesis testing. Theoretically, the statistical power of a study is defined as its ability to detect a phenomenon of specified magnitude given the existence of that phenomenon (Cohen, 1988; Oakes, 1986). In practical terms, statistical power is the complement of the miss rate (Type II error, see Table 1).
Table 1 Possible Decisions and Outcomes in Statistical Null Hypothesis Testing
State in the Population Possible Decisions Reject Ho and call result significant Retain H 0 and call result nonsignificant H 0 Is True H 0 Is False

Incorrect decision Correct decision Type I error Probability = 1 - 13 (statistical Probability = alpha power) Correct decision Incorrect decision Probability = 1 - alpha Type II error Probability = 13

Implications o f Low Statistical Power in Intervention Research Power can be readily calculated for various statistics (Cohen, 1988). For example, when studies are concerned with mean differences among treatment and nontreatment groups, the power can be calculated if the effect size, the required Type I error probability (usually .05), and the sample size are specified. These components of power have been described in the rehabilitation intervention literature (Ottenbacher, 1989, 1992, 1995, 1996). The larger the treatment effect (effect size), the higher its probability of detection as significant, given that other factors such as sample size and alpha level remain constant. Factors such as uncontrolled sources of random error, subject heterogeneity, and error of measurement all contribute to decreased power (Lipsey, 1990). These issues have received minimal attention in the occupational therapy intervention literature, although they are relevant to the development of statistical consensus essential to establish guidelines for evidence-based practice (Rosenberg & Donald, 1996). A consideration of statistical power is necessary in any attempt to establish guides for evidence-based practice. Evidence-based practice, as currently defined, is supported (or not supported) by the results of statistical hypothesis testing (Rosenberg & Donald, 1996). A body of research literature with low statistical power implies by definition an increase in Type II errors. Lipsey (1990) and others (Oakes, 1986; Ottenbacher, 1995, 1996) demonstrated that low power contributes to the accumulation of Type I errors in the published literature over time, a conclusion that deserves wider appreciation. The influence of power on Type I error is mediated by the prior probability that a true population effect has been selected for testing, another theoretical conclusion that deserves wider appreciation. Oakes (1986) demonstrated that the proportion of Type I errors in a body of research literature will depend on factors other than the Type I error rate selected for the individual statistical tests. Two key factors influencing Type I error rate are: 1. The prior probability that the investigator is conducting a study on a hypothesis with an effect of the stipulated magnitude in the population 2. The power of the study Consider a hypothetical body of research literature consisting of 1,000 studies (Table 2). Assume, on the basis of current theories, experience, and previous research, that clinical investigators are able to select hypotheses that contain a true population effect of an anticipated magnitude 30% of the time. Further assume that the average power of the published research studies is 50% for detecting what Cohen (1988) defined as a medium-sized effect (i.e., d = .50). Finally, assume that the alpha (Type I error rate) is controlled in all individual studies at

Table 2 Hypothetical Outcomes of 1,000 Studies"


True State of the Stroke Rehabilitation Field Researchers' Conclusion Accept H 0 Reject H 0 Total H 0 Is True 665 35 700 H 0 Is False 150 150 300

a Average power of.50, Type I error set at .05, and a 30% incidence of true effects.

the 5% level (p = .05). What then is the proportion of Type I errors in this hypothetical body of research literature? The data from Table 2 show that, under these circumstances, 35 of the 1,000 studies would be false claims that an effect exists (Type I errors), and 150 would be accurate claims that an effect exists. Thus, the proportion of Type I errors ("false alarms") in this literature would be 18.9% (35/185 100), not 5%. Stated more generally, if alpha the stipulated probability for Type I error, 1 - [3 is the is power, and if p(Ho) is the proportion of null hypotheses in the population of hypotheses, then p(FA), the actual proportion of erroneous claims that an effect was detected (Type I errors) is:
p(FA) = [alpha p(Ho)] + [ ( 1 - 1 3 ) ( 1 - p ( H o ) ) + alpha p(Ho)] (1)

This equation, or reworking of Table 2, can readily demonstrate the effect of varying the power on the actual proportion of Type I errors. For example, if we increase the power to .80, then p(FA) is reduced from 18.9% to 12.7%. This is still not the desired 5% but does represent a (30%) reduction in the proportion of false alarms in the literature. What if we were as stringent about controlling the Type II error as we typically are about the Type I error? If 1 - [3 is .95, then p(FA) becomes 1.9%. Clearly, the power of a study ought to be of concern if we are interested in keeping the proportion of Type I errors in the research literature low. These arguments suggest that low power affects the incidence of both Type I and Type II errors in the literature. The presence of Type I and Type II errors contributes to a body of research literature with contradictory findings and impedes the development of evidence-based practice. As Lipsey (1990) accurately observed, "statistical comparisons that lack power make for treatment effectiveness research that cannot accomplish its central purpose--to determine the effectiveness of treatment" (p. 21). The full implications of low power and Type II errors on the interpretation of occupational therapy research cannot be evaluated without current knowledge of the level of statistical power in the published literature. To determine the current power levels in the research literature, 30 intervention studies were examined. The purpose of the study was to find representative power values for occupational

therapy research and investigate the relationships among power, sample size, effect size, and et level. Method Thirty studies from the recent occupational therapy intervention literature were examined. The studies were obtained from the AmericanJournal of Occupational Therapy and the OccupationalTherapyJournal of Research.To be included in the analysis, a study had to meet the following criteria: 9 Include a rehabilitation intervention program involving occupational therapy 9 Involve a sample with clearly identified disabilities 9 Include some type of quantitative evaluation of perperformance 9 Contain outcome measures of functional performance (studies that included only physiological measures such as heart rate or respiration were not included) 9 Contain adequate descriptive and statistical information about the sample and results Studies were selected by two research assistants with graduate-level training in research design and statistical methods. Each journal was reviewed beginning with the last issue of 1997 and working back until 15 studies meeting the above criteria were identified. Studies included in the analysis were published from 1991 to 1997. The following data were collected from each of the studies: Type of research design, sample size, statistical tests used, and statistical outcomes. These values were transcribed to a coding form and were used in subsequent analysis. The coding was performed by two trained raters with experience in data reduction and coding. Two original studies were eliminated because of disagreements between the raters regarding coding of the type of research designs. Two additional studies were obtained to bring the total number of studies to 30. Reliabilities were determined for all items coded from the 30 studies. Reliability values ranged from K values of .77 to .86, and intraclass correlation coefficient of .86 to 1.0.

fields. Cohen noted that what may be considered a small effect in one area may have important practical (clinical or educational) implications in another field. Given this caveat, Cohen's guidelines have proven useful in establishing basic information for interpreting treatment effects when additional information is not available. For 2 group comparisons, Cohen provides the d index as a measure of effect size. A small d index ranges from .20 to .50, a medium d index ranges from .50 to .80, and a large d index is defined as a value >_8. A d index of .30 means that 3/10 of a standard deviation separates the average subjects in the 2 groups or conditions being compared. The minimum effect size value in the range described by Cohen was used to determine power. For example, when determining power for a small effect (e/index), the value of.20 was used. To complete the power analysis of the occupational therapy intervention studies, some additional parameters were defined. First, the .05 et level was assumed for all individual statistical tests when determining power levels. A two-tailed (nondirectional) version of the null hypothesis was assumed for all statistical tests. Adopting this convention may have produced underestimates of power in some cases. However, the assumption of a two-tailed test eliminated the more serious problem of inflated significance levels and power in nonpredicted directions.

Results There were 144 statistical hypotheses tests of various dependent measures included in the 30 intervention studies. The average investigation included 4.80 statistical tests (SD = 3.81). This is slightly fewer dependent measures than reported in the biomedical research literature (Pocock, Hughes, & Lee, 1987). Power is determined by a combination of factors including the et level, effect size, and sample size. It is often suggested or implied that sample size is the easiest factor to

Table 3 Stem and Leaf Table for Sample Size of 30 Clinical Trials Included in the Analysis
Stem 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Q1 = 21, Q3 = 71. Leaf 8 1349 00224 OO3678 O2 3 24 1 36 1 13

Determination of Power Power coefficients were obtained for each statistical test of a hypothesis associated with a dependent variable included in the 30 occupational therapy intervention studies. Power values were obtained from tables provided by Cohen (1988). The power for each statistical test to detect a "small," "medium," and "large" treatment effect was read directly from the appropriate table in Cohen's text or calculated by interpolation between tabled values. Cohen's (1988) guidelines were used to label treatment effects as small, medium, or large. These guidelines were developed as conventional frames of reference to make judgments concerning the effect of independent variables in various

2 0 2

Note. M= 53.23, Median = 37.5, SD = 40.00, N = 30, Max = 152, Min = 8,

manipulate in improving power (Kraemer & Thiemann, 1987). Table 3 contains sample size information for the 30 intervention studies. Each sample size is indicated by a combination of stem and leaf following procedures suggested by Tukey (1977). The stem and leaf table provides all the information contained in a traditional bar graph and shows the actual values for all sample sizes. For example, in the 20 to 29 range (stem = 2), the trailing digits in the body of the table are 00224. These digits are combined with the stem to indicate the sample size values. In this case, there were 2 investigations with sample sizes of 20, 2 studies with sample sizes of 22, and 1 article with a sample size of 24. The stem and leaf table includes additional descriptive and summary statistics related to the sample values included in the occupational therapy intervention studies. Results of the power analysis for small, medium, and large effect sizes are presented in Table 4. The table includes cumulative frequencies and summary statistics for various power ranges assuming either small, medium, or large effect size. The power coefficients included in Table 4 are from the 144 individual statistical hypothesis tests included in the 30 investigations. Inspection of Table 4 reveals that the median power of tests in the 30 studies to detect a small treatment effect as defined by Cohen (1988) (d = .20) was .09. The median power values to detect medium and large treatment effects in the target population were .33 and .66, respectively. Another way to interpret the results is to estimate the p r o b ability of committing a Type II error for the range of effect sizes examined. The probability of committing a Type II error ([3) was 1 - .09 = .91, or 91% if the population treatment effect was in the range Cohen considered small (i.e., d index = .20). The miss rates ([3) for medium and large effect sizes were 67% and 34%, respectively. These results suggest that even when the population treatment effect is in the range Cohen considered large, the probability of committing a Type II error is substantial, approximately 34%.

Table 4 Summary Statistics for Power To Detect Effect Size, With Individual Statistical Tests as the Unit of Analysis
Effect Statistic Maximum Minimum Median Mean Standard deviation Small .37 .05 .09 .14 .09 Medium .92 .10 .33 .43 .29 Large .99 .32 .66 .76 .48

Note. All effects (small, medium, large) are formulated on the basis of criteria developed by Cohen (1988).

was high ([3 = .63). The results suggest a high probability of Type II errors in the occupational therapy intervention literature examined in this study. These results are similar to those found for research in medical rehabilitation (Ottenbacher & Barrett, 1990). A traditional method for setting a desirable power for an effectiveness study is to use the following formula:
power = 1 - 4(00; if the .05 level is selected for a study, then power is established at 1 - 4(.05) = .80

This corresponds with a [3 rate of .2. By using this criterion, none of the statistical evaluations included in the 30 investigations showed sufficient power to detect a small effect in the population. The percentages of studies with adequate power (>.80) to detect medium and large population effect sizes were 24% and 46%, respectively.

Implications
Several investigations have identified and described the negative effect that studies of low power can have on a developing research literature (Fagley, 1985; Freiman, Chalmers, Smith, & Kuebler, 1978; Lipsey, 1990; Ottenbacher, 1996; Ottenbacher & Barrett, 1990). First, low power decreases the probability of rejecting the null hypothesis and thus increases the likelihood of committing a Type II error. Second, inadequate power may increase the probability of committing a Type I error; this interpretation derives from the Bayesian statistical concept of conditional alpha probability (Overall, 1969), has been illustrated by Oakes (1986), and is briefly described in the introduction to this article. A third negative consequence of low power concerns the interpretation of nonsignificant findings. When a researcher reports a failure to reject the null hypothesis, the various methodological factors and design features contributing to the obtained outcome (e.g., small sample size, measurement error, sample heterogeneity) should be carefully considered, particularly when power is
lOW.

Discussion
Examination of the 30 occupational therapy intervention studies revealed that investigators had a poor chance o f correctlyrejecting the null hypothesis unless the effect produced by the intervention was quite large. The argument could be made that effectiveness research in occupational therapy will be concerned with small and medium effect sizes due to the lack of rigorous control in clinical and community settings, the heterogeneous nature of the target population, and the insensitivity of clinically based measuring instruments. The median power values to detect small and medium effect sizes in the 30 trials reviewed were .09 and .33, respectively. The median power value for all 144 reviewed hypothesis tests was .37, which suggests that the probable Type II error rate for the reviewed literature

These three factors associated with low power contribute to the lack of statistical consensus that appears in the rehabilitation intervention literature. As Oakes (1986) has observed, control of alpha p < .05 is sometimes at

erroneously)imagined to mean that the probability of replication is 95%. In fact, statistical power determines the probability of replication as does the prior probability of null and nonnull hypotheses. It is instructive to consider quantitatively the influence of these factors on the probability of replication because replication of research findings is essential to the development of guidelines for evidence-based practice. What is the probability that a second occupational intervention study that uses independent and dependent variables identical to the first will replicate the original finding? If a null hypothesis with prior probability unknown is being tested, and alpha .05, then the probability that Ho will is be accepted twice in succession when using conventional statistical methods will be (1 - alpha)2 .952, or .9025. The = probability that it will be falsely rejected twice is .052 = .025. Thus, the overall probability of replication is .9025 + .025 = .93. If, however, we are dealing with a genuine effect (Ha true), power becomes a critical factor. Assuming a power of .33, the median power found in the sample of occupational therapy intervention literature to detect a medium effect size (see Table 4), we obtain p(replication) = .332+ (1 - .33) 2 = .557. The probability of replication is even more dismal if we consider that, in this last example, most of the replications represent confirmations of a false conclusion or Type II error. That is, with a power of .33, the probability of Type II error is .67, or 67%. In fact, (1 - .33) 2 + [ . 3 3 2 + (1 - 3 3 ) 2] = .804, meaning that approximately 80% of the replications strengthen a false impression. The overall probability of replication in a given field will further depend on the proportion of times investigators select what are actually null or nonnull hypotheses (Ottenbacher, 1996) (i.e., the prior probability of the alternative or research hypothesis being true). If we assume p(Ha) = .30, that is, 30% of the time we will identify a research hypothesis that is in fact true in the population. If we assume that (a) p(Ha) = .30, that is, 30% of the time we will identify a research hypothesis that is in fact true in the population; (b) the traditional Type I error rate of.05; and (c) a power of .33 (see Table 4), then the overall probability of replication is: p(replication) = .7[.052 + (1 - .05)2] + .3[.33 + (1 - .33)2] = .974 (2) 2 Clinical researchers might argue that current theories, observation, and prior research are valuable guides in selecting hypotheses to study. If we reverse the above example so that p(Ha) = .70 and p(Ho) = .30, that is, 70% of the time researchers conduct investigations where the intervention effect really exists in the population, then p(replication) is .736--a lower value. The apparently paradoxical conclusion is that the more often we are well guided by theory and prior clinical observation or research but conduct a low power study, the more we decrease the probability of

replication. Thus, literature with low power not only is committing a passive error, but also can actually contribute to statistical confusion. O f course, if theories and prior observations lead to testing what are true null hypotheses, that is, our treatments have no effect, then the practice of setting alpha .05 will allow a high probability of replication at (as above). When low-power studies are conducted, our best chance of replication occurs when we examine interventions that produce no effect, which is not the ideal situation if the goal is to establish evidence-based practice. Rosenthal (1990) provided an excellent example of the confusion that can result when statistical power is ignored in considering the results of replication studies. Assume that two investigations have been conducted. The first, Study A, produces statistically significant results. A second study, Study B, is conducted using the same research design, same treatment, and same outcome but with one important difference--a smaller sample size--and therefore lower statistical power. The results of Study B are not statistically significant, and the researcher's conclusion is a failure to replicate, that is, conflicting results. A closer look at the actual statistical outcomes reveals the following. 9 Study A: t = 2.21, 9 Study B: t= 1.06,

df= 78, p < .05 df= 18, p = nonsignificant

If a d index is computed for these 2 investigations with the formula presented by Rosenthal (1990) and others (Hedges & Olkin, 1985), d= square we find that the effect 2t/ roof (df), size (d = .50) is identical for both studies. Cohen (1988) defines an effect size as the degree to which a null hypothesis is false. In this example, the degree to which the null hypothesis is false is identical in both studies A and B. The discrepancy between the statistical significance test results and the effect size values is due to the smaller sample size in study B ( N = 20) and subsequently lower power. For study B, the power to reject the null hypothesis at p < .05, two-tailed, is .18, and the probability of a Type II error is
(1 - .18) = .82, or 8 2 % .

The agreement or consistency of treatment effect across trials is obscured when the focus is primarily on the results of statistical significance testing. In such cases, low power leads to statistical confusion. As Lipsey, Crosse, Pankle, and Stobart (1985) accurately noted, "research with low statistical power has the potential for falsely branding a program as a failure when, in fact, the problem is the inability of the research to detect an effect, not the inability of the program to produce one" (p. 26).
Conclusion

One consequence of a failure to replicate a previous investigation is a proliferation of studies developed around hypothesized "explanations" for the discrepancies observed in the literature. These explanations are usually made on the basis of theories or conjecture related to the original

independent variable. Rarely is the problem of sampling variation or low statistical power considered as the preferred explanation and the inspiration for a more powerful replication study. As noted by Lindsay and Ehrenberg (1993), exact replication is rare. Even rarer is replication with a more powerful design. Thus, the consequence of low power is a contradictory research literature that is apparently dynamic but fails to resolve uncertainties--it fails to establish consensus. A failure to establish consensus means a failure to develop guidelines for evidence-based practice. Evidencebased practice requires a series of studies that support the use of a particular intervention with a patient population. Rosenberg and Donald (1996) described evidence-based practice as "the process of systematically finding, appraising, and using contemporaneous research findings as the basis for clinical decision making" (p. 1122). When the research literature is characterized by studies with conflicting results, evidence-based practice is not possible. The responsible rehabilitation researcher must have a working knowledge of power and effect size. A concern with power, however, cannot end with its calculation. Because ability to detect treatment effects must be optimized, the investigator should be concerned with all factors that determine effect size. These include not only parameters controlled by research circumstances and method such as sample size, sensitivity of measuring instruments, sample heterogeneity, and design characteristics but also parameters influenced by the goals set for the study, such as the effect size to be detected as clinically meaningful. Several excellent resources exist to assist the clinical researcher in understanding and improving statistical power in applied investigations (Cohen, 1988; Lipsey, 1990). Rehabilitation researchers who use quantitative approaches should supplement the results of statistical significance testing with the use of effect size values or confidence intervals. Cohen (1988) has provided effect size measures for use with a wide variety of statistical tests. The importance of reporting effect size values is becoming widely recognized. Carver (1993) observed that "nothing is more important in educational and psychological research than making sure the effect size of results is evaluated when tests of statistical significance are used" (p. 291). Along similar lines, the former editor of the Journal o f Gerontology stated that, "We are increasingly insistent that authors address the issue of effect magnitude, in both tests of the statistics reported and the discussion of the implications of the research. Any effect can be shown to be significant, given sufficiently large sample sizes. The real question is whether or not the effect is important, measures of effect size help the researcher answer this important question'' (Storandt, 1983, p. 2). An appreciation of statistical conclusion validity, including

information about power and effect size, will help ensure that future intervention research will be accurately reported and correctly interpreted, and will contribute to the development of evidence-based practice in occupational therapy. 9 Acknowledgments We thank Richard Roach, MS, and Allen Holcome for assistancein data collection, coding, and review.This study was completed while the first author received support from the National Institutes of Health, R01 HD34622.
References

Carver, R. P. (1993). The case against null hypothesis testing-Revised.Journal of Experimental Education, 61, 287-292. Casto, G., & Mastropieri, M. A. (1986). The efficacyof rehabilitation intervention programs: A meta-analysis. Exceptional Children, 52, 417-424. Cohen, J. (1988). Statisticalpower analysisfor the behavioralsciences (2nd ed.). Hillsdale, NJ: Erlbaum. Cook, T. D., & Campbell, D. T. (1979). Quasi experimentation: Design and analysis issuesinfield settings. Chicago: Rand McNally. Dobkin, B. H. (1989). Focusedstroke rehabilitation programs do not improve outcome. Archives of Neurology, 46, 701-703. Dunst, C. J., & Rheingrover, R. M. (1981). An analysisof the efficacy of infant intervention programs with organically handicapped children. Evaluation and Program Planning, 4, 287-323. Dunst, C. J., Snyder, S. W., & Mankiner, M. (1989). Efficacyof rehabilitation intervention. In M. C. Wang, M. C. Reynolds, & H. J. Walberg (Eds.), Handbook of special education: Researchand practice. Vol. 3: Low incidence conditions (pp. 259-294). New York: Pergamon. Fagley, N. S. (1985). Applied statistical power analysis and the interpretation of non-significant results of research consumers.Journal of Counseling Psychology,32, 391-396. Fletcher, R. H., & Fletcher, S. W. (1997). Evidence-based approach to the medical literature. Journal of General Internal Medicine, 12(Suppl.), $5-S 12. Freiman, J., Chalmers, T., Smith, H., & Kuebler, R. (1978). The importance of beta, the Type II error and sample size in the design and interpretation of the randomized trial. New England Journal of Medicine, 299, 690-694. Fuhrer, M. J. (Ed.). (1997). Assessing medical rehabilitation practices: The promise of outcomes research. Baltimore: Brookes. Halpern, D. (1984). Lack of effects for home-based rehabilitation intervention? Some possible explanations. American Journal of Orthopsychiatry, 54, 33-42. Hedges, L. V., & Olkin, I. (1985). Statistical methodsfor metaanalysis. Orlando, FL: Academic Press. Kirby, tLL. (1994). Priorities and rehabilitation. American Journal of PhysicalMedicine & Rehabilitation, 73, 56-57. Kraemer, H. C., & Thiemann, S. (1987). How many subjects: Statisticalpower analysis in research.BeverlyHills, CA: Sage. Lau, J., Antman, E. M., Jimenez-Silva,J., Kupelnick, B., Mosteller, E, & Chalmers, T. C. (1992). Cumulative meta-analysesof therapeutic trials of myocardial infarction. New England Journal of Medicine, 327, 248-254. Lind, K. A. (1982). A synthesis of studies on stroke rehabilitation. Journal of Chronic Disease, 35, 133-149. Lindsay, R. M., & Ehrenberg, A. S. C. (1993).The design of replicated studies. The American Statistician, 47, 217-228. Lipsey, M. W. (1990). Design sensitivity: Statisticalpowerfor experimental research.Newbury Park, CA: Sage. Lipsey, M. W., Crosse, S., Pankle, J. & Stobart, G. (1985).

Evaluation: The state of the art and the sorry state of the science. New Directionfor Program Evaluation, 27, 7-28. Oakes, M. (1986). Statisticalinference:A commentaryfor the social and behavioralsciences.New York: Wiley. Ottenbacher, K. J. (1989). Statistical conclusion validity of rehabilitation intervention research with handicapped children. Exceptional Children, 55, 534-54. Ottenbacher, K. J. (1992). Practical significance in rehabilitation intervention research: From affect to empirical effect. Journal of Early Intervention, 16, 181-193. Ottenbacher, K. J. (1995). Why rehabilitation research does not work (as well as we think it should). Archives of Physical Medicine & Rehabilitation, 76, 123-129. Ottenbacher, K. J. (1996). The power of replications and replications of power. The American Statistician, 50, 271-275. Ottenbacher, K. J., & Barrett, K. A. (1990). Statistical conclusion validity of rehabilitation research. AmericanJournal of PhysicalMedicine and Rehabilitation, 69, 102-107. Ottenbacher, K. J., & Jannell, S. (1993). The results of clinical trials in stroke rehabilitation research. Archivesof Neurology, 50, 37-44. Overall, J. E. (1969). Classical statistical hypothesis testing within the context of Bayesian theory. PsychologicalBulletin, 71, 285-292. Pittler, M. H., & Ernst, E. (1997). Evidence-based PM&R [Letter

to the editor]. Archivesof PhysicaIMedicine & Rehabilitation, 78, 1281. Pocock, S. J., Hughes, M. D., & Lee, R. J. (1987). Statistical problems in the reporting of clinical trials: A survey of three medical journals. New EnglandJournal of Medicine, 317, 426-432. Reding, M. J., & McDowell, E H. (1989). Focused stroke rehabilitation programs improve outcome. Archivesof Neurology, 46, 700-701. Rosenberg, W., & Donald, A. (1996). Evidence based medicine: An approach to clinical problem solving. British Medical Journal, 310,
1120-1126.

Rosenthal, R. (1990). Replication in behavioral research. In J. W. Neulep (Ed.), Handbook of replication researchin the behavioraland social sciences(pp. 1-30). Newbury Park, CA: Sage. Sackett, D., Rosenberg, W. M., Gray, J. A., Haynes, R. B., & Richardson, W. S. (1996). Evidence-based medicine: What is it and what it is not. BritishMedicalJournal 312, 71-72. Storandt, M. (1983). Significance does not equal importance [Editorial]. Journal of Gerontology 38, 2. Tukey, J. W. (1977). Exploratorydata analysis. Reading, MA: Addison-Wesley. Wilkerson, D. L., & Johnston, M. V. (1997). Clinical program monitoring systems: Current capability and future directions. In M. J. Fuhrer (Ed.), Assessingmedicalrehabilitationpractices: Thepromise of outcomes research(pp. 275-306). Baltimore: Brookes.

You might also like