Choice of [alpha] for One-Sided Tests, The

Choice of [alpha] for One-Sided Tests, The

Neuhauser, Markus

The International Conference on Harmonisation (ICH) E9 guideline recommends using a significance level of [alpha]/2 for one-sided tests in regLilatory settings. Two arguments are presented to demonstrate that this approach may not be universally sensible. First, a two-sided p-value is not always twice the minimum of the two tail probabilities, that is, the two possible one-sided p-values. Based on Fisher’s exact test, examples are presented in which the one-sided p-value is larger than [alpha]/2 although the corresponding two-sided p-value is smaller than [alpha]. Second, the choice between one- and two-sided tests is an artificial dichotomy since there is a continuum of choices when using asymmetrical critical regions. Such an unequal split of a is implicitly used when Fisher’s exact test is applied two-sided. Furthermore, a test intermediate to one- and two-sided tests is sometimes appropriate in group sequential designs.

Key Words

Significance level; One-sided test; Two-sided test; Asymmetrical critical region; Fisher’s exact test; ICHE9

INTRODUCTION

A statislical test-when it is not anti-conservative-guarantees that the probability for a type I error is not larger than the nominal significance level. The conventional significance level is [alpha]= 5%. Although this choice is arbitrary, “it is the most widely used and perhaps the most widely useful significance level” (1). Sometimes, an adjustment of the level is necessary, for example, in the case of multiple testing (2). The international guideline entitled Statistical Principles for Clinical Trials (ICH E9) (3) says that, “the approach of setting type I errors for one-sided tests at half the conventional type 1 error used in two-sided tests is preferable in regulatory settings.” One-sided tests can be applied when the research hypothesis is one-sided because there is an a priori expectation of the direction of change as, for instance, when the aim is to confirm the superiority of an active treatment over placebo previously shown in another trial.

Although the ICH E9 guideline has been found to be broadly acceptable (4), the proposal to halve the type 1 error rate is often not applied. Recent clinical trials from various indications considered a one-tailed p-value smaller than 0.05 as significant (eg, 5-13). Therefore, I discuss the suitability of the approach to halve [alpha].

THE DOUBLED ONE-SIDED P-VALUE CAN DIFFER FROM THE TWO-SIDED P-VALUE

One-sided Lesls can offer a large gain in power over Lhe corresponding two-sided lest (14). For instance, consider two groups with 45 patients and an effect size of 0.6. Student’s t test has, when its assumptions are satisfied, a power of 80% if two-sided ([alpha] = 0.05), but 88% if applied one-sided with [alpha] =0.05. O’Suilleabhain et al. (10) used one-tailed t tests to maximize the power. Regarding some tests such as Student’s t test, a two-sided p-value is twice the minimum of the two possible one-sided p-values and, consequently, the gain in power is completely lost when [alpha] is halved. This fact, however, is not the case for all tests. George and Mudholkar (15) pointed out that the definition of a two-sided p-value as twice the minimum of the tail probabilities is only appropriate when the null distribution is unimodal and symmetric.

If the group sizes are unequal, the exact randomization distribution of the t statistic may be asymmetric (16). Hence, the division of the two-sided p-value by two does not provide a valid one-sided p-value. In an extreme case (eg, seven observations: 7, 5, 5 for treatment A and 4, 4, 3, 3, 2, for treatment B) the one-sided exact randomization t test gives a p-value (0.018) that is equal to the two-sided p-value (16).

Let us consider Fisher’s exact test, a frequently-used exact test. The common rule to calculate a two-sided p-value is to sum the probabilities of all possible outcomes (ie, 2 × 2 tables with identical marginal totals) at least as unlikely as the observed value (17). When, for example, 42 out of 180 patients are responders under treatment A and 39 out of 251 are responders under treatment B, Fisher’s exact test yields the two-sided p-value of 0.046 which is smaller than [alpha]= 57?. However, the one-sided p-value (of the test that treatment A is better) is larger than [alpha]/2 (it is 0.028; see upper part of Table 1). In extreme cases the one- and the two-sided p-value can be identical (17; see lower part of Table 1). As an alternative way to obtain a two-sided p-value of Fisher’s exact test, some authors (18,19) suggested doubling the one-sided p-value. This approach, however, applied, for example, by Paggiaro et al. (20), in most cases produces tests with the undesirable property of biasedness, although in many cases the bias is rather slight (21).

THERE ARE INTERMEDIATE STAGES BETWEEN ONE-AND TWO-SIDED TESTS

The choice [gamma]= [delta] = [alpha]/2 corresponds Lo a usual two-sided tesl, whereas a one-sided test results in [gamma]= [alpha] and [delta] = 0. As Rice and Gaines (14) noted, “there is no compelling reason to consider only these extreme options.” They suggested [gamma]/[alpha] = 0.8 as a pragmatic value, that is, 80% of the critical region is given to the anticipated direction. As a result, there is more power in comparison to a two-sided test when the direction of change is as predicted, but large changes in the unanticipated direction can be detected, too. Which significance level should be chosen for that test? In analogy to specify the level of a one-sided test at [alpha]/2 = 2.5%, the p-value p^sub acr^. must be compared with (1 + [delta]/[gamma]) 2.5%, which depends on the partition. Thus, the significance level in the case of [gamma]/[alpha] = 0.80 would be 0.03125, but 0.03 in the case of [gamma]/[alpha] = 0.75. Alternatively, one may argue that the appropriate significance level is 5% since p^sub acr^ can be regarded as a p-value of a ‘two-sided’ test.

Is an unequal split of [gamma] relevant in a regulatory setting? Again, I look at Fisher’s exact test. For illustrative purposes, I consider an example with three patients under treatment A and two patients under treatment B, and I set [alpha] = 0.5 for the two-sided test. Let the observed number of respondcrs be two under treatment A and zero under treatment B. In this case, there are three possible 2 × 2 tables with identical marginal totals, that is, the number of responders under treatment A can be zero, one, or two. The corresponding probabilities of these tables are 0.1, 0.6, and 0.3, respectively. Consequently, the one-sided p-value that treatment A is better is 0.3, that is, the probability of the observed table since there is no more extreme table than observed. The two-sided p-value is 0.4 since the table with no responder under treatment A is less likely than the observed table and, therefore, its probability (0.1) must be added to 0.3. The resultant two-sided p-value 0.4 is smaller than the level [alpha] = 0.5 as defined above.

A two-sided hypothesis test can be regarded as a combination of two one-sided tests. When an equal split of [alpha] = 0.5 is considered, both one-sided tests have a significance level of [alpha]/2 = 0.25. In that case, the null hypothesis could not be rejected since both one-sided p-values are larger than 0.25 (0.3 for the alternative that treatment A is better and 1.0 for the alternative that treatment B is better). Hence, the two-sided Fisher’s exact test implicitly uses an unequal split of [alpha].

Another example of a test intermediate to oneand two-sided tests comes from group sequential test designs (22,23). Such a hybrid design is appropriate for the comparison of a new treatment to a standard therapy when the trial objective is to show that the treatments are equivalent with respect to the primary endpoint (eg, survival). If the treatments are found to be equivalent, a further goal is to show that the new treatment is superior with respect to a secondary endpoint (eg, quality of life). In that situation, one may not want to use a group sequential design that treats the two treatments symmetrically (23). The trial should be stopped in the case of a trend for the new therapy to be worse regarding the primary endpoint. However, to abandon the standard treatment, a highly significant result would be necessary.

CONCLUSION

With the focus on Fisher’s exact test-a test that is often applied in clinical research for the comparison of two binomial proportions-I demonstrated that using [alpha]/2 in one-sided testing may not be universally sensible. The [alpha]/2-approach seems to be useful only when the null distribution of the test statistic is unimodal and symmetric and, furthermore, when intermediate stages between one- and two-sided tests are not considered.

Thus, “the approach of setting type I errors for one-sided tests at half the conventional type I error used in two-sided tests” (ICH E9) may not always be preferable. However, because the ICH E9 guideline uses the word ‘preferable,’ it leaves open the possibility of taking an alternative, and justified, position. This article might help to justify taking an alternative position.

REFERENCES

1. Zar JH. Biostaiistical Analysis. Englewood Cliffs, N]: Prentice-Hall; 1984.

2. Proschan MA, Waclawiw MA. Practical guidelines for multiplicity adjustment in clinical trials. Control Clin Trials. 2000;21:527-539.

3. ICH E9 Expert Working Croup. ICH Harmonised Tripartite Guideline: Statistical Principles for Clinical Trials. Stat Med. 1999; 18:1905-1942.

4. Lewis J, Louv W, Rockhold F, Sato T. The impact of the international guideline entitled Statistical Principles for Clinical Trials (ICH E9). Stal Med. 2001;20:2549-2560.

5. Pelletier JP, Yaron M, Haraoui B, Cohen P, Nahir MA, Choquette D, Wigler I, Rosner IA, Beaulieu AD. Efficacy and safety of diacerein in osteoarthritis of the knee. Arthritis Rheumatism. 2000;43:2339-2348.

6. Sacristan JA, Gilaberte I, Boto B, Buesching DP, Obenchain RL, Demitrack M, Perez Sola V, Alvarez E, Artigas F. Cost-effectiveness of fluoxetine plus pindolol in patients with major depressive disorder: results from a randomized, doubleblind clinical trial. Int Clin Psychopharmacol. 2000; 15:107-113.

7. Cardozo L, Chappie CR, Toozs-Hobson P, Grosse-Frecse M, Bulitta M, Lehmacher W, Strosser W, Ballering-Bruhl B, Schafer M. Efficacy of trospium chloride in patients with detrusor instability: a placebo-controlled, randomized, double-blind, multicentre clinical trial. BJU Int. 2000;85: 659-664.

8. Tollefson GD, Birkett MA, Kiesler GM, Wood AJ. Double-blind comparison of olanzapine versus clozapine in schizophrenic patients clinically eligible for treatment with clo/apine. Biolog Psychiairy. 2001;49:52-63.

9. Akin MD, Weingand KW, Hengehold DA, Goodale MB, Hinkle RT, Smith RP. Continuouslow-level topical heat in the treatment of dysmenorrhea. Obstetrics Gynecol. 2001;97:343-349.

10. O’Suillcabhain P, Bullard J, Dewey RB. Proprioception in Parkinsons disease is acutely depressed by dopaminergic medications. J Neurology, Neurosurgery Psychiatry. 2001;71:607-610.

11. International Recombinant Human Chorionic Gonadotropin Study Group. Induction of ovulation in World Health Organization group II anovulalory women undergoing follicular stimulation with recombinant human follicle-stimulating hormone: a comparison of recombinant human chorionic gonadotropin (rhCG) and urinary hCG. Fertil Steril. 2001;75:1111-1118.

12. Cohen MB, Giannella RA, Bean J, Taylor DN, Parker S, Hoeper A, Wowk S, Hawkins J, Kochi SK, Schiff G, Killeen KP. Randomized, controlled human challenge study of the safety, immunogenicity, and protective efficacy of a singe dose of Peru-15, a live attenuated oral cholera vaccine. Infect Immun. 2002;70:1965-1970.

13. Bernard P, Chosidow O, Vaillant L. Oral pristinamycin versus standard penicillin regimen to treat erysipelas in adults: randomised, noninferiority, open trial. Br Med J. 2002;325:864-866.

14. Rice WR, Gaines SD. ‘Heads I win, tails you lose’: Testing directional alternative hypotheses in ecological and evolutionary research. Trends Eco/ Evolut. 1994;9:235-237.

15. George EO, Mudholkar DS. P-values for twosided tests. Biomelrical J. 1990;32:747-751.

16. Onghena P, May RB. Pitfalls in computing and interpreting randomization test p values: A commentary on Chen and Dunlap. Behavior Res Meth, Instruments, Computers. 1995;27:408-411.

17. Lloyd CJ. Statistical Analysis of Categorical Data. New York, NY: Wiley; 1999.

18. Dupont WD. Sensitivity of Fisher’s exact test to minor perturbations in 2 × 2 contingency tables. StatMed. 1986;5:629-635.

19. Terwilliger JD, Ott J. Handbook of Human Genetic Linkage. Baltimore, MD: John Hopkins University Press; 1994.

20. Paggiaro PL, Dahle R, Bakran 1, Frith L, Hollingworth K, Efthimiou J. Multicentre randomised placebo-controlled trial of inhaled fluticasone propionate in patients with chronic obstructive pulmonary disease. Lancet. 1998;351:773-780.

21. Lloyd CJ. Doubling the one-sided P-value in testing independence in 2×2 tables against a two-sided alternative. Stat Med. 1988;7:1297-1306.

22. Kittelson JM, Emerson SS. A unifying family of group sequential test designs. Biometrics. 1999; 55:874-882.

23. Emerson SS. S + SeqTrial: Technical Overview. Seattle, WA: MathSoft, Inc.; 2000.

Markus Neuhauser

Senior Lecturer, Department of Mathematics and Statistics, University of Olago, Dunedin, New Zealand