Basic statistics and the inconsistency of multiple comparison procedures

Saville, David J

Abstract This paper has two main themes. First, the various statistical measures used in this journal are summarized and their interrelationships described by way of a flow chart. These are the pooled standard deviation, the pooled variance or mean square error (MSE), the standard error of each treatment mean (SEM) and of the difference between two treatment means (SED), and the least difference between two means which is significant at (e.g.) the 5% level of significance (LSD(5%)). The last three measures can be displayed as vertical bars in graphs, and the relationship between the lengths of these bars is graphically illustrated. It is suggested that the LSD is the most useful of these three measures. Second, when the experimenter has no prior hypotheses to be tested using analysis of variance “contrasts,” a multiple comparison procedure (MCP) that examines all pair-wise differences between treatment means, may be appropriate. In this paper a fictitious experimental data set is used to compare several wellknown MCPs by focussing on a particular operating characteristic, the consistency of the results between an overall analysis of all treatments and an analysis of a subset of the experimental treatments. The procedure that behaves best according to this criterion is the unrestricted least significant difference (LSD) procedure. The unrestricted LSD is therefore recommended with the proviso that it be used as a method of generating hypotheses to be tested in subsequent experimentation, not as a method that attempts to simultaneously formulate and test hypotheses.

The format for this paper is as follows. First I shall discuss the various ways in which statistical information can be summarized in scientific papers, and recommend usage of the least significant difference (LSD) as a succinct and useful measure of experimental variability. Then I shall illustrate differences in an important operating characteristic between various multiple comparison procedures by analyzing a specific data set from an experiment with 32 treatments, and subsets corresponding to selected treatments. This leads to a recommendation that planned and unplanned comparisons between treatment means should be made using the same statistical tool (e.g., a 5% level test per comparison), without any adjustment for multiplicity. The proviso is that experimenters should clearly distinguish between the formulation of new hypotheses involving such comparisons (that require confirmation in subsequent experimentation), and the testing of pre-existing, a priori hypotheses that are being confirmed in the current experiment.

Basic Statistical Measures

As an agricultural research statistician attempting to write an article for an unfamiliar readership, I decided to browse through the 2001 volume (No. 55) of the Canadian Journal of Experimental Psychology in order to gain an appreciation of the usual method of presenting statistical information. I especially searched for papers in which data were assumed to follow a “normal” distribution, leading to the calculation of pooled or unpooled standard deviations and related statistical measures. I selected five such papers for copying and further perusal (Baranski & Petrusic, 2001; Christie & Klein, 2001; Gold & Pratt, 2001; Hubbard, 2001; Shore, McLaughlin, & Klein, 2001).

My study of these five papers yielded the following. In all five papers, F values, MSE (Mean Square Error) values and /; values were reported; these summarized the output produced by the analysis of variance technique. In two of the five papers, the SEM (Standard Error of Mean) was presented, in another paper, 95% confidence intervals were presented for each mean, in another paper the PLSD (Protected Least Significant Difference) was presented, and in the fifth paper none of these three measures was presented. I shall now discuss these alternative measures, spell out the links between them, and describe their relationship to the analysis of variance.

When the analysis of variance technique is employed for data analysis, the implicit assumption is that all treatments have similar standard deviations. This assumption is routinely checked by plotting the residuals (or errors) from the model against the fitted values; if the resulting scattergram forms a “

Under this assumption that all experimental treatments have the same standard deviation (or are homogeneous in variance to use statistical jargon), a single estimated standard errorof the mean (SEM) can be calculated using the formula

if all experimental treatments have the same sample size n (Figure 1). This SEM applies equally to all of the experimental treatments.

The 95% confidence interval (CI) associated with each of the treatment means is

mean + or – SEM x (t critical value)

where the t critical value is the 97.5 percentile of the t distribution with the residual (or pooled error) degrees of freedom associated with the analysis of variance model (Figure 1). For example, if the residual degrees of freedom is 16, the 95% CI is [mean + or – SEM x 2.120]. The confidence interval has the same width for each treatment (under the assumptions of the analysis of variance).

For estimating the variability in the estimated difference between any two experimental treatments, a single estimated standard error of the difference (SED) between the two treatment means can be calculated using the formula

This SED applies equally to all pairs of experimental treatments (Figure 1).

The 95% CI associated with the difference between any two of the treatment means is

(difference in means) + or – x (t critical value)

using the t critical value described above (Figure 1). For example, if the residual degrees of freedom is 16, the 95% CI is [(difference in means) + or – SEM x [the square root of]2 x 2.120]; this reduces to [(difference in means) + or – 3 x SEM] in this cunningly chosen example! Again, the confidence interval has the same width for each pair of treatments.

To determine whether one treatment mean differs “significantly” from a second treatment mean, one approach is to determine whether zero is included in the 95% CI for the difference between the two treatment means. If zero is included, then “no difference” is a plausible scenario, so the difference between the two treatment means is “not significant.” This is equivalent to determining whether the difference between the two means is less than SED x (t critical value), the half-width of the 95% CI; this latter quantity is therefore called the “Least Significant Difference”

Of course, LSD(1%) and LSD(0.1%) values can be calculated if the experimenter wishes to determine whether the difference is significant at p

As an aside, note that the pooling of the standard deviations between experimental treatments simply means that the pooled standard deviation is a more accurate estimate of the “true” common standard deviation than could be obtained from each treatment individually, and hence the resulting LSD is also a more accurate estimate of its “true” value. Therefore the effect of increasing the number of experimental treatments is simply to increase the accuracy of estimation of the LSD, not to decrease or increase the estimate.

The above procedure for determining which pairs of treatment means differ significantly is an example of a multiple comparison procedure. The one described above is called the unrestricted (or unprotected) LSD procedure (Saville, 1990). This procedure is equivalent to carrying out multiple t-tests of the form

t = (difference between two means) / SED

subject to the restrictions that the SED is based upon a pooled variance estimate as described above, and the t critical value has the corresponding pooled residual d.t. It is also equivalent to carrying out multiple F tests of the form

F = (difference between two means)^sup 2^ / (SED)^sup 2^

subject to the same restrictions. Some of the pros and cons of such a procedure will be discussed in the next section of this paper.

From the above discussion, it will be apparent that if an author reports any one of the statistical measures MSE (s^sup 2^), SEM, SED or LSD(5%), the other measures can be calculated by the reader, assuming that the methods section of the paper gives the sample size (n) and enough information on the statistical design for the reader to calculate the residual degrees of freedom. However, some measures are clearly more convenient for the reader than other measures.

In my opinion, the most convenient measure is the LSD(5%); this can be reported in a table along with the treatment means, or displayed as a vertical bar when results are graphed (Figure 2a). The LSD or vertical bar can be used directly by the reader to determine whether there is a significant difference between two particular treatment means. No calculations are required as in the case of the other three measures. In Figure 2(a), the means of treatments A and B differ by 3.0, while the LSD(5%) is 3.0; hence treatments A and B “only just” differ from one another at p

If the author chooses to report SKM values, then the SEM must be multiplied by [the square root of]2 x (t critical value) if the reader wishes to calculate the LSD(5%) in order to determine if two treatments differ from one another at p 16. On a graph of the treatment means, the SEM is often displayed by drawing a vertical bar extending one SEM above and below each treatment mean, as shown in Figure 2(b). Note that in Figure 2(b), the vertical bars are far from overlapping, while treatments A and B “only just” differ at p

If the author chooses to report 95% confidence intervals for each mean, then the confidence intervals can overlap considerably, yet the treatment means are in fact significantly different (perhaps even at p

In summary, usage of SEM bars tends to make the experimental results look better than is perhaps justified, while usage of 95% confidence intervals will mean that many readers will not realize that differences are statistically significant. In agricultural research in New Zealand, LSD values or bars are the most common method of presentation; I would also recommend this to experimental psychologists. If this method of presentation was to be adopted, the reporting of the overall F, MSE, and the overall p values in the text could also be dispensed with.

Multiple Comparison Procedures

In general, statisticians become unhappy when faced with the prospect of making large numbers of correlated and unplanned pairwise comparisons between the means of the treatments included in an experiment. The usual scientific method is to investigate specific ideas or hypotheses by carrying out an experiment and statistically testing these hypotheses using specific contrasts between the treatment means. The simultaneous formulation and testing of hypotheses concerning all pairwise comparisons within the same experiment can be said to constitute a degradation of the scientific method, and conjures up visions of data dredging. In the face of such considerations, most statisticians react conservatively by advocating a multiple comparison procedure with a high level of protection for the “null” hypothesis (and therefore with low power for detecting real differences). By comparison, my response is to suggest using the simplest of all procedures, the unrestricted LSD, in conjunction with the notion that differences “detected” by the procedure require confirmation in a second, independent experiment (Saville, 1990). When viewed in this light, the multiple comparison procedure is seen as an hypothesis-generating procedure, not a procedure for simultaneous formulation and testing (e.g., in the case of “truly no treatment effects,” the unrestricted LSD procedure will generate false hypotheses at the rate of 5%).

The main reason that I prefer the simplest of formal procedures is the “inconsistency” of the other multiple comparison procedures. In brief, I call a procedure “inconsistent” if the probability of judging two treatments to be different depends on either the number of treatments included in the statistical analysis, or on the values of the treatment means for the remaining treatments. More precisely, in Saville (1990) I call a procedure “consistent” if the “decision it generates as to whether two population means are different is dependent only on (a) the difference between the two sample means, (b) the standard error of this difference, (c) the number of error degrees of freedom, and (d) the significance level at which the procedure is operated” (p. 177). To illustrate the undesirability of inconsistency, I shall now present an example.

An Example

Suppose (fictitiously) that we have 52 treatment programs for problem gamblers that have been included in a trial involving $12 problem gamblers, with 16 gamblers randomly allocated to each treatment program. Each gambler is subjected to a battery of psychological tests prior to the treatment program, and again at the completion of the program. The data we analyze is the increase in a standardized score that is a total over all of the tests included in the battery of psychological tests (with scales converted so that a low value corresponds to a poor psychological state, and a high value corresponds to a good psychological state).

Suppose that the (fictitious) mean increases in score for the 32 treatment programs, sorted into ascending order, are 158, 159, 161, 163, 164, 166, 167, 167, 109, 169, 170, 170, 171, 173, 173, 174, 175, 175, 176, 176, 179, 180, 182, 182, 183, 183, 185, 185, 186, 188, 189, and 190, with a pooled variance estimate of s^sup 2^ = 354.6 (pooled SD = s = 18.8) with 480 residual degrees of freedom. The common SEM is 4.708, the common SED is 6.658, and the LSD(5%) is 13.1.

I shall use this fictional data set to illustrate the differences between the following multiple comparison procedures: Bonferroni procedure, Tukey’s honest significant difference (HSD) procedure, Student-Newman-Keuls’ multiple range test (MRT), Fisher’s restricted LSD procedure, Duncan’s multiple range test, and the unrestricted LSD procedure. I shall especially focus on the significance of the difference between two of the most popular treatment programs, referring to them as programs A and H (with mean increases in score of 161 and 186, respectively); the data for these two programs are displayed as histograms in Figure 3. For each MCP, I shall first analyze the full data set, then a subset of 13 treatments (including A and B), then a subset of four treatments (including A and B), and lastly a subset of just the two treatments A and B. I have artificially arranged that in all of these analyses, the pooled variance estimate is s^sup 2^ = 354.6, so the SEM is 4.708 and the SED is 6.658 in all analyses. The residual d.f. varies from 480 to 195 to 60 to 30 in the four analyses, and as a result, the LSD(5%) varies within the range 13.1 to 13.6. All analyses are carried out using the analysis of variance and “allpairwise” routines in the statistical package Genstat (Genstat Committee, 2002).

First, I shall consider the results from an analysis of the full data set. Table 1 shows that in this case, the Bonferroni and Tukey procedures say that treatments A and H do NOT differ significantly (at p

Second, suppose a British Columbia provincial association of problem-gambling treatment providers requests an analysis of just the 13 treatments that their members have used (those with means 158, 161, 164, 166, 169, 170, 173, 175, 179, 182, 185, 186, and 188). In this case, Table 1 shows that the Bonferroni, Tukey, and Student-Newman-Keuls procedures say that Treatments A and B differ significantly at p

Third, suppose a North Vancouver treatment provider requests an analysis of just the four treatments that they have used (those with means 161, 171, 176, and 186). In this case, Table 1 shows that the Bonferroni, Tukey, Student-Newman-Keuls, and restricted LSD procedures say that Treatments A and B differ significantly at p

Lastly, suppose a Kamloops treatment provider requests an analysis of just the two treatments that they have used (Treatments A and B, with means 161 and 186). In this case, Table 1 shows that all six MCPs say that Treatments A and B differ significantly at p

Inconsistency

In this particular example, the Bonferroni and Tukey HSD procedures are both very “inconsistent” in terms of the decision they return about whether treatment programs A and B differ in mean increase in standardized score. The reason for this inconsistency is that both of these procedures are providing a 95% level of protection for the overall, experiment-wise null hypothesis of “no treatment effects” – this is also referred to as an experiment-wise error rate of 5%. This ensures that in the null case, false hypotheses are generated in at most 5% of experiments. The “down-side” of this is that “interesting” effects may go un-noticed; such effects could be explored in further experimentation. This is also referred to as a low “power,” meaning a low probability of detecting real effects.

The Student-Newman-Keuls MRT procedure is somewhat less conservative than the Bonferroni and Tukey procedures, since when examining a group of means for “homogeneity,” it uses the expected range for groups with that number of means, rather than the range for 32 means as in Tukey’s procedure. It is therefore slightly more consistent than the first two procedures, though still quite inconsistent (Table 1).

The restricted LSD and Duncan’s MRT are both reasonably consistent in the example that I have chosen (Table 1). In general, the restricted LSD has more potential for inconsistency than Duncan’s MRT (Saville, 1990). The only procedure that is always consistent is the unrestricted LSD, sometimes referred to as Fisher’s unprotected LSD (Saville, 1990).

In practice, consistency is an important criterion. Imagine what would happen at a problem-gambling conference if four presenters of four different papers all happened to refer to the difference between treatment programs A and B in the experimental work described above, each discussing it from their own perspective. If all presenters had used either the Bonferroni or Tukey procedures with the differing subsets of treatments described above, there would have been four different statements concerning the significance of the difference between programs A and B (not significant, significant at p

Other Discussions of Inconsistency

Other writers have similarly pointed out the above anomaly introduced by MCPs. Canner and Walker (1982) tell the delightful story of three porridge breeders, Papa Bear, Mama Rear, and Baby Bear, who consult their statistician, Goldilocks, about the anomalous results induced by usage of an MCP with an experiment-wise error rate; their recommendation is for researchers to use the restricted LSD. In a later paper, the same authors recommend the unrestricted LSD except if the overall null hypothesis is plausible, in which case they suggest the restricted LSD (Carmer & Walker, 1985).

In a similar vein, Klein (1990) presents a parable “indicating why the Bonferroni correction ….. interferes with seeing real differences between the treatments” (p. 682) in the context of a National Institute of Mental Health research program comparing four treatment regimes. He points out that there could be good evidence for a difference between two treatments when these are analyzed alone, weakened by the inclusion of a third treatment (from “down in the basement”), and reduced to nonsignificance by the inclusion of a fourth treatment (from “up in the attic”).

Similarly, Holland and Cheung (2002) point out that “A criticism of multiple-comparison procedures is that the family of inferences over which an error rate is controlled is often arbitrarily selected, yet the conclusion may depend heavily on the choice of the family” (p. 63). With this in mind, they state that “if the testing result differs with the family selected, the decision is family size inconsistent” (p. 65). This leads to a definition of a “familywise robust” testing procedure as one that, in brief, “tends to make family size consistent decisions” (p. 65). They then use their new criterion to examine the relative merits of some MCPs that are relatively inconsistent in their behaviour. It is interesting to observe that the unrestricted LSD would perform very well under this new criterion.

Related Topics

Writers such as O’Brien (1983) take the notion of consistency further, and contest the idea of pooling variances between treatments, instead advocating that each pairwise comparison should involve only the data from the two treatments currently under comparison. They rightly point out that if variances vary between treatments, the usage of variance estimates from other treatments can introduce inconsistencies similar to those I have discussed in this paper. This means that pooling of variances should be done with caution.

Another question arises in relation to several correlated or repeated measures such as the changes in score as measured by several psychological tests. In this paper I have taken my preferred approach of deriving “summary statistics,” in this case one such variable that collapses the several measures into a single variable. However, if it was decided to statistically analyze each variable separately, I would make no adjustment for the number of variables being analyzed, again on the grounds of consistency.

It is interesting to note that in the case of a replicated 2^sup 5^ factorial design, data analysts will happily carry out 2^sup 5^ – 1 = 31 tests of main effects and interactions, all at the 5% level of significance, without any thought of adjustment for multiplicity. The implicit supposition is that all 31 tests were planned, yet this is unlikely to be true.

Many proponents of multiple comparison procedures argue that a comparison-wise Type I error rate is inappropriate, and that some account must be taken of the multiplicity of tests. I disagree. I feel that the comparison is a more natural conceptual unit than the experiment, or the project consisting of several experiments, or the research program consisting of several projects, or, at the extreme, the family of all comparisons made by a statistician during his or her lifetime. My point is that once the comparison is abandoned as the conceptual unit, it is hard to know where to draw the line.

The Practical Solution

Multiple comparisons are only appropriate when you have no prior hypotheses, and wish to compare all treatments with all other treatments. In this situation, I suggest that you use the simplest MCP, the unrestricted LSD procedure. When used with a 5% level of significance, this MCP has the simple operating characteristic that if there are k equal pairs of treatment means, it will on average falsely declare 0.05k of these pairs to be unequal (e.g., if 400 null comparisons are made, an average of 0.05 x 400 = 20 will he spuriously declared significant at p

For consistency, I treat other contrasts between the treatment means (e.g., class comparisons and linear trends) in the same manner.

In conclusion, my suggestions are as follows. First, when planning an experiment, put in writing your ideas, including any specific hypotheses that you hope to confirm, and give a copy to a reputable acquaintance who can later “testify” to this (not normally required of course!). Then your modus operandi is simple, as follows. (1) Analyze your data and test these specific hypotheses using the orthogonal (or sometimes, nonorthogonal) contrasts that correspond to your hypotheses. For example, one hypothesis may be that a child’s learning rate improves with age, so that a “linear trend with age” contrast may be appropriate. Such contrasts result in a partitioning of the treatment sum of squares in the analysis of variance table, as described in statistics textbooks such as Snedecor and Cochran (1980) and Saville and Wood (1991). Use the significance level of your choice (e.g., 5%, 1%, and so on) for each hypothesis test, and do not “correct” for the number of such tests (their “multiplicity”). (2) When writing your report on the experiment, first describe your prior beliefs, report on the results of the hypothesis tests and whether your beliefs have been strengthened or weakened by the experimental data. Then go on to describe any new ideas that you have formed from further examination of the experimental data; make it clear that your new ideas have been formulated from the experimental data, so must be confirmed by subsequent experimentation. For the formation of such new ideas, again use the significance level of your choice (e.g., 5%, 1%, and so on) for each hypothesis test, and do not correct for multiplicity (this is discussed further in the next paragraph).

General Discussion

In this paper, I have summarized the basic statistical measures that are associated with “normally” distributed data, and have described how they are interrelated. In many cases a convenient way of summarizing the level of statistical variation is to present an LSD(5%) value; this can be used via Figures 1 and 2 to derive other statistical quantities of interest (assuming the methods section is adequate).

Such an LSD value should be used with caution, however. If it is applied to a pairwise difference for which the experimenter has a prior belief, the result can be treated as confirmation or denial of a prior belief. If it is applied to a pairwise difference for which the experimenter has no prior belief, the result corresponds to the formulation of a new hypothesis that needs confirmation in a second experiment.

A similar dichotomy exists in relation to more general contrasts, or comparisons between several treatment means (Saville, 1990). If there is a prior belief that corresponds to a particular contrast, this belief can be tested and thereby confirmed or denied. By comparison, if an unexpected pattern in the treatment means leads to a new hypothesis that corresponds to a new contrast, the new hypothesis requires confirmation in a second experiment.

In summary, with regard to multiple comparison procedures, my belief is that we should use the simple, relatively powerful unrestricted LSD procedure and rely upon better statistical education of researchers to protect against erroneous interpretation of the results, rather than rely upon the conservatism of the procedure to protect researchers from spurious results.

I would like to thank Michael Masson and John Vokey for suggesting that I write a paper for this special issue, and the journal referees for helpful suggestions for its improvement. I thank Jacqueline Rowarth for suggesting the flow chart presented in Figure 1 (which is based upon a similar chart published in the newsletter “New Zealand Soil News”). Russel McAuliffe is thanked for helping with the figures. AgResearch, a government-owned New Zealand Crown Research Institute, is thanked for supporting this work. Please address correspondence to David Saville, Statistics Group, AgResearch, P.O. Box 60, Lincoln 8152, New Zealand (E-mail: dave.saville@agresearch.co.nz).

References

Baranski, J. V., & Petrusic, W. M. (2001). Testing architectures of the decision-confidence relation. Canadian Journal of Experimental Psychology, 55(3), 195-206.

Carmer, S. G., & Walker, W. M. (1982). Baby bear’s dilemma: A statistical tale. Agronomy Journal, 74, 122-124.

Carmer, S. G., & Walker, W. M. (1985). Pairwise multiple comparisons of treatment means in agronomic research. Journal of Agronomic Education, 14(1), 19-26.

Christie, J., & Klein, R. M. (2001). Negative priming for spatial location? Canadian Journal of Experimental Psychology, 55(1), 24-38.

Genstat Committee (2002). The guide to Genstat release 6.1: Part 2: Statistics. Oxford: VSN International.

Gold, J. M., & Pratr, J. (2001). Is position “special” in visual attention? Evidence that top-down processes guide visual selection. Canadian Journal of Experimental Psychology, 55(3), 261-270.

Holland, B., & Cheung, S. H. (2002). Familywise robustness criteria for multiple-comparison procedures. Journal of Royal Statistical Society B, 64(1), 63-77.

Hubbard, T. L. (2001). The effect of height in the picture plane on the forward displacement of ascending and descending targets. Canadian Journal of Experimental Psychology, 55(4), 325-329.

Klein, D. F. (1990). Letter to the editor: NIMH collaborative research on treatment of depression. Archives of General Psychiatry, 47, 682-684.

O’Orien, P. C. (1983). The appropriateness of analysis of variance and multiple-comparison procedures. Biometrics, 39, 787-794.

Saville, D. J. (1990). Multiple comparison procedures: The practical solution. The American Statistician, 44(2), 174180.

Saville, D. J., & Wood, G. R. (1991). Statistical methods: The geometric approach. New York: Springer-Verlag.

Shore, D. I., McLaughlin, E. N., & Klein, R. M. (2001). Modulation of the attentional blink by differential resource allocation. Canadian Journal of Experimental Psychology, 55(4), 318-324.

Snedecor, G. W., & Cochran, W. G. (1980). Statistical methods. Ames, IA: Iowa State University Press.

DAVID J. SAVILLE, AgResearch, New Zealand

Copyright Canadian Psychological Association Sep 2003

Provided by ProQuest Information and Learning Company. All rights Reserved