Use of the jackknife statistic to evaluate result replicability

Rebecca P. Ang

Multiple regression is widely used in behavioral science research. When using multiple regression analysis, or any other parametric statistical procedure, researchers are typically concerned with whether the results obtained are generalizable to the larger population of interest. Because all classical analytic methods are correlational (Knapp, 1978), they capitalize on sampling error. Techniques such as multiple regression, for example, take advantage of sampling error by overfitting the data. The regression equation is “fitted” to the data by what is called the method of least squares. This method fits the equation or the line in such a way that the sum of squared distances from the data points to the line is a minimum. Stated differently, the least-squares criterion minimizes the sum of squared errors of prediction, which is equivalent to maximizing the correlation between the observed and predicted y scores.

The method results in two interrelated problems: (a) overfitting results in overestimates of the statistical parameters, which (b) might result in the lack of generalizability of results upon replication. Overestimates of statistical parameters might result in a possible failure to replicate the obtained results, which might indicate in turn the possibility that the results constitute an artifact of a specific sample and could differ from those of the population. If sampling error poses a threat to the generalizability of a set of results, then the best way to eliminate that threat is to find approximately the same-sized result in a replication study (Carver, 1993). Finding that the original results are stable across future samples is important because it allows researchers to evaluate how well predictions are likely to hold up in the larger group in which they will actually be used.

Many researchers continue to believe that obtaining a statistically significant result is equivalent to result importance or result replicability (Rosnow & Rosenthal, 1989; Thompson, 1989). Result importance rests fundamentally on personal value judgments, not statistics (Thompson, 1993). Result replicability cannot be evaluated by the p values calculated in statistical significance tests (Carver, 1978, 1993; Cohen, 1990). Shaver (1993) argued that

[s]tatistical significance not only provides no information about the probability that replications of a study would yield the same results, but is of little relevance in judging whether actual replications yield similar results. (p. 304)

Although it is true that statistical significance tests focus on the null hypothesis and assess sample statistics in relation to unknown population parameters, many researchers erroneously assume that statistical significance tests evaluate the probability that the null is true in the population, given the sample statistics for the data set. However, what statistical significance tests evaluate is the probability of the sample statistics for the data set, given that the null hypothesis is presumed to be true with regard to parameters in the population (Thompson, 1994). It should be emphasized that stating the first proposition is very different from stating the second. Inferring the likelihood of population parameters, given these sample statistics, is not the same as inferring the likelihood of sample statistics, given an assumption about the population parameters.

One way to examine the likelihood that results will be replicated in future research is to repeat a study with a different sample. Researchers have not favored this method because it is both time- and labor-intensive. Other methods to examine result replicability without reconducting the same study with a new sample exist; they include the jackknife procedure (Crask & Perreault, 1977; Tukey, 1958), double cross-validation (Thompson, 1989), and the bootstrap method (Diaconis & Efron, 1983). These “internal” replication methods evaluate the generalizability of obtained results based on a single sample of participants. The jackknife involves omitting one case or a subset of cases of a fixed size at a time and conducting separate analyses for each configuration. The double cross-validation procedure involves three steps: (a) randomly splitting the original sample in two, (b) conducting separate analyses on the subsets, and (c) empirically comparing the results. The bootstrap conceptually involves the copying of a data set on top of itself multiple times, creating a large data file. A large number of resamples (e.g., n = 500) is randomly drawn, and analyses are conducted on each new subsample. The results of various configurations are then compared.

Because all three internal replicability methods evaluate the stability of obtained results based on a single sample of participants, why might a researcher choose one technique over another? Given that the double cross-validation technique divides the original sample into two groups before generating prediction equations, the jackknife and the bootstrap are especially appropriate when the sample size is small (Taylor, 1991). Double cross-validation reduces the sample size by arbitrarily dividing the data for purposes of conducting the analyses, which could be problematic when the sample is limited in size. Splitting an already small sample increases the risk that the beta weights obtained in the subgroups are merely artifacts of the sample. Of the three internal replication techniques, the double cross-validation method is used most often because of its simplicity in computation (Cooil, Winer, & Rados, 1987).

The jackknife has an added advantage over the other two internal replication techniques because it permits researchers to estimate changes in sampling error related to the idiosyncrasies of a single observation (Tucker & Daniel, 1992). Only the jackknife procedure permits a researcher to omit one observation at a time; hence the jackknife procedure is sensitive enough to detect the impact of outliers on the analysis. The jackknife approach makes use of all of the data in a particular data set while eliminating bias related to the inclusion of atypical cases. Use of the jackknife approach has been demonstrated to produce more conservative and less biased estimates of true population characteristics (Crask & Perreault, 1977). Another advantage of the jackknife is that its methods and computation lend themselves to being easily used on any commonly available statistical package, such as SPSS or SAS. The bootstrap, however, requires the use of more specialized computer programs, such as Lunneborg’s (1987) for univariate analyses and Thompson’s (1988, 1992) for multivariate analyses.

In this article, I demonstrate the use of the jackknife statistic. The jackknife is an important and versatile statistical procedure. However, its lack of use in psychology is in part attributable to researchers’ unfamiliarity with the procedure. Thus, my purpose is pedagogical, not empirical. The jackknife method (Crask & Perreault, 1977) involves repetitively dropping different cases or subsets of cases from the analysis to determine how stable results are across various configurations of cases. I also demonstrate an analysis in which no participants are dropped. The decision whether to omit one case at a time or a subset of cases at a time depends on sample size and the researcher’s purposes. Omitting one case at a time certainly renders the jackknife procedure more sensitive and precise in detecting the presence of outliers within the data set. However, if one has a very large sample size (e.g., N = 800), the computation for omitting one case at a time is both labor intensive and time consuming. Thus, it is useful to note that there is a tradeoff between the benefit of sensitivity of the statistical procedure and its ease of computation.

I used a data set from Edwards (1985, p. 57) to assess the generalizability of multiple regression results. I used three independent variables (X1, X2, X3) and one dependent variable (DV) for a sample of 25 cases in the analysis. In practice, a sample size of 25 is too small to run multiple regression analysis with confidence. However, I used a sample size of 25 only for heuristic purposes. Stevens (1996) suggested that approximately 15 cases/participants are needed per predictor/independent variable for a reliable equation that generalizes with little loss of predictive power. Sample size (n) and number of predictors (k) are two important factors that determine how well a given equation will predict results in future samples. In fact, Stevens suggested that in particular, the n/k ratio is crucial. For small ratios (e.g., 5:1 or less), the shrinkage in predictive power can be substantial. Park and Dudycha (1974) found that with about 15 participants per predictor, the amount of shrinkage was small, with high probability that the squared population multiple correlation equaled .05. In line with that finding, Stevens stated that an estimate of .05 for the squared population multiple correlation is reasonable for social science research.

Jackknife Method

I shall outline a step-by-step procedure for using the jackknife statistic as a measure of replicability. (See the Appendix for the SPSS commands used to conduct the analysis. See Table 1 for the raw data set.)

Daniel (1989) provided a comprehensive description of the jackknife procedure. A given sample of size N is partitioned into k subsets, all of which must be of the same size. The value of k can range between 1 and the largest multiplicative factor of N. Theta prime ([Theta][prime]), a predictive estimator (e.g., beta weight, discriminant function coefficient), is then computed with all k of the subsamples from the original sample of size N. The same estimator is also computed with the i subset (i = 1 to k) omitted from the sample. This estimator is designated as [Theta]i[prime]. The procedure is repeated k times, with a different subset omitted each time. Weighted combinations of [Theta][prime] and [Theta]i[prime] values are computed before computing the jackknifed estimator. Those weighted values are called pseudovalues (Quenouille, 1956) and are designated by the letter J. Pseudovalues are computed via the following equation.

Ji([Theta][prime]) = k[Theta][prime] – (k – 1)[Theta]i[prime], (1)

where i = 1, 2, 3, . . ., k.

The average of the pseudovalues is the jackknifed estimator:

J([Theta][prime]) = [Sum Ji([Theta]i[prime])] / k,

where i = 1, 2, 3, . . ., k.

TABLE 1

Heuristic Data Set

Case X1 X2 X3 DV

1 11 38 10 15

2 7 42 16 18

3 12 38 18 17

4 13 36 15 16

5 14 40 15 17

6 15 32 11 14

7 5 20 13 10

8 14 44 18 21

9 14 34 12 17

10 10 28 16 11

11 8 24 10 13

12 16 30 16 18

13 15 26 15 16

14 14 24 12 15

15 10 26 12 14

16 9 18 14 13

17 11 30 16 16

18 9 26 13 13

19 7 18 11 11

20 10 10 17 6

21 9 12 8 12

22 10 32 18 18

23 10 18 14 12

24 16 20 15 16

25 10 18 12 15

First, the original data set (N = 25) is analyzed through the multiple regression procedure in SPSS, yielding multiple [R.sup.2] and beta weights for the independent variables X1, X2, and X3 (see Table 2). These beta weights and multiple [R.sup.2] values would be designated as [Theta][prime] in the procedure described earlier.

Second, subsets are systematically deleted one at a time (e.g., Case = 1 is omitted, then Case = 2 is omitted) from the data set. The truncated data sets of n = 24 are then each analyzed again through the multiple regression procedure, yielding multiple [R.sup.2] and beta weights for each analysis. These beta weights and multiple [R.sup.2] values would be designated as [Theta]i[prime], as described in the procedure mentioned earlier. Pseudovalues for the beta weights and for the multiple [R.sup.2] values are calculated using the equation, Ji([Theta][prime]) = k[Theta][prime] – (k – 1)[Theta]i[prime], where i = 1, 2, 3, . . . ., k. Because pseudovalues are real numbers, they range from -[infinity] to +[infinity], exclusive. To illustrate the procedure in a concrete manner, I shall compute pseudovalues for the instance in which Case = 1 is omitted. The computed values are as follows.

Pseudovalues for X1 = 25 (.301) – 24 (.293) = .493. Pseudovalues for X2 = 25 (.666) – 24 (.711) = -.414. Pseudovalues for X3 = 25 (.024) – 24 (-.043) = 1.632. Pseudovalues for [R.sup.2] = 25 (.689) – 24 (.707) = .257.

This procedure is repeated 25 times, with a different case omitted each time (see Table 3).

TABLE 2

Summary of Multiple Regression Results (N = 25)

Significance

Variable B SE B [Beta] t of t

X1 .319 .139 .301 2.302 .032

X2 .228 .047 .666 4.862 .000

X3 .0277 .154 .024 .180 .859

Constant 4.368 2.241 1.949 .065

Note. [R.sup.2] = .689 (p [less than] .05). B = unstandardized

regression coefficients. SE B = standard errors of the unstandarized

regression coefficients. [Beta] = standardized regression

coefficients. t = the Student t ratio.

Third, the jackknifed estimator is obtained by averaging the pseudovalues (see Table 3). The jackknifed estimator (the mean of k pseudovalues) and its standard error are produced with a spreadsheet program. Fourth, the stability of the jackknifed estimator can be evaluated by determining confidence intervals about the estimator, because the jackknifed estimator is postulated to be normally distributed (Crask & Perreault, 1977; Tukey, 1958). One can determine the confidence intervals by dividing the jackknifed estimator by its associated standard error to obtain a Student t value. This Student t value is also known as [t.sub.calculated]. The process of generating [t.sub.calculated] follows the generic hypothesis-testing procedure of obtaining a data-based calculated Student t value, to compare that with an appropriate critical value obtained from the statistical tables. (See Huck & Cormier, 1996, for an excellent discussion on this topic.) The degrees of freedom will be equal to k pseudovalues minus one, which is 24 in my example. If [t.sub.calculated] is larger than [t.sub.critical], then the jackknifed estimator can be considered stable (see Tables 3 and 4).

Discussion

The jackknifed estimators were rather close in value to the original beta weights and the [R.sup.2] value. I used Step 4 of the procedure outlined in the previous section to determine the stability of the jackknifed estimators. In all cases, the jackknifed estimator lay between the confidence interval constructed (see Table 4). [TABULAR DATA FOR TABLE 3 OMITTED] To the extent that the jackknifed estimator and the original sample value are similar, the sample value may be judged to be stable and generalizable. Of the three independent variables assessed, the jackknifed estimator for X2 was stable, whereas the jackknifed estimators for X1 and X3 were not (see Table 3).

The present findings indicate that only X2 predicts the DV with sufficient generalizability to the larger population of interest. Independent variables X1 and X3 tended to be more biased indicators; I found them to be unstable against changes in the composition of the sample. It appears that the replicability of [R.sup.2] was influenced primarily by the stability of the beta weight on variable X2.

TABLE 4

95% Confidence Intervals for Jackknifed Coefficients, for All

Variables

Coefficient X1 X2 X3 [R.sup.2]

Original .301(*) .666(*) .024(*) .689(*)

Jackknifed .275 .725 -.0173 .623

Lower -.00677 .313 -.521 .433

Upper .557 1.136 .486 .813

Note. * indicates that a coefficient lies within the 95% confidence

interval.

However, a caveat is in order. A superficial analysis of the data and the results might lead the reader to conclude erroneously that the variables that have the largest beta weights are replicable. In a condition commonly known as multicollinearity, predictors are highly correlated to each other. As Cooley and Lohnes (1971) noted, researchers typically rely on squared regression weights to determine the relative importance of the predictors. However, that reliance is not appropriate when there is multicollinearity among predictors. Multicollinearity makes determining the relative importance of predictors difficult because the effects of the predictors are confounded by the correlations among them. In addition, multicollinearity increases the variances of the regression coefficients; the greater the variances, the more unstable the prediction equation will be (Stevens, 1996). Because beta weights are greatly affected by collinearity, one should not use the size of the beta weight as a sole judge of its importance or stability. If a variable is highly correlated with other predictor variables and is only slightly more correlated with the criterion, sampling error – resulting in a slight fluctuation of a couple of bivariate correlation coefficients – could radically alter that variable’s beta weight. Thompson and Borrello (1985) used an empirical example from an actual study to illustrate this concept.

Because the values in the data set did not constitute real data and were used merely for illustrative purposes, no comparison can be made between the results obtained from the jackknife analysis and that of the true nature of empirical populations. Although limited in number, examples of actual applications of the jackknife do exist (Brillinger, 1966; Daniel & Okeafor, 1987; Okeafor, Licata, & Ecker, 1987).

Result replicability lies at the very heart of science, because generalizability strengthens confidence in research results. If the predictive power drops off sharply when the regression equation is applied to an independent sample, it has no generalizability and is therefore of limited scientific value. The ultimate purpose of deriving prediction equations is for prediction with future samples; if an equation does not predict well with future samples, it has not fulfilled the purpose for which it was designed. Thus, when an external replication is not feasible, researchers should use the jackknife statistic or any other appropriate procedures, such as the bootstrap or double cross-validation, to determine result stability.

REFERENCES

Brillinger, D. R. (1966). The application of the jackknife to the analysis of sample surveys. Journal of the Market Research Society, 8, 74-80.

Carver, R. (1978). The case against statistical significance testing. Harvard Educational Review, 48, 378-399.

Carver, R. (1993). The case against statistical significance testing, revisited. Journal of Experimental Education, 61(4), 287-292.

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304-1312.

Cooil, B., Winer, R. S., & Rados, D. L. (1987). Cross-validation for prediction. Journal of Marketing Research, 24, 271-279.

Cooley, W. W., & Lohnes, P. R. (1971). Multivariate data analysis. New York: Wiley.

Crask, M. R., & Perreault, W. D., Jr. (1977). Validation of discriminant analysis in marketing research. Journal of Marketing Research, 14, 60-68.

Daniel, L. G. (1989, January). Use of the jackknife statistic to establish the external validity of discriminant analysis results. Paper presented at the annual meeting of the Southwest Educational Research Association, Houston. (ERIC Document Reproduction Service No. 305 382).

Daniel, L. G., & Okeafor, K. R. (1987, November). Teaching experience and confidence in teachers. Paper presented at the annual meeting of the Mid-South Educational Research Association, Mobile, AL. (ERIC Document Reproduction Service No. ED 292 763).

Diaconis, P., & Efron, B. (1983). Computer-intensive methods in statistics. Scientific American, 248(5), 116-130.

Edwards, A. L. (1985). Multiple regression and the analysis of variance and covariance (2nd ed.). New York: Freeman.

Huck, S. W., & Cormier, W. H. (1996). Reading statistics and research (2nd ed.). New York: Harper Collins.

Knapp, T. R. (1978). Canonical correlation analysis: A general parametric significance testing system. Psychological Bulletin, 85, 410-416.

Lunneborg, C. E. (1987). Bootstrap applications for the behavioral sciences (Vol. 1). Seattle: University of Washington.

Okeafor, K. R., Licata, J. W., & Ecker, G. (1987). Toward an operational definition of the logic of confidence. Journal of Experimental Education, 56, 47-54.

Park, C., & Dudycha, A. (1974). A cross validation approach to sample size determination for regression models. Journal of the American Statistical Association, 69, 214-218.

Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika, 43, 353-360.

Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in the psychological science. American Psychologist, 44, 1276-1284.

Shaver, J. (1993). What statistical significance testing is, and what it is not. Journal of Experimental Education, 61(4), 293-316.

Stevens, J. (1996). Applied multivariate statistics for the social sciences (3rd ed.). Mahwah, NJ: Erlbaum.

Taylor, D. L. (1991, January). Evaluating the sample specificity of discriminant analysis results using the jackknife statistic. Paper presented at the annual meeting of the Southwest Educational Research Association, San Antonio, TX. (ERIC Document Reproduction Service No. ED 328 574).

Thompson, B. (1988). Program FACSTRAP: A program that computes bootstrap estimates of factor structure. Educational and Psychological Measurement, 48, 681-686.

Thompson, B. (1989). Statistical significance, result importance, and result generalizability: Three noteworthy but somewhat different issues. Measurement and Evaluation in Counseling and Development, 22, 2-6.

Thompson, B. (1992). DISCSTRA: A computer program that computes bootstrap resampling estimates of descriptive discriminant analysis function and structure coefficients and group centroids. Educational and Psychological Measurement, 52, 905-911.

Thompson, B. (1993). The use of statistical significance tests in research: Bootstrap and other alternatives. Journal of Experimental Education, 61(4), 361-377.

Thompson, B. (1994, February). Why multivariate methods are usually vital in research: Some basic concepts. Paper presented at the biennial meeting of the Southwestern Society for Research in Human Development, Austin, TX. (ERIC Document Reproduction Service No. ED 367 687).

Thompson, B., & Borrello, G. M. (1985). The importance of structure coefficients in regression research. Educational and Psychological Measurement, 45, 203-209.

Tucker, M. L., & Daniel, L. G. (1992, January). Investigating result stability of canonical function equations with the jackknife technique. Paper presented at the annual meeting of the Southwest Educational Research Association, Houston, TX. (ERIC Document Reproduction Service No. ED 305 382).

Tukey, J. W. (1958). Bias and confidence in not-quite large samples. Annals of Mathematical Statistics, 29, 614.

APPENDIX

SPSS Commands for the Jackknife Procedure

TITLE ‘REGRESSION WITH NO CASES OMITTED’.

DATA LIST FILE ‘A:ABC.DAT’ RECORDS = 1/CASE 1-2 X1 4-5 X2 7-8 X3 10-11 DV 13-14.

VARIABLE LABELS CASE ‘CASE NUMBER’

X1 ‘FIRST INDEPENDENT VARIABLE’ X2 ‘SECOND INDEPENDENT VARIABLE’ X3 ‘THIRD INDEPENDENT VARIABLE’ DV ‘DEPENDENT VARIABLE’

REGRESSION VARIABLES = DV X1 X2 X3 /DESCRIPTIVES = ALL /DEPENDENT = DV /ENTER = X1 X2 X3. subtitle ‘regression with case # omitted’. temporary. select if (id [greater than] # or id [less than] #). regression variables = dv x1 x2 x3 /descriptives = all /dependent = dv /enter = x1 x2 x3.

Note. The commands typed in lowercase letters should be repeated as many times as there are cases, with # substituted for the case number to be dropped.

COPYRIGHT 1998 Heldref Publications

COPYRIGHT 2004 Gale Group