On theory, statistics, and the search for interactions in the organizational sciences – Research Methods & Analysis

Philip Bobko

Several of the articles in this issue (and in recent, prior issues) of the Journal of Management are devoted to moderated regression; others are devoted to regression, in general. Due to the inability of researchers to detect moderator effects, even in the presence of what is believed to be strong theory, investigation of the properties of moderated (interactive) regression has remained the focus of many methodologically oriented articles in the organizational sciences. As Bedeian and Mossholder (1994) note in this issue, moderated regression models are used in a variety of management subdisciplines, including organizational behavior, organizational theory, strategy, operations management, and human resources management. The purpose of our article is to consider some recent methodological developments in moderated regression, with an eye towards understanding what might be the critical themes and/or questions that are worthy of study in the upcoming years.

Bedeian and Mossholder

One paper in this issue, by Bedeian and Mossholder (1994), considers whether a statistically significant (from zero) value of multiple |R.sup.2~ is required before one is “allowed” to test for the significance of a regression weight for an interaction term. Put another way, does the overall coefficient of determination (|R.sup.2~) have to be significant before one can ask if the incremental |R.sup.2~, due to an interaction term, is significant? The bottom line for Bedeian and Mossholder is, “… given a theory-based, a priori hypothesis, an MMR |moderated multiple regression~ analysis is analogous to a planned statistical comparison and, thus, a significant overall F value is not a prerequisite for interpreting a significant interaction term” (1994: abstract). We couldn’t agree more with their conclusion. It is refreshing to see theory used as the primal criterion by which an analytic procedure is determined.

There is some controversy in the statistical literature regarding the analysis of interaction terms. As Evans (1991) notes, some researchers mistakenly analyze just the bivariate relationship between the dependent variable and the cross-product term. However, in the context of traditional regression analysis, one should not simply place the interaction term in the model and then test it for significance without first adding the “main effects.” That is, if the cross-product term, (|X.sub.1~)*(|X.sub.2~), is in the equation, then the variables |X.sub.1~ and |X.sub.2~ should also be in the equation. In such analyses, the burden of proof is on demonstrating that the interaction adds unique explanatory power over and above main effects (Bobko, in press; Cohen & Cohen, 1983). Using only interaction terms in a model confounds main effects and interactions and is not congruent with the field’s usual appeal to parsimony.

As an aside, note that the primal use of theory to guide hypothesis testing procedures has led to different adaptations of hierarchical regression, as well as suggestions involving procedures other than traditional hierarchical regression. For example, McClelland and Judd (1993) discuss the controversial practice of oversampling extreme observations in the independent variables in order to maximize the power of moderated regression in field settings. Further, in the design literature, Bobko (1986) has proposed a series of planned contrasts to increase power in the search for interactive effects. Strube and Bobko (1989) demonstrated that the power gain from this theory-driven set of contrasts offsets any increases in Type I error associated with the procedure. The point here is that such theory generative approaches are consistent with the logic of Bedeian and Mossholder.

To repeat, we agree with Bedeian and Mossholder’s (1994) basic conclusion. We also suggest that researchers in the field consider this conclusion as a reaffirmation of Wilson’s (1962) notion of hypothesis-wise error rates. To explain further, we note that individuals who recommend omnibus significance tests are correctly concerned with spurious significance that can arise when many post-hoc tests are conducted (where each test is conducted at the usual .05, or .01, level). In this case, one might appeal to experiment-wise error rates in order to control overall Type I error. However, Wilson (1962) noted that a single data collection effort and analysis might entail several a priori hypotheses (i.e., those based on theory). In this case, he argued, each hypothesis “deserves” its own error rate. We endorse this type of theoretically driven thinking and believe that Bedeian and Mossholder’s conclusions are an example of hypothesis-wise error rate logic (the interaction effect being given its own hypothesis).

In fact, we suggest that Bedeian and Mossholder could have (and should have) taken their conclusion even one step further. These authors state that if there is not a sound theoretical case for doing so, then the requirement of a significant overall |R.sup.2~ is “judicious” (1994, p. 164). As noted, we certainly agree with concerns for Type I error rates and spurious results due to “data snooping”. As a basis for their conclusion, Bedeian and Mossholder borrow from typical practice in the analysis of experimental designs. They note that a significant F-ratio for a factor should probably precede any tests (i.e., post-hoc contrasts) among levels of that factor. However, the requirement of a significant overall |R.sup.2~ is not conceptually equivalent to testing an overall F-ratio for a particular factor. Rather, the test of |R.sup.2~ considers the significance of the entire equation, complete with all the variables and interactions. In a two-factor experimental design, this would be akin to asking if the sums of squares for the entire model (e.g., total sums of squares for Factor A, Factor B, and the linear-by-linear interaction) was “significant”. We are unaware of any organizational researcher who advocates such an omnibus test. |Again, the test that might be required is for a given factor, not an omnibus test across all factors.~

Thus, the logic for requiring a significant overall |R.sup.2~ breaks down, and such a test should not be a prerequisite for testing either of the main effects or the interaction. If Type I error rates are a concern, we suggest that the nominal alpha-level be reduced for each significance test, rather than linking any such test to the overall |R.sup.2~. The alpha level could be reduced, if desired, using the Bonferroni procedure. Indeed, Castaneda, Levin, & Dunham (1993) have recently re-affirmed the utility of the Bonferroni procedure–a procedure which can be adopted within our recommended theory-driven approach.

Stone-Romero, Alliger, and Aguinis

Another article in this issue, by Stone-Romero, Alliger, and Aguinis (1994), also concerns moderated regression. Rather than focusing on Type I error rates, Stone-Romero et al. are concerned about the more usual problem with interactive models (i.e., the lack of statistical power in detecting moderators). These authors demonstrate that sample size, the magnitude of differences between subgroup correlations, and the differences in subgroup sample sizes, all affect the power to detect underlying interactions. Of course, the published literature on differential validity would have predicted these findings. For example, one can algebraically demonstrate that it is more difficult to show that two correlations differ by a particular amount (say, .20) than to show that a single correlation of that amount (e.g., .20) is significantly different from zero (see footnote #2 in Bartlett, Bobko, & Pine, 1977).

On the other hand, there are some interesting interactions among the parameters in the Stone-Romero et al. study that, to the best of our knowledge, have not been studied before. Nor, in fact, has the literature systematically examined the summative effects of these factors. We encourage the field to be aware of the factors studied in Stone-Romero et al.–particularly in cases where interactive models do not lead to statistically significant findings.

Having mentioned the differential validity literature, it is important to note that researchers in this area have concluded that differential prediction, rather than differential validity, is the real issue for organizational researchers and personnel psychologists (Bobko & Bartlett, 1978; Linn, 1978). In a regression model containing two main effects (|X.sub.1~ and |X.sub.2~) and one interaction (|X.sub.1~*|X.sub.2~), it is important to realize that the test for the interaction is a test for slope differences. Differential slopes can be a function of differential correlations and/or variances (e.g., in simple regression, the slope is |r.sub.xy~(|s.sub.y~ / |s.sub.x~)). Therefore, due to differences in variances, the interaction can be significant even if the subgroup correlations are equal. The work of Stone-Romero et al. is important in that it shows that a technique which has been lamented to have low power can have even less power under some fairy typical circumstances. We suggest that this work be extended to consider parameters which reflect differential subgroup variances.

At its core, the Stone-Romero et al. (1994) simulation is concerned with inefficiencies in the detection of true interaction effects caused by the joint distribution of |X.sub.1~ and |X.sub.2~. In ANOVA terms, their simulation can be characterized by a 2-way design (where one factor is dichotomous) with size of the interaction, total sample size, and joint distribution of |X.sub.1~ and |X.sub.2~ being varied. Indeed, a number of authors have commented on how the joint distributions of the predictor space can effect the search for moderators (McClelland & Judd, 1993; Schepanski, 1983). As an example, note that a 9×9 experimental design contains 81 possible cells. If one takes all possible pairs of these cells, 60% would exhibit what is called “strict dominance” (i.e., one of the cells would contain higher values on at least one of the independent variables and an equal value on the other). Schepanski (1983) demonstrated how changing the proportion of non-dominated to dominated cells in this experimental design, from 60% to 18%, can increase the effect size in moderated regression by almost 600%! Thus, the distribution of scores on the independent variables can have a substantial effect on the proportion of variance accounted for by the interaction term.

Also, McClelland and Judd (1993) noted that, in traditional hierarchical moderated regression, it is the residual variance of the product term that determines the statistical power of the test for a moderator. These authors show how different joint distributions of |X.sub.1~ and |X.sub.2~ impact this term, concluding that “moderator effects are easier to detect to the extent that: (a) extreme values |of the independent variables~ occur … and (b) extreme values of each predictor variable co-occur with extreme values of the other predictor variable” (1993, p. 381).

McClelland and Judd (1993) further demonstrated that a sample containing only cells found in the far “corners” of a 9×9 design (with equal numbers of observations in these cells) will yield maximal power to detect true interaction effects. To the extent that a field study contains |X.sub.1~ and |X.sub.2~ observations in cells other than those found in the extreme corners of the design, power to detect true interaction effects will decrease. While such findings are known by those interested in the design of maximally efficient experiments, they are often overlooked in moderated regression analyses within field settings (where the frequency with which cells are represented is difficult to control).

As one possible solution, McClelland and Judd (1993) presented the strategy of oversampling extreme observations when it is “theoretically important” to demonstrate a moderator effect. Clearly, this may be difficult to accomplish in field studies (i.e., situations where the investigator does not know what values of the predictors characterize each subject until after the data have been collected). This is one aspect of the situation captured in the Stone-Romero et al. study (i.e., differential cell proportions) that, unfortunately, many field researchers cannot take steps to address. Finally, the notion of oversampling based on values of the predictor space has both positive and negative aspects. While the values of the regression parameters remain unbiased (and are estimated with increased efficiency), the values of |R.sup.2~ and proportion of |R.sup.2~ due to the interaction terms are affected by the oversampling strategy (note that the effect on |R.sup.2~ is consistent with Schepanski’s, 1983, presentation).

Weinzimmer, Mone, and Alwan

The manuscript by Weinzimmer, Mone, and Alwan (1994) that appears in this issue is concerned with the use of regression diagnostics to assess violations of assumptions. The contrived (but heuristically valuable) case of Anscombe (1973) aside, it would be very useful for the field to consider the effects of violations of assumptions under more realistic circumstances. In the context of the prior two articles, it would be useful to see how the violation of any assumptions affected interactive hypotheses, as compared with hypotheses on main effects.

Note that the two articles discussed above were concerned with the mechanics of the test and characteristics of the data (which did not necessarily violate assumptions). The focus on assumptions by Weinzimmer et al. leads us to suggest even more fundamental issues that should surround the use of interactive regression. First, somewhat consistent with the discussion of the Bedeian and Mossholder paper, we suggest that theory should be sufficiently strong before interactive models are used. Second, we suggest that any variables that are used in a continuous fashion be assessed for their “interval” nature. That is, Arnold and Evans (1979) long ago reaffirmed that interactive variables need not be ratio in nature; interval properties were sufficient. However, what are the effects of violating even interval properties of the measurement? We suggest these effects may be severe. For example, Busemeyer and Jones (1983) have shown that interactive terms used within general linear models can be made to disappear (or appear) by ordinal transformations of the data at hand. Thus, a lack of interval scale properties could create complete ambiguity in interpreting interactive models. Third, we suggest that researchers do not often assess whether or not their regression models are mis-specified. It is common in linear structural relations (LISREL) analyses to consider variable specification. This makes sense, given the fact that the value of parameter estimates (e.g., regression weights) can be affected by the presence of other explanatory variables. This is an issue that has been recognized by the path analytic literature (see Bobko, 1990, or Pedhazur, 1982, for summaries). Surely, concerns about mis-specification apply to all general linear model estimation, including interactive regression. Future research should consider what the effects of missing variables is on our ability to accurately assess the presence of moderators.

A mention of the Weinzimmer et al. (1994) article would not be complete without noting that these authors recommend the use of traditional transformations of the data if diagnostics indicate violation of assumptions. We believe such recommendations deserve further study before they become routinely accepted. First, transformations lead to practical (and philosophical) questions such as, “Precisely what does the ‘arcsine of turnover’ mean to organizational decision makers?” Second, the above mentioned work of Busemeyer and Jones (1983) points quite clearly to the fact that transformations are two-edged swords: while they reduce apparent violations of assumptions, it is not clear whether resulting interactions (or lack thereof) are artificial by-products of the fact that interactive terms are not invariant to non-linear transformations of the data. The field needs more definitive research in this area.

Don’t Forget the Dependent Variable

The research noted above is indicative of the moderated regression literature in general; these studies have focused either on diagnostics of the regression residuals or on properties and/or characteristics of the independent variables. Russell and his colleagues (Russell, Pinto, & Bobko, 1991; Russell & Bobko, 1992) have taken a different approach by demonstrating that a characteristic of the dependent variable, scale “coarseness”, can affect the power to detect interactive relationships. These authors note that the dependent scale needs to have a sufficient number of outcome possibilities in order to capture the increased information richness implied by interactive models. While Schepanski (1983) and McClelland and Judd (1993) demonstrated a reduction in power when various combinations of the independent variables are examined, a “coarse” operationalization of the dependent variable is also expected to decrease the power of moderated regression (Russell & Bobko, 1992). In the context of the present article, we suggest that the field should consider both sides of the regression equation. Research is needed which integrates (into a single study) how the statistical properties/characteristics of dependent and independent variables operate together in determining the power to detect interactions.

Such research would relate directly to several of the issues mentioned earlier. For example, when a subject must respond using a dependent scale that is too “coarse”, what (presumably non-linear) transformation does the subject use to fit his/her response to the scale provided by the researcher? How do these transformations explicitly affect statistical power? Or, how could researchers specifically use theory, not only to determine the significance testing procedure, but to also determine the number of scale points required on the dependent variable?

Where Do We Go From Here?

We have a variety of suggestions for how the field’s interest in moderated/interactive regression should proceed. From the above considerations we suggest that:

1. Theory should be a primal determinant of analytic technique.

2. One need not require an omnibus test of |R.sup.2~ in order to test interactive hypotheses. (If Type I error rates are a concern, then reduce the nominal alpha rate at which each test is conducted.)

3. We need to know more about how distributional properties of the predictors at hand affect the power of moderated regression. (Relatedly, given the low power of moderated regression in field applications (McClelland & Judd, 1993), guidelines need to be developed to assist researchers in deciding when, in the theory testing/development process, it is appropriate to oversample extreme values of the predictor space.)

4. We need to know more about how the characteristics of the dependent variables affect the power of moderated regression analyses.

5. We need to know more about how non-linear transformations of the independent and dependent variables affect the accuracy and power of moderated regression.

6. We need more consideration of fundamental premises and how they affect moderated regression. (These premises include assumptions about underlying distribution theory, the nature of the variables (e.g., interval properties), and accurate specification of the pool of independent variables).

We believe that the articles reviewed above provide steps in the right direction and suggest that integrated, holistic research on these topics will lead to a better understanding and use of moderated regression in theory development and testing.

Acknowledgment: We wish to thank Larry Williams for re-affirming the importance of moderated regression analyses to the field and for giving us the opportunity to provide commentary on articles recently published and articles in this issue of the Journal of Management.

References

Anscombe, F. (1973). Graphs in statistical analyses. The American Statistician, 27: 17-21.

Arnold, H. & Evans, M. (1979). Testing multiplicative models does not require ratio scales. Organizational Behavior and Human Performance, 24: 41-59.

Bartlett, C., Bobko, P., & Pine, S. (1977). Single-group validity: Fallacy of the facts? Journal of Applied Psychology, 62: 155-157.

Bedeian, A. & Mossholder, K. (1994). Simple question, not so simple answer: Interpreting interaction terms in moderated multiple regression. Journal of Management, 20(1): 159-165.

Bobko, P. (1986). A solution to some dilemmas when testing hypotheses about ordinal interactions. Journal of Applied Psychology, 71: 323-326.

—–. (1990). Multivariate correlational data analysis. Pp. 637-686 in M. Dunnette & L. Hough (Eds.), Handbook of industrial and organizational psychology, Vol. 1, 2nd ed. Palo Alto, CA: Consulting Psychologists Press.

—–. (In press). Correlation and regression in industrial/organizational psychology and management. New York: McGraw-Hill.

Bobko, P. & Bartlett, C. J. (1978). Subgroup validities: Differential definitions and differential predictions. Journal of Applied Psychology, 63: 12-14.

Busemeyer, J. & Jones, L. (1983). Analysis of multiplicative combination rules when the causal variables are measured with error. Psychological Bulletin, 93: 549-562.

Castaneda, M., Levin, J., & Dunham, R. (1993), Using planned comparisons in management research: A case for the Bonferroni procedure. Journal of Management, 19: 707-724.

Cohen, J. & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum.

Evans, M. (1991). The problem of analyzing multiplicative composites. American Psychologist, 46: 6-15.

Linn, R. (1978). Single-group validity, differential validity, and differential prediction. Journal of Applied Psychology, 63: 507-512.

McClelland, G. & Judd, C. (1993). Statistical difficulties of detecting interactions and moderator effects. Psychological Bulletin, 114: 376-390.

Pedhazur, E. (1982). Multiple regression in behavioral research, 2nd ed. New York: Holt, Rinehart, & Winston.

Russell, C. & Bobko, P. (1992). Moderated regression analysis and Likert scales: Too coarse for comfort. Journal of Applied Psychology, 77: 336-342.

Russell, C., Pinto, J., & Bobko, P. (1991). Appropriate moderated regression and inappropriate research strategy: A demonstration of the need to give your respondents space. Applied Psychological Measurement, 15: 257-266.

Schepanski, A. (1983). The predictive ability criterion in experimental judgment research in accounting. Decision Sciences, 14: 503-512.

Stone-Romero, E., Alliger, G., & Aguinis, H. (1994). Type II error problems in the use of moderated multiple regression for the detection of moderating effects for dichotomous variables. Journal of Management, 20(1): 168-178.

Strube, M. & Bobko, P. (1989). Testing hypotheses about ordinal interactions: Simulations and further comments. Journal of Applied Psychology, 74: 247-252.

Weinzimmer, L., Mone, M., & Alwan, L. (1994). An examination of perceptions and usage of regression diagnostics in organization studies. Journal of Management, 20(1): 179-192.

Wilson, W. (1962). A note on the inconsistency inherent in the necessity to perform multiple comparisons. Psychological Bulletin, 59: 296-300.

COPYRIGHT 1994 JAI Press, Inc.

COPYRIGHT 2004 Gale Group