Journal of Management

An exploratory and confirmatory factor-analytic investigation of item wording effects on the obtained factor structures of survey questionnaire measures

An exploratory and confirmatory factor-analytic investigation of item wording effects on the obtained factor structures of survey questionnaire measures

Chester A. Schriesheim

Psychometricians (e.g., Anastasi, 1982; Nunnally, 1978) have generally made the broad and general recommendation that both standard (“positive”) and reverse-scored (“negative”) items be included in applied research instruments so as to control for response biases (such as “acquiesence” or agreement response tendency). This recommendation has implicitly been based upon the assumptions that: (1) such response biases are serious threats to instrument validity; (2) reverse-scored items can be used without serious negative consequences (i.e., that such items have no major adverse side-effects on the psychometric properties of an instrument); and (3) no major differences exist in psychometric quality between the two types of positively-scored items (regular and negated polar opposite; e.g., “I am happy” and “I am not sad,” respectively) or between the two types of negatively-scored items (polar opposite and negated regular; e.g., “I am sad” and “I am not happy,” respectively).

Research, however, raises substantial concerns about the validity of all three assumptions. For example, the weight of evidence casts serious doubt on whether “response biases” are, in fact, serious response contaminants. Summarizing these findings, Nunnally (1978), a strong advocate of employing mixtures of regular and reverse-scored items on instruments, concluded that “…the overwhelming weight of evidence now points to the fact that the agreement tendency is of very little importance either as a measure of personality or as a source of systematic invalidity in measures of personality and sentiments” (Nunnally, 1978, p. 669). Given this evidence, the continued use of reverse-scored items can be justified as a “conservative practice” – but only if the second assumption holds (i.e., that using item reversals is without serious negative consequences). Unfortunately, however, the existing evidence also shows that there are sometimes two serious drawbacks associated with employing reverse-scored items in a questionnaire instrument.

First, although findings to the contrary do exist (Winkler, Kanouse & Ware, 1981), a number of studies have shown that reverse-scored items often negatively affect the reliability and/or assumed validity of measures (e.g., Benson & Hocevar, 1985; Campbell & Grissom, 1979; Schriesheim, Eisenbach & Hill, 1991; Schriesheim & Hill, 1981; Simpson, Rentz & Shrum, 1976). Second, the inclusion of reversed or differentially-worded items may distort factor-analytic results by causing the appearance of artificial factors composed largely (or entirely) of these items. For example, Schmitt and Stults (1985) noted that factor-analytic studies of field-collected data often report that a majority of reverse-scored items load on one or more separate factors (distinct from those on which the remaining non-reversed items load). Following up on this observation, Schmitt and Stults (1985) conducted a series of simulations to explore the effect of careless respondents (those who fail to take note of item reversals) on factor structures, and found that factors composed entirely of reverse-scored items can appear when as few as ten percent of respondents are careless in their responses.

Additional research on measures of role conflict and ambiguity (House, Schuler & Levanoni, 1983; Rizzo, House & Lirtzman, 1970) has likewise shown that the intended conceptual distinction between conflict and ambiguity may be confounded with whether the items are positively or negatively worded (Harris, 1991; McGee, Ferguson & Seers, 1989; Tracy & Johnson, 1981), although some contradictory evidence also exists (Kelloway & Barling, 1990). Similarly, a number of studies have shown that the negatively-worded items in the original Job Diagnostic Survey (Hackman & Oldham, 1975) are one possible cause of its frequently problematic factor structure (Cordery & Sevastos, 1993; Harvey, Billings & Nilan, 1985; Idaszak, Bottom & Drasgow, 1988; Idaszak & Drasgow, 1987; Kulik, Oldham & Langer, 1988). Finally, further evidence exists in a number of diverse areas. For example, one study (Schmitt & Coyle, 1976) factor-analyzed a 74-item measure dealing with the reactions of college applicants to placement interviewers and found one factor to be defined almost entirely by negative descriptors, with very few negative items loading highly on other factors. Similarly, Siegel and Kaemmerer (1978) evaluated 525 statements describing innovative and traditional organizations and obtained a three-factor solution which included one factor that was generally composed of negative items. Finally, Pilotte and Gable (1990) and Miller and Cleary (1993) found that positively- and negatively-worded items formed separate and distinct factors for measures of anxiety about computers and feelings about loneliness, respectively.

Although the findings summarized briefly above suggest that concern is warranted when standard and reverse-scored items are used together, it should be noted that, virtually without exception, researchers have implicitly accepted the third assumption outlined earlier and have treated the two regularly-scored and two reverse-scored item formats as being essentially the same. This appears to mirror the practice of scale developers, who often mix item types together. For example, in measuring organizational commitment, Porter, Steers, Mowday and Boulian (1974) include both polar opposite items (“Often, I find it difficult to agree with this organization’s policies on important matters relating to its employees”) and negated items (“There’s not too much to be gained by sticking with this organization indefinitely”), as do Meyer and Allen (1984) (polar opposite: “I think I could easily become as attached to another organization as I am to this one;” negated: “I do not feel a strong sense of belonging to my organization”). Similarly, in the job satisfaction domain, the Brayfield and Rothe (1951) scale mixes polar opposites (“I consider my job rather unpleasant”) and negated items (“I feel that my job is no more interesting than others I could get”), as does the Job Descriptive Index (Smith, Kendall & Hulin, 1969) (e.g., the supervisor satisfaction subscale contains the polar opposite item, “Lazy,” and the negated item, “Doesn’t supervise enough”).

Unfortunately, however, the limited empirical evidence suggests that the four item types are not similar in their psychometric quality. In a study directed specifically at this issue, Schriesheim et al. (1991) experimentally examined the four item formats and found that the regular items were the most reliable and produced the most accurate responses; these were followed by the negated regular, polar opposite, and negated polar opposite items, in that order. Thus, while the evidence concerning the effects of item type is probably best viewed as suggestive (more weight of evidence is needed to draw firm conclusions), it does raise the question of whether factor-analytic distortions are equal for the regular, polar opposite, negated regular, and negated polar opposite items and whether one or more item formats may be superior to the others in terms of being less prone toward producing artifactual factors or having lower levels of artifactual method (i.e., wording) and/or error variance.

Since no scale with which we are familiar (e.g., Brayfield & Rothe, 1951; Hackman & Oldham, 1975; Meyer & Allen, 1984; Porter et al., 1974; Rizzo et al., 1970; Smith et al., 1969) contains a sufficient number of all four item types to allow an archival analysis of already-collected data, the present study was undertaken to examine this issue and determine: (1) whether all four item formats are equally likely to produce separate factors in an exploratory factor analysis (EFA); and (2) whether confirmatory factor analysis (CFA) shows all four item formats to produce method factors and to have equal amounts of trait, method, and error variance. If some of the item formats are more likely than the others to cause separate factors to appear, or if some have substantially less trait (and more method and/or error) variance, instrument developers and users would be well-advised to avoid their use. However, if one or both of the two reverse-scored item formats (polar opposite and negated regular) did not yield problematic factors and/or levels of method and error variance, the practice of including reverse-scored items on measures might still be followed without seriously impairing instrument quality.



The sample consists of 496 upper-division undergraduates enrolled in business administration courses at a medium-sized private university. The study was conducted on a voluntary basis, and all responses were kept anonymous. The assignment of respondents to the 16 completely-crossed scenario and instrument combinations was random; an equal number (31) was assigned to each. Upon completion of their questionnaires, the respondents were debriefed and thanked for their cooperation. All guidelines of the Academy of Management and the American Psychological Association for the protection of human subjects were followed.


To ensure that each respondent had a suitable referent for the current study, each was given one of four different descriptions of the behaviors displayed by a fictitious supervisor. The respondents were asked to read their particular scenario very carefully and to turn it face down on their desks when they completed it (and to not consult it further). The respondents were then given one of four questionnaire versions and told to describe the behavior of the supervisor in their scripts by completing this instrument.

Two of the scenarios were identical to those previously used by Schriesheim and Hill (1981) and Schriesheim et al. (1991); these two scripts were the same as each other but differed in the levels of supervisory behavior portrayed (high or low – to obtain adequate variance in responses for the analyses; Schriesheim & Hill, 1981, pp. 1107-1108). The two modified scenarios involved replacing each regularly-worded behavior descriptor in the scripts (e.g., “clear”) with a polar opposite (e.g., “ambiguous”) that was drawn from Roget’s Thesaurus and judged by a group of colleagues as not altering its essential meaning or connotation. (This was done to ensure that scenario wordings did not bias the results due to enhanced recall or reduced confusion effects. Separate analyses by regular or polar opposite scenario wording produced virtually identical results; thus, only combined sample results are presented. However, full tabular results for these separate analyses are available from the authors.)


Selection of the LBDQ. The 95-item questionnaire used to record the subjects’ responses to the scripts was a modification of Form XII of the Leader Behavior Description Questionnaire (LBDQ; Stogdill, 1963). The standard LBDQ contains 100 items measuring 12 dimensions of perceived leader behavior. These items are presented in alternating (“random”) order and the respondents are asked to describe their perceptions by using five-point Likert-type response alternatives (“Always,” “Often,” “Occasionally,” “Seldom,” and “Never”). The LBDQ was selected because it mixes reverse-scored items and items measuring diverse constructs (and because it is lengthy). These properties allowed the use of alternative forms of the same items without their being detected by the respondents. (Not one respondent complained about item redundancies and post-debriefing interviews with 30 of the respondents indicated that the length and diversity of LBDQ items had apparently been successful in preventing the discovery of the alternate form items which were included in the questionnaires.)

Development of modified LBDQ items. To produce an instrument of approximately the same length as the original LBDQ, two randomly-selected LBDQ subscales were deleted (Tolerance of Uncertainty, TU, and Persuasion, P). Then, five of the original Initiating Structure (IS) items (LBDQ items 14, 34, 54, 74, and 94) were slightly modified to yield regular, polar opposite, negated regular, and negated polar opposite versions. This resulted in a 95-item revised LBDQ; the four sets of new IS items are shown in Table 1. It should be mentioned that there is no standard or agreed-upon way of developing item reversals which are not subtly different from the non-reversed items they are supposed to mirror (Rorer, 1965). Consequently, care was taken in the development of the new items for this study. The polar opposite items were developed using common antonyms (listed in Roget’s Thesaurus) which a group of colleagues felt reversed the scoring of each item but did not alter its meaning or connotation. Then, negated versions of the regular and polar opposite items were produced by adding either the phrase “does not” or “not” to each item. The final set of items shown in Table 1 were then pretested on a small class of M.B.A. students (N = 32) who were enrolled in an introductory behavioral course at the same university as the main sample of this study. None of the polar opposite, negated, or negated polar opposite items were judged by more than 8% of the pretest sample as being substantially different from or as coming from a content domain different than the regular items from which they were derived. Thus, while it cannot be claimed that the items shown in Table 1 do not inadvertently incorporate differences in connotation, they are probably equal in quality or better than the item reversals which are typically employed in most survey questionnaire instruments. (Parenthetically, we should mention that we did not use all 95 LBDQ items in the analyses reported below, but only 20 – the five Initiating Structure items which appeared on the questionnaire in regular, polar opposite, negated, and negated polar opposite formats. The other LBDQ scales and items were not employed because they nest substantive item content within wording formats; additionally, excluding them also yielded a good respondent-to-item ratio for our analyses.)

Counterbalancing. To offset possible presentation order effects, four versions of the 95-item LBDQ were developed and employed. All four used a “random” format and intermixed items from the LBDQ subscales in the standard alternating manner; the one difference among the four versions was that the order in which the regular, polar opposite, negated polar opposite, and negated regular IS items appeared was counterbalanced. The first questionnaire replaced the five original IS items with the new regular items (see Table 1). The TU and P items were then replaced by the polar opposite, negated polar opposite, and negated regular IS items (in that order). The second questionnaire form used the replacement ordering of polar opposite, regular, negated polar opposite, and negated regular, while the third form employed negated polar opposite, negated regular, regular, and polar opposite items. Finally, the last questionnaire version used the replacement order of negated regular, negated polar opposite, regular, and polar opposite for replacing the original IS, TU, and P items. (Copies of all four instruments are available from the authors.)

Method of Analysis

Exploratory factor analyses (EFA). The reverse-scored (polar opposite and negated regular) LBDQ items were first properly scored. Then, a Pearson product-moment correlation matrix was computed and a principal axis factor analysis undertaken with [R.sup.2]’s as initial communality estimates. Eigenvalue-one (Rummel, 1970) and scree tests (Cattell, 1966) suggested that either two or three factors were the appropriate number to extract (Harman, 1976) (the eigenvalues of the first six unrotated factors were 9.14, 1.34, 1.04, 0.82, 0.79, and 0.73, explaining 45.7, 6.7, 5.2, 4.1, 4.0, and 3.6 percent of the variance, respectively); two and three factors were thus extracted and then subjected to direct oblimin (with [Delta] = 0) and varimax rotations (so as to conform to recommended and common practice, respectively; Harman, 1976; Rummel, 1970). As is frequently done (Hair, Anderson & Tatham, 1987), items with factor loadings of [greater than or equal to] [absolute value of] .30 were treated as meaningful for interpretation.

Confirmatory factor analyses (CFA). The EFA’s outlined above are useful to address the question of whether the four item types are equally-likely to produce separate factors in typical applied management research. However, since they are not theoretically-driven, EFA’s tend to capitalize on chance error in a data set; additionally, they do not allow the clear partitioning of variance into separate trait, method, and error components (Fornell & Larcker, 1981; Schmitt & Stults, 1986). We thus conducted a series of CFA’s employing the maximum likelihood estimation procedures of LISREL VII (Joreskog & Sorbom, 1989) and a slight modification of the general multitrait-multimethod (Campbell & Fiske, 1959) approach outlined by Widaman (1985) and recommended by Schmitt and Stults (1986) (since only one trait – IS – was examined, we omitted the model that tests for the “discriminant validity” of traits). To obtain meaningful estimates of trait, method, and error variance (Fornell & Larcker, 1981), and as suggested by Cudeck (1989), item intercorrelations were analyzed in addition to covariances. These yielded identical results with respect to the significance of parameters and all goodness of fit indices – except for the root mean-square residuals (RMSR). Consequently, being more informative, we report RMSR’s (in addition to parameter estimates for the best-fitting model) obtained from our correlation matrix analyses.

Model testing proceeded as follows. First, a full matrix model (Model 3C in Widaman’s taxonomy) was fit, consisting of three sets of elements: a trait factor (IS), four method factors (regular, polar, negated regular, and negated polar wording), and twenty error terms (item uniquenesses); each IS item was specified as having one loading on the trait factor, one loading on its appropriate method factor, and one unique error term; for all of the models which were specified, no “garbage” parameters (additional cross-loadings or error correlations) were estimated to inflate model fit (MacCallum, 1986). Following convention (Harris, 1991), and to facilitate identification (Schmitt & Stults, 1986; Widaman, 1985), the trait factor was not allowed to be correlated with the method factors or with the error terms, and the method factors were allowed to be correlated only among themselves. The error (uniqueness) terms were also specified as being uncorrelated among themselves or with any other factors.

To examine item format (method) effects, the full model described above was then compared to three additional models. The first rival model consisted of only four method factors and twenty error terms – with no trait factors (Widaman’s Model 1C) – and was used to assess the lack of significant trait variance (i.e., a lack of “convergent validity”). The second rival model (Model 3A) included one trait factor and twenty error terms (but no method factors), and was used to assess the presence of method (item wording) effects. Finally, the third rival model was a refinement of the full model (Revised Model 3C), combining the polar opposite and negated regular method factors into one factor (i.e., this model had one trait factor, three method factors, and twenty error terms). Although the full model’s (Model 3C) significant (p [less than] .01) intercorrelation between the polar opposite and negated regular method factors (.42 – see Table 4) was also significantly (p [less than] .01) less than 1.0, this last rival model was estimated to directly test whether the polar opposite and negated regular method factors could be combined into one method factor (representing items which are reverse-scored) without a significant decrement in model fit.

Since a relatively large sample size (N = 496) is involved, the adequacy of model fit was not assessed by the chi-square statistic (due to its being particularly sensitive to sample size; Gerbing & Anderson, 1992; Marsh, Balla & McDonald, 1988; Mulaik, James, Van Alstine, Bennett, Lind & Stilwell, 1989). Instead, Joreskog and Sorbom’s (1989) goodness-of-fit (GFI) and adjusted goodness-of-fit (AGFI) indices were employed, along with each model’s RMSR, comparative fit index (CFI; Bentler, 1990), relative noncentrality index (RNI; McDonald & Marsh, 1990), and nonnormed fit index (NNFI; Bentler & Bonett, 1980). These later three indices were selected for use based upon the recent recommendations of Medsker, Williams, and Holahan (1994), Goffin (1993), Gerbing and Anderson (1992), and others (e.g., Tanaka, 1993). Comparisons between the models were undertaken using the chi-square likelihood test for nested models described by Bentler and Bonett (1980) and recommended by others (Bollen, 1989; Hayduk, 1987; Joreskog & Sorbom, 1989), along with differences in the fit indices (Medsker et al., 1994; Gerbing & Anderson, 1992; Tanaka, 1993). Following the suggestion of Widaman (1985), differences in fit indices of less than .01 were not considered meaningful.


Exploratory Analyses

Table 2 presents the EFA results for both the three- and two-factor solutions, each with oblimin and varimax rotations.

Three-factor results. As shown in the left-hand side of Table 2, in the three-factor findings, Factor 1 is an Initiating Structure (trait) factor, with all but the negated polar opposite items loading on it in the oblimin analysis. The same pattern of Factor 1 loadings exists for the varimax analysis, except that negated polar opposite item 4 (NP-4) also obtains a meaningful (.32) loading on the first factor.

In the oblimin analysis, Factor 2 is composed of only the five negated polar opposite items, clearly showing a method (item wording) effect for this format. In the varimax analysis, however, Factor 2 is more complex: the six items with meaningful loadings include two regularly-worded items (R-2 and R-3), one [TABULAR DATA FOR TABLE 2 OMITTED] polar opposite item (P-4), and three negated polar opposite items (NP-1, NP-4, and NP-5).

Factor 3 also shows pronounced differences between the oblimin and varimax analyses. In the oblimin results, only two polar opposite items (P-4 and P-5) obtain meaningful Factor 3 loadings, while varimax Factor 3 has two regular items (R-1 and R-5), one polar opposite item (P-5), and three negated polar opposite items (NP-2, NP-3, and NP-4) with meaningful loadings.

Two-factor results. As shown in the right-hand side of Table 2, the two-factor oblimin results mirror the three-factor oblimin findings. Factor 1 is clearly an Initiating Structure (trait) factor, with all but the negated polar items loading on it, and Factor 2 is strictly a method (item wording) factor, with only the negated polar items loading on it.

The two-factor varimax results are also similar to those of the three-factor varimax solution in that they are more complex than the oblimin findings. Here, Factor 1 is again a clear trait factor, as all but four negated polar items (NP-1, NP-2, NP-3, and NP-5) load at the [absolute value of .30] criterion level or greater. Factor 2 has all of its highest loadings on the five negated polar items (all [greater than or equal to] 45), but eight other items also have meaningful loadings as well (items R-1, R-2, R-3, R-5, P-1, P-3, P-4, and NR-5). Thus, the item loading pattern of Factor 2 suggests that it is a method factor, but these results are not as strong and as clear as are those of the two-factor oblimin solution.

Confirmatory Analyses

Table 3 presents the goodness-of-fit indices for the alternative models which were examined (the CFI and RNI indices yielded identical results and are therefore not presented separately; Medsker et al., 1994). As shown in Table 3, several models fit the data reasonably well, with the full model (Model 3C – with 1 trait factor, 4 method factors, and 20 error terms) having a good RMSR (i.e., [less than or equal to] .05; Cordery & Sevastos, 1993) and good GFI, AGFI, CFI/RNI, and NNFI values (all in excess of .90; Bentler & Bonett, 1980; Bollen, 1989; Cuttance, 1987; Medsker et al., 1994; Tanaka, 1993).


Examination of the differences in chi-square values [Mathematical Expression Omitted] for the first three models of interest shows that the full model (Model 3C) provides a significantly better fit to the data (p [less than] .01) than does the trait-only (Model 3A) ([Mathematical Expression Omitted], 26 df) and the methods-only (Model 1C) ([Mathematical Expression Omitted], 20 df) models. Additionally, the GFI, AGFI, CFI/RNI, and NNFI values of the full model are clearly superior to those of the trait-only and methods-only models. Table 4 presents this full model, showing a significant (p [less than] .01) correlation between the polar opposite and negated regular method factors (Factors 3 and 5). Consequently, a revised model was estimated which collapsed these two factors into one which was reflective of just reverse-scoring method effects (Model Revised 3C). However, upon examination of Table 3 it can be seen that the original full model still provides a better fit than does its revision, as indicated by its uniformly higher GFI, AGFI, CFI/RNI, and NNFI values, as well as by a significant (p [less than] .01) chi-square difference test ([Mathematical Expression Omitted], 6 df). Thus, the original full matrix model (Model 3C), with one trait factor, four method factors, and twenty error terms appears to provide both a satisfactory fit and the best fit to the data.

Examining the parameter estimates shown in Table 4 suggests three principal conclusions. First, some of the regular (R-4) and polar opposite (P-1, P-4, and P-5) items do not appear to suffer from significant method variance (item wording) effects – although their estimated error variances (as reflected in their [[Theta].sub.[Delta]] estimates) are higher than those of the other regular and polar opposite items. This suggests less consistency of method effects for regularly-worded and polar opposite items.

Second, all of the negated polar and negated regular items have significant (p [less than] .05) method factor loadings. In fact, for each item format, the variance attributable to trait (IS), method (item wording), and error (uniqueness) can be computed by squaring the trait (Factor 1) loadings, summing them, and dividing by 5 (the number of items) to yield the average percent of item variance attributable to the underlying trait (IS) (Fornell & Larcker, 1981). The same computations can be performed for the method factor loadings (Factors 2-5), yielding an estimate of average method variance effects. Finally, adding the error estimates ([[Theta].sub.[Delta]]) of the four item types and dividing by five produces an estimate of average item error variance. These calculations clearly show the regular items to be superior to all the others – with 63.6% trait, 5.9% method, and 30.5% error variance. The negated polar opposite items, in stark contrast, have 17.0% trait, 15.9% method, and 67.1% error variance, while the polar opposite and negated polar items are mid-way between the regular and negated polar opposite items with respect to measurement quality – the polar opposite items have 41.5% trait, 9.1% method, and 49.4% error variance, while the negated regular items have 44.2% trait, 8.0% method, and 47.8% error variance.

Finally, it may be noted that four methods effects are present and that these are relatively independent of each other, although the polar opposite and negated regular items share some similarity due to their reverse-scoring (this is reflected in the significant .42 intercorrelation between their factors).



The EFA results show that the polar opposite and negated polar opposite item formats are clearly capable of obtaining problematic loadings in an exploratory factor analysis. However, as also shown by the obtained results, the specific factors which are produced may be difficult to interpret, particularly if – as is commonly done – an orthogonal rotation is employed. Thus, one recommendation which naturally follows from this finding is that researchers interested in exploring item wording effects on factor-analytic structures are probably well-advised to employ oblique rotations to enhance the interpretability of their findings. However, even with oblique rotation, the present study clearly shows that an appropriately-designed CFA can be far more informative than an EFA – for example, by allowing the statistical testing of different rival models and by allowing the partitioning of item variance into that attributable to trait, method, and error. It therefore appears reasonable to suggest that CFA should be considered the analytic method of choice in research on item format effects, although EFA may yield additional information which is useful in some research contexts. It also seems reasonable to suggest that researchers who employ items with different wording formats should routinely test for item wording effects (using CFA procedures), before employing their data for substantive (i.e., hypothesis-testing) purposes. Otherwise, it is possible that item wording effects will go undiscovered, perhaps impairing results and distorting substantive conclusions.

Three decades ago Rorer (1965) observed that it is extremely difficult to produce reverse-scored items which do not change the meanings of the regularly-scored items from which they are derived – even if the changes in meaning are sometimes small and subtle. The current study clearly supports this assertion but also suggests a refinement in that differences in item wording, in addition to item directionality (positive or negative scoring), appear capable of causing the emergence of separate factors in a CFA: had directionality been the sole determinant of factor production, the two regularly-scored item formats (regular and negated polar opposite) should have formed one factor and the two reverse-scored item formats (polar opposite and negated regular) should have formed another. However, four separate method factors were found, suggesting an interaction between scoring directionality and item phrasing.

Finding that all four item formats yielded separate method factors (despite our best efforts at constructing alternative item forms which minimized differences in connotation) suggests something which has apparently not been recognized in earlier research on negatively- and positively-scored items: when compared among themselves; all formats may yield their own unique “method” effects. Thus, if we cannot eliminate item format effects (if they are a part of all items – no matter how they are worded), the question arises as to which format or formats seem more preferable.

In this regard, if all four item types are theoretically appropriate for measuring a particular construct, the current research suggests that the polar opposite and negated polar opposite items may yield problematic EFA results and the CFA findings show that the four item types may not be equal with respect to trait, method, and error variance. In fact, the regularly-worded items were clearly superior to the three other formats in the CFA results: they had substantially higher levels of trait variance (63.6%), lower levels of error variance (30.5%), and relatively little method variance (5.9%). Computing coefficient alpha internal consistency reliabilities for the four item formats (separately and in various pair-wise combinations) yielded the results shown in Table 5.

As shown in Table 5, the regularly-worded items were the most reliable in the current sample’s data, followed by the negated regular, polar, and negated polar items (in that order). The Table 5 results also show that mixing polar or negated regular items with an equal number of regular items results in a slight diminishment in reliability (compared with what would be expected from an equally-long scale of only regularly-worded items). However, mixing the negated polar items with any of the other item formats yields even stronger decrements in scale reliability.

Table 5. Coefficient Alpha Internal Consistency Reliabilities for

Item Formats


Anastasi, A. (1982). Psychological testing, 5th ed. NY: Macmillan.

Benson, J. & Hocevar, D. (1985). The impact of item phrasing on the validity of attitude scales for elementary school children. Journal of Educational Measurement, 22: 213-240.

Bentler, P.M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107: 238-246.

Bentler, P.M. & Bonett, D.G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88: 588-606.

Bollen, K.A. (1989). Structural equations with latent variables. New York: Wiley.

Brayfield, A.H. & Rothe, H.F. (1951). An index of job satisfaction. Journal of Applied Psychology, 35: 307-311.

Campbell, D.T. & Fiske, D.W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56: 81-105.

Campbell, N.O. & Grissom, S. (1979). Influence of item direction on student responses in attitude assessment. Paper presented at the 63rd annual meeting of the American Educational Research Association, San Francisco, CA, April.

Cattell, R.B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1: 245-276.

Cordery, J.L. & Sevastos, P.P. (1993). Responses to the original and revised job diagnostic survey: Is education a factor in responses to negatively worded items? Journal of Applied Psychology, 78: 141-143.

Cudeck, R. (1989). Analysis of correlation matrices using covariance structure models. Psychological Bulletin, 105: 317-327.

Cuttance, P. (1987). Issues and problems in the application of structural equation models. Pp. 241-279 in P. Cuttance & R. Ecob (Eds.), Structural modeling by example. New York: Cambridge University Press.

Fornell, C. & Larcker, D.F. (1981). Evaluating structural equation models with observable variables and measurement error. Journal of Marketing Research, 18: 39-50.

Gerbing, D.W. & Anderson, J.C. (1992). Monte Carlo evaluations of goodness of fit indices for structural equation models. Sociological Methods and Research, 21: 132-160.

Goffin, R.D. (1993). A comparison of two new indices for the assessment of fit of structural equation models. Multivariate Behavioral Research, 28: 205-214.

Hackman, J.R. & Oldham, G.R. (1975). Development of the Job Diagnostic Survey. Journal of Applied Psychology, 60: 159-170.

Hair, J.F. Jr., Anderson, R.E. & Tatham, R.L. (1987). Multivariate data analysis, 2nd ed. New York: Macmillan.

Harman, H.H. (1976). Modern factor analysis, 3rd ed. Chicago: University of Chicago Press.

Harris, M.M. (1991). Role conflict and role ambiguity as substance versus artifact: A confirmatory factor analysis of House, Schuler, and Levanoni’s (1983) scales. Journal of Applied Psychology, 76: 122-126.

Harvey, R.J., Billings, R.S. & Nilan, K.J. (1985). Confirmatory factor analysis of the Job Diagnostic Survey: Good news and bad news. Journal of Applied Psychology, 70: 461-468.

Hayduk, L.A. (1987). Structural equation modeling with LISREL: Essentials and advances. Baltimore: John Hopkins University Press.

House, R.J., Schuler, R.S. & Levanoni, E. (1983). Role conflict and ambiguity scales: Reality or artifacts? Journal of Applied Psychology, 68: 334-337.

Idaszak, J.R., Bottom, W.P. & Drasgow, F. (1988). A test of the measurement equivalence of the revised Job Diagnostic Survey: Past problems and current solutions. Journal of Applied Psychology, 73: 647-656.

Idaszak, J.R. & Drasgow, F. (1987). A revision of the Job Diagnostic Survey: Elimination of a measurement artifact. Journal of Applied Psychology, 72: 69-74.

Joreskog, K.G. & Sorbom, D. (1989). LISREL 7: User’s reference guide. Mooresville, IN: Scientific Software, Inc.

Kelloway, E.K. & Barling, J. (1990). Item content versus item wording: Disentangling role conflict and role ambiguity. Journal of Applied Psychology, 75: 738-742.

Kulik, C.T., Oldham, G.R. & Langer, P.H. (1988). Measurement of job characteristics: Comparison of the original and revised Job Diagnostic Survey. Journal of Applied Psychology, 73: 462-466.

MacCallum, R.C. (1986). Specification searches in covariance structure modeling. Psychological Bulletin, 100: 107-120.

Marsh, H.W., Balla, J.R. & McDonald, R.P. (1988). Goodness-of-fit indexes in confirmatory factor analysis: The effect of sample size. Psychological Bulletin, 103: 391-410.

McDonald, R.P. & Marsh, H.W. (1990). Choosing a multivariate model: Noncentrality and goodness of fit. Psychological Bulletin, 107: 247-255.

McGee, G.W., Ferguson, C.E. & Seers, A. (1989). Role conflict and role ambiguity: Do the scales measure these two constructs? Journal of Applied Psychology, 74: 815-818.

Medsker, G.J., Williams, L.J. & Holahan, P.J. (1994). A review of current practices for evaluating causal models in organizational behavior and human resources management research. Journal of Management, 20: 439-464.

Meyer, J.P. & Allen, N.J. (1984). Testing the “side bet theory” of organizational commitment: Some methodological considerations. Journal of Applied Psychology, 69: 372-378.

Miller, T.E. & Cleary, T.A. (1993). Direction of wording effects in balanced scales. Educational and Psychological Measurement, 53: 51-60.

Mulaik, S.A., James, L.R., Van Alstine, J., Bennett, N., Lind, S. & Stilwell, C.D. (1989). Evaluation of goodness-of-fit indices for structural equation models. Psychological Bulletin, 105: 430-445.

Nunnally, J.C. (1978). Psychometric theory, 2nd ed. New York: McGraw-Hill.

Pilotte, W.J. & Gable, R.K. (1990). The impact of positive and negative item stems on the validity of a computer anxiety scale. Educational and Psychological Measurement, 50: 603-610.

Porter, L.W., Steers, R.M., Mowday, R.T. & Boulian, P.V. (1974). Organizational commitment, job satisfaction, and turnover among psychiatric technicians. Journal of Applied Psychology, 59: 603-609.

Rizzo, J.R., House, R.J. & Lirtzman, S.E. (1970). Role conflict and ambiguity in complex organizations. Administrative Science Quarterly, 15: 150-163.

Rorer, L.G. (1965). The great response style myth. Psychological Bulletin, 63: 129-156.

Rummel, R.J. (1970). Applied factor analysis. Evanston, IL: Northwestern University Press.

Schmitt, N. & Coyle, B.W. (1976). Applicant decisions in the employment interview. Journal of Applied Psychology, 61: 184-192.

Schmitt, N. & Stults, D.M. (1985). Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement, 9: 367-373.

—–. (1986). Methodology review: Analysis of multitrait-multimethod matrices. Applied Psychological Measurement, 10: 1-22.

Schriesheim, C.A., Eisenbach, R.J. & Hill, K.D. (1991). The effect of negation and polar opposite item reversals on questionnaire reliability and validity: An experimental investigation. Educational and Psychological Measurement, 51: 67-78.

Schriesheim, C.A. & Hill, K.D. (1981). Controlling acquiescence response bias by item reversals: The effect on questionnaire validity. Educational and Psychological Measurement, 41: 1101-1114.

Siegel, S.M. & Kaemmerer, W.F. (1978). Measuring the perceived support for innovation in organizations. Journal of Applied Psychology, 63: 553-562.

Simpson, R.D., Rentz, R.R. & Shrum J.W. (1976). Influence of instrument characteristics on student responses in attitude assessment. Journal of Research in Science Teaching, 13: 275-281.

Smith, P.C., Kendall, L.M. & Hulin, C.L. (1969). The measurement of satisfaction in work and retirement. Chicago: Rand-McNally.

Stogdill, R.M. (1963). Manual for the leader behavior description questionnaire-Form XII. Columbus: Bureau of Business Research, Ohio State University.

Tanaka, J.S. (1993). Multifaceted conceptions of fit in structural equation models. Pp. 10-39 in K.A. Bollen & J.S. Long (Eds.), Testing structural equation models. Newbury Park, CA: Sage.

Tracy, L. & Johnson, T.W. (1981). What do the role conflict and role ambiguity scales measure? Journal of Applied Psychology, 55: 464-489.

Widaman, K.F. (1985). Hierarchically nested covariance structure models for multitrait-multimethod data. Applied Psychological Measurement, 9: 1-26.

Winkler, J.D., Kanouse, D.E. & Ware, J.E. Jr. (1981). Controlling for acquiescence response set in scale development. Paper presented at the 90th annual meeting of the American Psychological Association, Los Angeles, CA, August.

RELATED ARTICLE: Table 1. Experimental Initiating Structure (IS) Items

Regular Format Items (R)

R-1. He makes the use of uniform procedures required. R-2. He communicates his attitude to the group in a precise manner. R-3. He gives group members precise task assignments. R-4. He is active in scheduling the work to be done. R-5. He tells group members that rules and regulations are to be followed.

Polar Opposite Items (P)

P-1. He makes the use of uniform procedures optional. P-2. He communicates his attitude to the group in a vague manner. P-3. He gives group members vague task assignments. P-4. He is passive in scheduling the work to be done. P-5. He tells group members that rules and regulations are to be ignored.

Negated Polar Opposite Items (NP)

NP-1. He does not make the use of uniform procedures optional. NP-2. He does not communicate his attitude to the group in a vague manner. NP-3. He does not give group members vague task assignments. NP-4. He is not passive in scheduling the work to be done. NP-5. He does not tell group members that rules and regulations are to be ignored.

Negated Regular Items (NR)

NR-1. He does not make the use of uniform procedures required. NR-2. He does not communicate his attitude to the group in a precise manner. NR-3. He does not give group members precise task assignments. NR-4. He is not active in scheduling the work to be done. NR-5. He does not tell group members that rules and regulations are to be followed.

COPYRIGHT 1995 JAI Press, Inc.

COPYRIGHT 2004 Gale Group