Structural properties and psychometric qualities of organizational self-reports: field tests of connections predicted by cognitive theory

Structural properties and psychometric qualities of organizational self-reports: field tests of connections predicted by cognitive theory

David A. Harrison

Most organizational researchers have used some type of self-report, either to operationalize constructs for theory testing or to gather evidence for theory formulation. Yet, much needs to be learned about the mechanisms underlying this fundamental mode of data collection (Podsakoff & Organ, 1986). Cognitive theory and research have recently shed some light on item-answering processes. The purpose of this paper is to review those theories, and use the sequential cognitive mechanisms within them to hypothesize three links between the structural (contextual) properties and psychometric qualities of self-report instruments. We test the hypotheses in three field experiments, then discuss the ramifications of our findings for the development and use of organizational self-reports.

Cognitive Processes in Self-Reports

Several cognitive theories have been developed to explain how respondents form answers to self-report items (Ericsson & Simon, 1984; Feldman & Lynch, 1988; Hippler, Schwarz & Sudman, 1987; Porac, 1987; Tourangeau & Rasinski, 1988). These theories propose that a respondent goes through a version of most or all of the following processes: (a) interpretation of an item’s meaning, (b) retrieval of relevant information from memory, (c) judgment about some attribute of the retrieved information, and (d) mapping of the judgment to a response on a scale provided by the item. Cognitive theorists and critics of self-reports (e.g., Podsakoff & Organ, 1986) also assume the simultaneous processes: (e) conservation of cognitive effort, and (f) positive self-presentation. Coupling the last two with the first four processes suggests that respondents do not normally take great care in interpreting, retrieving, judging, and responding to questions, nor do they ignore extraneous information (Feldman & Lynch, 1988). In fact, evidence shows that respondents use heuristic versions of (a) – (d), especially when unthinking answers have no personal consequences (Eder & Fedor, 1989). This condition is especially symptomatic of anonymous self-report data.

Cognitive mechanisms are not explicitly accounted for in applications of standard measurement theories, which instead posit a (sometimes latent) “true score.” Most measurement inaccuracies are treated as random error. Some researchers and instrument developers, however, have long known the limitations of measurement theories, and have guarded against potentially systematic response biases, styles, and sets that bias true score estimation, by relying on several rules of thumb as guides for structuring their instruments (Sudman & Bradburn, 1982). These rules, in turn, rest on typically untested assumptions about the cognitive processes underlying self reports.

Tourangeau and Rasinski (1988) have assembled a conceptual framework of these assumed cognitive processes. They focus attention on the effects of an item or scale’s context on interpretation, retrieval, judgment, and response selection. In the present paper, we use their cognitive framework to develop predictions about the psychometric effects of three structural features of self-report instruments: evaluative context, physical context, and response context.

Reverse-Wording, Evaluative Context, and Question Interpretation

One structural rule of thumb directs self-report instruments to contain a few or perhaps as many as one-half negatively or reversed-worded items, to control for acquiescence, leniency bias, and spurious response consistencies (for a review, see Schriesheim, 1981a). Reverse-worded items scattered throughout an instrument are presumed to act as cognitive “speed bumps,” to slow a kind of inattentive inertia that might develop from answering a series of overlapping questions. Tourangeau and Rasinski (1988, p. 299) term this inertial mechanism “cognitive carryover.” They propose that as a respondent moves through a series of positive or negative items, he or she becomes more likely to interpret the item in terms of the evaluative context created by the series.

Furthermore, sets of self-report items are designed to differ somewhat in their extremity, even if none are reverse-worded, to provide precise measurement and discrimination among respondents throughout the range of the construct being measured (Hulin, Drasgow & Parsons, 1983; Lord & Novick, 1968). That is, even in a positive series, some items are more positive than others, with a few items only mildly so, or perhaps even neutral. For example, the Job Descriptive Index (JDI; Smith, Kendall & Hulin, 1969) contains some descriptive, and somewhat evaluatively neutral items, such as “on your feet” on the Work Itself scale, “leaves me on my own” on the Supervisor scale, and “ambitious” on the Co-workers scale. Because of cognitive carryover prompted by the surrounding (especially prior) items, such neutral items embedded in a consistently negative or positive series should have psychometric qualities that are similar to those of the contextual items (Tourangeau & Rasinski, 1988). This leads us to:

H1: A series of self-report items in the same evaluative (positive or negative) tone will produce context-consistent responses. A neutral item embedded in such a series will have a mean, item-total correlation, and correlation with external constructs that is consistent with the surrounding, contextual items.

Item Grouping (Dispersal), Physical Context, and Information Retrieval

Another structural rule of thumb calls for items measuring different constructs to be randomly spread throughout an instrument, rather than grouped together and/ or labeled. Many well-developed and widely used measures in organizational research employ this rule. For examples, the Job Diagnostic Survey (Hackman & Oldham, 1975), the Survey of Work Values (Wollack, Goodale, Wijting & Smith, 1971), the Work Values Inventory (Super, 1970), the Manifest Needs Questionnaire (Steers & Braunstein, 1976), the Leader Behavior Description Questionnaire (LBDQ: Stogdill, 1963), and the Anxiety-Stress Questionnaire (House & Rizzo, 1972), all arrange items from different subscales in a random or mixed order.

Random item dispersal is designed to reduce demand characteristics, hypothesis guessing, and resulting spurious relations among constructs (Schriesheim, Solomon & Kopelman, 1989b; Budd, 1987). On the other hand, a random sequence of items is purposely disorganized, perhaps making the respondent’s task more tedious or confusing, prompting the use of heuristic response processes as a coping mechanism (Schriesheim et al., 1989b). In contrast, characteristics of the layout of a questionnaire, such as grouped and labelled sets of items, can facilitate respondents’ cognitive processing. Such features of the physical context of items should prompt respondents to think that items in a demarcated group share the same content, cuing subjects to retrieve information about those grouped items from the same areas in memory (Tourangeau & Rasinski, 1988). Within-group response consistencies among physically grouped items would increase because of the overlap in sources of retrieved information, and internal estimates of reliability (e.g., coefficient [Alpha]) based on these response consistencies should also be higher.

The physical contexts and cognitive processes that lead to stronger within-group consistencies should also lead to stronger between-group distinctions and (contrary to the assumption of demand characteristics) higher discriminant validity. Such structural properties may help a respondent distinguish between constructs by cuing him or her to retrieve a new source of information when moving to a new item grouping. Some evidence exists that physical grouping cues increase within-group response consistencies and between-group discriminant validities, especially with instruments having somewhat weak psychometric properties (Schriesheim et al., 1989b). Therefore, we propose:

H2: Questionnaires with grouped sets of items provide a physical context that produces stronger item-item response consistencies and higher homogeneity estimates of reliability, within the item groupings. The same structural property also produces higher discriminant validities across groupings.

The degree of support for H2 might depend on respondent perceptions of the similarity of measured constructs. The effects of physically grouping and demarcating items may be weakest for collections of highly dissimilar or orthogonal constructs, and become stronger as the perceived similarity among the constructs increases, reaching its strongest point for collections of items measuring highly similar constructs. Groupings could lead respondents to interpret the physically separated sets of items as referring to different categories of information, resulting in distinct memory search and retrieval processes for each category. In the absence of such groupings, respondents may be more likely to interpret the items as eliciting the same broad class of information, searching for and retrieving information from memory on the basis of a single category. Regardless of the effect of construct similarity, we still expect a main effect of item grouping on psychometric qualities of instruments.

Standardizing Scale Ranges, Response Context, and Judgment-Response Mapping

Many self-report instruments ask subjects to estimate the frequency of an event, behavior, or cognition. A third (more implicit) rule of thumb about structuring self-reports says that one should constrain frequency estimates by instructing respondents to check one of several response options (in the form of verbal or numerical anchors) from a given list. It is intended to reduce some sources of estimation error and retain only meaningful differences among responses (e.g., Bass, Cascio & O’Connor, 1974; Miller, 1991). It is quite common. Measures of Physical Symptoms of Stress (Patchen, 1970), Job Pressure (Sutton & Rousseau, 1979), Job-Related Tension (Kahn, Wolfe, Quinn, Snoek & Rosenthal, 1964), and Conflict Resolution Strategies (Howat & London, 1980) are all examples of carefully developed organizational instruments using some form of frequency anchors.

However, if respondents answer items via the cognitive processes reviewed earlier, they may have already made some form of magnitude judgment in their minds before giving an observable response (Schwarz & Bienias, 1990). Rather than allowing respondents to render this judgment with little to no modification, standardized scale ranges and anchors generate a response context that can cue respondents to a presumably expected or normative answer (Hippler et al., 1987), as well as compel respondents to translate their internal estimate into a different functional form (Ostrom, 1970; Tourangeau & Rasinski, 1988; Upshaw, 1969). For example, a respondent who smokes only while she completes her federal tax forms might fall between standard frequency anchors such as “none” and “a few per week,” with no clear reason for choosing either. As this judgment-to-response mapping becomes a complex task, subjects are likely to adopt the heuristic of anchoring on inadvertent, contextual information – the presumably expected response in the middle of the range of options. Psychometrically, this pushes the item’s response distribution toward the scale midpoint (Schwarz & Bienias, 1990). This leads us to our final hypothesis:

H3: For self-reported frequency items, standardized scale ranges create a response context that biases response distributions toward the scale midpoint.

Research Plan and Overview

These three hypotheses are derived from theory about complementary and sequential (rather than redundant) cognitive mechanisms involved in answering self-report questions. To test Hypotheses 1, 2, and 3, we conducted Studies 1, 2, and 3, respectively. In each of the studies, we manipulated a structural property of an existing self-report instrument to test the hypothesis. We did not choose instruments in any of our studies to single them out for criticism. Instead, they were chosen because they were well-developed and are widely used in management research.

Study 1: Effects of Evaluative Context

Some recent studies have examined the cognitive carryover effects that result from evaluative context (e.g., Harrison & McLaughlin, 1993). This work is limited by the narrow set of constructs studied (dimensions of job satisfaction), the unusually lengthy instruments used (18 items: and therefore an unusually potent context created), and the single response format examined (modified adjective checklist). Responding to the recommendations of those investigators, Study 1 extends this past work by using a different construct, measured via a shorter instrument with a different response format (5-point scale format, which may be less apt to foster cognitive carryover than the previously studied checklist format).

In addition, past studies have embedded neutral items in a mixed series of positive and negative items to produce a null context or a control condition. It is not clear that this generates a truly null context, as respondents will have already processed some prior items containing positive and/or negative cues. Therefore, we placed our neutral item before all other items in the control (null context) condition. That is, in our control condition, respondents had not yet been exposed to or “cued” by previous positive or negative items. There could be no carryover in our control condition as there were no prior items to carry information from.

Finally, earlier work on the effects of evaluative context focused exclusively on internal, within-scale psychometric qualities and item parameters. We further extend past work by examining external, between-scale relations. Specifically, we test for context-dependent differences in the correlations of an embedded, neutral item with other constructs measured in an employee survey.

Method

Instruments and Design

To create different item contexts, we adapted the Job Affect Scale (JAS: Brief, Burke, George, Robinson & Webster, 1988; Watson & Tellegen, 1985), which is a carefully crafted measure of the affective-expressive component of reactions to one’s job (e.g., “How often at work do you feel . . . peaceful? hostile?”). A 5-point response scale for the adapted JAS items ranged from “not at all” (scored as -2) to “all the time” (scored as +2). We chose five items from both the positive and negative JAS subscales, each of which loaded on both an affective and an arousal-activity dimension. Therefore, our embedded item needed to be close to neutral with respect to both affect and activity. The word “watchful” had near-zero ([less than]. 10) loadings on both such dimensions from Osgood, Suci and Tannenbaum’s (1957) review of factor analyses of bipolar adjectives. It could plausibly describe one’s job affect – especially with the security employees in our sample (see below).

We created three forms for the adapted JAS instrument in a one-way, three-level experimental design. One form was a null context or control condition. In this form, “watchful” was the first item, and all the positive and negative affect items followed it. In the other two JAS arrangements, all positive items were listed first in a block, followed by a block of negative items. In the positive context condition, “watchful” was embedded as the fourth item in the positive block; in the negative context condition, “watchful” was embedded as the fourth item in the negative block (making it the ninth item overall). This potentially confounds the effect of an item’s serial position with the effect of its context. It would have been optimal to have serial position (fourth vs. ninth item) crossed with context in a fully factorial design. However, our anticipated sample size and the fairly low power of detecting differences in relationships (Bobko, 1995) led us to retain the evaluative context manipulation at the expense of separating it from serial position. Still, previous research (two studies in Harrison and McLaughlin, 1993, and one in McLaughlin and Harrison, 1990) and pilot studies(1) provide some evidence that counterbalancing the block of positive and negative items would have had no detectable effect.

Subjects and Procedures

The three forms of the JAS were distributed randomly to all employees in a nation-wide security firm, as part of a comprehensive attitude survey. Each employee was given 45-60 minutes of paid work time to complete the survey, usually in a departmental or office meeting, but for field employees (those who worked at customers’ homes) at a time of their choosing. No employee was rushed or urged to complete the survey quickly. Participation was voluntary and confidential.

The survey had a 57% response rate; 366 employees in all departments, functional specialties, locations, and organizational levels returned it. The positive, null (control), and negative context forms were returned by 124, 121, and 119 employees, respectively. Chi-square tests for sex, marital status, and education, as well as t tests for age and tenure showed that respondents did not differ significantly on demographic characteristics from non-respondents (whose demographic data came from an organization-wide database). Most respondents were male (71%) and married (61%). Most respondents also had a high school education (80%), and reported that their job was the main source of their family’s income (71%). Virtually all respondents worked full-time (99%), with a mean tenure of only 2.2 years at the firm (SD = 1.9; the firm had existed for only eight years). Mean age was 33.9 years (SD = 9.0).

Results

The positive and negative affect subscales of the JAS had adequate reliability estimates in the positive, null (control), and negative context conditions. For the positive affect subscale, coefficient [Alpha] was .84, .91, and .90, across respective conditions. For the negative affect subscale, coefficient [Alpha] was .84, .76, and .84, respectively. The subscales were also substantially correlated (r’s = -.47, -.53, and -.45, respectively; all p’s [less than] .01). Therefore, in addition to each subscale, a total job affect (JA) scale was created, in which the negative items were reverse-scored and summed with the positive items. Coefficient [Alpha] for this total scale was .87 ([Alpha] = .86, .87, and .88, across respective conditions). “Watchful” was not one of the summed items in total JA. To test Hypothesis 1, means for the “watchful” item, along with its correlations with both job affect subscales, total JA, and a set of external constructs, were compared across negative, nominal, and positive context conditions. Results are presented in Table 1.

Results are consistent with the predictions of H1. “Watchful” item means were higher in the positive than the negative context condition, with the null context (control) mean falling between these two, F(2, 361) = 5.40, p [less than] .01. There were no significant differences in variance across conditions. Almost all (98%) of this effect stemmed from the difference between item means embedded in positive versus negative contexts: planned contrast F(1,361) = 10.68 (p [less than] .01).

We also expected an interactive effect of JA and context on “watchful” responses, because H1 states that responses to a neutral or ambiguous item should be consistent with responses in its evaluative context – in this case, responses to prior items in a positive or negative series.(2) Thus, we conditioned on JA to fully observe the evaluative context effects. In an hierarchical regression, we entered evaluative context-by-JA interaction terms after JA and the context main effects. The results of this regression supported Hypothesis 1, F(2, 359) = 9.60, p [less than] .01. The interactive term improved the predictiveness of “watchful” responses by [Delta][R.sup.2] = .08. Once again, a planned comparison of effects showed that most (86%) of this interactive influence was due to different regression slopes in the positive versus negative contexts ([Delta][R.sup.2] = .07: F(1,359) = 16.46, p [less than] .01). This interaction is clear in Figure 1. It shows that as JA scores became more positive, employees responded: (a) more positively (higher mean) to “watchful” embedded in a positive context – interpreting it to mean something positive, (b) more negatively (lower mean) to “watchful” embedded in a negative context – interpreting it to mean something negative, and (c) roughly the same regardless of their JA in the control condition.

[TABULAR DATA FOR TABLE 1 OMITTED]

As predicted, all correlations involving responses to “watchful” were significantly different and in opposite directions for positive and negative contexts. The “watchful”-total JA correlation was .10 in the positive context condition and -.50 in the negative context condition (z test of the difference in Fisher-transformed r’s = 5.05, p [less than] .01). In the negative context condition, employees who responded that they often felt “watchful” also reported less job satisfaction, more withdrawal cognitions, and more negative health symptoms (r = -.25, .22, and .28, respectively). In the positive context condition, employees who responded that they often felt “watchful” reported more job satisfaction, fewer withdrawal cognitions, and fewer health symptoms (r = .07, -.14, and -.04, respectively). Again, each of the correlations in the positive context condition was significantly different from its counterpart in the negative context condition (p [less than] .05), with effects that involve not only changes in magnitude, but direction as well.

Discussion

Summary

These results converge with findings from previous studies, showing that effects of evaluative context predicted by a theory of cognitive processing can occur in field settings, as respondents use contextual cues to help interpret self-report items (Harrison & McLaughlin, 1993; Tourangeau & Rasinski, 1988). That is, the major contribution of this study is that it supports the existence of underlying item-answering mechanisms predicted by cognitive theories. The importance of these context effects and their generalizability to most self-reports, however, are limited by the potential confound of serial position with context, and the focus on a single, neutral item.

Limitations and Research Directions

One restriction on our findings is the combination of no counterbalancing for serial position (discussed earlier) with the small number of items used to deliver the context manipulation. The somewhat weaker effect of positive context on “watchful” may have occurred because respondents had only answered three previous (positive) items. In the negative context, there may have been greater buildup or cumulative effect over the eight previous (five positive, then three negative) items. This speculation leads to an interesting, testable notion: use of heuristic response mechanisms may be heightened by repeated exposure to the same structural properties over instruments in a survey.

A clear restriction on the external validity of Study 1 results is the fact that context effects were predicted and found only for responses to a single item. Most instruments used in management research contain multiple items. Despite their acknowledged weaknesses, however, single-item measures are still used in many studies. For example, probabilities, utilities, and strengths of preferences are measured with single items in decision-making (Carroll & Johnson, 1990) and policy-capturing research (Murphy, Herr, Lockhart & Maguire, 1986). Single items have often been used to measure constructs in research that tests procedural justice (Greenberg, 1990), motivation (Schmidt, 1973), job satisfaction (e.g., Kunin’s Faces Scale, 1955), absenteeism (Johns, 1994), turnover decision (Steel & Ovalle, 1984), behavioral intention (Ajzen, 1991; Ajzen & Fishbein, 1980), and strategic management (Miles & Snow, 1978) theories. It would be both interesting and useful to manipulate item context in some of these domains, to see what effect it has not only on item-level psychometric qualities, but also on the fit of theories (Feldman & Lynch, 1988).

It would be even more interesting and useful to test for instrument-level context effects (e.g., measuring coworker trust before or after a measure of perceived ingratiation tactics). Cognitive carryover may involve context-consistent responses to entire sets of items, because of the structure of the questionnaire in which the instrument is embedded, or because of the researcher’s or practitioner’s stated purpose for administering it (Tourangeau & Rasinski, 1988). Although theories predict such scale-level effects, data have not been uniformly supportive (e.g., Stone & Gueutal, 1984).

A further limitation on external validity is that we found context effects for a plainly neutral item. A serious question remains about what context effects to expect for items worded to be increasingly positive or negative. Such studies have been done with respect to negative items. A growing body of evidence suggests that using just a few negative or reverse-worded items scattered throughout a positive series can have detrimental effects on an instrument’s psychometric properties (see Harrison & McLaughlin, 1993, for a brief review). Little research has examined the impact of embedding a few positive items in a negative series, although similar results might be expected.

Another important consideration is that items might be neutral only with respect to the context in which they are embedded. In instruments measuring more than one factor, factor loadings could change because of the change in context that results from mixing items from different dimensions. For example, our “watchful” item might have a high loading on a construct such as “focus of attention.” If focus of attention and job affect items were mixed, “watchful” might acquire a different interpretation and different psychometric qualities, depending on the items just prior to it. More research is needed to test for such context-dependent psychometric qualities.

Finally, the 57% response rate for Study 1 could be considered low enough to raise questions about respondent involvement (although the response rate compares favorably to other surveys; Heberlein & Baumgartner, 1978). Low involvement subjects are predicted to be more likely to adopt effort-saving cognitive heuristics, which would result in stronger item-item response consistencies and carryover (Feldman & Lynch, 1988; Tourangeau & Rasinski, 1988). The two largest groups of employees, sales representatives and technicians, were in the field when the survey was administered. They were told to complete the survey if and when they had a chance. Therefore, those who did respond probably had higher involvement than those who did not, because they took the time to voluntarily complete and return it. Hence, effects may have been underestimated because fewer low-involvement persons took part. In any case, future research on context and carryover would be strengthened by including measures of respondent involvement in the survey’s subject matter.

Study 2: Effects of Physical Context

Effects of the grouping of items on psychometric properties have been reported in earlier research. To test Hypothesis 2, we used a research design, operations, and analyses that differ substantially from those of previous studies. Both Schriesheim, Kopelman, and Solomon (1989a) and Schriesheim et al. (1989b) assessed item grouping effects using within-subjects designs and measures of job characteristics, life-, job-, family-, and self-satisfaction. Their results suggested there is no major psychometric advantage in grouping items. They also encouraged further research on format effects, especially using measures with better psychometric properties than those in their research. Other studies of the effects of grouping items on scale reliabilities and discriminant validities have also been published (Schriesheim, 1981 a, 1981b; Schriesheim & DeNisi, 1980). Although those studies clearly contributed to our understanding of grouping effects, additional research using different methods is needed because of some of their limitations (e.g., confounding sample with format, limited format comparisons; described in Schriesheim et al., 1989a). In Study 2 we employed a between-subjects design, measured different constructs, and included a physical cue (described below) to distinguish between groups of items.

Method

Instruments and Design

For this study, we created two types of physical item groupings for six different instruments. Constructs measured by these instruments were similar enough that respondents could perceive them to be related in some way, so that in a randomly dispersed format there would be no contrast or “backfire” effects (Tourangeau & Rasinski, 1988, p. 300). The first instrument was a six-item measure of overall job satisfaction (JS) adopted from the Job Diagnostic Survey (Hackman & Oldham, 1975; e.g., “All in all, I like working on this job”). The second instrument was a seven-item measure of general affect toward one’s schedule (AS; e.g., “I am dissatisfied with my current work schedule”; Dunham & Pierce, 1986). The third instrument was a five-item measure of psycho-physiological adjustment to one’s work schedule (ADJ; e.g., “My body hasn’t had any trouble getting used to the current work schedule.”(3)) Items for the ADJ scale came from interviews and focus group meetings with the eventual subjects, who suggested that how, and how well, they adjusted to their stressful schedules (e.g., by developing and using specific coping mechanisms) were important variables to consider in an employee survey. Response formats for all of these measures were 5-point Likert scales. These three instruments had a different response scale from those described below, so they were physically separated (by survey pages). Roughly half of the items in each of the originally published instruments were positively worded and half were negatively worded. Rather than create potential context effects with some negative items embedded in a positive block, or vice versa, we evenly intermixed the positive and negative items in each of the conditions described below.

The other three instruments measured employee perceptions of how work schedules interfered with activities outside work. Each instrument was taken from Dunham and Pierce (1986). All items began with “How easy or difficult is it for you now to . . .” and had a response format ranging from 1 = “very easy” to 5 = “very difficult.” The constructs they measured included interference with friend or family activities ([INT.sub.FF]; nine items, e.g.: “. . . maintain personal family relations?”), with services, events, and consumables ([INT.sub.SEC]; seven items, e.g.: “. . . see a movie?”), and with personal business ([INT.sub.PB]; four items, e.g.: “. . . go to the bank?”).

The grouping manipulation involved dispersing these items in different orders across physical “boxes” arranged on the questionnaire pages. Boxes were defined by .05[inches]-width, shaded borderlines. In the uniform grouping condition each box contained a homogeneous set of items, items that all measured the same construct. The first box contained six JS items, the second box contained seven AS items, the third box contained five ADJ items, and so on. In the random grouping condition, items were spread throughout the boxes in an indiscriminate order. For example, the first box still contained six items, but they were a mixed set of JS, AS, and ADJ questions. As a methodological safeguard, uniform and random groupings were counterbalanced. Respondents who received uniform groupings of JS, AS, and ADJ items on one page, received random groupings of [INT.sub.FF], [INT.sub.SEC], and [INT.sub.PB] items on the other page. Other respondents received the opposite pattern.

Subjects and Procedures

The two questionnaire forms were distributed randomly to line-level employees of a chemical processing plant. The instruments were part of an organizational survey of employee preferences about work schedules, as the firm considered a change from its existing 8-day, forward-rotating shifts. Each employee was given 30 minutes of paid work time to complete the survey at employee meetings. No one was rushed or urged to fill out the survey quickly. Participation was completely voluntary and anonymous.

Ninety-two percent of all line-level employees completed the survey: 200 employees returned form 1 (in which JS, AS, and ADJ items were in uniform groupings and items from INT scales were mixed randomly on the page); 192 returned form 2 (with the opposite grouping pattern). Mean tenure of the respondents was 7.9 years (SD = 5.4); mean age was 35.6 (SD = 9.1). Most of the participants were married (58%), with the mean number of children = 1.3 (SD = 1.2; median = 1, range = 0-6). Males made up 66% of the sample. For 72% of the participants, their jobs at the chemical plant were their family’s primary source of income. All employees held full-time positions.

Results

Hypothesis 2 predicts that physically grouping self-report items will increase psychometric estimates of their homogeneity and improve discriminant validities between instruments in separate groupings. Table 2 shows coefficient [Alpha] estimates of homogeneity for each instrument, under uniform and random grouping conditions. Coefficient [Alpha] was significantly higher for job satisfaction ([F.sub.199,191] = 1.25, p [less than] .05; from a test by Feldt, 1969), for [INT.sub.FF] ([F.sub.191,199] = 1.33, p [less than] .05), and for [INT.sub.SEC] ([F.sub.191,199] = 1.57, p [less than] .01). It was slightly higher in uniform grouping conditions for all six instruments, which is an improbable event (binomial test, p [less than] .05, adjusted for correlations between instruments and validated via bootstrapped subsamples.)(4)

We also tested the equality of item-item covariance ([Sigma]) matrices between uniform and random grouping conditions (Box, 1949). As Table 2 shows, for five of six instruments these covariance matrices were not equal. This occurred because item-item covariances were slightly higher in the uniform grouping condition. The average item-item correlation (within an instrument) in the uniform grouping condition was about .05 higher than in the random grouping condition. Item variances did not differ significantly across the two conditions.

The predicted pattern of discriminant validities was also observed (see Table 2). The average correlation of each construct with other constructs was always higher in the random grouping condition. Variance accounted for in the [INT.sub.FF], [INT.sub.SEC], and [INT.sub.PB] measures ranged from 7-12% more in the random grouping condition than in the uniform grouping condition. Likewise, variance accounted for in the JS, ATS, and ADJ ranged from 7-18% more in the random grouping condition.

Tests of the equality of construct-construct covariance matrices between uniform and random grouping conditions were conducted as a necessary precondition for the confirmatory factor analytic procedures discussed below (Idaszak & Drasgow, 1987). Both were significant. For the [Sigma] matrices involving JS, ATS, and ADJ: [Mathematical Expression Omitted]; for the [Sigma] matrices involving [INT.sub.FF], [INT.sub.SEC], and [INT.sub.PB]: [Mathematical Expression Omitted]. There were no significant differences across conditions in construct-level variances. However, when items were in uniform groupings, the correlation of one construct with another was always slightly lower than in the random grouping condition. The pattern is especially interesting given the attenuation for unreliability formula, which shows that relations between instruments should go up, rather than down, when each instrument has improved reliability.

As a final, more powerful set of tests of Hypothesis 2, we completed a series of confirmatory factor analyses and path analyses, in which we progressively relaxed constraints on the equivalence of measurement and structural models across the two forms of the questionnaire (often called a SIFASP procedure – see Idaszak & Drasgow, 1987 for details). In all analyses, we used [Sigma] matrices as input and estimated parameters via maximum likelihood. In any confirmatory factor analysis it is also necessary to set the scale of the latent variable, either by fixing the loading of one “marker” item for each factor to be a constant (usually 1.0), or fixing the variance of a factor to be 1.0. As differences in item parameters across grouping conditions was primary in this study, we set factor variances to 1.0. Results are in Table 3, including indices of the improvement in fit when we relaxed the constraints that fixed parameters to be equal on the two forms. Our interest was in testing the impact of the grouping manipulation, and not on validation of any particular model. Therefore, we focus below on the improvement in fit when between-condition parameters were allowed to vary.

For the JS, AS, and ADJ items, we first fit an overall measurement model in which each item loaded on its content factor. Factor loadings, unique variances, and factor correlations were fixed across the two grouping forms. We added a fourth factor to account for the negative wording of six items – two from each scale ([Mathematical Expression Omitted], p [less than] .01; cf., Schriesheim and Eisenbach, in press). Next, we allowed the item parameters (18 loadings and 18 error variances) to differ across the two forms, which significantly improved the fit of the measurement model. Loadings were higher and error variances were lower when the items were in uniform groupings. We then freed the factor correlations across forms, which once again significantly improved model fit. Supporting the earlier analyses regarding divergent validity, factor correlations were higher in the random item groupings. In fact, the estimated correlation between ADJ and AS factors approached unity (r = .98), signifying that after one corrects for unreliability in observed variables, these two factors were indistinguishable when their items were mixed together. Next, we created three method factors (“artifactors”: Idaszak & Drasgow, 1987), one for each of the item groupings on the random form. As expected, estimating the loadings for these factors on the random form significantly improved the fit of the model, and freeing a set of loadings for this “grouping” factor in the uniform condition did not.

We performed identical analyses for the [INT.sub.FF], [INT.sub.CES], and [INT.sub.PB] items. Table 3 shows that they yielded similar results. Item parameters and factor correlations significantly differed across grouping forms. Loadings were higher and error variances were lower in the uniform grouping condition as compared with the [TABULAR DATA FOR TABLE 2 OMITTED] [TABULAR DATA FOR TABLE 3 OMITTED] random grouping condition. Addition of three “artifactors” for random grouping also improved fit on the random form. These method factors did not significantly improve fit under the uniform condition.

Finally, we examined the invariance of a path model that linked total scores for each of the six constructs to one another. Attitude theory (e.g., Fishbein & Ajzen, 1975) and research on work scheduling (e.g., Dunham & Pierce, 1986) suggest a model in which the six constructs are arranged in three waves. In it, causal flows move from the most narrow perceptions ([INT.sub.FF], [INT.sub.CES], and [INT.sub.PB]) to intermediate affective constructs (three paths each to AS and ADJ, with a path linking the two) and finally to the global attitudinal outcome (two paths to JS). We first fit a model that constrained these nine paths and the error variances of the endogenous variables to be identical across grouping conditions. Freeing error variances across forms improved model fit: [Mathematical Expression Omitted]. Freeing paths across forms also improved fit: [Mathematical Expression Omitted].

Discussion

Summary

Results from Study 2 support Hypothesis 2, and converge with (the direction of) results in previous studies (Schriesheim, 1981a, 1981b; Schriesheim et al., 1989b). Physically grouping items on a questionnaire slightly enhances internal consistency and discriminant validity, by enhancing the within-set commonalities and between-set distinctions that guide respondents to retrieve relevant caches of information. As with Study 1, the primary contribution of this study is its support for cognitive theories’ predictions about information retrieval processes in item-answering.

Some practical contributions are also worth noting. The results, in part, contradict the random dispersal rule of thumb, which is based on the idea that grouping creates demand characteristics that inflate estimated relations between constructs. We found the opposite pattern. In factor analytic tests, we also found that both measurement models and structural models differed across physical item grouping, in ways that were consistent with Hypothesis 2. That is, an important finding of this study is its demonstration that researchers would have come to different conclusions about the relations of items to underlying factors, correlations of factors with one another, and structural relations of constructs solely because the physical contexts they provided for their items differed.

Limitations and Research Directions

Despite the statistical significance of our findings, it is important to point out that the differences we observed across structural properties (item groupings) were not large. In most cases, differences in reliabilities were in the second decimal place. Although this is in some ways to be expected because of the non-linearity of the correlation metric, as a change in r from .80 to .89 is three times the effect size of a change from .30 to .39, it nevertheless limits the impact that grouping would have on the results of much organizational research. Indeed, one of our intentions in this study was to examine effects of grouping on instruments that already had strong measurement characteristics. Taken together, the results of Schriesheim et al. (1989a, 1989b) and the present study imply that the practical effect of grouping on psychometric quality is weaker when the initial psychometric quality of an instrument is stronger. A systematic test of this implication might compare grouping effects on a wide variety of instruments within a single investigation.

Even with these limitations on the practical consequences of our findings, it seems reasonable to suggest that psychometric qualities of uniformly grouped and randomly mixed sets of items should be routinely compared during instrument development. The cost of creating and administering two survey forms is quite small, even when compared to the moderate psychometric benefits that are possible. A series of such tests by different researchers could help clarify the likelihood and magnitude of physical context effects with other combinations of scales (including combinations of constructs with differing degrees of relationship), and clarify the cognitive processes underlying them.

Finally, because we wanted to separate information retrieval from item interpretation processes, our manipulation of grouping cues also lacks some of the impact that might have been provided by another structural property of surveys: a “label” or title associated with each item grouping (Schriesheim et al, 1989b). Such labels are fairly common on organizational instruments (Cook et al, 1981). Future research could examine how labels might add to or interact with physical context effects, or activate respondents’ implicit theories about item content (Lord, Binning, Rush & Thomas, 1978).

Study 3: Effects of Response Context

Study 3 builds on earlier work on response range effects in several ways. First, we tested our hypothesis using a measure of internal states (specifically, somatic complaints) rather than measures of readily observable, unambiguous, and public behaviors that were used in previous response range studies (hours of TV watching and beer consumption at bars in Schwarz & Bienias, 1990, and hours of TV watching in Schwarz, Hippler, Deutsch & Strack, 1985). Second, for each (somatic complaint) item, we asked about the ease with which that information was retrieved from memory, to help identify which cognitive processing stage is affected by response scale range. Finally, we looked at the impact of response context on estimates of relations between constructs.

Method

Instruments and Design

We adapted commonly used self-report instruments that measure the frequency of health symptoms, somatic complaints, psychosomatic distress, and physiological strains during a given time period (Caplan, Cobb, French, Van Harrison & Pinneau, 1975; Goldberg & Hillier, 1979; Greller & Parsons, 1988; Landsbergis, 1988; Smith, Cohen & Stammerjohn, 1981; Steffy, Jones & One, 1990). Scores on these instruments often serve as the primary dependent variable in studies of work stress (e.g., Spector, Dwyer & Jex, 1988).

We took ten commonly used items from these measures (e.g., “. . . had trouble sleeping at night,”) to create a prototypical instrument that we refer to as the Health Symptoms scale. Instructions on our instrument told respondents to estimate “How often you’ve experienced each of the following conditions during the past month.” We also took what was typically a 3-point scale range (based on the “never” to “once or twice” to “three or more times” range in Caplan et al., 1975), and created low, benchmark (free response), and high scale range forms. We used six response options on the low range form: “0,” “1,” “2,” “3,” “4,” or “5 or more” times, because they include options typically found in such measures (e.g., Caplan et al., 1975; Patchen, 1970; Quinn & Staines, 1979). On the benchmark form we left a blank for respondents to fill in their own number. In the high range form, response options were: “4 or less,” “5,” “6,” “7,” “8,” and “9 or more” times. We chose these options for the high range form because we needed overlap in options with the low range form, to enable comparisons across forms.

Scale ranges can affect responses by affecting retrieval of information from memory or by affecting judgment-response mapping processes. To distinguish the effect as one involving judgment-response mapping, we measured respondents’ perceived difficulty in retrieving the frequencies of health symptoms from memory. For each Health Symptom item, subjects were asked to rate “How easy or difficult is it to remember?” the frequency of that symptom on a seven-point scale that ranged from 1 (easy) to 7 (difficult). Because we hypothesized this as a judgment-response mapping effect, we wanted to ensure that the impact of scale range on self-reported frequencies of health symptoms would be unrelated to respondents’ perceived difficulty in retrieving the information.

Subjects and Procedures

Three forms of a “Student Experience Survey” were randomly distributed to 187 undergraduates in organizational behavior courses at a large, southwestern university. Each of the three forms contained a version of the Health Symptoms instrument that used one of the three scale ranges described above. Fifty-eight, 58, and 59 students returned the low, benchmark, and high range forms, respectively (a 95% response rate). Participation partially fulfilled course requirements. Anonymity and confidentiality were assured. Subjects received a full, end-of-semester debriefing.

Of those who completed the survey, 23% listed themselves as the primary source of income for their family. Chiefly because of the university’s urban location, 76% of the students currently held full- or part-time, low-level jobs in a wide range of (mostly service) occupations. Fourteen percent were married, with a mean number of children = .22 (SD = .64; median = 0, range 0 – 5). Sixty-three percent were male. Mean age was 22.8 years (SD = 5.1).

Results

As an initial, omnibus test of Hypothesis 3, we created a total Health Symptoms score by first dichotomizing each item response. Estimates “[greater than or equal to] 5 times” were assigned 1s, estimates “[less than or equal to] 4 times” were assigned 0s. This threshold of dichotomization was necessary because the options “[less than or equal to] 4 times” and “[greater than or equal to] 5 times” were the only ones shared by all three scale ranges. By the same token, this dichotomization removes any artificial floor, ceiling, or range restriction effects because it uses only the response options common to all three forms. Health Symptoms total scores were computed by summing these dichotomous item scores across the 10 items. As Table 4 shows, there were strong differences in the total scores across the high- and low-range forms, gauged by a one-way ANOVA, F(2, 172) = 6.71 (p [less than] .01; [[Eta].sup.2] = .07). The total Health Symptoms mean on the high range form [Mathematical Expression Omitted] was nearly double that on the low range form [Mathematical Expression Omitted].

Table 4 also shows a pattern in Health Symptom item responses that is strikingly consistent with Hypothesis 3 and previous research (Schwarz & Bienias, 1990). The proportion of students in the high-range condition who reported a stress symptom of “[greater than or equal to] 5 times,” was higher than the proportion of students in the low-range condition, on each item (binomial test for this completely transitive pattern, p [less than] .01 adjusting for correlation among items in the same way as described in Note 4, for Study 2). That is, subjects receiving the high range form were more likely than those who received the low range form to give high frequency estimates (“[greater than or equal to] 5 times”).

Although we did not make this prediction, it is interesting to note that items with significantly different response distributions were also those that asked about psychological ailments rather than physical aches and pains (“felt very low or emotionally depressed,” [Mathematical Expression Omitted], p [less than] .01, “felt nervous, anxious or stressed out,” [Mathematical Expression Omitted], p [less than] .01, and “had difficulty sleeping,” [Mathematical Expression Omitted], p [less than] .05). These items may have been more ambiguous. “Difficulty in sleeping” can be regarded as extra time to fall asleep or a completely sleepless night. Our post hoc explanation is that the ambiguity of such items may have made it more likely that subjects would use inadvertent information in the item’s scale range to convey a response. To distinguish this effect from one in which subjects had more difficulty interpreting, retrieving or formulating a judgment for these items, we correlated the effect size for each item with subjects’ reported difficulty in retrieving information about it. The correlation was not significant r = .06 (p [greater than] .10).(5)

To assess the impact of response scale ranges on estimates of relations between constructs, we compared correlations of Health Symptoms total scores with other measures, across the high and low range forms. There were measures of five other constructs on the survey that might reasonably be associated with Health Symptoms: age, number of hours of paid employment per week, perceived control over class attendance, percent of university classes missed, and number of times that the respondent had missed their organizational behavior (OB) class. Correlations between the high and low range forms were significantly different (p [less than] .05) for number of hours worked (r = -.04 on the low range form, and r = .40 on the high range form; z-test of the difference in Fisher-transformed r’s = 2.85), percent of university classes missed (r’s = -.19 and .15 for the low and high range forms, respectively; z = 2.54), and number of times the OB class was missed (r’s = -.09 and .39 for the low and high range forms; z = 2.39). Interestingly, these differences were not just in magnitude, but in sign. The correlations were greater and positive in the high-range condition for all three significant comparisons, which might stem [TABULAR DATA FOR TABLE 4 OMITTED] from the greater total score variance (for the sum of dichotomized items) in that condition.

Recall that the Health Symptoms total score was computed by summing dichotomized item scores. The dichotomization was necessary to make scores comparable across the high and low scale range conditions, as only two options ([less than or equal to] 4 times and [greater than or equal to] 5 times) were common to both the high and low scale ranges. Because such dichotomization is atypical in management research, we also computed and compared (across conditions) the correlations of Health Symptoms with other constructs using total scores from raw data (keeping in mind that such correlations may be differentially affected by the floor, ceiling, and range restriction effects mentioned earlier). Correlations between the high and low range forms were significantly different (p [less than] .05) for percent of university classes missed (r’s = -.15 and .23 for the low and high range forms, respectively; z = 2.86), and number of times the OB class was missed (r’s = -.10 and .40 for the low and high range forms; z = 3.89), but not for any of the other variables.

Discussion

Summary

Study 3 supports Hypothesis 3, in showing that response context (scale range) can change how people answer items on self-report surveys. The strong transitivity of response percentages in Table 4 shows that respondents probably used inadvertent information in an instrument’s scale range to map their own internal judgments (which may have been “fuzzy” intervals, rather than points) into responses. This is consistent with other studies of response context effects (e.g., Schwarz & Bienias, 1990). The main contribution of these results, along with Studies 1 and 2, is that they support the viability of applying theories about cognitive processing mechanisms to the measurement domain.

In addition to extracting information from subjects, our results suggest that instruments themselves can inject noise into the measurement process by cuing subjects to adjust their responses toward a scale midpoint. Therefore, a practical implication of Study 3 is that it may be unwise to routinely adopt the rule of thumb mandating standardized response categories, especially for frequencies. It seems clearly so for estimates of the frequency of stress or health symptoms, especially given the effects on correlations of Health Symptoms with other constructs. It may be more advisable to use open-ended, free response formats when respondents are asked to provide frequency estimates; responses can be standardized after the data are collected.

Limitations and Research Directions

Our high range scale was a fairly extreme manipulation. It therefore lacks external validity with respect to most organizational instruments. The means and within-condition correlations of Health Symptoms with other constructs may be unlikely to generalize to other settings. On the other hand, the low range scale is more typical of instruments in use (e.g., similar, low-range scales are used in Caplan et al., 1975; Patchen, 1970; Quinn & Staines, 1979). Of particular interest in Table 4 is the consistently lower percentage of high frequency responses given by subjects in the low range condition compared to those who filled in a blank. That is, it is difficult to argue that the free responses are biased upward, rather than the constrained responses being biased downward, especially considering that the reported health symptoms are all rather negative and personal.

Differences that these response options generate might also depend on the nature of the construct being measured. Our results hinted that response context may have a stronger effect on more ambiguous or covert internal states. The frequency distributions that respondents infer from response options, and their unintended effects may depend on the extent to which the construct is familiar, salient, observed by others, or socially desirable. Differences in response context effects across types of events (e.g., perceptions of incidents that may or may not be sexual harassment), behaviors (e.g., extra-role behaviors), affective states (e.g., positive and negative affectivity), and cognitions (e.g., frequency of thoughts of quitting) should be evaluated in future research.

Finally, the design of Study 3, as well as Studies 1 and 2, focused on strictly psychometric or response-based outcomes of differences in structural properties. Such designs allow only indirect, somewhat tentative inferences about respondent’s cognitive processes. Response latencies might provide more clues about processing mechanisms. Recording concurrent, “think-aloud,” verbal protocols from subjects as they answer items on organizational surveys, might further illuminate item-answering processes (Ericsson & Simon, 1984).

Conclusions

The findings of these three studies provide support for our hypotheses about the links between structural properties and psychometric qualities of self-report instruments, via intervening cognitive processes. We hope that the results will not only further our understanding of self-report measurement, but also help to improve it. A clear theme from our results is that respondents will seize and use irrelevant but easily accessible information that might ease the dual burdens of time spent and cognitive effort required to complete an instrument. Given a fixed amount of time, perhaps investigators should ask fewer questions, and try to elicit more deliberate and thoughtful responses to each one. In any case, researchers should continually try to improve both the validity and utility of organizational measures. Cognitive theories can be important tools in that effort.

Acknowledgement: An earlier version of this paper was presented at the 1992 national Academy of Management meetings in Las Vegas, Nevada, where it won the Best Paper award from the Research Methods Division. We are grateful to Devakumar Doraisamy, Doug Jantzen, Tom Lumpkin, Bob Mayer, Lynn Nguyen, Margaret Shaffer and Cheryl Surber for their assistance in data collection and preparation. We also appreciate the comments and helpful suggestions of Chet Schriesheim, Jerry Wofford, and several anonymous reviewers.

Notes

1. In an unpublished study with over 1500 respondents, in which the serial position and evaluative context were manipulated in the JDI Work Itself scale, serial position had no effect, although there were item context effects.

2. If responses to the neutral item are consistent with responses in its evaluative context, respondents with high JA should respond to each item in the positive block (including the embedded item) by reporting high frequencies. Respondents with low total JA should respond to items in the positive block (including the embedded item) by reporting low frequencies. Respondents with high JA and respondents with low JA would tend to “cancel out” each other’s response to the embedded item.

3. The four remaining ADJ items were: “It’s hard to mentally adjust when we rotate to a new shift,” “I’ve gotten pretty accustomed to working on the current schedule,” “Sometimes, personal problems come up for me trying to work around my schedule,” and, “I’ve been able to find ways to adjust to our current schedule.”

4. The binomial test assumes six independent “trials,” but the six measures in this study were not independent – they were correlated. Degrees of freedom for the binomial test were therefore adjusted to account for this dependence. We operationalized dependence as proportions of shared variance; independence was a proportion of unique variance. That is, the degrees of freedom were equal to one (for the first measure) plus the sum of the uniquenesses of the other variables. In that way, any unique variance that a measure had (variance that was not shared by the other variables) added to the degrees of freedom. This is a conservative test because the uniqueness of a variable is one minus the squared multiple correlation of the variable with all the other variables, and the single sample estimate of the true squared multiple correlation is biased upward (Bobko, 1995).

We also compared this test to results we would have obtained by pooling all the data and repeatedly splitting the pooled sample into two random subsamples. These random subsamples contained a mix of those who had the two questionnaire forms. When we repeatedly compared coefficient [Alpha]’s across subsamples using the modified binomial test, we formed a “bootstrapped” sampling distribution of the test. The tails of this sampling distribution can serve as rejection regions. Our original comparison of [Alpha]’s in the grouped versus uniform conditions produced a test value that fell in the rejection region.

5. The low correlation between effect size and subjects’ difficulty in retrieving information was not due solely to the unreliability of the latter measure. Because it is impossible to estimate the reliabilities of the single-item measures, we used corrected item-total correlations (between each item and the sum of the remaining items) as evidence of accountable variance. The corrected item-total correlations ranged from .26 to .56, indicating that at least some of the variance in the items was non-error variance.

References

Ajzen, I. (1991). The theory of planned behavior. Organizational Behavior and Human Decision Processes, 50: 179-211.

Ajzen, I. & Fishbein, M. (1980). Understanding attitudes and predicting social behavior. Englewood Cliffs, NJ: Prentice-Hall.

Bass, B.M., Cascio, W.F. & O’Connor, E.J. (1974). Magnitude estimations of frequency and amount. Journal of Applied Psychology, 59: 313-320.

Bobko, P. (1995). Correlation and regression: Principles and applications for industrial/organizational psychology and management. New York: McGraw-Hill.

Box, G.E.P. (1949). A general distribution theory for a class of likelihood criteria. Biometrika, 36: 317-346.

Brief, A.P., Burke, M.J., George, J.M., Robinson, B.S. & Webster, J. (1988). Should negative affectivity remain an unmeasured variable in the study of job stress? Journal of Applied Psychology, 73: 193-198.

Budd, R.J. (1987). Response bias and the theory of reasoned action. Social Cognition, 5: 95-107.

Caplan, R.D., Cobb, S., French, J.R.P., Jr., Van Harrison, R. & Pinneau, S.R., Jr. (1975). Job demands and worker health. Cincinnati, OH: NIOSH Publication: 75-168.

Carroll, J.S. & Johnson, E.J. (1990). Decision research: A field guide. Newbury Park, CA: Sage.

Cook, J.D., Hepworth, S.J., Wall, T.D., & Warr, P.B. (1981). The experience of work: A compendium and review of 249 measures and their use. London: Academic Press.

Dunham, R.B. & Pierce, J.L. (1986). Attitudes toward work schedules: Construct definition, instrument development, and validation. Academy of Management Journal, 29: 170-182.

Eder, R.W. & Fedor, D.B. (1989). Priming performance self-evaluations: Moderating effects of rating purpose and judgment confidence. Organizational Behavior and Human Decision Processes, 44: 474-493.

Ericsson, K.A. & Simon, H.A. (1984). Protocol analysis: Verbal reports as data. Cambridge, MA: MIT Press.

Feldman, J.M. & Lynch, J.G., Jr. (1988). Self-generated validity and other effects of measurement on belief, attitude, intention, and behavior. Journal of Applied Psychology, 73: 421-435.

Feldt, L.S. (1969). A test of the hypothesis that Cronbach’s alpha or Kuder-Richardson coefficient twenty is the same for two tests. Psychometrika, 34: 363-373.

Fishbein, M. & Ajzen, I. (1975). Belief attitude, intention, and behavior: An introduction to theory and research. Reading, MA: Addison-Wesley.

Goldberg, D.P. & Hillier, V.F. (1979). A scaled version of the General Health Questionnaire. Psychological Medicine, 9: 139-145.

Greenberg, J. (1990). Organizational justice: Yesterday, today, and tomorrow. Journal of Management, 16: 399-432.

Greller, M.M. & Parsons, C.K. (1988). Psychosomatic complaints scale of stress: Development and psychometric properties. Educational and Psychological Measurement, 48: 1051-1065.

Hackman, J.R. & Oldham, G.R. (1975). Development of the Job Diagnostic Survey. Journal of Applied Psychology, 60: 159-170.

Harrison, D.A. & McLaughlin, M.E. (1993). Cognitive processes in self-report responses: Tests of item context effects in work attitude measures. Journal of Applied Psychology, 78: 129-140.

Heberlein, T., & Baumgartner, R. (1978). Factors affecting response rates to mailed questionnaires: A quantitative analysis of the published literature. American Sociological Review, 43: 447-462.

Hippler, H.J., Schwarz, N. & Sudman, S. (1987). Social information processing and survey methodology. New York: Springer-Verlag.

House, R.J. & Rizzo, J.R. (1972). Role conflict and ambiguity as critical variables in a model of organizational behavior. Organizational Behavior and Human Performance, 7: 467-505.

Howat, G. & London, M. (1980). Attributions of conflict management strategies in supervisor-subordinate dyads. Journal of Applied Psychology, 65: 172-175.

Hulin, C.L., Drasgow, F. & Parsons, C.K. (1983). Item response theory: Application to psychological measurement. Homewood, IL: Irwin.

Idaszak, J.R. & Drasgow, F. (1987). A revision of the job diagnostic survey: Elimination of a measurement artifact. Journal of Applied Psychology, 72: 69-74.

Johns, G. (1994). How often were you absent? A review of the use of self-reported absence data. Journal of Applied Psychology, 79: 574-591.

Kahn, R.L. Wolfe, D.M., Quinn, R.P., Snoek, J.D. & Rosenthal, R. (1964). Organizational stress: Studies in role conflict and ambiguity. New York: Wiley.

Kunin, T. (1955). The construction of a new type of attitude measure. Personnel Psychology, 8: 65-78.

Landsbergis, P.A. (1988). Occupational stress among health care workers: A test of the job demands-control model. Journal of Organizational Behavior, 9: 217-239.

Lord, F.M. & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, Ma: Addison-Wesley.

Lord, R.G., Binning, J.F., Rush, M.C. & Thomas, J.C. (1978). The effect of performance cues and leader behavior on questionnaire ratings of leadership behavior. Organizational Behavior and Human Performance, 21: 27-39.

McLaughlin, M.E. & Harrison, D.A. (1990). Item arrangement influences measurement properties of an attitude scale: Support for cognitive hypotheses. Paper presented at the annual meeting of the American Psychological Society, Dallas, Texas.

Miles, R. & Snow, C. (1978). Organizational strategy, structure, and process. New York: McGraw-Hill.

Miller, D.C. (1991). Handbook of research design and social measurement, 5th ed. Newbury Park, CA: Sage.

Murphy, K.R., Herr, B.M., Lockhart, M.C. & Maguire, E. (1986). Evaluating the performance of paper people. Journal of Applied Psychology, 71: 654-661.

Osgood, C.E., Suci, G.J. & Tannenbaum, P.H. (1957). The measurement of meaning. Urbana, IL: University of Illinois Press.

Ostrom, T. (1970). Perspective as a determinant of attitude change. Journal of Experimental Social Psychology, 6: 280-292.

Patchen, M. (1970). Participation, achievement, and involvement on the job. Englewood Cliffs, NJ: Prentice-Hall.

Podsakoff, P.M. & Organ, D.W. (1986). Self-reports in organization research: Problems and prospects. Journal of Management, 12: 531-544.

Porac, J.F. (1987). The job satisfaction questionnaire as a cognitive event: First- and second-order processes in affective commentary. Pp. 51-102 in K. Rowland & G. Ferris (Eds.), Research in personnel and human resources management, Vol. 5. Greenwich, CT: JAI Press.

Quinn, R.P. & Staines, G.L. (1979). The 1977 Quality of Employment Survey. Ann Arbor, MI: Institute for Social Research, University of Michigan.

Schmidt, F.L. (1973). Implications of a measurement problem for expectancy theory research. Organizational Behavior and Human Performance, 10: 243-257.

Schriesheim, C.A. (1981a). The effect of grouping or randomizing items on leniency response bias. Educational and Psychological Measurement, 41: 401-411.

Schriesheim, C.A. (1981b). Leniency effects on convergent and discriminant validity for grouped questionnaire items: A further investigation. Educational and Psychological Measurement, 41: 1093-109.

Schriesheim, C.A. & DeNisi, A.S. (1980). Item presentation as an influence on questionnaire validity: A field experiment. Educational and Psychological Measurement, 40: 175-182.

Schriesheim, C.A. & Eisenbach, R.J. (in press). Item wording effects on factor-analytic results: An experimental investigation. Journal of Management.

Schriesheim, C.A., Kopelman, R.E. & Solomon, E.S. (1989a). The effect of grouped versus randomized questionnaire format on scale reliability and validity: A three-study investigation. Educational and Psychological Measurement, 49: 487-508.

Schriesheim, C.A., Solomon, E.S. & Kopelman, R.E. (1989b). Grouped versus randomized format: An investigation of scale convergent and discriminant validity using LISREL confirmatory factor analysis. Applied Psychological Measurement, 13: 19-32.

Schwarz, N. & Bienias, J. (1990). What mediates the impact of response alternatives on frequency reports of mundane behaviors? Applied Cognitive Psychology, 4: 61-72.

Schwarz, N., Hippler, J., Deutsch, B. & Strack, F. (1985). Response scales: Effects of category range on reported behavior and comparative judgments. Public Opinion Quarterly, 49: 388-395.

Smith, M.J., Cohen, B.G. & Stammerjohn, L.W., Jr. (1981). An investigation of health complaints and job stress in video display operations. Human Factors, 23: 387-400.

Smith, P.C., Kendall, L.M. & Hulin, C.L. (1969). The measurement of satisfaction in work and retirement. Chicago: Rand-McNally.

Spector, P.E., Dwyer, D.J. & Jex, S.J. (1988). Relation of job stressors to affective, health, and performance outcomes: A comparison of multiple data sources. Journal of Applied Psychology, 73: 11-19.

Steel, R.P. & Ovalle, N.K. (1984). A review and meta-analysis of research on the relationship between behavioral intentions and employee turnover. Journal of Applied Psychology, 69: 673-686.

Steers, R.M. & Braunstein, D.N. (1976). A behaviorally-based measure of manifest needs in work settings. Journal of Vocational Behavior, 9: 251-266.

Steffy, B.D., Jones, J.W. & One, A.W. (1990). The impact of health habits and life-style on the stressor-strain relationship: An evaluation of three industries. Journal of Occupational Psychology, 63: 217-229.

Stogdill, R.M. (1963). Manual for the leader behavior description questionnaire – form XII: An experimental revision. Columbus, OH: Bureau of Business Research.

Stone, E.F. & Gueutal, H.G. (1984). On the premature death of need-satisfaction models: An investigation of Salancik and Pfeffer’s view on priming and consistency artifact. Journal of Management, 10: 237-249.

Sudman, S. & Bradburn, N.M. (1982). Asking questions: A practical guide to questionnaire design. San Francisco: Jossey-Bass.

Super, D.E. (1970). Work values inventory. Boston: Houghton Mifflin.

Sutton, R.I. & Rousseau, D.M. (1979). Structure, technology, and dependence on a parent organization: Organizational and environmental correlates of individual responses. Journal of Applied Psychology, 64: 675-687.

Tourangeau, R. & Rasinski, K.A. (1988). Cognitive processes underlying context effects in attitude measurement. Psychological Bulletin, 103: 299-314.

Upshaw, H. (1969). The personal reference scale: An approach to social judgment. Pp. 315-371 in L. Berkowitz (Ed.), Advances in experimental social psychology, Vol. 4: 315-371. San Diego: Academic Press.

Watson, D. & Tellegen, A. (1985). Toward a consensual structure of mood. Psychological Bulletin, 98: 219-235.

Wollack, S., Goodale, J.G., Wijting, J.P. & Smith, P.C. (1971). Development of the Survey of Work Values. Journal of Applied Psychology, 55: 331-338.

COPYRIGHT 1996 JAI Press, Inc.

COPYRIGHT 2004 Gale Group