Identifying Growth Indicators for Low-Achieving Students in Middle School Mathematics – Statistical Data Included

Anne Foegen

The purpose of this study was to explore the technical adequacy of potential indicators of growth in mathematics at the middle school level. Correlation and regression analyses were used to investigate the reliability and criterion validity of four measures in mathematics. One hundred students from a racially and socioeconomically diverse urban middle school completed 4 measures (1 involving basic facts and 3 involving estimation) twice in a 1-week period. Criterion measures included grades in school, standardized test scores, and teacher ratings. Results indicated the measures are reliable and possess acceptable levels of criterion validity; thus, they may prove to be useful as indicators of mathematics proficiency for middle school students.

Attention to outcomes produced by the public education system is increasing. Improved student achievement has served as a focal point for the national educational goals outlined in the Goals 2000: Educate America Act (20 U.S.C. [sections] 5801 et seq; Hoff, 1998), and its related appropriations (Sack, 1998), as well as for efforts to institute national testing programs (Lawton, 1997). The demand for increased accountability and improved outcomes has not been exclusive to the field of general education. Special educators also find themselves urged to demonstrate the effects of their programs and practices (Ysseldyke, Thurlow, & Shriner, 1992).

In the field of mathematics education, concerns about curricula, instructional practices, and levels of student achievement led to the publication of standards for curriculum and evaluation, teaching, and assessment by the National Council of Teachers of Mathematics (NCTM; 1989, 1991, 1995). These documents outline a vision for mathematics education that moves away from the computation-laden curriculum of the past and toward a challenging, concept-driven curriculum that empowers students to solve problems and reason logically. The standards articulate five general goals for all students (NCTM, 1989, p. 5):

1. they learn to value mathematics,

2. they become confident in their ability to do mathematics,

3. they become mathematical problem solvers,

4. they learn to communicate mathematically, and

5. they learn to reason mathematically.

Although the standards themselves have not been accepted without criticism (Carnine, 1992; Hofmeister, 1993; Shriner, Kim, Thurlow, & Ysseldyke, 1992, 1993), the difficulties with mathematics instruction that spurred their development are rarely questioned.

Efforts to revise mathematics instruction have been accompanied by a push toward authentic assessment strategies involving open-ended tasks, checklists, interviews, extended investigations, and portfolios (Cain & Kenney, 1992; Clarke, 1992; NCTM, 1995). Spurred in part by dissatisfaction with multiple-choice standardized tests that assess computation and isolated skills, proponents of reform advocate procedures that are based on student performance, more closely tied to the curriculum, representative of realistic mathematical tasks, and more useful to teachers for improving instruction (Kulm, 1994; Lesh & Lamon, 1992; Romberg, 1992). Although descriptive writings outlining proposals for authentic assessment are abundant, few proposals have been operationalized into assessment procedures with reported levels of reliability and validity. Furthermore, studies of the technical adequacy of performance-based assessments have illustrated difficulties in developing reliable and valid procedures (Baxter, Shavelson, Herman, Brown, & Valadez, 1993).

General outcome measurement (GOM; Fuchs & Deno, 1991) has been developed in the field of special education as a means of monitoring the progress of individual students. As articulated by Fuchs and Deno, the two most salient features of GOM are “(a) the assessment of proficiency on the global outcomes toward which the entire curriculum is directed, and (b) the reliance on a standardized, prescriptive measurement methodology that produces critical indicators of student performance” (p. 493). Two approaches have characterized efforts to develop general outcome measures. The first relies on systematic sampling of the annual curriculum to generate representative probes. In this approach, the broad, general outcome of the curriculum is demonstrated through mastery of the multiple “pieces” that constitute the year’s curriculum. The second approach involves identifying a task representing year-end difficulty that requires students to integrate and apply the skills they are learning in the annual curriculum in a complex performance that serves as an indicator of growth across the year. The probe in this case represents the general outcome that the curriculum aims to achieve, rather than a microversion of the skills and content comprising the curriculum.

Curriculum-based measurement (CBM; Deno, 1985) is one form of GOM that has an extensive research base (Shinn, 1989, 1998). By collecting data from brief samples of student work in the curriculum, teachers can examine the progress of individual students and make timely changes in instruction to effect enhanced student progress in reading (L. S. Fuchs, Fuchs, Hamlett, & Ferguson, 1992), mathematics (L. S. Fuchs, Fuchs, Hamlett, & Stecker, 1991), written expression (Tindal & Parker, 1989a), and spelling (L. S. Fuchs, Fuchs, Hamlett, & Allinder, 1991). With regard to mathematics, researchers have applied the first GOM approach (curriculum sampling) to develop probes through the sixth-grade level that reflect aspects of the mathematics curriculum, including computation (L. S. Fuchs, Hamlett, & Fuchs, 1998; Marston 1989) and problem solving (L. S. Fuchs et al., 1994). Studies examining the technical adequacy of these measures have revealed reliability coefficients ranging from .48 to .93 and criterion validity coefficients ranging from .52 to .93 (L. S. Fuchs et al., 1994; L. S. Fuchs et al., 1998; Skiba, Magnusson, Marston, & Erickson, as cited in Marston, 1989). Criterion measures have typically included student performance on standardized, norm-referenced tests and mastery tests, grades, and teacher judgments. Recent research involving the technical adequacy of CBM has expanded to include student performance on statewide tests as a criterion measure (Deno, Fuchs, & Marston, 1997). Tindal (in press) reported correlations between second- and third-grade CBM mathematics measures and a state math assessment ranging from .40 to .55.

Although empirical support for CBM at the elementary level is extensive across multiple content areas, only a small number of studies have examined extensions of GOM beyond the elementary level. Researchers investigating applications of GOM at the secondary level have explored means of evaluating student performance in basic skills and content area learning (Espin & Tindal, 1998). The measures developed for indexing students’ academic success have included reading aloud and classroom study tasks (Espin, 1990; Espin & Deno, 1993); oral reading, maze, and vocabulary (Espin & Foegen, 1996); written retells (Tindal & Parker, 1989b); and essays, chapter tests, and perception probes (Nolet & Tindal, 1994; Tindal & Nolet, 1995). No GOM studies have been conducted at the secondary level in mathematics. In part, this must certainly be due to the diverse curricula and general lack of agreement regarding outcomes in mathematics (Hartocollis, 2000; “Math Wars,” 2000; NCTM, 1989).

When the approaches to developing general outcome measures for advanced levels of the curriculum are examined, it becomes clear that the range or character of the behavior measured varies considerably across domains. For example, the essays explored by Nolet and Tindal (1994) require students to use writing skills to summarize and discuss conceptual content in science. This broad, complex task is then scored in various ways to appraise growth in knowledge in the curriculum. In contrast, Espin and Foegen (1996) examined growth in content knowledge through a much simpler task–matching vocabulary words with their definitions. With efforts to extend the basic measurement approach used in CBM to more complex curriculum content, the question of what to measure also has become increasingly complex. This increased complexity is well represented in the explorations of Fuchs and her associates as they sought to develop mathematics measures for elementary students (L. S. Fuchs, Fuchs, Hamlett, & Stecker, 1991; L. S. Fuchs et al., 1994; L. S. Fuchs & Fuchs, 1996).

The purpose of the study was to explore the possibility of identifying indicators of growth in middle school mathematics that could be used in the formative evaluation of instruction in much the same fashion as GOM and CBM. The basic strategy used was similar to that originally employed by Deno and others (Deno, Marston, & Mirkin, 1982; Deno, Mirkin, & Chiang, 1982). That strategy used an empirical approach to identify a simple, economical, efficient, and technically adequate performance measure that teachers could use repeatedly to measure student growth (Deno, 1985). Rather than applying the curriculum-sampling approach frequently used in CBM, we sought to identify tasks that would reflect global proficiency in mathematics.

We examined two types of measures to explore their potential as growth indicators: students’ fluency with estimation and with basic facts. We selected estimation as the primary focus of the study because it is representative of the type of mathematical skill that is widely applied by adults in daily living situations and thus is likely to represent a general outcome of many middle school mathematics curricula. Reys (1992) noted that “estimation is a basic skill, and its growing importance in a technological society is recognized. It is used much more than exact computation” (p. 281). In fact, Reys suggested that more than 80% of all mathematical applications use estimation rather than exact computation. We selected fluency with facts because research in cognitive science supports the importance of a certain level of automaticity in basic skills to attaining higher levels of performance (Gagne, 1983; Glaser, 1987; Resnick, 1989). Studies in reading (LaBerge & Samuels, 1974) have suggested that automatic decoding skills allow students to focus increased attention on the task of comprehension. Similarly, in mathematics, the “ability to succeed in higher-order skills appears to be directly related to the efficiency at which lower-order processes are executed” (Hasselbring, Goin, & Bransford, 1988, p. 1). Applying the information-processing approach drawn from cognitive psychology, we hypothesized that students who fluently computed basic facts would be more likely to have greater information-processing capacity available for addressing more complex problem-solving tasks.

Four alternative measures were examined in this study: a basic facts task and three forms of an estimation task that included both computation and word problem solving. The primary question was whether any of the measures possessed sufficient reliability and criterion validity to be a candidate for use in repeated classroom-based growth assessment.

Method

Participants and Setting

Participants for the study were students from four sections of middle school mathematics classes, all taught by the same teacher. The school in which the study took place was located in an urban district in a large metropolitan area. The school’s population, which numbered approximately 1,000 students, was diverse in racial and ethnic composition. Data collected from the district’s records were used to index the socioeconomic status of the school and revealed that approximately three quarters of the students qualified for a free or reduced lunch program. Of the total school population (not including a districtwide setting for students with severe behavioral disorders housed in the building), 9.9% of the students received special education services.

Of the 128 students in four sections who were eligible to participate, the 100 students for whom permission was obtained comprised the sample for this study. The sample included, 48 males and 52 females. Forty-two percent of the students were of European American descent, 39% were African American, 12% were Asian American, 4% were American Indian, and 3% were Hispanic American. The mean age of the participants was 13.52 years, with a range of 11.7 to 15.4 years. The sample included primarily seventh- and eighth-grade students, with 9, 49, and 41 students in the sixth, seventh, and eighth grades, respectively. Twelve students were receiving special education services. These students were identified as having learning disabilities or behavioral disorders using state guidelines.

The four classes taking part in the study included one algebra class, one pre-algebra class, and two general math classes. The teacher and his classes were participating in a larger research project (Robinson & Deno, 1992) exploring the use of a group communication technology to support students with mild disabilities in general education classrooms. The technology system (Discourse[R], 1990) had been in place in the teacher’s classroom since the onset of the school year and was used daily for instructional activities. Although the data for this research were collected through the technology system, the system itself was not a focus of the study. All of the students had used the technology extensively and were familiar with its operation.

Materials and Measures

Measures. The study examined 4 measures. Design specifications for each of the measures are described in the following sections; critical features of each task and sample problems are presented in Table 1. A single measure was used to quantify students’ fluency with basic facts, and three different measures were used to examine students’ estimation skills. The brief measures, 1 and 3 minutes in duration, provided an index of students’ proficiency in basic facts and estimation. The fixed time intervals were selected to ensure comparability of the data across testing sessions and promote efficiency in data collection across repeated measurements.

TABLE 1. Critical Features of Experimental Tasks

Number of

Number of Response forms completed

Task Time problems format per session

Facts task 1 minute 80 numerical 2

BMOT

Estimation 3 minutes 40 single letter 1

tasks BET (A, B, C)

MET-A(a) 3 minutes 40 single letter 2

(A, B, C)

MET-B(a) 3 minutes 40 single letter 2

(A, B, C)

Task Sample problems

Facts task 35 / 5 = 3 – 0 =

BMOT 6 + 8 = 2 x 7 =

Estimation (A) 921 – 480 = A. 2 B. 48 C. 441

tasks BET (B) Each month I earn $37. How

much will I earn in 6 months?

A. $3 B. $76 C. $222

MET-A(a) (A) 7.9 x 9 = A. 16.6 B. 71.1 C. 711

(B) The jacket costs $95. It is on sale for

30% off. About how much will you save?

A. $28 B. $125 C. $285

MET-B(a) (A) 7.9 x 9 is about A. .7 B. 7 C. 70

(B) The jacket costs $95. It is on sale for

30% off. About how much will you save?

A. $3 B. $30 C. $300

Note. BMOT = basic math operations task; BET = basic

estimation task; MET = modified estimation task.

(a) The MET version (A or B) was randomly assigned to class.

Students completed either two forms of MET-A or two forms of MET-B.

Basic Math Operations Task (BMOT). The BMOT was designed to index students’ accuracy and fluency in mental computation of whole-number facts (0-9) in each of the four operations. Two parallel forms of the BMOT were developed for use in the study by randomly selecting 20 single-digit combinations for each of the operations of addition, subtraction, multiplication, and division. The 80 items were randomly ordered for placement on the probe. Students had 1 minute to respond to each BMOT. The probes utilized a constructed response format in which students were presented with the problems and responded by typing numerical answers.

Basic Estimation Task (BET). The basic estimation task (BET) was composed of 40 problems, 10 from each operation. A table of specifications used in constructing each probe is presented in Table 2. All 40 problems involved whole numbers. The “basic” problems were comparable to the single-digit facts combinations used in the BMOT. “Intermediate” problems included combinations of two-digit and one- or two-digit numbers for addition and subtraction; for multiplication and division, combinations of a two-digit number and a one- or two-digit number were used. Division problems at the intermediate level were designed so there were no remainders. At the “advanced” level, addition and subtraction problems involved combinations of a three-digit and a two- or three-digit number. Multiplication problems at the advanced level required multiplying a three-digit number by a one- or two-digit number. Advanced division consisted of dividing a three-digit number by a two- or three-digit number. No effort was made to avoid problems involving remainders. Computation problems were created by the means of the random number generator function on a handheld calculator.

TABLE 2. Design Specifications for Constructing Estimation Probes

Operation and number type

Addition Subtraction

Probe and problem type Whole Rational Whole Rational

Basic estimation task (BET)

Basic computation 3(a) 0 3 0

Intermediate computation 2 0 2 0

Advanced computation 2 0 2 0

Basic word 2 0 2 0

Intermediate word 1 0 1 0

Modified estimation task (MET)

Intermediate computation 1 1 1 1

Advanced computation 1 1 1 1

Word problems 2 2 2 2

Operation and number type

Multiplication Division

Probe and problem type Whole Rational Whole Rational

Basic estimation task (BET)

Basic computation 3 0 3 0

Intermediate computation 2 0 2 0

Advanced computation 2 0 2 0

Basic word 2 0 2 0

Intermediate word 1 0 1 0

Modified estimation task (MET)

Intermediate computation 1 1 2 2

Advanced computation 1 1 2 2

Word problems 2 2 4 4

(a) Numerical values represent the number

of problems of each type in the task.

Word problems paralleled the construction specifications used for basic and intermediate computation problems. Three general topics were used when creating word problems: money and purchasing; sports, hobbies, and music; and school and work.

The correct answer for each problem was the answer that would be obtained if the exact computation were carried out. In the case of remainders in division, the answer was rounded to the nearest whole number. Distracters for both types of problems were constructed to simulate common student errors (e.g., incorrect operation, digits out of order, faulty algorithm). The three response options presented were varied in magnitude to facilitate students’ use of estimation or “number sense” (Sowder, 1992) to quickly select the correct answer. Students had 3 minutes to respond to 40 items by selecting the letter (a, b, or c) corresponding to the correct answer.

Modified Estimation Tasks (METs). Two forms of the MET (Form A and Form B) were developed to represent variations of the BET; they included (a) an increased proportion of word problems, (b) an increased proportion of division problems, (c) elimination of basic computation problems, and (d) inclusion of problems involving rational numbers. The number of problems of each type is indicated in Table 2. “Intermediate” computation problems involved combinations of numbers with two digits; “advanced” problems involved at least one number with three or more digits. Only common fractions (1/4, 1/2, 3/4) and percentages (10%, 25%, 30%, 50%) were used to facilitate student understanding and encourage estimation. Word problems included numbers similar to those of the intermediate and advanced computation. They were constructed to represent the types of math problems encountered in daily living situations. Rather than using a random number generator, we developed the MET problems by selecting numbers that could be rounded to obtain a problem that could be computed using mental arithmetic, thus encouraging students to use reformulation and other estimation strategies (R. E. Reys, Rybolt, Bestgen, & Wyatt, 1982).

The problems for each MET were identical. The METs differed from one another only in response format. The META provided alternatives in the form of exact numbers (e.g., 16, 247, 3,562), paralleling the format used in the BET. The correct response was the exact answer obtained by computation or, in the case of decimals and fractions, rounded to the nearest significant digit matching the form of the numbers used in the stem. Distracters were constructed using the same procedures used for the BET. MET-B response alternatives, rounded numbers that differed from one another by a power of 10 (e.g., 40, 400, 4,000), represented a response format often used in tests of estimation (R. E. Reys, 1986; Schoen, Blume, & Hoover, 1990). For all three estimation tasks, the position of the correct answer within the multiple choice alternatives was varied systematically so that all three positions were regularly used. Two parallel forms of each type of task (MET-A and MET-B) were developed. Students had 3 minutes to respond to 40 problems by selecting the letter corresponding to the correct answer.

Criterion Variables. Data were collected from the teacher and school records for use as external criteria to which scores from the measures could be related. We selected 4 variables related to proficiency in mathematics as representative of information typically available to teachers for making decisions regarding students’ academic progress. The 4 variables included math grade point average, overall grade point average, standardized test scores, and teacher ratings. School records were used to retrieve data on all criterion variables except the teacher ratings.

School Record Data. School records were used to obtain data on student grades and standardized test scores. Math GPA was defined as the semester grade earned for the first semester of the school year during which the study was conducted. A 4-point scale (A = 4, B = 3, C = 2, D = 1, F = 0) was used for all grade-related data. Overall GPA was defined as the average grade across four “core” courses. As defined by the school, the core courses for students participating in this study included mathematics, English, geography, and family life science. The overall GPA was calculated to the nearest hundredth of a point. Standardized test scores were drawn from subtests of the California Achievement Test (CAT; CTB/McGraw-Hill, 1986). The Mathematics Computation subtest assesses skills in computing with whole numbers, fractions, mixed numbers, decimals, and algebraic equations. Problems on the Computation subtest are exclusively numerical. The Mathematics Concepts and Applications subtest assesses skills in comprehending and applying concepts involving numeration, number sentences, number theory, problem solving, measurement, and geometry (Salvia & Ysseldyke, 1991). For comparison purposes, students’ scores on reading subtests of the CAT were also gathered. The CAT is administered to students in the district in sixth and eighth grades. Because scores for the current school year were not yet available at the time of the study, sixth-grade scores (available for only the seventh and eighth graders in the sample) were used. Sixth-grade students (N = 9) were excluded from analyses involving standardized test scores.

Teacher ratings. The NCTM curricular goals were used to create a scale for teacher ratings. For each student, the teacher used a Likert scale (1 = very low, 5 = very high) to rate the student’s (a) overall proficiency in mathematics, (b) value for mathematics, (c) confidence in his or her mathematics ability, (d) mathematical problem-solving ability, (e) mathematical communication ability, and (f) mathematical reasoning ability. The first scale was used to obtain the teacher’s estimate of general competence in mathematics. The last five ratings corresponded to the goals advocated by the NCTM. Ratings were collected from the teacher during the same week in which the four measures were administered.

Procedure

On two occasions during 1 week in the spring of 1995, students completed a series of measurement tasks including the BET, two parallel forms of the BMOT, and two parallel forms of either the MET-A or the MET-B. Each of the four classes was randomly assigned to one of the two MET forms. The task sequence began with the BET for all four classes, with the order of the BMOT and MET counterbalanced across classes. The same sequence of administration was used in each testing session. Prior to the first administration of unfamiliar tasks (BMOT and MET), students were provided with brief (5-minute) instructional sessions during which they were shown examples of the types of problems in the task, provided a demonstration of the solution process, and allowed to practice responding to sample items.

Data for the study were collected via a group response technology system (Discourse[R], 1990; see Foegen & Hargrave, 1997; Robinson, 1991, 1994) used for instructional purposes in the teacher’s classroom as part of the larger study. The group response technology used networking to link individual student terminals to a teacher workstation. Each terminal consisted of an eight-line display (40 characters per line) and a standard keyboard. Students viewed each task on the display and were able to control the pacing of item presentation by moving from one screen to the next using simple keyboard commands. The system allowed us to control precisely the timing of the tasks and permitted students’ responses to be immediately recorded, scored, and saved to disk.

Scoring

Scores for the measures were the number of correct responses made by each student. Because the tasks were administered on the technology system, all scoring was completed automatically. Students’ raw scores on the CAT were convened to standard scores with a mean of 100 and a standard deviation of 15 by using information about subtest means and standard deviations provided in the CAT technical manual. Because of the high degree of overlap in the distributions of scores for Grades 7 and 8 students, we deemed cross-grade analyses to be appropriate.

Analyses

The first step in analyzing the data was to examine the reliability of each of the measures. In the second step, we explored the degree to which students’ scores on the measures were related to other indicators of mathematics proficiency (standardized test scores, grades, teacher ratings). A third analysis was conducted to explore the efficacy of using the measures singly or in combination to predict performance on the criterion variables. The primary analyses involved correlation and multiple regression. Correlation coefficients were used to explore the reliability of the measures and relations between students’ scores on the 4 performance measures and the criterion variables. Multiple regression analyses permitted an examination of the degree to which the performance measures could be used. singly or in combination to predict teacher ratings.

Results

The means and standard deviations for scores obtained from the measures and the criterion variables are presented in Table 3. Mean scores on the BET reflected incremental increases across grades; the same was not true of the other measures. One factor influencing this result might have been students’ familiarity with the BET. As part of the larger study, students had completed one BET per week for the 24 weeks of school preceding the data collection period for this study. In contrast, the BMOT and METs were new tasks to which the students had not previously been exposed. Students’ lack of familiarity may be reflected in the restricted range of scores obtained for some measures. For the METs in particular, the effective range was from 0 to 10 responses for two thirds of the students. Floor effects for these measures should be considered in interpreting results. Similarly, the limited variability in the scores obtained on the teacher ratings should be noted when evaluating the results of subsequent analyses. This result is a particular problem for rating scales, which tend to be cross sectional, rather than longitudinal, in character.

TABLE 3. Means and Standard Deviations for Measures

and Criterion Variables by Grade and Full Sample

Full sample Grade 6

Measure/variable N M SD n M sd

Measures

BMOT 94 11.84 6.03 8 11.19 5.39

BET 95 19.86 7.30 8 16.81 8.00

MET-A 53 8.66 3.98 4 6.63 5.02

MET-B 42 8.50 3.92 4 7.63 2.66

Gradepoint

Math GPA 90 2.69 1.05 9 2.56 1.13

Overall GPA 90 2.64 1.07 9 2.28 1.15

CAT scores

Computation

Concepts

Teacher ratings

Proficiency 98 3.68 0.86 9 3.44 1.01

Communication 98 3.68 0.86 9 3.44 1.01

Confidence 98 3.45 0.96 9 3.22 0.97

Problem solving 98 3.36 0.94 9 3.22 1.09

Reasoning 98 3.81 0.86 9 3.78 1.20

Valuing/Mathematics 98 3.34 1.10 9 3.00 1.32

Grade 7 Grade 8

Measure/variable n M sd n M sd

Measures

BMOT 48 11.22 5.64 37 12.84 6.71

BET 48 18.76 6.94 38 22.04 7.26

MET-A 25 8.74 3.66 24 8.92 4.21

MET-B 23 7.41 3.65 14 10.57 4.16

Gradepoint

Math GPA 45 2.80 1.12 36 2.58 0.94

Overall GPA 45 2.72 1.10 36 2.62 1.02

CAT scores

Computation 44 99.42 16.79 30 101.26 14.29

Concepts 45 100.07 17.57 30 100.86 13.01

Teacher ratings

Proficiency 49 3.84 0.66 40 3.55 1.01

Communication 49 3.78 0.71 40 3.63 0.98

Confidence 49 3.63 0.83 40 3.27 1.09

Problem solving 49 3.49 0.92 40 3.23 0.95

Reasoning 49 3.90 0.77 40 3.70 0.88

Valuing/Mathematics 49 3.51 1.04 40 3.20 1.11

Note. CAT = California Achievement Test; BMOT = basic math operations

task; BET = basic estimation task; MET = modified estimation task.

(a) CAT scores available for Grade 7 and Grade 8 students only.

Reliability of the General Outcome Measures

The reliability of the measures was evaluated by examining scoring reliability for the technology system and by computing estimates of the internal consistency, test-retest, and parallel forms reliability for each of the math measures. For this study, scoring reliability was determined by hand scoring students’ responses to a sample of the tasks administered on the technology system and comparing these results to those produced by the computer. For each of the four classes, one set of data from each of the two testing sessions (selected to represent each of the administered tasks) was rescored. These eight data sets represented 17% of the total number of tasks administered across the two sessions and consisted of 3,257 individual student responses. An agreement was counted when both the researcher and the computer scored the response as correct or incorrect. A disagreement was counted when the researcher and the computer differed in the scoring of a response. Scoring reliability was computed by dividing agreements by agreements plus disagreements. The scoring reliability of the technology system was found to be 99.7%.

Internal consistency, test-retest, and parallel forms reliability estimates are presented in Table 4. Internal consistency reliability was estimated using Cronbach’s alpha (Mehrens & Lehmann, 1991). The BET and the BMOT were found to have somewhat higher levels of internal consistency than the METs. Test-retest reliability was estimated by computing Pearson product-moment correlation coefficients between scores obtained during the first testing session and those obtained during the second. For the BET, the coefficient represents the relation between scores obtained on the first day and scores obtained on the second. For measures with multiple forms (BMOT, MET-A, MET-B), test-retest coefficients are presented separately for each form and then for scores aggregated across the two forms. The aggregated scores consisted of the mean of the two forms of a measure administered on a single day. In every case, test-retest reliability coefficients increased when scores were aggregated.

TABLE 4. Reliability of the Measures

Internal

Measure consistency Test-retest Parallel forms

BET .93 .80 –(a)

BMOT Form 1 .92 Form 1 .80 Day 1 .79

Form 2 .91 Form 2 .84 Day 2 .80

Aggregated .85 Aggregated .82

MET-A Form 1 .81 Form 1 .73 Day 1 .67

Form 2 .78 Form 2 .67 Day 2 .73

Aggregated .81 Aggregated .79

MET-B Form 1 .77 Form 1 .77 Day 1 .77

Form 2 .81 Form 2 .80 Day 2 .82

Aggregated .88 Aggregated .86

Note. All correlations significant at p < .01 level.

BET = basic estimation task; BMOT = basic math operations

task; MET = modified estimation task.

(a) Correlation could not be computed. Students

completed either Form A or Form B of the MET.

Pearson product-moment correlation coefficients were also used to examine the parallel forms reliability of scores produced by the measures. An estimate could not be computed for the BET because only a single form of the task was administered during each testing session. Aggregated scores consisted of the mean of the scores from a single form of a measure administered across two testing sessions. As with the test-retest coefficients, aggregation resulted in larger coefficients in all cases.

Criterion Validity of the General Outcome Measures

The second step in the analysis addressed relations between students’ scores on the measures and other indicators of their mathematics proficiency. Correlation coefficients for the scores obtained from the four measures and the criterion variables are presented in Table 5. Because the number of scores available differed by task (two for the BET, four for the other measures), comparability of the scores used in the analyses was an issue. Although reliability of the scores increased when scores were aggregated, aggregating across all possible scores may create an advantage in reliability for the BMOT and MET tasks over the BET. To hold the level of aggregation constant across all tasks, we selected the mean of the first two administrations of each task as the score to be used for all analyses. For the BET, these administrations crossed both days of testing, whereas for the BMOT and the MET, both scores were obtained during the first day of testing.

TABLE 5. Relations Between the Measures and the

Criterion Variables

Measure

Criterion variable BMOT BET MET-A MET-B

Gradepoint

Math GPA .44(**) .39(**) .30(**) .22

Overal GPA .62(**) .53(**) .41(**) .50(**)

CAT scores

Mathematics

Computation .63(**) .56(**) .47(**) .55(**)

Concepts .44(**) .45(**) .29 .55(**)

Reading

Reading .33(**) .48(**) .38(*) .62(**)

Vocabulary .32(**) .50(**) .21 .65(**)

Teacher ratings

Proficiency .52(**) .49(**) .39(**) .51(**)

Communication .42(**) .45(**) .38(**) .45(**)

Confidence .33(**) .29(**) .31(**) .19

Problem solving .54(**) .54(**) .42(**) .50(**)

Reasoning .49(**) .43(**) .40(**) .44(**)

Valuing mathematics .16 .15 .22 -.12

Note. BMOT = basic math operations task; BET = basic estimation task;

MET = modified estimation task.

(*) p < .05; (**) p < .01.

Correlations between the measures and grade point average were in the moderate range, with performance on the math measures more strongly related to overall GPA than to math GPA. Analyses involving standardized test scores were conducted using combined data for students in Grades 7 and 8. In general, coefficients between the measures and the math subtests were in the moderate range. As with GPA, relations involving the MET-A were lower, and in some cases unreliable, when compared to the other measures. The strongest relations were those that involved either the MET-B or the Computation subtest. Strong relations between the BMOT and the Computation subtest were not surprising, given the similarity between the tasks, both of which were limited exclusively to computation. As anticipated, the estimation measures (which required reading word problems) were found to be more highly related to the reading subtests than was the BMOT. Particularly interesting were the correlations involving the MET-B; results indicated student performance on the MET-B was more closely related to reading subtests of the CAT than to mathematics subtests. Reliable moderate relations were identified between student scores on the measures and the teacher’s ratings of students’ mathematical proficiency and status on four of the five NCTM curricular goals. One exception was the “Valuing Mathematics” rating, for which none of the relations was reliable. With the exception of the MET-A, for which the coefficients were somewhat lower, the BMOT, BET and MET-B appeared to correspond to teacher ratings with comparable levels of accuracy. Differences were observed in the six scales on which the teacher rated student performance. The measures were more strongly related to the Proficiency, Communication, Problem Solving, and Reasoning scales than to the Confidence and Valuing Mathematics scales.

Efficacy of Single vs. Combined Measures for Prediction of Criterion Variables

Using regression analyses, we examined the relative efficacy of using the measures separately and in combination to predict standardized test scores and teacher ratings. Our purpose was to determine the degree to which each of the measures contributed uniquely to the prediction of these criterion variables. Because each class was randomly assigned to one of the two Modified Estimation Tasks (either MET-A or MET-B), it was not possible to include both measures in the same regression analysis. Rather than double the number of analyses conducted, we decided to exclude the MET-A from further consideration. This decision was based on the technical adequacy data, which indicated the MET-B was a more reliable and valid measure than the MET-A. Because of the weak and often unreliable relations to the measures, we also excluded the teacher’s ratings of Confidence and Valuing Mathematics to minimize the potential error associated with a large number of analyses. The lapse in time between administration of the standardized test (spring of sixth grade) and collection of the measurement data (spring of seventh or eighth grade) might have diminished the likelihood that both measures were tapping comparable levels of performance.

Results of the regression analyses for standardized test scores and teacher ratings are presented in Tables 6 and 7, respectively. We conducted the analyses using a forced-entry method of regression. For each criterion variable, a series of three regressions was conducted. One measure was forced into the equation, and then the remaining two were entered in order of significance. The p-value criterion on the computer program was adjusted so that all variables would be entered into the regression equation. The results of the regression analyses are provided in two sections. First, we consider the prediction of the criterion measures from a single measure. Next, we address the relative contributions of the measures to predicting CAT scores in math and teacher ratings.

TABLE 6. Prediction of CAT Scores

Cumulative

Criterion variable [Measure.sup.a] [R.sup.2] F change p value

CAT score

Computation BMOT .44 23.79 .000

MET-B — — —

BET — — —

Computation BET .35 16.10 .000

BMOT .47 6.85 .014

MET-B — — —

Computation MET-B .30 13.03 .001

BMOT .48 9.62 .004

BET — — —

Concepts BMOT .32 14.60 .001

MET-B — — —

BET — — —

Concepts BET .32 14.62 .001

BMOT — — —

MET-B — — —

Concepts MET-B .30 13.26 .001

BET .38 4.12 .051

BMOT — — —

Note. CAT = California Achievement Test. BMOT = basic math operations

task; MET = modified estimation task; BET = basic estimation task.

TABLE 7. Prediction of Teacher Ratings

Cumulative

Criterion variable Measure(a) [R.sup.2] F change p value

Teacher rating

Proficiency BMOT .42 28.25 .000

BET — — —

MET-B — — —

Proficiency BET .36 21.96 .000

BMOT .45 6.59 .014

MET-B — — —

Proficiency MET-B .26 13.39 .002

BMOT .43 11.96 .001

BET — — —

Communication BMOT .31 17.18 .000

BET .39 5.34 .026

MET-B — — —

Communication BET .36 22.34 .000

BMOT — — —

MET-B — — —

Communication MET-B .20 10.04 .003

BET .37 9.88 .003

BMOT — — —

Problem solving BMOT .45 32.11 .000

BET .53 6.02 .019

MET-B — — —

Problem solving BET .46 33.11 .000

BMOT .53 5.41 .025

MET-B — — —

Problem solving MET-B .25 13.18 .001

BET .47 15.23 .000

BMOT .54 5.93 .020

Reasoning BMOT .43 29.90 .000

BET — — —

MET-B — — —

Reasoning BET .36 22.10 .000

BMOT .47 7.35 .010

MET-B — — —

Reasoning MET-B .20 9.57 .003

BMOT .43 15.98 .000

BET — — —

Note. BMOT = basic math operations task: BET = basic estimation

task; MET = modified estimation task.

(a) Variables are listed in the order in which they were entered into

the anaylses.

Prediction of Criterion Measures From a Single Measure. Results in Tables 6 and 7 can be used to identify the “best” measure for use in predicting CAT scores and teacher ratings. Each dependent variable is listed in the tables three times, once with each of the measures as the first entry in the regression equation. By comparing the [R.sup.2] value for each of the measures, the single “best” predictor can be identified. These results parallel those of the correlation data presented in Table 5 but may differ slightly due to the reduced sample size. For the CAT scores, the BMOT was the strongest single predictor of the Computation subtest, accounting for 44% of the variance. The BMOT and the BET predicted scores on the Concepts subtest equally well but accounted for a smaller proportion (32%) of the variance.

Similar findings were obtained for regression analyses involving the teacher ratings. The BMOT was the best single predictor of Proficiency and Reasoning, accounting for slightly over 40% of the variance in each rating. Communication was best predicted by the BET, whereas the BMOT and the BET predicted Problem Solving equally well. Consistent with the slightly lower correlations obtained for Communication, the proportion of variance in this scale accounted for by the measures was 15% to 20% less than that of the other ratings. With the exception of the Communication ratings, the BMOT was found to be the single best predictor of the teacher ratings and CAT scores, although it could be used interchangeably with the BET in two instances. The MET-B did not prove to be useful for predicting CAT scores or teacher ratings. Next, we examined the regression data further to explore the extent to which use of the measures in combination increased the amount of variance accounted for in the criterion measures.

Prediction of Criterion Measures From Combined Measures. The regression data in Tables 6 and 7 can also be used to examine the amount of overlap in the variance accounted for by the measures and to determine whether using the measures in combination would improve prediction of CAT scores and teacher ratings. For the Computation subtest of the CAT, the BMOT alone accounted for a substantial portion (.44) of the explainable variance. When either the BET or the MET-B were entered first, however, the BMOT accounted for a statistically significant amount of additional variance in Computation (increases of .12 and .18, respectively). The variance accounted for by the BMOT apparently included much of the variance accounted for by the other measures. For the Concepts subtest, the BMOT and the BET accounted for comparable proportions of the variance, with no statistically significant portion of the variance explained by adding a second measure. When the MET-B was entered first, however, the addition of the BET resulted in a slight increase in the proportion of variance explained over the use of either of the single measures.

For each of the four teacher rating scales, any single measure accounted for a significant portion of the variance. Although the magnitude of the variance differed, each measure seemed to explain some portion of the teacher rating variable. A closer examination of the results for Proficiency, Communication, and Reasoning revealed that often combinations of two measures accounted for only slightly more of the variance than did any single measure. When the strongest predictor was entered first, the amount of additional variance accounted for by the remaining measures was nonsignificant. The pattern differed for the teacher rating of Problem Solving. Regardless of whether the BMOT or the BET was entered first, the addition of the other accounted for a significant portion of additional unique variance. When two measures were used in combination, increases in explained variance in Problem Solving were nearly double those obtained when combined measures were used to predict the other three ratings.

Discussion

This study explored relations between four potential indicators of growth in mathematics and other assessments of student proficiency in mathematics, including grades, standardized test scores, and teacher ratings. The basic research question was whether any of the four measures is sufficiently reliable and valid to be used for evaluating instruction.

Results support the conclusion that all four measures are reliable. Coefficients for internal consistency, test–retest, and parallel forms reliability are comparable with previous findings on the technical adequacy of CBM in math at the elementary level. When multiple samples were aggregated, the reliability of the measures increased. This finding parallels research by Fuchs, Deno, and Marston (1983), who found that although some academic behaviors can be measured precisely with a single administration of a task, others require more than one administration to obtain a reliable estimate of academic performance. In our study, the academic behaviors tapped by the mathematics measures appear to fall into the latter group.

Results of the criterion validity analyses provide a basis for concluding that the measures are promising indicators of math proficiency. Moderate relations exist between students’ scores on the measures and grades, standardized test scores, and teacher ratings. Although stronger correlations are always desirable, the present results are consistent with prior research. Previous studies of students in upper elementary and secondary settings (Espin & Deno, 1993; Espin & Foegen, 1996; Jenkins & Jewell, 1993) have identified a pattern of decreasing strength in relations between curriculum-based measures or general outcome measures and external criteria as students advance from elementary to secondary schools. In addition, CBM studies in mathematics at the elementary level have obtained lower criterion validity coefficients for math measures in comparison to reading measures (Marston, 1989). Our criterion validity results are similar to those reported for fifth- and sixth-grade students in previous studies of mathematics CBM probes. Skiba et al. (as cited in Marston, 1989) reported coefficients for computation probes ranging from .52 to .67. Fuchs and her colleagues reported coefficients ranging from .66 to .77 for computation probes (L. S. Fuchs et al., 1998) and from .67 to .81 for application probes (L. S. Fuchs et al., 1994; L. S. Fuchs, personal communication, February 27, 2000). Moreover, the results we obtained compare favorably with validity coefficients typically reported for standardized, norm-referenced tests of mathematics, which tend to be in the moderate range. The comparability of our findings is notable given the limited range of scores for the estimation measures and the teacher ratings, which likely suppressed the obtained correlations.

In all CBM and GOM validity research, the strength of the coefficients between the curriculum-based measures or general outcome measures and the criterion measures is higher for reading and than for mathematics. It is not obvious why this should be. One hypothesis is that the “task” of reading is more holistic and, therefore, more easily measured as a generalized outcome than is possible for mathematics. To obtain consensus on what the more generalized “task” for mathematics is would be difficult. Mathematics is often thought of as a constellation of related, but not necessarily unified, constructs, skills, and content areas. Thus, any single measure of math proficiency, particularly one intended for middle school or high school students, might never reach the levels of criterion validity obtained by measures of reading ability.

Several empirical issues emerge from this study. First, the reliability of the measures increased when two measures of student performance were aggregated; however, the number of samples needed for maximizing reliability is unknown. Future studies might involve gathering more samples of student performance; these data could also address the issues of lack of familiarity with the probes’ content and format which might have influenced the results of the present study.

A second set of issues relates to probe formats and administration procedures. Although the facts and estimation probes developed for this study were found to have high levels of reliability and moderate validity, the impact of changes in format, response mode, and timing are unknown. Additional research might investigate the degree to which the technical adequacy of the measures could be improved by altering these design features.

When we examined the efficacy of using the measures singly or in combination to predict other indicators of mathematics proficiency, we concluded that single measures can often predict other indicators of mathematics competence about as well as combinations of measures. We find it curious that the estimation tasks, which represent more complex mathematical thinking than the facts task, did not seem to add as much to the prediction of mathematical performance as we had expected. Perhaps this is due to the types of processes on which many students rely when estimating (R. E. Reys et al., 1982); these strategies often rely heavily on mentally manipulating the numbers in the problem to create an approximation of a basic fact. The predictive ability of the estimation tasks used in this study also might have been limited by the characteristics of the task. The limited range of obtained scores may suggest that the time limits need to be extended. Moreover, the selection response, format (as opposed to the constructed responses used in the facts task) may result in increased measurement error associated with guessing.

The present results must be interpreted cautiously in light of two limitations in the study. First, although the data presented here support using the measures as indicators of student performance, this conclusion is limited to static levels of performance. Mean performance on the facts and estimation measures increased across grade levels, providing some limited evidence of sensitivity to growth. However, we cannot be sure that the scores on the measures will prove to be sensitive to changes in individual student performance across time. A collection of student performance data, gathered over time, would allow investigation of the sensitivity of the measures to short-term growth in mathematics achievement. Second, the teacher rating data must be interpreted cautiously because they were based on a single teacher’s evaluations of each of his students on six rating scales. Although the use of a single teacher increased the consistency with which the ratings were applied, the results may be biased by this teacher’s pattern of responding.

Our results lead us to conclude that the facts and estimation measures provide a reliable and valid indicator of student performance in mathematics; therefore, data drawn from these brief samples of student performance may prove to be useful to teachers as they make decisions related to mathematics instruction. Although the estimation tasks are likely to have higher face validity for middle school teachers, the degree of overlap in the variance accounted for by the measures suggests that the basic facts task may represent a more efficient alternative for teachers in terms of feasibility. The basic facts and estimation tasks explored here are not likely to be endorsed by the NCTM (W. F. Tate, personal communication, October 19, 1999), but they do offer several advantages over existing assessment options. Relative to traditional norm-referenced tests, these measures offer the added advantages of allowing for more frequent and repeated administration. They also can be tied more directly to general curricular goals and are more likely to produce data useful to teachers for instructional decision making. Measures of the type examined here are also more logistically feasible than the performance-based alternative assessment practices proposed by the NCTM and in the mathematics education literature in that they require minimal commitments of time for administration and scoring; they possess known (and acceptable) levels of technical adequacy; and, finally, it is possible to develop multiple forms of the measures so that repeated measurement of growth is possible.

The potential applications of growth indicators in middle and secondary schools are apparent in the work of other researchers exploring CBM and GOM. One such application is the use of curriculum-based measures in combination with more complex, time-consuming performance assessments. Fuchs and Fuchs (1996) described a system in which students’ growth in basic skills is monitored more frequently using CBM and their application of mathematics knowledge to more complex tasks is monitored intermittently using performance assessments. The development of a similar system for middle school and high school students merits further attention.

Growth indicators also hold promise with regard to high-stakes testing. In many states and districts, student performance on a particular assessment is linked to promotion or graduation (Jones et al., 1999; Stiggins, 1999). Growth indicators like those used in CBM or GOM can be used to predict student performance on high-stakes tests (Deno, Fuchs, & Marston, 1997). These data may prove useful in identifying those students in need of additional instructional supports to meet the required level of competency or in establishing target levels of student performance generally associated with passing marks on high-stakes tests. Tindal (in press) advocated a coordinated assessment system in which CBM is used (a) as an “early warning sign” to identify students who may be having difficulties, (b) as a means of determining appropriate assessment accommodations (see also L. S. Fuchs, Fuchs, Eaton, & Karns, 1999; Fuchs, Karns, Eaton, & Hamlett, 1999), and (c) as an alternate assessment for students who cannot participate in the typical testing program.

Our results suggest that the basic facts and estimation measures examined here may prove useful as growth indicators for middle school students. This finding adds to the current body of literature by exploring the performance of middle school students on measures that are likely to represent general outcomes of a variety of mathematics curricula. Given the success with which curriculum-based measures in mathematics at the elementary level have been used to support instructional decision making and improved student outcomes, the potential of these measures is high. The increasing emphasis on outcomes and student accountability, particularly at the secondary level, provides an environment in which monitoring students’ academic growth is essential. Future research should continue the development of these measures and the exploration of their use for supporting middle school teachers’ instructional decision making.

AUTHORS’ NOTE

This study was based on a doctoral dissertation submitted to the graduate school at the University of Minnesota. Parts of this study were presented at the 1997 annual meeting of the American Educational Research Association in Chicago.

REFERENCES

Baxter, G. P., Shavelson, R. J., Herman, S. J., Brown, K. A., & Valadez, J. R. (1993). Mathematics performance assessment: Technical quality and diverse student impact. Journal for Research in Mathematics Education, 24, 190-216.

Cain, R. W., & Kenney, P. A. (1992). A joint vision for classroom assessment. Mathematics Teacher, 85, 612-615.

Carnine, D. (1992). Expanding the notion of teachers’ rights: Access to tools that work. Journal of Applied Behavior Analysis, 25, 13-19.

Clarke, D. J. (1992). Activating assessment alternatives in mathematics. Arithmetic Teacher, 39(6), 24-29.

CTB/McGraw-Hill. (1986). California Achievement Tests: Forms E and F: Technical Bulletin 2. Monterey, CA: Author.

Deno, S. L. (1985). Curriculum-based measurement: The emerging alternative. Exceptional Children, 52, 219-232.

Deno, S. L., Fuchs, L. S., & Marston, D. (1997). Alternative measures for predicting student performance on graduation standards. Presentation at the annual meeting of the Pacific Coast Research Conference, La Jolla, CA.

Deno, S. L., Marston, D., & Mirkin, P. (1982). Valid measurement procedures for continuous evaluation of written expression. Exceptional Children, 48, 368-371.

Deno, S. L., Mirkin, P., & Chiang, B. (1982). Identifying valid measures of reading. Exceptional Children, 49, 36-45.

Discourse [Computer groupware]. (1990). Milwaukee, WI: Discourse Technologies.

Espin, C. A. (1990). Reading aloud from text as an indicator of achievement in the content areas. Unpublished doctoral dissertation, University of Minnesota, Minneapolis.

Espin, C. A., & Deno, S. L. (1993). Performance in reading from content area text as an indicator of achievement. Remedial and Special Education, 14(6), 47-59.

Espin, C. A., & Foegen, A. (1996). Validity of GOMs for predicting secondary students’ performance on content-area tasks. Exceptional Children, 62, 497-514.

Espin, C. A., & Tindal, G. (1998). Curriculum-based measurement for secondary students. In M. R. Shinn (Ed.), Advanced applications of curriculum-based measurement (pp. 214-273). New York: Guilford.

Foegen, A., & Hargrave, C. P. (1997). Discourse: Groupware for interactive classroom communication [Review of the computer groupware Discourse]. Journal of Computing in Teacher Education, 13(2), 30-32.

Fuchs, L. S., & Deno, S. L. (1991). Paradigmatic distinctions between instructionally relevant measurement models. Exceptional Children, 57, 488-500.

Fuchs, L. S., Deno, S. L., & Marston, D. (1983). Improving the reliability of curriculum-based measures of academic skills for psychoeducational decision making. Diagnostique, 8(3), 135-149.

Fuchs, L. S., & Fuchs, D. (1996). Combining performance assessment and curriculum-based measurement to strengthen instructional planning. Learning Disabilities Research & Practice, 11, 183-192.

Fuchs, L. S. Fuchs, D., Eaton, S., & Karns, K. (1999, April). Test accommodations for students with disabilities: Teacher judgment vs. data-based decisions. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Canada.

Fuchs, L. S., Fuchs, D., Hamlett, C. L., & Allinder, R. M. (1989). The reliability and validity of skills analysis within curriculum-based measurement. Diagnostique. 14(4), 203-221.

Fuchs, L. S., Fuchs, D., Hamlett, C. L., & Ferguson, C. (1992). Effects of expert system consultation within curriculum-based measurement using a reading maze task. Exceptional Children, 58, 436-450.

Fuchs, L. S., Fuchs, D., Hamlett, C. L., & Stecker, P. M. (1991). Effects of curriculum-based measurement and consultation on teacher planning and student achievement in mathematics operations. American Educational Research Journal, 28, 617-641.

Fuchs. L. S., Fuchs, D., Hamlett, C. L., Thompson, A., Roberts, P. H., Kubek, P., & Stecker, P. M. (1994). Technical features of a mathematics concepts and applications curriculum-based measurement system (1994). Diagnostique, 19(4), 23-49.

Fuchs, L. S., Hamlett, C. L., & Fuchs, D. (1998). Monitoring basic skills progress: Basic math computational manual (2nd ed.). Austin, TX: PRO-ED.

Fuchs, L. S., Karns, K. M., Eaton, S. B., & Hamlett, C. L. (1999, April). Identifying fair, appropriate testing accommodations for students with learning disabilities. Paper presented at the annual meeting of the Council for Exceptional Children, Charlotte, NC.

Gagne, R. M. (1983). Some issues in the psychology of mathematics instruction. Journal for Research in Mathematics Education, 14, 7-18.

Glaser, R. (1987). The integration of instruction and testing: Implications from the study of human cognition. In D. C. Berliner & B. V. Rosenshine (Eds.), Talks to teachers (pp. 329-341). New York: Random House.

Goals 2000: Educate America Act, 20 U.S.C. [sections] 5801 et seq.

Hartocollis, A. (2000, April 27). The new flexible math meets parental rebellion [41 paragraphs]. The New York Times on the Web [On-line.], Available: http://www.nytimes.com/library/national/regional/042700nymath-edu.html

Hasselbring, T. S., Goin, L. I., & Bransford, J. D. (1988). Developing math automaticity in learning handicapped children: The role of computerized drill and practice. Focus on Exceptional Children. 20(6), 1-7.

Hoff, D. J. (1998, January 21). Controversial Goals 2000 to face new uncertainties. Education Week [On-line]. Available: http://www.edweek.org/ htbin/fastweb?searchform+view4

Hofmeister, A. M. (1993). Elitism and reform in school mathematics. Remedial and Special Education, 14(6), 8-13.

Jenkins, J. R., & Jewell, M. (1993). Examining the validity of two measures for formative teaching: Reading aloud and maze. Exceptional Children, 59, 421-432.

Jones, M. G., Jones, B. D., Hardin, B., Chapman, L., Yarbrough, T., & Davis, M. (1999). The impact of high-stakes testing on teachers and students in North Carolina. Phi Delta Kappan, 81, 199-203.

Kulm, G. (1994). Mathematics assessment: What works in the classroom. San Francisco: Jossey-Bass.

LaBerge, D., & Samuels, S. J. (1974). Toward a theory of automatic information processing in reading. Cognitive Psychology, 6, 293-323.

Lawton, M. (1997, August 6). Feds position national tests on fast track. Education Week, pp. 1, 34.

Lesh, R., & Lamon, S. J. (1992). Assessment of authentic performance in school mathematics. Washington, DC: AAAS Press.

Marston, D. B. (1989). A curriculum-based measurement approach to assessing academic performance: What it is and why do it. In M. R. Shinn (Ed.), Curriculum-based measurement: Assessing special children (pp. 18-78). New York: Guilford.

Math wars. (2000, January 4). Wall Street Journal, p. A22.

Mehrens, W. A., & Lehmann, I. J. (1991). Measurement and evaluation in education and psychology. Fort Worth, TX: Holt, Rinehart and Winston.

National Council of Teachers of Mathematics. (1989). Curriculum and evaluation standard for school mathematics. Reston, VA: Author.

National Council of Teachers of Mathematics. (1991). Professional standards for teaching mathematics. Reston, VA: Author.

National Council of Teachers of Mathematics. (1995). Assessment standards for school mathematics. Reston, VA: Author.

Nolet, V., & Tindal, G. (1994). Instruction and learning in middle school science classes: Implications for students with disabilities. The Journal of Special Education, 28, 166-187.

Resnick, L. B. (1989). Developing mathematical knowledge. American Psychologist, 44, 162-169.

Reys, B. (1992). Estimation. In T. R. Post (Ed.), Teaching mathematics in grades K-8: Research-based methods (pp. 279-301). Boston: Allyn & Bacon.

Reys, R. E. (1986). Evaluating computational estimation. Estimation and mental computation, 1986 yearbook of the National Council of Teachers of Mathematics (pp. 225-238). Reston, VA: National Council of Teachers of Mathematics.

Reys, R. E., Rybolt, J. F., Bestgen, B. J., & Wyatt, J. W. (1982). Processes used by good computational estimators. Journal for Research in Mathematics Education, 13, 183-201.

Robinson, S. L. (1991). Computer-based instruction in special education. In T. M. Schlecter (Ed.), Problems and promises of computer-based training (pp. 39-60). Norwood, NJ: Ables.

Robinson, S. L. (1994). Classroom technology for the human touch: A mini case study using DISCOURSE[TM]. Curriculum/Technology Quarterly, 4(1), 1-4.

Robinson, S. L., & Deno, S. L. (1992, November). Computer-enhanced inclusion: A model for computer-based instructional management in mainstream classrooms. Grant proposal submitted to the U. S. Department of Education, Office of Special Education and Rehabilitation Services.

Romberg, T. A. (1992). Mathematics assessment and evaluation: Imperatives Press.

Sack, J. L. (1998, April 15). Competition heats up for federal education dollars. Education Week [On-line]. Available: http://www.edweek.org/htbin/ fastweb?searchform+view4

Salvia, J., & Ysseldyke, J. E. (1991). Assessment (5th ed.). Boston: Houghton Mifflin.

Schoen, H. L., Blume, G., & Hoover, H. D. (1990). Outcomes and processes on estimation test items in different formats. Journal for Research in Mathematics Education, 21, 61-73.

Shinn, M. R. (Ed.). (1989). Curriculum-based measurement: Assessing special children. New York: Guilford.

Shinn, M. R. (Ed.). (1998). Advanced applications of curriculum-based measurement. New York: Guilford.

Shriner, J. G., Kim, D., Thurlow, M. L., & Ysseldyke, J. E. (1992). Experts’ opinions on national math standards for students with disabilities (Technical Report 3). Minneapolis, MN: National Center on Educational Outcomes, University of Minnesota, College of Education.

Shriner, J. G., Kim, D., Thurlow, M. L., & Ysseldyke, J. E. (1993). Experts’ opinions about the appropriateness and feasibility of national math standards (Technical Report 4). Minneapolis, MN: National Center on Educational Outcomes, University of Minnesota, College of Education.

Sowder, J. (1992). Estimation and number sense. In D. A. Grouws (Ed.), Handbook of research on mathematics teaching and learning (pp. 371-389). New York: Macmillan.

Stiggins, R. J. (1999). Assessment, student confidence, and school success. Phi Delta Kappan, 81, 191-198.

Tindal (in press). How will assessments accommodate students with disabilities? In R. W. Lissitz & W. D. Schafer (Eds.). Assessments in educational reform. New York: Allyn & Bacon.

Tindal, G., & Nolet, V. (1995). Curriculum-based measurement in middle and high schools: Critical thinking in content areas. Focus on Exceptional Children, 27(7), 1-22.

Tindal, G., & Parker, R. (1989a). Assessment of written expression for students in compensatory and special education programs. The Journal of Special Education, 23, 169-183.

Tindal, G., & Parker, R. (1989b). Development of written retell as a curriculum-based measure in secondary programs. School Psychology Review, 18, 328-343.

Ysseldyke, J. E., Thurlow, M. L., & Shriner, J. G. (1992). Outcomes are for special educators too. Teaching Exceptional Children, 25(1), 36-50.

Anne Foegen, Iowa State University

Stanley L. Deno, University of Minnesota

Address: Ann Foegen, Department of Curriculum and Instruction, Iowa State University, N162D Lagomarcino Hall, Ames, IA 50011; e-mail: afoegen@iastate.edu

COPYRIGHT 2001 Pro-Ed

COPYRIGHT 2001 Gale Group