Assessing Students In A Graduate Tests And Measurement Course: Changing The Classroom Climate

Assessing Students In A Graduate Tests And Measurement Course: Changing The Classroom Climate

Beverly M. Klecker

Four classroom assessment techniques were used in a graduate-level course in tests and measurements as an initial exploration of the relationships between classroom climate and classroom assessment. The subjects in this action research study were forty-three students in two class sections. The content of the course was divided into four areas (1) legal and ethical issues of assessment; (2) using statistics to describe assessment data; (3) reliability and validity; and (4) types and uses of standardized tests. Classroom instruction included lecture/media presentations followed by content-reinforcing group act ivies. The criterion-referenced grades for the course were based on mastery of core concepts assessed by (1) individual multiple-choice tests; (2) collaborative group multiple-choice tests; (3) take-home open-response tests; and (4) take-home multiple-choice tests. At the end of the course, the students rated the four assessment events and provided written evaluations. Twenty-nine (67.4%) of the students preferred the group multiple-choice format; eleven (25.6%) preferred the take-home, open-response test; two (4.6%) preferred the standard-format, individual multiple-choice test, and one student (2.4%) preferred the take- home multiple-choice test. Student ratings of fairness of assessment in the course were 4.9 and 4.7 for the two classes.

“I hate tests … I am no good at taking tests … I can’t do math … I’ll never understand this … This will be my one `C’ in graduate school … I have saved this course until last.” These comments were frequently heard as graduate students entered the introductory class of “Tests and Measurements.” Typically, these graduate students have had no previous experience with tests and measurements other than taking classroom tests and the requisite admission tests (e.g., Graduate Record Examination).

Classroom assessment in postsecondary education provides feedback to students and information to professors about student achievement and course objectives. The main goal of classroom assessment is to obtain valid, reliable, meaningful, and appropriate information about student learning (Brookhart, 1998; Linn & Gronlund, 1995; Stiggins & Bridgeford, 1985). The relationship between classroom climate, classroom assessment, and student learning has been widely researched (Brookhart, 1997; Porter & Freeman, 1986; Stiggins & Conklin, 1992; Tittle, 1994). These studies have found that assessment measuring clearly identified learning targets, fairly scored, reduced student anxiety, relaxed the classroom climate and lead to increased student learning. However, the context for most of this research has been elementary and middle schools.

Theoretical Framework

Empirical studies of postsecondary classroom assessment have found that students prefer criterion-referenced grading to norm-referenced grading. Students would rather that their work be compared with a standard of quality rather than with the work of their fellow students (Jacobsen, 1993). O’Sullivan and Johnson (1993) found that students who participated in an educational measurement class where grading was performance-based increased their learning. Stearns (1996) found that collaborative exams assisted learning and improved classroom climate. Rodabaugh and Kravitz (1994) reported that professors who had fair assessment and grading policies were rated more highly by students that professors who had unfair assessment practices. This was true even if the latter gave higher grades.

Purpose of the Study

The purpose of this study was to extend the research on the relationship between classroom assessment and classroom climate in the postsecondary context. The questions that drove the inquiry were: (1) “Which of four assessment strategies will students prefer?” and (2) “Will fair testing practices and criterion-referenced grading have an effect on classroom climate?” The four strategies were: (1) individual multiple-choice test; (2) collaborative group multiple-choice test; (3) take-home open-response test; and (4) take-home multiple-choice test.


The two questions were explored through a descriptive research study. There were four assessment conditions. There was no control group and no manipulated variable. All students received the same instruction through the same method of delivery. All students took the same assessments the same week of the semester.


Forty-three graduate students enrolled in two sections of “Tests and Measurements” in a mid-sized, mid-south, regional state university were the participants in the study. Twenty-five students were enrolled in Class 1 (Monday); eighteen students were enrolled in Class 2 (Wednesday). Each class met three hours once a week for sixteen weeks.

Course Content

The objectives for the course were outlined in the syllabus distributed to the students during the first class. Course content was presented in outline form with corresponding assigned textbook chapters. The semester was divided into four units of instruction: 1) legal and ethical issues of assessment; (2) using statistics to describe assessment data; (3) reliability and validity; and (4) types and uses of standardized tests.

Content Delivery

The content was delivered through assigned readings (completed prior to attending class), lecture/media presentation (chalk, overheads, Power Point, video, and Internet), followed by group work with material designed to require participants to apply knowledge of content presented.

Group Membership

Students formed their own groups on the first night of class. Group size was limited to four or five. Shaw (1981) suggested that the larger the group, the less time the individual has to participate. The students stayed in the same group for the whole semester (except for assessment).

The Relation of Assessment to Grades

The grades in the course were based on mastery of the content and were criterion-referenced. Because the course was not graded pass-fail, the tests were designed to spread student scores. The four assessments were equally weighted with 60 points possible on each. The student’s raw score on each assessment was the numerator with the highest student grade as the denominator (cf. Sax, 1997, p. 567). Ninety percent to 100% was an “A,” 80% to 89% a “B,” 70% to 79% a “C,” 60% to 69% a “D,” and below 59% was failing.

The Four Assessments

All tests were power tests with ascending item difficulty. Tests were constructed from the content of the unit using a test weighted with 60 points each. The fourth assessment was given during finals week and contained core concepts that were woven throughout the course. Most of the course content was new to the students, hence, they were required to learn a new vocabulary. Each of the tests measured knowledge, comprehension, and application levels of Bloom’s (1964) cognitive taxonomy.

Assessment one: Individual 60-item multiple-choice test.

Each student took a 60-item multiple choice test (with four alternatives). The course content assessed with this first test was “ethical and legal issues of assessment.” Students marked responses on scantron sheets. It took approximately 75 minutes for all students to complete the test. The instructor then scanned the answer sheets (with item analysis) while the students took a “break.” The test was reviewed after the break.

Assessment two: Collaborative group, individual response, 60-item multiple-choice test.

Individual students took this test in randomly-assigned test groups of four or five students. The test contained 60 multiple-choice items (with four alternatives). The content measured by the test was, “using statistics to describe assessment data.” Students within (and between) groups discussed questions, calculations, and answers before marking their individual scantron sheets. Instructions were given: “You may discuss the questions within groups and between groups. You may ask anyone except the professor, but you are responsible for the answer that you record on your individual scantron sheet.” Tests were scanned during the break (with item analysis) and were reviewed the final 45 minutes of class.

Assessment three: Take-home, open-response test.

The content of assessment three was, “reliability and validity.” Before the students took the test home, the concept of a “scoring guide” or “rubric” was described. (Modeling this type of assessment and scoring had added meaning for the graduate students in this state currently using statewide open-response accountability testing in a reform environment.) Class was not held during the week of this take-home assessment. Students were encouraged to use any resources they could to complete the questions. Some of the questions (i.e., knowledge and comprehension cognitive level) could be answered from the textbook. The application-level questions required some thought. As the tests were returned, the professor-designed scoring guide was shared with the students. Tests were scored by the professor and returned the following week of class.

Assessment four: Take-home 60-item multiple-choice test.

Students were given the exams and scantron sheets the week before final exams and were permitted to take them home. They were free to use the textbook and any other resources available to them to answer the questions. The students returned the night of the scheduled final exam, the answer sheets were scanned, and item analysis was performed, and the results were discussed.

Data Sources

Data sources for the study were (1) written evaluation from students, (2) instructors’ observations, and (3) question from IDEA form, “The faculty member teaching this course evaluated my performance systematically and accurately” (Likert-type item scale, 1=strongly disagree to 5=strongly agree).

Results and Discussion

Written Evaluations from Students

All students returned the evaluation forms asking them to choose their favorite form of assessment (N=43). Twenty-nine (67.4%) of the students preferred the group multiple-choice format; eleven (25.6%) preferred the take-home, open-response test; two (4.6%) preferred the standard-format, individual multiple-choice test, and one student (2.4%) preferred the take-home multiple-choice test. These results reflect the findings of Stearns (1996).

Instructor’s Observations

In the review of assessment #one, three of the 60 multiple-choice items were disputed by students. One of the items, revealed by the item analysis to have been missed by all but one member of the class, was discussed at length and was discarded. Two of the items were retained; the difference between “bad” items and difficult items was discussed.

During assessment #two, the collaborative multiple-choice test, much enthusiastic discussion of items took place within and among the groups. There was very little “consulting” between groups. There was much earnest discussion within groups. By randomly assigning students to groups for this assessment, the practice of dividing up the sections of the content for studying was avoided. Students did not know who would be in their assessment group until they came to class. Much leaning was observed by the instructor during this assessment. It was a “high energy” event.

For assessment # three, the take-home, open-response test, students continued to collaborate. Some groups met to work on the test together, others consulted over the telephone. There was little disagreement with the scoring rubric. There was, however, a lively discussion on the relationship between “objectivity of scoring” and reliability (part of the content for this assessment).

The take-home, multiple-choice assessment was by far the least favorite (scores on this test were the lowest). The students had the most trouble with the application level questions. The open-book approach worked well for lower-level questions. This assessment just seemed to be an all around bad idea!

IDEA Question: “The Faculty Member Evaluated My Performance Systematically and Accurately”

The mean ratings on the five-point Likert-type rating scales (1=strongly disagree to 5=strongly agree) for the two classes were 4.9 and 4.7 respectively (all students responding). Clearly, the students thought the assessments in this tests and measurements course were systematic and accurate (but not perfect). These ratings reflect findings of Jacobsen (1993) and O’Sullivan and Johnson (1993).


The instructor learned a lot from this exploratory, descriptive study. Using the course content that the students find most difficult, the statistical chapter, for the collaborative assessment worked well. The students’ discussions helped clarify the questions and they consulted on the answers. Scores were high on this test. Modeling objectivity of scoring for the reliability content of the course taught this in a way that lecture could not. Students came to a rapid realization that there could be more than one good answer to open-response questions. Students also came to realize the difference between unfair (and invalid) questions and difficult questions. The instructors’ favorite question required “knowledge” of four concepts before it could be answered (application level). The take-home, multiple-choice, final examination was a disaster! After this study, a group project of “Identifying an assessment in your field, checking the Mental Measurement Yearbook for information about the assessment, and making a presentation to the class,” was added to tap into students’ abilities to analyze, synthesize, and evaluate materials that directly related to their field of study.


Brookhart, S. M. (October, 1998). Classroom assessment in postsecondary education. Paper presented at the annual meeting of the Mid-Western Educational Research Association, Chicago, IL.

Brookhart, S. M. (1997). A theoretical framework for the role of classroom assessment in motivating student effort and achievement. Applied Measurement in Education, 10 (2), 161-180.

Jacobsen, R. H. (1993). What is good testing? Perceptions of college students. College Teaching, 41, 153-156.

Linn, R. L., & Gronlund, N. E. (1995). Measurement and assessment in teaching (7th ed.). Columbus, OH: Merrill.

O’Sullivan, R. G., & Johnson, R. L. (1993). Using performance assessments to measure teachers’ competence in classroom assessment. Paper presented at the annual meeting of the American Educational Research Association, Atlanta, GA. (ERIC Document Reproduction Services No. ED 358 156)

Porter, A. C., & Freeman, D. J. (1986). Professional orientations: An essential domain for teacher testing. Journal of Negro Education, 55, 284-292.

Rodabaugh, R. C., & Kravitz, D. A. (1994). Effects of procedural fairness on student judgments of professors. Journal on Excellence in College Teaching, 5, 67-83.

Sax, G. (1997). Principles of educational and psychological measurement and evaluation (4th. ed.). Belmont, CA: Wadsworth Publishing Company.

Shaw, M. E. (1981). Group dynamics: The psychology of small group behavior (3rd. ed.). New York: McGraw Hill.

Stearns, S. A. (1996). Collaborative exams as learning tools. College Teaching, 44, 111- 112.

Stiggins, R. J., & Conklin, N. F. (1992). In teachers’ hands: Investigating the practices of classroom assessment. Albany, NY: SUNY Press.

Stiggins, R. J.,& Bridgeford, N. J. (1985). The ecology of classroom assessment. Journal of Educational Measurement, 22, 271-286.

Tittle, C. K. (1994). Toward an educational psychology of assessment for teaching and learning: Theories, contexts, and validation arguments. Educational Psychologists, 29, 149-162.

BEVERLY M. KLECKER Eastern Kentucky University

COPYRIGHT 2000 Project Innovation (Alabama)

COPYRIGHT 2000 Gale Group