Improved interobserver variation after training of doctors in the Neer system: A randomised trial

Improved interobserver variation after training of doctors in the Neer system: A randomised trial

Brorson, S


We investigated whether training doctors to classify proximal fractures of the humerus

according to the Neer system could improve interobserver agreement. Fourteen doctors were randomised to two training sessions, or to no training, and asked to categorise 42 unselected pairs of plain radiographs of fractures of the proximal humerus according to the Neer system. The mean kappa difference between the training and control groups was 0.30 (95 % CI 0.10 to 0.50, p = 0.006). In the training group the mean kappa value for interobserver variation improved from 0.27 (95% CI 0.24 to 0.31) to 0.62 (95% CI 0.57 to 0.67). The improvement was particularly notable for specialists in whom kappa increased from 0.30 (95% CI 0.23 to 0.37) to 0.79 (95% CI 0.70 to 0.88). These results suggest that formal training in the Neer system is a prerequisite for its use in clinical practice and research.

Fractures of the proximal humerus account for approximately 4% of all fractures.1,2 The incidence is approximately 70 per 100 000 and will probably increase;3,4 the fractures are age- and osteoporosis-related.3-6

Several systems for classifying them have been proposed in an attempt to improve the description of the fractures and to provide guidelines for treatment and outcome.7-15 The Neer system was introduced in 19701 (Fig. 1) and has been the most widely used in clinical practice and research. It is based on the displacement and angulation of four anatomical parts of the proximal humerus, namely, the greater tuberosity, the lesser tuberosity, the articular segment and the shaft of the humerus (Fig. 2).

Within the last 15 years several studies have reported poor observer variation for the Neer system.17-25 Lack of consistency in classification is one possible reason for the discrepancies in outcome after treatment of complex fractures of the proximal humerus, especially of three- and four-part fractures.28

Previous studies have reported no improvement in observer variation using the Neer system despite the exclusive availability of radiographs of high quality,19,22 CT and three-dimensional reconstructions,22-25 and a reduction in the number of classification units.19,22 We have been unable to find any randomised trial of the effect of training on the interobserver variation of the Neer system and have therefore undertaken such a trial.

Patients and Methods

Between August 1 and October 15, 1999 we identified 55 patients with a fracture of the proximal humerus who had been discharged from the Bispeberg University Hospital (Copenhagen). We excluded 14 patients: four children, two patients with pathological fractures, one with pseudarthrosis, one with a healed fracture, three who had been miscoded, and three whose radiographs had been lost. The final study group consisted of 41 patients with a total of 42 pairs of radiographs. There was no selection according to the quality of the radiographs. Plain anteroposterior (AP) and lateral radiographs were available for all the fractures.

More than six months after the last patient had had radiography we invited the orthopaedic medical staff to a teaching session. They had a brief introduction before being given a diagram of the Neer classification system, a written definition of displacement and angulation, a ruler and a goniometer. They were asked to study each pair of radiographs and to choose one of the 16 possibilities on the diagram. They were informed that their observations would be reported anonymously. No communication between observers was allowed during the classification. There was no time limit for making decisions.

They were then randomly allocated to a training group and a control group. The training group was taught the Neer system for 45 minutes. This was based on the original reports of the system.16,29 Two weeks later, the training group received a second teaching session of 45 minutes, before all participating doctors were presented with the same pairs of radiographs, but in a different and unstructured order. None of the authors was an observer.

The classification data were then transferred to a statistical program.

Statistical methods. Within each group of observers the kappa value for interobserver variation were calculated for each possible pair of observers, before calculating the mean kappa value.30

Kappa statistics adjust simple observed agreement for chance agreement. The kappa coefficients range from 1 (perfect agreement) to 0 (chance agreement) and

Before calculating the kappa values the 16 groups of the Neer system were reduced to six as follows: one-part (undisplaced), two-part, three-part and four-part fractures, fracture-dislocations, and articular surface fractures.

We assessed the effect of training on interobserver variation by comparing the change in the mean kappa values between the two groups of observers. The level of statistical significance was assessed by an unpaired t-test applied to the differences in kappa values for each pair of observers. A predefined subgroup analysis was performed to investigate if any change in interobserver variation was related to whether the observers were specialists.

For all kappa values 95% confidence intervals (CI) were calculated.30 The prevalence of each Neer group was calculated in order to evaluate the practice of classification within the different levels of experience. The level of statistical significance of post-training differences in prevalences was assessed by a sign-rank test. Kappa values were calculated for agreement within each of the Six Neer groups.


Nineteen doctors attended the first session, were randomised, and carried out the baseline classification. There were ten in the training group and nine in the control group. In the second session 14 returned, eight specialists in orthopaedic surgery and six non-specialists. The five doctors who dropped out were two non-specialists and one specialist from the training group, and two non-specialists from the control group. The final distribution of observers was four specialists and three non-specialists in each group.

The mean kappa value at baseline was 0.27 (95% CI 0.23 to 0.31) in the training group and 0.28 (95% CI 0.24 to 0.31) in the control group (Table I). Thus, the two groups were initially comparable. The difference in the post-training mean kappa value between the two groups was 0.30 (95% CI 0.10 to 0.50, p = 0.006, df = 12). In the training group the overall mean kappa coefficient for pairwise agreement increased from 0.27 (95% CI 0.24 to 0.31) at baseline to 0.62 (95% CI 0.57 to 0.67) after training (Table I). The increase in agreement after teaching in the training group was approximately equally distributed among all pairs of observers, while agreement between all pairs of observers in the control group remained distributed around baseline kappa (Fig. 3). In the control group the overall mean kappa coefficient for pairwise agreement increased from 0.28 (95% CI 0.24 to 0.31) at baseline to 0.33 (95% CI 0.29 to 0.36) in the second classification session.

After teaching, the mean kappa value in the training group increased from 0.30 (95% CI 0.23 to 0.37) to 0.79 (95% CI 0.70 to 0.88) in the specialist group as compared with 0.26 (95% CI 0.16 to 0.36) to 0.51 (95% CI 0.39 to 0.62) in the non-specialist group.

In the training group the proportion of fractures classified as undisplaced increased from 0.19 to 0.45 (p = 0.016) (Table II). At both sessions the non-specialists tended to classify more fractures as displaced than the specialists. After teaching the mean kappa value for classifying fractures as undisplaced was 0.86 (95% CI 0.73 to 0.98) for the specialists and 0.56 (95% CI 0.39 to 0.74) for the non-specialists.


We found that two teaching sessions improved the overall mean kappa value from 0.27 (95% CI 0.23 to 0.31) to 0.62 (95% CI 0.57 to 0.67) for the Neer system. The improvement was particularly noticeable for the specialists in whom kappa increased from 0.30 (95% CI 0.23 to 0.37) to 0.79 (95% CI 0.70 to 0.88)

The sample. One potential problem for interobserver studies is an insufficient number of observers and patients. We were unable to find sample-size formulae for observer– variation studies and therefore we followed the advice of authors of previous articles and included more than 40 patients.32 The number of patients and observers in our study was sufficient to detect a statistically significant rise in the mean kappa value after training.

Our sample of 14 observers consisted of doctors working in the Department of Orthopaedic Surgery. The distribution of specialists and non-specialists reflects a typical Danish university hospital.

Bias. The observer variation at the first classification session was similar in both groups which were therefore comparable at baseline. The drop-out was 5/19 (26%), but was the same in both groups, and did not alter the distribution between specialists and non-specialists.

The radiographs included reflected the clinical population. The trial was performed more than six months after they were taken and therefore the risk of staff remembering them was small.

For pragmatic reasons we reduced the number of groups in the Neer system from 16 to six before carrying out statistical calculations. Two recent studies of the reduction of the number of classes showed no significant change in the kappa values.19,22

Most interobserver studies on the classification of fractures use settings which differ from clinical situations and this may affect the results. Observers may concentrate particularly when they participate in a study. However, the large number of cases and the lack of clinical consequence in an experimental situation could imply less rigorous attention to detail. We imitated several aspects of the real clinical situation by including a consecutive series of patients, an unselected sample of radiographs, and unselected and unprepared observers exclusively.

Kappa and statistics. Interobserver variation is usually reported as kappa statistics instead of crude observer agreement since the influence of chance agreement is accounted for. However, the level of chance agreement is estimated on the basis of the prevalence of the various categories of observations. This has two important implications for the interpretation of kappa values. First, even with a fixed level of agreement between observers, kappa will be reduced when the prevalence of a category is very high or very low. Secondly, prevalences of categories can differ from one clinical setting to another and crude comparisons of kappa values between different clinical studies may therefor be meaningless.33

Imaging of fractures. We did not include CT scans or three-dimensional reconstructions in our study because their use is not part of standard daily routine and recent studies hace shown no improvement in observer variation by adding such techniques.22-25

Difficulties in translating information about images into classification groups has been suggested as a major reason for the poor interobserver variation which has been reported in previous studies.22,25 The significant effect of training in our study indicates that basic difficulties in translating information cannot fully explain the poor previous results.

Specialists and non-specialists. Uncertain observers are probably more anxious not to overlook a fracture which requires surgery, than to misclassify a fracture which only needs conservative treatment. This could be a reason for the dramatic increase, after teaching, in the proportion of undisplaced fractures and the decrease in displaced fractures.

In the typical clinical situation, the non-specialist will inspect the radiographs and if in doubt, consult a more experienced colleague. This element of doubt was not incorporated in our study. We found only a moderate interobserver agreement among non-specialists. However, we cannot exclude the possibility that non-specialists when in doubt and having consulted with a specialist, may achieve similar interobserver variations as specialists alone.

The low baseline kappa values reproduced earlier dismal reports for the Neer system.17-25 We found that when the Neer system is used by untrained doctors it is too unreliable for clinical practice and research. However, when specialists were trained in the system the kappa value increased to good or excellent levels. These results suggest that formal training in the Neer system should be a prerequisite for its use both in clinical practice and research.

We thank Thomas Gjorup, Peter Gotzsche and Lene Skovgaard for valuable comments on earlier versions of this paper. S. Brorson was supported by a grant from the Danish Medical Research Council.

No benefits in any form have been received or will be received from a commercial party related directly or indirectly to the subject of this article.


1. Buhr AJ, Cooke AM. Fracture patterns. Lancet 1959;1:531-6.

2. Knowelden J, Buhr AJ, Dunbar 0. Incidence of fractures in persons over 35 years of age. Brit J Prev Soc Med 1964;18:130-41.

3. Kiar T, Larsen CF, Blicher J. Proximal fractures of the humerus: an epidemiological and descriptive investigation of proximal fractures of the humerus. Ugeskr Lager 1986;148:1894-7.

4. Lind T, Kroner K, Jensen J. The epidemiology of fractures of the proximal humerus. Arch Orthop Trauma Surg 1989;108:285-7.

5. Horak J, Nilsson BE. Epidemiology of fracture of the upper end of the humerus. Clin Orthop 1975;112:250-3.

6. Bengner U, Johnell 0, Redlund-Johnell I. Changes in incidence of fracture of the upper end of the humerus during a 30-year period: a study of 2125 fractures. Clin Orthop 1988;231:179-82.

7. Malgaigne JF. Traite des fractures et des luxations. Paris: Malgaigne. 1847;513-31.

8. Kocher T. Beitrage zur Kenntniss einiger Praktisch wichtiger Fracturformen. Basel: Salman, 1896:7-91.

9. Matti, H. Kie Knochenbruche and ihre Behandlung. Berlin: Springer, 1931:544-69.

10. Dehne E. Grundsatzliches fiber die Bruche des Oberarkopfes. Archiv fitr Orthopadische and Unfall-Chirurgie 1939;39:435-64.

11. Watson-Jones R. Fractures and joint injuries. 3rd ed. Vol. 11. Edinburgh: Livingstone, 1944.

12. Knight RA, Mayne JA. Comminuted fractures and fracture-dislocations involving the articular surface of the humeral head. J Bone Joint Surg [Am] 1957;39-A:1343-55.

13. Drapanas T, McDonald J, Hale HW. A rational approach to classification and treatment of fractures of the surgical neck of the humerus. Am J Surg 1960;99:617-24.

14. Duparc J, Largier A. Les luxations-fractures de l’extremite superieure de I’humerus. Rev Chir Orth 1976;62:91-110.

15. Miiller ME, Nazarian S, Koch P, Schatzker. The comprehensive classification of fractures of long bones. Berlin: Springer 1990:54-64.

16. Neer CS. Displaced proximal humeral fractures. Part I. Classification and evaluation. J Bone Joint Surg [Am] 1970;52-A:1077-89.

17. Ackermann C, Lam Q, Linder P, Kull C, Regazzoni P. Problems in classification of fractures of the proximal humerus. Z Unfallchir Versicherung.smed Berufskr 1986;79:209-15.

18. Kristiansen B, Andersen ULS, Olsen CA, Varmarken JE. The Neer classification of fractures of the proximal humerus: an assessment of interobserver variation, Skeletal Radiol 1988;17:420-2.

19. Sidor ML, Zuckerman JD, Lyon T, et al. The Neer classification system for proximal humeral fractures: an assessment of interobserver reliability and intraobserver reproducibility. J Bone Joint Surg tAm] 1993;75-A:1745-50.

20. Siebenrock KA, Gerber C. The reproducibility of classification of fractures of the proximal end of the humerus. J Bone Joint Surg tAm] 1993;75-A: 1751-5.

21. Brien H, Notfall F, MacMaster S, et al. Neer’s classification system: a critical appraisal. J Trauma 1995;38:257-60.

22. Bernstein J, Adler LM, Blank JE, et al. Evaluation of the Neer system of classification of proximal humeral fractures with computerized tomographic scans and plain radiographs. J Bone Joint Surg [Am] 1996;78-A:1371-5.

23. Sjoden GO, Movie T, Guntner P, et al. Poor reproducibility of classification of proximal humeral fractures: additional CT of minor value. Acta Orthop Scand 1997;68:239-42.

24. Sallay PI, Pedowitz RA, Mallon WJ, et al. Reliability and reproducibility of radiographic interpretation of proximal humeral fracture pathoanatomy. J Shoulder Elbow Surg 1997;6:60-9.

25. Sjoden GO, Movie T, Aspelin P, Guntner P, Shalabi A. 3D-radiographic analysis does not improve the Neer and AO classifications of proximal humeral fractures. Acta Orthop Scand 1999;70:325-8.

26. Burstein AH. Fracture classification systems: do they work and are they useful? J Bone Joint Surg [Am] 1993;75-A: 1743-4.

27. Cowell HR. Patient care and scientific freedom. J Bone Joint Surg [Am] 1994;76-A:640-1.

28. Rees J, Hicks J, Ribbans W. Assessment and management of threeand four-part proximal humeral fractures. Clin Orthop 1998;353:18-29.

29. Neer CS. Displaced proximal humeral fractures. Part II. Treatment of three-part and four-part displacement. J Bone Joint Surg [Am] 1970;52-A: 1090-1103.

30. Svanholm H, Starklint H, Gundersen HJC, et al. Reproducibility of histomorphologic diagnosis with special reference to the kappa statistic. APMIS 1989;97:689-98.

31. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159-74.

32. Gjorup T. Reliability of diagnostic tests. Acta Obstet Gynecol Scand Suppl 1997;166:9-14.

33. Gjorup T. The kappa coefficient and the prevalence of a diagnosis. Meth Inform Med 1988;27:184-6.

S. Brorson, J. Bagger, A. Sylvest, A. Hobjartsson

From Bispebjerg University Hospital, Copenhagen, Denmark

J Bone Joint Surg [Br] 2002;84-B:950-4.

Received 14 November 2001; Accepted after revision 3 April 2002

S. Brorson, MD, Research Fellow

A. Hobjartsson, MD. Research Fellow

Department of Medical Philosophy and Clinical Theory, University of Copenhagen, Panum Institute, Blegdamsvej 3, DK-2200 Copenhagen NV, Denmark.

J. Bagger, MD, Consultant Orthopaedic Surgeon

A. Sylvest, MD, Consultant Orthopaedic Surgeon

Department of Orthopaedic Surgery, Bispebjerg University Hospital, DK– 2400 Copenhagen NV, Denmark.

Correspondence should be sent to Dr S. Brorson.

Copyright British Editorial Society of Bone & Joint Surgery Sep 2002

Provided by ProQuest Information and Learning Company. All rights Reserved