Effects of Free and Forced Retrieval Instructions on False Recall and Recognition – Statistical Data Included
STUART J. McKELVIE
ABSTRACT. One hundred undergraduates heard 6 lists of 14 words that were each associated with 1 of 6 central concepts not on the lists (the DRMRS procedure). The participants were instructed to recall as many words as possible (free retrieval) or to fill all 14 spaces (forced retrieval) and were subsequently given a recognition test. False recall and recognition of the critical central concepts were higher with forced than with free retrieval instructions, but correct recall and recognition were not affected. Confidence was lower for false than for correct recall and recognition. Confidence was also lower with forced than with free retrieval instructions for false recall but not for false recognition. The DRMRS procedure easily elicited false memories, but confidence judgments helped more in detecting them in recall than in recognition. Theoretical and applied implications are discussed.
Key words: false recall, false recognition, retrieval instructions
OVER THE PAST 5 YEARS, there has been an upsurge of research using what Bruce and Winograd (1998) called the DRM (Deese-Roediger-McDermott) procedure, referred to here as the DRMRS  (Deese-Roediger-McDermott-Read-Solso or “drummers”) procedure. People hear lists of 12 to 15 words associated with a central concept that is not itself presented (e.g., “hard,” “light,” “pillow,” “plush”–associated with “soft”; Roediger & McDermott, 1995). Shortly after each list is given, memory is examined with a recall test, a recognition test, or both. For recall, participants write down as many items as they can. If a recalled item was on the studied list, it is scored as correct. If a recalled item was not on the list and is the critical non-studied central concept (hereafter referred to as the critical concept) for that list, it is scored as a false recall. If it is not on the list and is not the critical concept, it is scored as an intrusion. For recognition, participants are presented with a mixture of studied items, critical concepts, and non-studied items that are unrelated to the lists, and they judge each one as familiar (“yes,” it was on the list) or as not familiar (“no,” it was not on the list). Positive (yes) responses are scored as hits for studied items, as false recognitions for critical concepts, and as false alarms for unrelated items.
Recall accuracy for studied items (correct) ranges from .44 (Payne, Elie, Blackwell, & Neuschatz, 1996) to .75 (Newstead & Newstead, 1998), and intrusions of unrelated items that did not appear on lists are rare (.24 or less, depending on how they are estimated). The finding of most interest is that critical concepts are falsely recalled more often than unrelated intrusions, at rates ranging from .24 (Robinson & Roediger, 1997) to .71 (Read, 1996). In addition, false recognition of critical concepts ranges from .44 (Norman & Schacter, 1997) to .93 (Tussing & Greene, 1997), which is often as high as correct recognition (hits)–.58 (Brainerd & Reyna, 1998) to .98 (Payne et al., 1996)–and is always higher than false alarms to unrelated non-studied items (usually less than .30). Subjectively, participants report being fairly confident when the critical concepts are falsely recalled (Bredart, 2000; McKelvie, 1999; Payne et al., 1996; Read, 1996; Toglia, Neuschatz, & Goodwin, 1999) or falsely recognized (Mather, H enkel, & Johnson, 1997; Roediger & McDermott, 1995). Confidence ratings are usually closer to those for correct responses than to those for unrelated intrusions in recall or to false alarms in recognition.
Reasons for the Popularity of the DRMRS Procedure
Bruce and Winograd (1998) attribute the frequent use of the DRMRS procedure to the Zeitgeist: Traditional memory research emphasized accuracy, but the reliable occurrence of this kind of error was perceived as potentially relevant to the social controversy over false/recovered childhood memories of sexual abuse. If people could be induced to confidently recall or recognize words that had not been presented on the lists, then could they also be induced to mistakenly “remember” previous events that did not actually happen to them? The parallel is intriguing, but most DRMRS investigators who mention the issue of sexual abuse are understandably cautious (e.g., Gallo, Roberts, & Seamon, 1997; Platt, Lacey, Iobst, & Finkelman, 1998; Read, 1996; Roediger & McDermott, 1996) or even dismissive (Freyd & Gleaves, 1996) about such a direct link between laboratory and life. Students remembering lists of words recently heard in the laboratory are hardly comparable to people remembering childhood abuse.
Nevertheless, as Mook (1983) has eloquently argued, it may be theoretical principles rather than specific results that have external validity, and if behavior is affected by independent variables in the laboratory, then it may be similarly affected if these variables occur outside the laboratory (see also Newstead & Newstead, 1998; Payne et al., 1996). In fact, Platt et al. (1998) found that people who were less accurate in remembering their circumstances when they heard the verdict in the O.J. Simpson trial were more likely to falsely recall critical nonstudied concepts. Of even greater interest here are the findings of a study of four groups of women (Clancy, Schacter, McNally, & Pitman, 2000), which showed that false recognition of critical concepts was higher for those who had recovered memories of sexual abuse than for (a) women who believed themselves to have been abused but did not recall it, (b) women who were abused and never forgot, and (c) women who did not report abuse. These studies suggest that there may be common processes determining false memory in the DRMRS procedure and in real life. If so, it is important to discover what these processes are. The laboratory setting may provide some results that can be generalized, but the major virtue of the laboratory setting is that it permits experimental control over variables so that the important ones can be identified and theories can be tested.
Theoretical Accounts of Findings With the DRMRS Procedure
In his original paper, Deese (1959) observed that a critical concept was more likely to be falsely recalled if it was a higher frequency associate of the words on the list. Each set of connected words was also presented together in a block, a procedure that has been shown to enhance correct recall and the clustering of related items in recall compared with random presentation of different sets of connected words (Klatzky, 1975, p. 186). According to Bousfield (cited in Klatzky, 1975, p. 186), the common associate is evoked by the items on the list and becomes the cue around which recall is organized.
In the seminal paper that sparked most of the current interest in the DRMRS procedure, Roediger and McDermott (1995) suggested that false memory may occur because of factors during presentation, test, or both. While encoding a list, people may think of the non-studied central concept, or it may be activated unconsciously via spreading activation in an associative memory network. Later, the critical concept is erroneously recalled or recognized as having been on the list, reflecting a source-monitoring confusion between the internal inference and the external event (Lampinen, Neuschatz, & Payne, 1999; Winograd, Peluso, & Glover, 1998). This encoding account is consistent with Bousfield’s explanation of clustering and with the more recent fuzzy trace theory (Brainerd & Reyna, 1998), according to which items are represented in memory in two ways: (a) verbatim, which contains details about each item; and (b) gist, which contains general semantic content (in this case, the critical concept). Subsequent recall and recognition will be accurate if based on verbatim information but false if based on gist information. Also relevant is theorizing about prototype formation, in which an abstract representation is formed from the central tendency of a number of exemplars (Solso, Heck, & Means, 1993). If the non-studied prototype (the critical concept in the DRMRS task) is later presented, it may be recognized. This kind of pseudomemory has been demonstrated with a variety of materials, such as numbers, dot patterns, colored figures, and faces (Solso, 1995, pp. 109-115). In support of encoding-based accounts, a number of independent variables manipulated during presentation have been shown to affect the rate of false recall and recognition: for example, encoding strategies (Gallo, Roberts, & Seamon, 1997; Newstead & Newstead, 1998; Read, 1996; Tussing & Greene, 1997), number of list items associated with the central concept (Robinson & Roediger, 1997), and blocked versus random presentation of lists (Mather, Henkel, & Johnson, 1997; McDermott, 1996).
As noted earlier, Roediger and McDermott (1995) also theorized that false memory in the DRMRS procedure may be caused by processes that occur during testing. Because they found that the critical concepts appeared later in the list of recalled items, Roediger and McDermott (1995) proposed that false recall may be attributable to repeated retrieval attempts: Initially recalled items tend to be correct and, as participants continue to think about what they remember, the recalled items trigger the central concept (see also Mather, Henkel, & Johnson, 1997; McDermott, 1996). Similarly, studied items presented on the recognition test may prime the non-studied critical concepts.
The most radical testing account is that false recall and false recognition do not reflect a memory experience, but rather only a low response criterion (Miller & Wolford, 1999). That is, participants adopt a more relaxed standard for reporting critical concepts as memories than they do for studied list items. However, this position has been strongly challenged (Roediger & McDermott, 1999; Wickens & Hirshman, 2000; Wixted & Stretch, 2000). In particular, Payne et al. (1996) reported two results that weaken the criterion shift account and strengthen the repeated retrieval account of false recall. Payne et al. tested participants three times in one of two sets of instructions: recall as much as possible (free retrieval) or match the number of attempts to the number of studied items (forced retrieval). They assumed that the response criterion could vary with free retrieval instructions but was fixed with forced retrieval instructions. Consequently, if false recall was attributable to a relaxed response criterion, it would increase only over trials with free retrieval. In fact, false recall increased over trials in both conditions, contrary to the criterion shift account. False recall was also greater under forced retrieval instructions than under free retrieval instructions. Because participants made more retrieval attempts when they follow forced retrieval than free retrieval instructions, this provides direct support for Roediger and McDermott’s (1995) repeated retrieval account of false recall. The effect of retrieval instructions has been replicated in one experiment, but not in another (McKelvie, 1999). One reason for the negative result may be that, although participants in the forced retrieval condition made more attempts than those in the free retrieval condition (.89 vs. .78 of the total number pos sible), they did not match them perfectly to the number of presented items (.89 being less than 1.00).
Manipulation of retrieval instructions permits an evaluation of the theory that false memory is attributable to processes during testing, but it may also be relevant to the recovered memory controversy, because clinicians and interviewers sometimes use techniques that encourage people to “tell more” (Payne et al., 1996) or to go beyond explicit remembering (Brainerd & Reyna, 1998). This process could place a low premium on accuracy, which, according to Roediger and McDermott (1995), might lead to false memories. Read (1996) has also observed that therapeutic memory recovery techniques that encourage clients to imagine and rehearse abuse may lower their criterion for accepting these events as memories. Whether this theoretical interpretation is correct or not, the suggestion is consistent with the potential influence of directed retrieval on false memory. Because the effect of free versus forced instructions has both theoretical and applied significance, it was investigated further in the present experiment. Following previous studies (McKelvie, 1999; Payne et al., 1996), I predicted that false recall would be higher with forced retrieval instructions than with free retrieval instructions, but that correct recall would remain at a similar level.
Subjective Experience in the DRMRS Procedure
Theoretical accounts of false memory in the DRMRS task have been evaluated not only on the basis of objective performance but also on the basis of participants’ phenomenological experience. For example, it has been repeatedly found that both accurate and false recognition are more often accompanied by judgments that items are “remembered” (there is a vivid memory of the actual presentation) rather than simply “known” (the person is sure that the item was presented but does not have a vivid memory; e.g., Payne et al., 1996; Roediger & McDermott, 1995; Winograd et al., 1997). This finding suggests that participants may have consciously thought of the critical concept during encoding (Roediger & McDermott, 1995).
Phenomenological experience has also been examined with confidence judgments, which are particularly interesting from the applied perspective, because they may help to discriminate incorrect from correct responses. Outside the laboratory, we often cannot verify the accuracy of memory, because the original event was not manipulated. However, if confidence judgments vary with accuracy in the laboratory setting, they may be helpful as an aid to judging the veracity of recollection in other contexts. More specifically, if laboratory confidence is lower for incorrect than for correct responses, then we may doubt a real-life recollection if the person reports that he or she is less than sure about it. Unfortunately, and contrary to the popular belief that there is a positive relationship between confidence and accuracy, studies (usually of face memory) have shown mixed results (Busey, Tunnicliff, Loftus, & Loftus, 2000), although the relationship is stronger when accuracy is good (the optimality hypothesis; Deffenb acher, 1980; McKelvie, 1993).
In the case of false recall in the DRMRS procedure, confidence ratings have been (a) high: about 4 out of 5 = extremely confident or sure (Bredart, 2000; Read, 1996); 2.5 out of 3 = highly confident (Toglia et al., 1999); or (b) moderate: about 3 out of 5 = reasonably confident, 5 out of 5 = certain (McKelvie, 1999); but (c) reliably lower than for correct recall (e.g., 4.7 out of 5; McKelvie, 1999; see also Payne et al., 1996), indicating that they have potential for discriminating false from correct responses. However, it is not as clear whether false recall confidence ratings are higher or similar to those for unrelated intrusions; both results have occurred (McKelvie, 1999; Toglia et al., 1999). Confidence data may also help in understanding the effects of forcing people to match the number of attempts to the number of presented items. If, on the one hand, such instructions place a low premium on accuracy (Roediger & McDermott, 1995) or lower the response criterion (Read, 1996), then confidence in false r ecall may be lower than in correct recall. In fact, with forced retrieval instructions, confidence ratings have systematically declined from correct (almost certain = 5), to false (reasonably confident = 3), to intrusion responses (just above guessing = 1; McKelvie, 1999). These findings indicate that they may be useful for judging whether to accept responses in this condition. On the other hand, the effect of retrieval instructions on false recall confidence itself is not clear: In one experiment, false recall confidence ratings were higher when people wrote down as much as possible (free retrieval instructions) than when they maximized their attempts (forced retrieval instructions), and in one experiment there was no significant difference (McKelvie, 1999).
The second purpose of the present experiment was to further investigate confidence in correct recall, false recall, and other intrusions, and to reexamine the effect of retrieval instructions on confidence. Because some participants in McKelvie’s (1999) forced retrieval condition reported anecdotally that they sometimes knew that their responses were wrong, the confidence scale was extended to permit that judgment. In addition, the label “guessing” was altered to “no confidence” that the word was on the list. The term “guess” seems more appropriate when a choice is made between supplied alternatives, as in a recognition test, rather than when alternatives are generated by the participant. The modified scale might provide a more sensitive test of whether responses under free and forced false retrieval instructions can be discriminated with confidence judgments.
As noted earlier, critical concepts are falsely recalled, but they are also falsely recognized. In some studies, recognition testing has been preceded by recall and in others by a filler task, but the effect of free versus forced retrieval instructions on recognition has not been examined. This was the third purpose of the present experiment. Because forced retrieval instructions increase false recall, and because it is reasonable to expect that items falsely recalled are likely to be judged as familiar when they appear during recognition, it seems likely that false recognition would be higher following forced retrieval instructions than following free retrieval instructions.
The general effect of recall on recognition has been investigated (McDermott, 1996), but the most relevant evidence can be found in studies that compare recognition accuracy after recall and after a filler arithmetic task. Although false recognition was generally increased by prior recall, closer inspection of the condition that was similar to that of the present study (i.e., short delay) shows a slightly different picture. False recognition was enhanced by prior recall in two cases (Lampinen, Neuschatz, & Payne, 1999, Experiment 2; Roediger & McDermott, 1995) and was unaffected in two cases (Lampinen et al., 1999, Experiment 1; Payne et al., 1996). From these results, it was not possible to make an exact prediction for the effect of forced versus free retrieval instructions on false recognition. However, from previous research, I expected that the rate of false recognition would be as high or almost as high as the rate of correct recognition (hits), but considerably greater than false alarms for unrelated it ems.
As with recall, confidence judgments were also obtained for recognition. In Mather and colleagues’ (1997) condition that most closely resembled the present experiment (one speaker, associated items presented together on a list), confidence ratings were similar for hits and for false recognition (above 2 = fairly sure on a scale from 1 = guessing to 3 = very sure) and higher than for false alarms (lower than 2). However, recall did not precede recognition. Because two other studies with prior recall found that confidence was higher for hits (almost sure = close to 4 on a scale where 1 = sure new, 2 = almost sure new, 3 = almost sure old, 4 = sure old) than for false recognition (close to 3), and lowest for false alarms (close to 1; Platt et al., 1998; Roediger & McDermott, 1995), this rank order was predicted for both free and forced retrieval instructions. As for false recall, it was not clear whether confidence for false recognition following forced retrieval would be lower or similar to that following free retrieval.
In summary, the major goal of the present experiment was to examine the effect of forced versus free retrieval instructions on false recall, on false recognition, and on corresponding confidence ratings. From previous research, I expected that false recall would be higher with forced than with free retrieval instructions. If false recognition is influenced by prior recall, it should also increase following forced retrieval. If confidence varies with accuracy, it should be lower for false recall and false recognition following forced rather than free retrieval instructions.
Members of two psychology research methods classes (Class 1, n = 53; Class 2, n = 47) were assigned randomly to the free and forced retrieval instruction conditions. For Class 1, there were 27 (free) and 26 (forced) participants in the two conditions, respectively; corresponding numbers in Class 2 were 24 and 23. Lists were read to Class 2 in the order opposite from that used in Class 1.
Gender was matched across classes and across experimental conditions. Approximately 80% of the participants were women. The vast majority of the participants (96%) were between 20 and 23 years of age.
Because most investigators of the DRMRS procedure have used six lists of 12 to 15 words, the materials consisted of six lists of 14 words associated with the central concepts (sleep, book, cold, eat, needle, and high). The sleep list has been used in many studies and was the only one in Read’s (1996) research. The book and eat lists were taken from McKelvie (1999), and the needle and high lists were from Roediger and McDermott (1995). The cold list was constructed for the present experiment.
In each class, the two experimental groups were given their retrieval instructions separately. Then they were immediately brought together for testing in a single session. Both groups were told that they would hear six lists of 14 words and they were to try to recall the items on each list in any order immediately after it was read. Those in the free retrieval condition were told to write down as many words as they could in the 14 spaces provided. Those in the forced retrieval condition were told to fill all 14 spaces.
In Class 1, the six lists (sleep, book, cold, eat, needle, and high) were read to the class in that order by the experimenter at a 3-s pace, with 2.5 mm permitted for immediate written recall of each list. In Class 2, the lists were read in the opposite order. After each word was recalled, the participants immediately rated it for confidence by writing down a number from the following 6-point scale: 4 = certain, 3 = very confident, 2 = reasonably confident, 1 = slightly confident, 0 = word written but no confidence it was on the list, -l = word written but you think it was not on the list.
After recall of the final list, participants were given an unannounced 18-word recognition test that contained 3 words pertinent to each of the six lists. On this test, 6 words were studied items that had been heard (1 from serial positions 4 to 10 on each list), 6 were the critical concepts, and 6 were unrelated to the lists. As each item was read, participants recorded “yes” if they thought it had been on a list and “no” if not. Then they indicated their confidence using scale points 4 to 0. The scale was shortened because–1 did not apply: a person would not say yes to a recognition test item if he or she did not think that it was on the list.
Recall accuracy data (correct, false recall, intrusions) were analyzed with 2 x 2 (Retrieval Instructions x Order of Presentation) factorial analyses of variance (ANOVAs). Recognition accuracy data (hits, false recognitions, false alarms) and both recall and recognition confidence data were analyzed with 2 x 2 x 3 (Retrieval Instructions x Order of Presentation x Response) mixed-model ANOVAs with repeated measures on response (hits, false recognitions, false alarms). In all cases, alpha was set at .05. However, p values are also reported for each inferential statistic.  Table 1 contains the results collapsed over order of presentation. Although there were four significant or almost significant effects, they were not considered to be important and are not discussed. Note also that accuracy data are reported as proportions correct or incorrect.
Accuracy. For correct responses, the only significant effect was order of presentation, F(l, 96) = 6.44, p [less than] .02. Performance was slightly better with the second than the first order (.73 [greater than] .69). The proportions correct were very similar for forced (.72) and free (.70) retrieval instructions, F( 1, 96) = 6.44, p [less than].02. For number of recall attempts, false recall and intrusions, the only significant effect was retrieval instructions, Fs(1, 96) = 103.15,56.62,90.86, respectively, ps [less than].001. In all three cases, scores were higher with forced than with free recall instructions: .95 [greater than] .76 (attempts), .68 [greater than] .34 (false), and .17 [greater than] .03 (intrusions).
Confidence. For confidence, there were three significant effects: retrieval instructions, F(1, 82) = 32.15; response, F(2, 164) = 174.07; and the Retrieval Instructions X Response interaction, F(2, 164) = 21.38. Overall, confidence was higher for correct (3.64) than for false recall (1.47), and higher for false recall than for intrusions (0.99). However, although these distinctions were clear with forced retrieval instructions (3.64, 1.48, 0.21, respectively), confidence was only higher for correct (3.65) than for the two kinds of error (2.28 for false recall, 2.03 for intrusions) with free recall instructions. Participants were also generally less confident with forced than with free instructions, but only for false recall (1.48 [less than] 2.28) and intrusions (0.21 [less than] 2.03), not for correct recall (3.64, 3.65).
The Order of Presentation X Response interaction was almost significant, F(2, 164) = 3.02, p = .051. For the first order of presentation, confidence scores for correct, false recall, and intrusions were 3.61, 1.82, and 1.24, respectively. For the second order, scores were 3.67, 1.82, and 0.71, respectively. The last low value for intrusions was due mainly to the very low confidence (-0.04) for forced retrieval instructions.
Accuracy. There was a significant effect of response, F(2, 192) = 272.80, p [less than] .001, with more hits (.83) than false recognitions (.56) and false alarms (.11). The Retrieval Instructions x Response interaction was almost significant, F(2, 192) = 2.58, p [less than] .08. For free and forced instructions, hits were more frequent than false recognitions, which were in turn more frequent than false alarms (see Table 1). Both hits (.84, .82) and false alarms (.09, .12) were very similar for forced and for free instructions, respectively, but false recognitions were higher for forced (.61) than for free (.51) instructions.
Confidence. The effects of response, F(2, 62) = 27.16, p [less than] .001, and of the Retrieval Instructions x Order of Presentation interaction were significant, F(l, 31) = 6.45, p [less than] .02; but the effects of retrieval instructions, F(l, 31) = 1.40, p [greater than] .20, and of the Retrieval Instructions x Response interaction were not, F(2, 62) = 0.65, p [greater than] .50. In general, confidence declined from hits (3.70), to false recognitions (3.16), to false alarms (2.44). Because false alarms were rare, this analysis was based on only 35 participants (17 in free, 18 in forced) who said yes to all three kinds of item. To clarify the evaluation of any effect of retrieval instructions on false recognition, I conducted a 2 x 2 x 2 (Retrieval Instructions x Order of Presentation x Response) analysis of variance with false alarm confidence omitted. Here, the number of participants was 92 (48 in free and 44 in forced). Again, the effect of response was significant, F(l, 88) = 73.36, p [less than] .001, with confidence lower on false recognition (3.10) than on hits (3.75); however, the effect of retrieval, F(1, 88) = 1.49, p [greater than] .20, and the Retrieval Instructions x Response interaction were not significant, F(l, 88) = 2.34, p [greater than] .10. The previous Retrieval Instructions x Order of Presentation interaction disappeared, but the Order of Presentation x Response interaction was significant, F(l, 86) = 5.30, p [less than] .03. Confidence was lower on false recognition than on hits for both the first (3.20 [less than] 3.63) and second orders of presentation (2.98 [less than] 3.83), but the difference was greater in the second case.
Accuracy. With standard free retrieval instructions, the rate of false recall was .34. This is commensurate with the rates obtained in previous work with the DRMRS procedure, in which the range is .24 to .71. Similarly, the proportions of correct recall (.70) and intrusions (.03) are similar to those in other studies (ranges .44 to .75 and less than .24, respectively).
The major findings were that attempts, intrusions, and false recall were higher with forced than with free retrieval instructions, whereas correct recall was not affected. Payne et al. (1996) obtained similar results, with forced retrieval increasing false recall from .27 to .46 but having no effect on correct recall. Although McKelvie (1999, Experiment 1) found no significant effect of forced retrieval instructions on correct or false recall, attempts increased only from .78 to .89 between free and forced retrieval. In McKelvie (1999) Experiment 2, attempts increased more, from .80 to .96, and false recalls increased from .41 to .68. Again, correct recall did not change. In the present study, attempts increased from .76 to .95, and false recall from .34 to .68. In addition, intrusions were also higher with forced than with free retrieval, replicating two similar effects (McKelvie, 1999). The present results are generally consistent with previous demonstrations that forced retrieval increases false recall an d intrusions but does not increase correct recall.
Beginning with Deese (1959) himself, previous investigators (e.g., Newstead & Newstead, 1998; Platt et al., 1998; Stadler et al., 1999) have noted that the rate of false recall varies among lists, and norms have been published (Stadler et al., 1999). One of the most popular lists (sleep) yields a consistently high false recall rate (.44, Deese, 1959; .53, Newstead & Newstead, 1998; .61, Stadler et al., 1999). In the present study, the highest rate of false recall under free retrieval instructions occurred for the sleep list (.50), and the lowest rate occurred for eat (.12). Other values were .46 (cold), .34 (needle), .32 (book), and .22 (high). With forced retrieval, the highest rate of false recall also occurred for sleep (.86), and the other rates were .84 (cold), .68 (needle), .62 (eat), .58 (book), and .46 (high). Thus, although rates varied from list to list, such variation is not unusual, and all of them increased with forced retrieval. Indeed, with the exception of eat, for which the rate of false rec all increased fivefold from .12 to .62, the rates approximately doubled for each list. This finding shows that the lists contributed proportionately to the effect of retrieval instructions.
Confidence. Participants were less confident with forced than with free retrieval instructions for false recall and for intrusions, but not for correct recall. The last result is consistent with the lack of effect of retrieval instructions on correct recall itself and replicates a previous finding (McKelvie, 1999). Similarly, forced retrieval instructions had an effect on both the number of intrusions and on the confidence reported in them: intrusions increased with forced compared with free retrieval, but confidence decreased and was very low indeed (0.21 on a scale from – 1 to 4).
Retrieval instructions also affected the rate of false recall and confidence in it. As with intrusions, false recall increased with forced retrieval compared with free retrieval, but confidence decreased. This finding clarifies an ambiguous pattern of results in which false recall confidence was reduced by forced retrieval in one case but not in another (McKelvie, 1999). The present drop in false recall confidence with forced retrieval, together with the very low confidence for forced intrusions, may reflect the more sensitive rating scale, which did not contain the label “guess” and allowed participants to indicate that they felt some responses to be wrong (–1). Presumably, these responses were made only to satisfy the demands of the forced retrieval instructions. In fact, for the 51 participants in the free recall condition, a rating of –1 was given a total of 30 times, whereas for the 49 participants in the forced recall condition, it was given 467 times. Thus, –1 occurred much more often with forced t han with free retrieval instructions. Finally, and replicating a result from McKelvie’s (1999) Experiment 2, forced retrieval confidence declined systematically from correct recall, to false recall, to intrusions, whereas free retrieval confidence declined from correct to false recall, for which it was similar to that for intrusions.
The confidence data show that, when participants were correct, they were almost certain (mean ratings were 3.65, 3.64 out of 4 for the free and forced retrieval conditions). This is very similar to the level of correct confidence found by other investigators (Bredart, 2000; McKelvie, 1999; Payne et al., 1996; Read, 1996; Toglia et al., 1999). When the present participants were wrong under free retrieval instructions, they felt reasonably but not very confident (ratings were 2.28 and 2.03 for false recall and intrusions, respectively). The level of false recall confidence (2.28) was similar to that (reasonably confident) of McKelvie (1999) and of Toglia et al. (1999) but somewhat lower than those (very confident) reported by others (Bredart, 2000; Payne et al., 1996; Read, 1996; Toglia et al., 1999). Although Read’s data were obtained with only one list (sleep), which often produces high estimates of false recall (e.g., Deese, 1959; Newstead & Newstead, 1998; Stadler et al., 1999), and the estimate of Payne e t al. was obtained after three recall attempts rather than the single one here, the results of Bredart and Toglia et al. were obtained under conditions that were similar to those in the present experiment. Overall, the present false recall confidence was somewhat lower than previous estimates.
Accuracy. Overall, hits (.83) were significantly greater than false recognitions (.56), which, in turn, exceeded false alarms (.11). This pattern held true for both free and forced retrieval instructions. The size of the first contrast is unusual, because previous studies have generally shown that false recognition is either similar to hits or only slightly lower (see introduction). One factor that may account for this discrepancy is that the present test consisted of a similar number of studied items, critical concepts, and other new items, whereas previous tests have had relatively more studied and new items. This implies that the present hit rate may be unreliable compared with previous hit rates. However, the free retrieval value of .82 is similar to past estimates; it is the free retrieval false recognition rate of .51 that is lower than previous values. Nevertheless, that rate still represents substantial error: The false recall rate with free retrieval instructions was only .34.
Of particular importance, false recognition was higher (.61) for forced retrieval than for free retrieval instructions (.51). Although findings of previous studies examining the effect of recall on recognition (Lampinen et al., 1999; McDermott, 1996; Payne et al., 1996; Roediger & McDermott, 1995) were ambiguous, they did not specifically compare forced and free retrieval. Because forced retrieval instructions increased false recall, the corresponding increase in false recognition probably occurred when critical concepts that had been falsely recalled appeared on the recognition test and were judged to be familiar. This account is consistent with the results for correct recall and for hits, both of which were unaffected by retrieval instructions. Although forced retrieval increased intrusions on recall but did not affect false alarms on recognition, the extra recall intrusions were produced only by the participants and did not match the new items introduced by the experimenter on the recognition test.
Confidence. As predicted, confidence declined from hits, to false recognitions, to false alarms, replicating a pattern found in previous research (Mather et al., 1997; Platt et al., 1998; Roediger & McDermott, 1995). However, there was no effect of retrieval instructions on confidence. This finding is consistent with the findings for hits and for false alarms, which did not differ as a function of retrieval instructions. In contrast, false recognition was higher with forced retrieval, but confidence was not significantly affected. For recall, forced retrieval increased false recall and decreased confidence. Here, forced retrieval increased false recognition but did not decrease confidence; indeed, confidence ratings were slightly but not significantly higher with forced retrieval than with free retrieval instructions.
As with recall, the recognition confidence data show that, when participants were correct, they were almost certain about their responses (mean confidence ratings ranged from 3.69 to 3.75 Out of 4 for the two conditions). These numbers are very similar to Roediger and McDermott’s (1995) confidence estimates (almost sure) for hits, and slightly higher than those obtained by Mather et al. (1997; above guessing but less than very sure). When the present participants were wrong with false recognition, they felt at least very confident (3; the four mean ratings ranged from 2.96 to 3.35), which was higher than the corresponding levels (approximately reasonably confident = 2; range 1.10 to 2.68) with false recall. It was also relatively higher than Mather and colleagues’ (1997) estimate (between guessing and very sure), but it was lower than Roediger and McDermott’s (1995; almost sure). When participants were wrong with false alarms, they felt slightly more than reasonably confident (2.30, 2.57, where 2 = reasonabl y confident), which is similar to the corresponding level (reasonably confident) with free retrieval intrusions. Like false recognition confidence, it was relatively higher than Mather and colleagues’ (1997) estimate (just above guessing) but lower than Roediger and McDermott’s (1995; probably correct).
Theoretical and Applied Implications
With forced retrieval instructions, when participants made repeated attempts to recall all list items, false recall and intrusions were higher than with free retrieval instructions, when participants simply recalled what they could. False recognition was also higher with forced retrieval than with free retrieval instructions, probably because people were more likely to judge critical concepts as familiar if they had already recalled them. In both cases, the rate of correct responses was unaffected by retrieval instructions.
These findings support the theoretical claim that false recall can be caused by factors occurring during testing (Roediger & McDermott, 1995). Although caution is required in generalizing from laboratory to life, the findings also imply that therapists who encourage clients to “tell more” (Payne et al., 1996) may elicit extra information, but it is more likely to be incorrect than correct. In particular, the false recognition results imply that if a therapist suggests an event that is consistent with other recalled information, then it may be accepted as having happened when it actually did not.
Previous investigators have debated whether confidence is related to accuracy, particularly for recognition memory (Busey, Tunnicliff, Loftus, & Loftus, 2000; Deffenbacher, 1980). Here, confidence ratings declined along with hits, false recognitions, and false alarms, and they also reflected the nonsignificant effect of retrieval instructions on hits and false alarms. In contrast, false recognition was higher with forced retrieval than with free retrieval instructions, but the corresponding confidence ratings did not differ significantly. For recall under free retrieval instructions, confidence ratings declined with correct and false recall responses, but not with intrusions. However, for free recall under forced retrieval instructions, confidence ratings declined systematically across all three scores. Finally, confidence ratings showed the same pattern as accuracy with forced and with free retrieval instructions: Correct recall was unaffected, but false recall and intrusions were higher with forced retriev al, and confidence in both responses was lower. Thus, with the exception of one case in recognition and one case in recall, confidence ratings varied in the same way as did accuracy. This implies that confidence judgments may be used as a guide to accuracy when the latter is not objectively known.
To develop some practical advice that might be further investigated, I calculated 95% confidence intervals for the confidence judgments (see Table 1). If people report themselves as very confident (3) to certain (4; range 3.5 to 3.8) on recall, they are likely to be correct; but if they report themselves as only reasonably confident (2; range 1.9 to 2.7), they are likely to be making false recalls. If they have been given forced retrieval instructions and are less than reasonably confident (range 1.1 to 1.8), they are likely to be making false recalls; if they are not confident at all (0; range 0.03 to 0.5), they are likely to be making intrusions. These results suggest that, if people outside the laboratory are pushed to tell more, many of the extra recall responses may be doubted if people feel less than reasonably confident.
On the recognition test, confidence declined across hits, false recognitions, and false alarms, but the difference between correct and incorrect was not as great as with recall. People reported themselves as very confident (3) or certain (4) on hits (range 3.6 to 3.9) and very confident on false recognitions (range 2.8 to 3.4), and they were usually at least reasonably confident (2) on false alarms (range 1.8 to 3.0). Furthermore, they were just as confident on false recognition with forced and free recall despite making more errors in the first case. These results suggest that confidence judgments are of less practical value in detecting errors on a recognition test than on a recall test.
This experiment replicated reports of substantial false recall and false recognition of critical concepts in the standard DRMRS procedure, and it showed that the errors increased with forced retrieval instructions. Theoretically, these results indicate that false memory is, at least in part, a retrieval process. Because correct responses were not affected by retrieval instructions, the results also imply that encouraging people to maximize their retrieval attempts will increase errors at the expense of accuracy. In general, the pattern of confidence judgments followed the pattern of memory scores, but the patterns were more useful for discriminating correct recall from false recall than for discriminating correct recognition from false recognition.
The author thanks reviewers Robert Solso and Susan Amato, who provided very helpful comments on an earlier version of this article.
(1.) Bruce and Winograd (1998) argued that Roediger and McDermott (1995) were primarily responsible for reviving Deese’s (1959) technique, and they suggested that it be termed the DRM (Deese-Roediger-McDermott) procedure. Stadler, Roediger, and McDermott (1999) also used the DRM nomenclature (pronouncing DRM as “dream”), but they claimed that the term was coined by Tulving. However, both Bruce and Winograd and Roediger and McDermott recognized that Read (1996) worked on the paradigm independently and that he originally presented his findings at a 1993 conference (Bruce & Winograd, 1998). Furthermore, Solso has drawn my attention to his use of a very similar task in a study of prototype formation (Solso, Heck, & Meams, 1993), the results of which were presented at a 1987 conference. Although Solso et al. did not cite Deese, prototype formation is relevant to false memory. Roediger and McDermott’s (1995) contribution is certainly the most detailed and is probably the best known, but I think that both Read and S olso should be given credit and so will refer to the DRMRS (Deese-Roediger-McDermott-Read-Solso, “drummers”) procedure.
(2.) Setting the level of alpha indicates the level of Type I error that is tolerated in the research. In this “rejection region procedure’ it is not necessary to present p values for each inferential statistic (Herzberg, 1983, p. 229). However, because one reviewer felt that they would strengthen the paper, p values are included.
Brainerd, C. J., & Reyna, V. F. (1998). When things that were never experienced are easier to “remember” than things that were. Psychological Science, 9, 484-489.
Bredart, S. (2000). When false memories do not occur: Not thinking of the lure or remembering what is heard? Memory, 8, 123-128.
Bruce, D., & Winograd, E. (1998). Remembering Deese’s 1959 articles: The Zeitgeist, the sociology of science, and false memories. Psychonomic Bulletin & Review, 5, 615-624.
Busey, T. A., Tunnicliff, J., Loftus, G. R., & Loftus, E. (2000). Psychonomic Bulletin and Review, 7, 26-48.
Clancy, S. A., Schacter, D. L., McNally, R. J., & Pitman, R. K. (2000). False recognition in women reporting recovered memories of sexual abuse. Psychological Science, 11, 26-31.
Deese, J. (1959). On the prediction of occurrence of particular verbal intrusions in immediate recall. Journal of Experimental Psychology, 58, 17-22.
Deffenbacher, K. A. (1980). Eyewitness accuracy and confidence: Can we infer anything about their relationship? Law and Human Behavior, 4, 243-260.
Freyd, J. J., & Gleaves, D. H. (1996). “Remembering” words not presented in lists: Relevance to the current recovered/false memory controversy. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, 811-813.
Gallo, D. A., Roberts, M. J., & Seamon, J. G. (1997). Remembering words not presented on lists: Can we avoid creating false memories? Psychonomic Bulletin & Review, 4, 271-276.
Herzberg, P. (1983). Principles of statistics. New York: Wiley.
Klatzky, R. L. (1975). Human memory: Structures and processes. San Francisco: W. H. Freeman.
Lampinen, J. M., Neuschatz, J. S., & Payne. D. G. (1999). Source attributions and false memories: A test of the demand characteristics account. Psychonomic Bulletin & Review 6, 130-135.
Mather, M., Henkel, L. A., & Johnson, M. K. (1997). Evaluating characteristics of false memories: Remember/know judgments and memory characteristics questionnaire compared. Memory & Cognition, 25, 826-837.
McDermott, K. B. (1996). The persistence of false memories in list recall. Journal of Memory and Language, 35, 212-230.
McKelvie, S. J. (1993). Confidence and accuracy in facial memory: Further evidence for the optimality hypothesis. Perceptual and Motor Skills, 76, 1257-1258.
McKelvie, S. J. (1999). Effect of retrieval instructions on false recall. Perceptual and Motor Skills, 88, 876-878.
Miller, M. B., & Wolford, G. L. (1999). Theoretical commentary: The role of criterion shift in false memory. Psychological Review 106, 398-405.
Mook, D. G. (1983). In defense of external invalidity. American Psychologist, 38, 379-387.
Newstead, N. A., & Newstead, S. E. (1998). False recall and false memory: The effect of instructions on memory errors. Applied Cognitive Psychology, 12, 67-79.
Norman, K. A., & Schacter, D. L. (1997). False recognition in younger and older adults: Exploring the characteristics of illusory memories. Memory & Cognition, 25(6), 838-848.
Payne, D. G., Elie, C. J., Blackwell, J. M., & Neuschatz, J. S. (1996). Memory illusions: Recalling, recognizing, and recollecting events that never occurred. Journal of Memory and Language, 35, 261-285.
Platt, R. D., Lacey, S. C., Iobst, A. D., & Finkelman, D. (1998). Absorption, dissociation, and fantasy-proneness as predictors of memory distortion in autobiographical and laboratory-generated memories. Applied Cognitive Psychology, 12, S77-S89.
Read, J. D. (1996). From a passing thought to a false memory in 2 minutes: Confusing real and illusory events. Psychonomic Bulletin & Review 3(1), 105-111.
Robinson, K. J., & Roediger, H. L., III. (1997). Associative processes in false recall and recognition. Psychonomic Science, 8, 231-237.
Roediger, H. L., III, & McDermott, K. B. (1995). Creating false memories: Remembering words not presented on lists. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21, 803-814.
Roediger, H. L., III, & McDermott, K. B. (1996). False perception of false memories. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, 814-816.
Roediger, H. L., III, & McDermott, K. B. (1999). False alarms about false memories. Psychological Review 106, 406-410.
Solso, R. L. (1995). Cognitive psychology (4th ed.). Boston: Allyn and Bacon.
Solso, R. L., Heck, M., & Mearns, C. (1993). Prototype formation in very short-term memory. Bulletin of the Psychonomic Society, 31, 185-188.
Stadler, M. A., Roediger, H. L., III, & McDermott, K. B. (1999). Norms for word lists that create false memories. Memory & Cognition, 27, 494-500.
Toglia, M. P., Neuschatz, J. S., & Goodwin, K. A. (1999). Recall accuracy and illusory memories: When more is less. Memory, 7, 233-256.
Tussing, A. A., & Greene, R. L. (1997). False recognition of associates: How robust is the effect? Psychonomic Bulletin & Review, 4, 572-576.
Wickens, T. D., & Hirshman, E. (2000). False memories and statistical decision theory: Comment on Miller and Wolford (1999) and Roediger and McDermott (1999). Psychological Review, 107, 377-383.
Winograd, E., Peluso, J. P., & Glover, T. A. (1998). Individual differences in susceptibility to memory illusions. Applied Cognitive Psychology, 12, S5-S27.
Wixted, J. T., & Stretch, V. (2000). The case against a criterion-shift account of false memory. Psychological Review, 107, 368-376.
Mean Proportion of Accurate Responses and
Mean Confidence in Each Condition
Recall accuracy [a]
False recall .34
Recall confidence [b,d]
Correct 3.49 (3.65) 3.80
False recall 1.88 (2.28) 2.68
Intrusions 1.56 (2.03) 2.50
Recognition accuracy [c]
False recognition .51
False alarms .12
Recognition confidence [b,e]
Hits 3.63 (3.69, 3.75) 3.87
False recognition 2.77 (2.96, 3.00) 3.22
False alarms 1.78 (2.30) 2.83
Recall accuracy [a]
False recall .68
Recall confidence [b,d]
Correct 3.55 (3.64) 3.72
False recall 1.10 (1.48) 1.84
Intrusions -0.03 (0.21) 0.46
Recognition accuracy [c]
False recognition .61
False alarms .09
Recognition confidence [b,e]
Hits 3.64 (3.71, 3.75) 3.86
False recognition 2.99 (3.35, 3.21) 3.42
False alarms 2.13 (2.57) 3.02
(a)For recall accuracy, raw score maxima = 6 for false recall (one for
each list), 84 for attempts, correct, and intrusions (14 for each list).
(b)Range for confidence = -1 (word not on list) to 4 (certain) for
recall and 0 (no confidence) to 4 (recognition).
(c)For recognition accuracy, raw score maxima = 6 (one hit, one false
recognition, and one false alarm for each list).
(d)The 3 scores for confidence are upper and lower
bounds of the 95% confidence interval, with the mean in parentheses.
(e)For confidence in hits and false recognition, the two means
(in parentheses) refer to the analyses with and without false alarms,
respectively; 95% confidence intervals were calculated
around the latter.
COPYRIGHT 2001 Heldref Publications
COPYRIGHT 2001 Gale Group