Effects of hypothesis generation on hypothesis testing in rule-discovery tasks

Effects of hypothesis generation on hypothesis testing in rule-discovery tasks

Dennis J. Adsit

Cognitive hypothesis testing is a process for reducing uncertainty in ill-defined problems (Abelson & Levi, 1985). It involves discovering an underlying rule governing objects in a given problem situation. For example, the rule may be for classifying objects into mutually exclusive categories (Bruner, Good-now, & Austin, 1956), for governing the movement of objects in a computer-simulated universe (Mynatt, Doherty, & Tweney, 1977), or determining how a certain function key on a robot works (Klahr & Dunbar, 1988). Inability to test hypotheses can result in the adoption of erroneous beliefs, including myths and superstitions (Einhorn & Hogarth, 1986). Having hypotheses to test allows the decision maker to replace, rather than merely disconfirm, a preferred belief.

Hypothesis-Testing Research

Bruner, Goodnow, and Austin (1956) observed that problem solvers prefer to confirm any hypothesis that they may be working on. However, Popper (1959) proposed that conclusive verification of hypotheses is not possible; only conclusive falsification is possible, and thus participants should seek to disprove favored explanations of various phenomena. Wason (1960) used an open-ended task to determine which of these two tendencies was more applicable to hypothesis-testing behavior. In his “2-4-6” task, participants are told that the string 2-4-6 is an example of a set of numbers that is consistent with a simple relational rule describing sets of three numbers. The participants’ task is to discover the rule by generating successive sets of three numbers and modifying their hypotheses based on feedback on each trial about whether the proposed string is consistent or inconsistent with the correct rule.

In the initial research by Wason and much of the subsequent research using Wason’s task, the simple rule being sought was “any three numbers in increasing order of magnitude.” However, the most typical initial hypotheses were “consecutive even numbers” or “numbers that increase by 2,” and participants worked from there, in attempts to confirm these wrong hypotheses. Wason’s results have been replicated several times (e.g., Gorman & Gorman, 1984; Mahoney & DeMonbreun, 1977; Tukey, 1986; Tweney et al., 1980). People make problem-solving errors because they focus on confirming their ideas rather than on looking for other ideas to test (Tversky & Kahneman, 1974).

There may be something anomalous about the typical 2-4-6 design that prompts a confirmatory hypothesis-testing strategy. Wetherick (1962) claimed that the use of 2-4-6 as an initial string made the task misleading. By using other initial instances, most participants correctly solved the task. Wetherick argued that participants given the 2-4-6 task may have been confirming favored hypotheses by simply comparing them with competing hypotheses. Gorman and Gorman (1984) found that teaching participants to disconfirm hypotheses led to improved performance. Tukey (1986) found that participants in the 2-4-6 task did not simply confirm or disconfirm their hypotheses one at a time but, rather, used the information from hypotheses tested to eliminate competing hypotheses, often seeking relevant dimensions or simply probing with no particular hypothesis in mind.

Other rule-discovery tasks used to study confirmation bias in cognitive hypothesis testing (cf. Snyder & Swann, 1978; Swann, Giuliano, & Wegner, 1982; Wason, 1966) have been criticized as being confounded by semantic, linguistic, and social factors and ambiguous facts and hypotheses (Fischoff & Beyth-Marom, 1983; Klayman & Ha, 1989; Mynatt, Doherty, & Tweney, 1978). In some cases, high task complexity makes generating new hypotheses difficult for participants, contributing to a reluctance to abandon disconfirmed hypotheses (Mynatt et al., 1978). In one study, the goal of hypothesis confirmation seemed to prevent participants from considering alternative hypotheses (Dunbar, 1993).

Fischoff and Beyth-Marom (1983) argued that there is no confirmation bias. Rather, participants cannot always recognize diagnostically useful information. Klayman and Ha (1987) also argued against the existence of a confirmation “bias.” They demonstrated that the search strategy used by participants in hypothesis-testing tasks is actually a good way to determine the truth of a hypothesis under realistic conditions.

However, the typical, confirmatory search strategy can lead to systematic biases. Klayman and Ha (1987) highlighted three relationships between the initial hypothesis and the target hypothesis: embedded (every element of the initial hypothesis is an element of the target hypothesis), overlapping (some of the elements of the initial hypothesis are in common with the target hypothesis and others are not), and surrounding (all the elements of the target hypothesis are a subset of the elements of the initial hypothesis). Prescriptions about what is normative are a function of the relationship between the strategy of the participant and the structure of the task. In general, a disconfirmatory strategy is likely to be most useful in the embedded situation and a confirmatory strategy is likely to be most useful in the surrounding and overlapping strategies.

Hypothesis testing is an inherently dynamic process of generating hypotheses, finding ways to test those hypotheses and then revising beliefs as a result of those tests. As Hogarth (1981) and Lopes (1986) have pointed out, those studying static decision-making problems in the lab have ignored important features of dynamic decision-making environments, leading to somewhat misleading conclusions about human judgment processes. Hypothesis testing is an iterative, adaptive process. Research needs to go beyond examining initial information-gathering strategies. Also, researchers should study hypothesis testing across a range of tasks. In addition, they should vary the structure of the tasks, because what is normative can change across task situations. To the extent that a particular task allows for varying the number of instances encompassed by a target rule, researchers should study performance across a range of task structures.

Hypothesis-Generation Research

To date, researchers have done little to isolate the possible effects of hypothesis generation on rule discovery. In the future, investigators might determine whether hypothesis-generation deficits limit success in hypothesis testing. Hypothesis generation is the process of creating possible, alternative explanations for a given set of information (Fisher, Gettys, Manning, Mehle, & Baca, 1983). Problem solvers tend to generate “impoverished” hypothesis sets, both in terms of quantity and quality, and then overestimate the completeness of their sets and the plausibility of their hypotheses (cf. Fisher et al., 1983; Gettys, Fisher, & Mehle, 1978; Gettys, Manning, Mehle, & Fisher, 1980; Mehle, Gettys, Manning, Baca, & Fisher, 1981).

Using the 2-4-6 task, Tukey (1986) had participants generate hypotheses before the hypothesis-testing process. Unfortunately, Tukey failed to use a control group that did not generate hypotheses, and thus it was impossible to determine what effect hypothesis generation had on hypothesis-testing performance. Using a different task, Klahr and Dunbar (1988) found that spending time generating hypotheses before one begins designing experiments and testing hypotheses has important effects on the hypothesis-testing process. Participants who generated hypotheses before beginning to test them solved the task correctly and arrived at the correct solution more quickly than participants who did not generate hypotheses before beginning hypothesis testing. Not having generated a set of alternative hypotheses resulted in taking longer to abandon hypotheses that had been refuted by experimentation.

Farris and Revlin (1989) argued that the apparent confirmation bias in hypothesis testing cited by Wason (1960) and others may be a function of the hypothesis-generation process. In the normal course of the rule-discovery task, participants generate and evaluate hypotheses culled from a potentially infinite set of hypotheses. To reduce the enormity of this task, they actually perform a counterfactual reasoning strategy. In its simplest form, participants generate one or more favored hypotheses and competing hypotheses and then compare each favored hypothesis with its competitors. This process might seem to imply a confirmatory bias, but actually it involves using a strong inference that is highly predictive of success. When the task involves only hypothesis evaluation, however, participants consider each hypothesis separately from a small set of predesignated hypotheses. The costs of disconfirmation are low, so they simply try to disconfirm one hypothesis at a time.

Koehler (1994) found that problem solvers who generated their own hypotheses expressed less confidence that the hypotheses were true and were more sensitive to the hypotheses’ accuracy than those who were presented with the same hypotheses for evaluation. This finding suggests that hypothesis generation leads people to test more alternative hypotheses than those who are asked to evaluate prespecified hypotheses.

Current Research

The present study involved four conditions designed to isolate the effects of alternative hypotheses on hypothesis testing: baseline, participant-generated, yoked, and experimenter-supplied. In the baseline condition, participants were simply given the tasks and asked to determine the correct rule. There were no special instructions on how to approach the task. In the participant-generated condition, participants generated a list of hypotheses before they were asked to do any testing. This structure allowed them to examine the number of hypotheses generated and their quality (whether the initial set of hypotheses contained the target rule), as well as the relationships between the quality and number of hypotheses generated and hypothesis-testing effectiveness. We expected significant correlations among both number and quality measures and the hypothesis-testing performance variable.

Rule-discovery performance may also be affected by the source of hypotheses. Participants who generated their own hypotheses were expected to have trouble thinking of many additional hypotheses beyond those they had already generated (Gettys, Fisher, & Mehle, 1979). Those in a participant-generated condition were likely to consider the list they generated as fairly complete and therefore were less likely to consider alternative hypotheses. On the other hand, participants who were given a list of hypotheses should be able to think of some alternatives to the list, which should increase the likelihood of their solving the tasks correctly. We tested our hypotheses in a yoked condition, in which a list of hypotheses generated by one of the participants in the participant-generated condition was given to all of the participants in the yoked condition before they began to test hypotheses.

Researchers of rule discovery have concluded that participants have not performed well at hypothesis testing because of their tendency to confirm hypotheses. This conclusion about participants’ hypothesis-testing abilities is troubling for two reasons. First, participants may have been doing more than just hypothesis testing. They may also have been generating their own hypotheses. It is thus unclear whether participants have been ineffective at hypothesis generation, hypothesis testing, or both. Second, drawing conclusions and developing any real understanding about participants’ hypothesis-testing abilities demand that researchers look beyond the general level of measurement of correct/incorrect to more subtle measures, such as how effectively participants used the results of their hypothesis tests to design subsequent tests.

In this study, we addressed both of these concerns by including an experimenter-supplied condition in which participants received a list of hypotheses to test, including the target rule. They were told that the correct solution was in the set they had been given and that their task was to find it. This procedure allowed us to examine their hypothesis-testing performance independent of hypothesis generation. We expected the participants in the experimenter-provided condition to propose tests that discriminated between hypotheses and to use the information from their experiments to eliminate hypotheses, thereby discovering the correct rule.

Thus, we expected a main effect between conditions for correct rule discovery, in the following order: baseline [less than] participant-generated [less than] yoked [less than] experimenter-supplied.

To gain further insights into how having a list of hypotheses and the source of that list affected rule discovery performance, we also examined two additional performance measures across the four experimental conditions: time spent hypothesis testing and the extent to which participants retained and retest-ed a hypothesis that had already been eliminated by a previous hypothesis test. Klahr and Dunbar (1988) found that spending time generating hypotheses before actually testing them reduced the tendency to retain hypotheses. In the current study, for participants who solved a task correctly, we expected the time spent hypothesis testing in the four conditions to be significantly different and ordered in the same way as we expected for the correct/incorrect measure. We expected the average number of retained hypotheses to be significantly different between each of the three conditions in which participants had a list of hypotheses before they began testing and the baseline condition in which participants did not begin with a list.

We also examined rule-discovery process variables. As indicated from our review of the literature, the presence of alternative hypotheses decreased the use of confirmation and led to more tests of multiple hypotheses. So we also expected the predicted performance ordering to hold for the number of trials that were scored as disconfirmations and alternative tests across the four conditions.

Our literature review also suggested that rule-discovery performance is affected by task characteristics such as the target rule, the initial instance given to participants as being consistent with the rule, and the relationship between a participant’s initial guess and the target rule. We achieved variance on the task characteristics dimension by using three different tasks and three different target rules for each of those tasks to manipulate the relationship between a participant’s initial best guess and the target rule.



One hundred eight students enrolled in introductory psychology classes at the University of Minnesota participated. They received extra credit toward their grade in the course. Sixty-one of the participants were male, and 47 were female. Their ages ranged from 18 to 46 years, with a mean of 20.6 years. All participants were native English speakers.


In each of the four hypothesis-generation conditions, we used a repeated-measures Greco-Roman Latin square to block the task and rule variables. A 3 x 3 Greco-Roman Latin square for tasks (A, B, C) and types of rules (1, 2, 3) resulted in three task-rule “sequences”: (a) A1-C3-B2, (b) B3-A2-C1, and (c) C2-B1-A3. This design balanced the order of both tasks and rules across participants. The four conditions and three sequences of tasks and rules yielded 12 condition-sequence combinations. We randomly assigned participants to one of those combinations. Thus, each participant worked on all three tasks and all three types of rules but always within the same hypothesis-generation condition. Each set of 12 condition-sequence combinations can be considered a “replicate,” and 9 replicates were run, for a total of 108 participants. Each participant was run individually in an experimental session that lasted up to 2 hr.


Tasks and testing procedure. One of the three tasks was the Wason (1960) 2-4-6 task.(1) As in previous research, participants in the present study were told that the experimenter had a specific rule in mind for describing sets of three numbers and that their task was to determine what that rule was. The difference between the present study and previous research was that we used three different target rules. Although there were different target rules, the starting position was always the same. Participants were told that one string of numbers consistent with the experimenter’s rule was “2-4-6.” On each trial, participants recorded three pieces of information. The first was their “current best guess” as to what they thought the experimenter’s rule was. Then they recorded the next string of three numbers they wanted to test, followed by the reason why they wanted to test that string of numbers (i.e., what they hoped to determine by testing the proposed string). After all three pieces of information were recorded, participants were told whether their string of numbers was consistent or inconsistent with the correct rule. A final column on the record sheet allowed space for the participant to record the experimenter’s response on each trial.

The second task was a variation on the 2-4-6 problem. The task was similar to one described by Klayman and Ha (1989). It required that participants discover a rule that described sets of three cities. Participants were shown a 50[inches] x 33[inches] Rand McNally map of the world that detailed cities of various sizes. They were told that their task was to discover the experimenter’s rule for describing sets of cities from around the world. To help them get started, we highlighted three cities consistent with the experimenter’s rule with white plastic paper clips. The cities were Brasilia, Asuncion, and Cordoba, all three of which are in South America. Participants recorded three pieces of information on the record sheets: their current best guess at the experimenter’s rule, the next set of three cities that they wanted to test, and the reason they wanted to test that set of cities. Additionally, participants were required to use colored paper clips to mark the three cities they wanted to test on each trial. Because participants spent most of the time looking at the map, using the markers avoided the problem of participants’ being forced to look back and forth between the record sheet and the map to see what cities they had tested.

The third task was designed to be similar in form to the concept-attainment task of Bruner et al. (1956), but we tried to make the problem more meaningful. We told participants to imagine that they were advertising directors in a company that had just invented a new product. Their task was to design ads for the new product. Participants were told that an ad consisted of four parts – type of media, position, influence style, and emotional appeal – and that there were a number of elements within each of the four categories. Participants were then shown the “advertising menu,” and their choices were explained to them. Participants were informed that the experimenter had a specific rule for determining which ads were successful and that their goal was to determine what ad element or combination of elements resulted in successful ads. Participants were required to record three pieces of information: their best guess at the key ad elements, the next ad they wanted to test, and the reason they wanted to test that particular ad. Participants were told that they would be given feedback on each trial regarding whether the ad they had designed was “successful” or “not successful.” After learning how the task was to be performed, participants were given more specific information about the “new product” for which they would be designing ads (the “auto message” – a device that displays ads in the rear windows of cars). To help the participants get started, they were told that one ad had already been run (a humorous ad in a prime spot of a general interest magazine) and that this was a “successful” ad.

In all three tasks, participants had up to 18 trials or 30 min to test their hypotheses. After 18 trials (or 30 min) participants were asked to record their final guess at the rule. However, participants could end the task before completing the 18 trials by announcing a guess. Participants were told that once they announced their guess, they would not be allowed to test any more strings, they would not be given any feedback about the correctness of their guess, and the task would be over. This was explained to the participants before they began each task. When participants indicated that they wanted to announce their guess, they were again reminded that after recording their guess, the task was over and they would not receive any feedback until the debriefing session.

We introduced the manipulation of not giving participants immediate feedback about the correctness of their guesses to encourage them to reason as carefully as possible and to avoid having them offer “casual” guesses and then use the experimenter’s feedback to revise their hypotheses. Once a participant had recorded a final guess, the instructions and record sheets were collected and the next task was introduced.

Rules. The value of a search strategy depends on the relationship between the hypothesized set and the target set. Because it was impossible to know participants’ initial hypotheses, we tried to establish embedded, overlapping, and surrounding conditions by using three different rules for each task that varied the size of the target sets (Klayman & Ha, 1987). For the 2-4-6 task, the typical initial hypothesis is “even numbers” or “numbers that increase by two” (Klayman & Ha, 1989). The three target rules used in the present study were “any set of numbers in ascending order” (embedded), “any set of single-digit numbers” (overlapping), and “any set of increasing consecutive even numbers that end in 2-4-6” (surrounding). For the cities task, Klayman and Ha (1989) found that with a similar set of cities, the typical initial hypothesis was “South American cities.” The three rules we used were “any cities on the North, Central, or South American land mass” (embedded), “any cities south of the Equator” (overlapping), and “South American cities below the Equator” (surrounding). Finally, although there was no strong indication as to participants’ initial hypothesis in the advertising task, we established different target set sizes with the three rules. The largest target set was established by the rule “any ad that uses the print media.” The next largest was “any ad that uses humor/fun,” and the smallest was “any general magazine ad that uses humor.” Note that the third rule allowed only one of the four ad dimensions (influence style) to vary, and thus, when that rule was used, a participant was able to design only three ads, other than the one initially provided, that would be “successful.”

Each participant performed all three tasks. However, because participants performed each task only once, each participant was given three of the possible nine rules. To handle this situation, the Greco-Roman Latin square design blocked the tasks and rules variables to ensure that each participant saw all three tasks and all three rule “sizes.”

Hypothesis-generation conditions. The baseline condition did not include any special hypothesis-generation instructions, nor were the participants given lists of hypotheses to test. Thus, once the task was explained ‘and the initial example of a number string, a set of cities, or an ad consistent with the target rule was given, participants immediately began the first trial.

In the participant-generated condition, once the task was explained and the initial example was given, participants were asked to generate as many rules as they possibly could that would be consistent with the example. A sheet was provided for participants to record their list of possible rules. Participants were given 5 min to generate hypotheses.

In the yoked condition, after the initial example was given, participants received a list of hypotheses generated by one of the students in the participant-generated condition. The yoked participants were told that the list of rules was generated by another participant. As they read over the list, participants were reminded that the correct rule might or might not be included in the list and that the participant who generated the list may or may not have solved the task correctly. Yoked participants were given 5 min to look over the list and think about the task before beginning the first trial.

Finally, in the experimenter-supplied condition, for each task, participants were given a list of nine rules, including the correct rule. Participants were told that one of the nine rules was correct and that the rest were decoys. Once they had read over the list, clarification for any of the rules that they had questions about was provided, and then they began the first trial. As the same list was given to each participant no matter what the target rule, three of the nine possible rules for each task were the ones listed in the rules section corresponding to embedded, overlapping, and surrounding. We designed the remaining six rules for each task to cover a wide range of target set sizes. Finally, all nine rules included the initial example given for each task.

Dependent measures. The principal dependent measure was simply whether or not the participant discovered the correct rule. The time and the number of trials taken to arrive at the correct solution were also measured. We used each participant’s record sheet to determine the number of “confirmations,” “disconfirmations,” and “alternatives.” Another process measure was the extent to which participants retained (i.e., tested again) a hypothesis that had previously been eliminated.


We computed the percentage of trials of each of the three types, because the number of trials for each participant varied. The experimental design was a hybrid, in that it combined aspects of whole-plot and split-plot designs. It consisted of thirty-six 3 x 3 “squares.” The whole-plot part of the experiment involved analyzing differences between the sets of nine squares that were randomly assigned to each of the four conditions. The 36 squares together represented the split-plot part of the experiment. Each square was a Greco-Roman Latin square, where the columns were 3 different participants, and the rows were three different sequences of repeated measures on each participant. The sequences varied the order of tasks and rules that were presented to each participant. Thus, participants were nested within squares, and sequences were crossed with squares. We isolated orthogonal degrees of freedom for analyses.

In addition to looking at the main effect for conditions, we could also use the design to determine if there was any systematic variance associated with the nine replications. Determining whether there were any interactions between task characteristics and the four experimental conditions required using the split-plot part of the design. We conducted a series of analyses of variance (ANOVAs), to check on main effects for tasks, rules, sequences and the interactions between conditions and each of the three variables.


The overall F ratio for percentage correct among the four conditions was significant, F(3, 24) = 9.09, p [less than] .001. (For the means for conditions, replications, and the other main-effect variables of interest, see Table 1. For the full ANOVA analysis, see Table 2.) Planned comparisons revealed that the percentage correct was highest in the experimenter-supplied condition (M = .86), compared with the three other hypothesis-generation conditions – all values of t(320) [greater than] 4.55, p [less than] .001 – but other planned comparisons were not significant, contrary to expectations. Thus, generating hypotheses or receiving them from another participant did not affect performance.

In the participant-generated condition, the average number of hypotheses generated was 8.3, SD = 3.0. The mean number of hypotheses generated was not significantly different among tasks, despite task differences in the number of possible [TABULAR DATA FOR TABLE 1 OMITTED] hypotheses. The target rule was generated during the initial 5-min period for only 20 of the 81 tasks performed by participants in the participant-generated condition. The correlation between the number of hypotheses generated and correct performance was .03. Thus, generating more hypotheses did not increase the likelihood of solving tasks correctly. However, as expected, there was a significant relationship between whether a participant generated the target hypothesis before beginning to test hypotheses and the likelihood of a correct solution (chi-square = 4.02, p [less than] .05, phi = .25).

Retained hypotheses (those retested after they had already been eliminated) were very infrequent across all four conditions, with all means less than 1 (baseline = .40, participant-generated = .33, experimenter-supplied =. 18, yoked = .32). Generating or having a list of hypotheses did have a significant facilitative effect on hypothesis-testing performance, as compared with the baseline condition. When time spent on actually testing hypotheses was examined for those participants who solved tasks correctly in the four conditions, participants in the participant-generated (M = 17.78 min), experimenter-supplied (M = 16.67 min), and yoked (M = 17.79 min) conditions all required less hypothesis-testing time to reach the correct solution than did successful participants in the baseline condition (M = 21.5 min; all ts [greater than or equal to] 3.13, p [less than] .005). However, there was no significant difference between the baseline, participant-generated, and yoked conditions if [TABULAR DATA FOR TABLE 2 OMITTED] one added the 5 min spent generating hypotheses and used time on task as the dependent variable.

To examine cognitive processes, each trial was scored as either a confirmation, disconfirmation, or alternative. For percentage of disconfirmations, the main effect for conditions was significant, F(3, 24) = 6.46, p [less than] .001. The planned comparisons resulted in the same pattern of differences as that obtained for the performance data. That is, the experimenter-supplied condition (M = .40) yielded more disconfirmations than any other condition, t(320) [greater than] 5.75, p [less than] .01, whereas we found no significant differences for the other conditions. For the percentage of alternatives tested, we found no significant main effect, F(3, 24) = 1.10, p [greater than] .05. The replication main effect was not significant for either variable, F(8, 24) [less than] 1.0.

None of the two-way interactions between hypothesis-generation conditions and tasks, rules, and sequences for the percentage correct responses were significant, F(6, 192) [less than] 1.52, p [greater than] .05). This result implies that the relationships found for the effects of the various conditions on hypothesis-testing performance generalize across tasks and rules.

The main effect between tasks for percentage correct was significant, F(2, 192) = 10.76, p [less than] .001. A Tukey honestly significant difference test (HSD) post hoc comparison revealed a significant difference (p [less than] .05) between the 2-4-6 task (M = .46) and the advertising task (M = .72), but the cities task (M = .57) was not significantly different from the other two. Thus, the advertising task appeared to be the easiest.

Across all three types of rules, most of the participants’ initial guesses overlapped the target rules. Hence, for Rule 2 (overlapping), over 80% of the initial guesses were overlapping, as expected. However, only 51% of the initial guesses for Rule 3 (surrounding) were in the surrounding category, and only 35% of the initial guesses for Rule 1 were embedded. The main effect for rules was significant, F(2, 192) = 5.23, p [less than] .01. Tukey HSD post hoc comparisons revealed that Rules 1 and 2 (M = .64) were significantly different from Rule 3 (M = .48) at the .05 level. Thus, the most restrictive rules, in terms of the number of cases they encompassed, were the most difficult to solve.

Sequences represent the order of presentation of tasks and rules. There was no a priori reason for expecting any order effects. In fact, we used the Greco-Roman Latin square design to distribute the effects of order. The order of the combination of task and rule presentation did not affect hypothesis-testing performance. There was a main effect for sequences, F(2, 192) = 3.53, p [less than] .05, on percentage correct. Although a follow-up Tukey HSD did not reveal any differences at the .05 level, the largest difference was between Sequence 1 and Sequence 2.


Contrary to expectations, generating a list of hypotheses before beginning to test hypotheses did not significantly improve rule discovery. The “source” of the list of hypotheses did not affect rule-discovery performance, in that participants in the yoked condition did not outperform those in the participant-generated condition. The experimenter-supplied condition resulted in the most disconfirmations and the highest percentage correct, compared with the other hypothesis generation conditions. Also, generating more hypotheses did not increase the likelihood of correct rule discovery. However, as expected, generating the target hypothesis was positively related to finding the correct solution. These findings generalized across tasks and rules, although, not surprisingly, the most restrictive rules were the most difficult to solve. Unlike Klahr and Dunbar (1988), we found no significant differences among the conditions with respect to retained hypotheses, although the differences were in the hypothesized direction.

Ineffective hypothesis generation may underlie hypothesis-testing inadequacies. Overall, the participants were ineffective at testing hypotheses independent of having to generate them. However, they were capable of isolating the target rule from a list of possible rules: 86% of the tasks performed by participants in the experimenter-supplied condition were solved correctly. Although the percentage of disconfirmations was highest in the experimenter-supplied condition, it was not necessary for participants in the experimenter-supplied condition to use disconfirmation to isolate the correct rule. That is, participants could have used the often observed direct testing or confirmation strategy and still have identified the target hypothesis.

One of the reasons that performance was so much better for participants in the experimenter-supplied condition was that they had been told that the correct rule was on the list they had been given and that their task was to isolate the correct hypothesis. Participants in the experimenter-supplied condition seemed much more oriented than the participants in the other two conditions toward eliminating hypotheses from their list. Participants in the experimenter-supplied condition were observed crossing or checking off rules that had been eliminated from their list until only one uneliminated rule remained.

On the other hand, it was not uncommon to see participants in the participant-generated and yoked conditions push their lists aside and never look at them again, once they had begun the actual hypothesis-testing process. The task structure created in the experimenter-supplied condition may have fostered an “elimination orientation” that resulted in a dramatic increase in the frequency of the use of disconfirmations. Future research should be conducted to examine the effects of inducing an elimination orientation in a hypothesis-generation condition. That might be accomplished by having participants try to determine if the correct hypothesis is on their list before allowing them to test any hypothesis that is not on their list, which might increase the use of disconfirmations and, in turn, increase the number of correct solutions.

If researchers are interested in producing generalizable results, they should study hypothesis testing in a range of hypothesis-testing tasks. Also, researchers should carefully study the effects of the relationship between the initial best guess and the target rule. In the future, instead of just using rules encompassing a range of instances, as we did, the experimenter could wait to establish the target rule until a participant had made an initial guess. The target rule would be chosen from a list of possible hypotheses, such that it established one of the three desired relationships. This would provide for a better test of Klayman and Ha’s (1987) hypotheses about the optimal search strategies, given the initial task structure.

Although the current results did not show that generating hypotheses before hypothesis testing improved hypothesis-testing performance, they do suggest that participants’ poor hypothesis-generation abilities might be a barrier to hypothesis-testing performance. Those who generated the correct hypotheses before beginning the testing were more likely to solve the problem.

The study of the relationship between hypothesis generation and hypothesis testing should be explored further. If a relationship between quality of hypothesis generation and hypothesis-testing performance can be further documented, then one way of improving hypothesis-testing performance would be to improve the quality of hypothesis generation.

Another outcome of the present research is that because of the use of a range of tasks and rules, it can be claimed with higher confidence that disconfirmation plays an important role in hypothesis-testing effectiveness. If subsequent investigations continue to show similar results, researchers should turn their attention to the question of how the use of disconfirmation can be promoted to help individuals test hypotheses and solve ill-defined problems.

Dennis Adsit expresses appreciation to Dr. John Campbell for his invaluable role as dissertation advisor. Manuel London prepared the manuscript for publication.

1 Details about the three tasks can be obtained from the author.


Abelson, R. R., & Levi, A. (1985). Decision making and decision theory. In G. Lindzey & E. Aronson (Eds.), The handbook of social psychology. New York: Random House.

Bruner, J. S., Goodnow, J., & Austin, G. A. (1956). A study of thinking. New York: Wiley.

Cohn, C. M. G (1984). Creativity training effectiveness: A research synthesis. Dissertation Abstracts International, 45, 2501-A. (Order No. DA8424639)

Dunbar, K. (1993). Concept discovery in a scientific domain. Cognitive Science, 17, 397-434.

Einhorn, H. J., & Hogarth, R. M. (1986). Judging probable cause. Psychological Bulletin, 99, 3-19.

Farris, H. H., & Revlin, R. (1989). Sensible reasoning in two tasks: Rule discovery and hypothesis evaluation. Memory & Cognition, 17, 221-232.

Fischhoff, B., & Beyth-Marom, R. (1983). Hypothesis evaluation from a Bayesian perspective. Psychological Review, 90, 239-260.

Fisher, S. D., Gettys, C. F., Manning, C., Mehle, T., & Baca, S. (1983). Consistency checking in hypothesis generation. Organizational Behavior and Human Performance, 31, 233-254.

Gettys, C. F., Fisher, S., & Mehle, T. (1978). Hypothesis generation and plausibility assessment. Technical Report 15-10-78. Decision Processes Laboratory, University of Oklahoma.

Gettys, C. F., Manning, C., Mehle, T., & Fisher, S. (1980). Hypothesis generation: A final report of three years of research. Technical Report 15-10-80. Decision Processes Laboratory, University of Oklahoma.

Gorman, M. E., & Gorman, M. E. (1984). A comparison of disconfirmatory, confirmatory, and a control strategy on Wason’s 2-4-6 task. Quarterly Journal of Experimental Psychology, 36A, 629-648.

Hogarth, R. M. (1981). Beyond discrete biases: Functional and dysfunctional aspects of judgmental heuristics. Psychological Bulletin, 90, 197-217.

Klahr, D., & Dunbar, K. (1988). Dual space search during scientific reasoning. Cognitive Science, 12, 1-48.

Klayman, J., & Ha, Y. W. (1989). Hypothesis testing in rule discovery: Strategy, structure, and content. Journal of Experimental Psychology: Learning, Memory, & Cognition, 15, 596-604.

Klayman, J., & Ha, Y. W. (1987). Confirmation, disconfirmation, and information in hypothesis testing. Psychological Review, 94, 211-228.

Koehler, D. J. (1994). Hypothesis generation and confidence in judgment. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 461-469.

Lopes, L. L. (1986). Aesthetics and the decision sciences. IEEE Transactions on Systems, Man, and Cybernetics, SMC-16, 434-438.

Mahoney, M. J., & DeMonbreun, B. G. (1977). Psychology of the scientist: An analysis of problem-solving bias. Cognitive Therapy and Research, 1, 229-238.

Mehle, T., Gettys, C. F., Manning, C., Baca, S., & Fisher, S. (1981). The availability explanation of excessive plausibility assessments. Acta Psychologica, 49, 127-140.

Mynatt, C. R., Doherty, M. E., & Tweney, R. D. (1977). Confirmation bias in a simulated research environment: An experimental study of scientific inference. Quarterly Journal of Experimental Psychology, 29, 85-95.

Mynatt, C. R., Doherty, M. E., & Tweney, R. D. (1978). Consequences of confirmation and disconfirmation in a simulated research environment. Quarterly Journal of Experimental Psychology, 30, 395-406.

Snyder, M., & Swann, W. B., Jr. (1978). Hypothesis testing in social interaction. Journal of Personality and Social Psychology, 36, 1202-1212.

Swann, W. B., Jr., Giuliano, T., Wegner, D. M. (1982). Where leading questions can lead: The power of conjecture in social interaction. Journal of Personality and Social Psychology, 42, 1025-1035.

Trabasso, T., & Bower, G. H. (1968). Attention in learning. New York: Wiley.

Tukey, D. D. (1986). A philosophical and empirical analysis of participants’ modes of inquiry in Wason’s 2-4-6 task. Quarterly Journal of Experimental Psychology, 38A, 5-33.

Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185, 1124-1131.

Tweney, R. D., Doherty, M. E., Worner, W. J., Pliske, D. B., Mynatt, C. R., Gross, K. A., & Arkkelin, D. L. (1980). Strategies of rule discovery in an inference task. Quarterly Journal of Experimental Psychology, 32, 109-123.

Wason, P. C. (1960). On the failure to eliminate hypotheses in a conceptual task. Quarterly Journal of Experimental Psychology, 12, 129-140.

Wason, P. C. (1966). Reasoning. In B. M. Foss (Ed.), New horizons in psychology. Harmondsworth: Penguin.

Wetherick, N. E. (1962). Eliminative and enumerative behavior in a conceptual task. Quarterly Journal of Experimental Psychology, 14, 246-249.

COPYRIGHT 1997 Heldref Publications

COPYRIGHT 2004 Gale Group