The Marshmallow Test: Procedure, Results & Legacy

In the late 1960s, in a small testing room at the Bing Nursery School on the Stanford University campus, the psychologist Walter Mischel placed a single marshmallow in front of a child of about four years old, told the child the rules, and walked out. If the child waited until Mischel returned, roughly fifteen minutes later, the child could have two marshmallows. If the child could not wait, the child could ring a bell to summon Mischel and receive the one marshmallow now. The footage from these sessions — children covering their eyes, smelling the treat, talking to themselves, looking away, occasionally lunging for the marshmallow — has been replayed so often that the protocol itself has become a piece of popular culture.

The marshmallow test occupies a curious position in psychology. The basic finding — that young children differ enormously in their willingness to wait, and that those differences predict at least some later outcomes — is robust and theoretically interesting. The popular story built around the test, however, has often been more sweeping than the underlying data supported. A large 2018 conceptual replication and a body of related work have considerably narrowed the claims that can be made. The actual mechanism turns out to be less about a child's intrinsic willpower and more about how the situation is mentally represented and how trustworthy the surrounding environment has been.

Quick Facts About the Marshmallow Test

Initiated by Walter Mischel at Stanford in the late 1960s, continuing into the 1970s and 1980s
Original participants were children aged roughly 3 to 6 years from the Bing Nursery School
Standard protocol: one treat now or two treats after waiting about 15 minutes
The treat was not always a marshmallow; cookies, pretzels, and other foods were also used
Average wait time across studies has typically been about 6–8 minutes
Original follow-ups linked wait time to SAT scores and adolescent self-report measures
The 2018 Watts, Duncan and Quan study found a much smaller effect when controls were added
Mischel and colleagues' final framework was the cognitive-affective processing system or CAPS

1. Historical and Intellectual Context

Walter Mischel had grown up in Vienna and emigrated as a child after the Anschluss. He arrived in the United States with a sensitivity to the ways that personality theories of the day, particularly psychoanalytic theories, gave abstract accounts of inner traits that struggled to predict actual behavior. His 1968 book Personality and Assessment laid out a striking empirical claim: when you actually measured behavior across situations, the cross-situational consistency of personality traits was much weaker than common-sense and Freudian-influenced personality theory implied. People behaved differently in different situations, and the broad "trait" predictions of the era did not hold up well.

That argument provoked the so-called person-situation debate of the late 1960s and 1970s. Mischel was not denying that individuals had stable patterns; he was arguing that the patterns were specific to classes of situations rather than global traits. To make this position constructive, he needed a research program that could capture stable behavioral signatures in carefully specified situations. The delay-of-gratification work was part of that program. A child sitting alone with a marshmallow is a particular situation; how the child handles it might reveal something stable and meaningful precisely because the situation is so well defined.

Behind the testing room sat a second intellectual tradition. Postwar developmental psychology had become preoccupied with the emergence of self-regulation — the capacity of a young child to override impulses in service of a delayed reward. Earlier work by Iris Litke, Bandura, and others had begun mapping the developmental trajectory of impulse control. Mischel's contribution was to translate this question into a single, measurable, observable behavior in a well-controlled setting.

The 1960s was also a period when the National Institute of Mental Health and the Office of Economic Opportunity were investing heavily in research that might inform interventions for children at risk of poor educational and behavioral outcomes. The delay-of-gratification work fit naturally into that funding environment: if self-regulation in early childhood predicted later success, then interventions targeting self-regulation might be productive. Some of Mischel's earliest work on delay of gratification was actually conducted in Trinidad in the 1950s, before he arrived at Stanford, and explored cultural and family influences on the willingness to wait.

2. Research Questions

The research questions Mischel pursued across decades clustered into three groups. The first was descriptive: how long can preschool children wait for a delayed reward, and how does this vary by age, by reward, and by features of the situation such as whether the reward is visible? The second was mechanistic: what cognitive strategies allow children to wait longer — distraction, transformation of the reward in imagination, self-instruction, attention deployment? The third was predictive: do individual differences in delay of gratification at four years old predict differences in life outcomes — school achievement, social competence, mental health — many years later?

Underneath these formal questions was a theoretical conjecture that Mischel and Janet Metcalfe would later articulate fully in 1999 as the hot/cool system framework. The conjecture was that successful delay does not come from "willpower" in the lay sense of effortful resistance to a salient temptation, but from how the child mentally represents the situation. If the child attends to the appetitive, "hot" features of the marshmallow — its texture, smell, taste — waiting becomes very hard. If the child can shift to a "cool," abstract representation — picturing the marshmallow as a cloud, a circle, or as part of a future plan — waiting becomes much easier.

The third question, the long-term predictive one, was politically and educationally charged. If a brief preschool measure predicted SAT scores a decade later, the implication was that early self-regulation was a leverage point for life chances. That implication has driven both enthusiasm for the test and the most pointed criticisms of how it has been popularized.

3. Method and Procedure

The Standard Protocol

In the canonical version of the test, a child was brought into a small testing room with a table, a chair, and few distractions. The experimenter introduced a treat — most often a marshmallow, but sometimes a pretzel, an Oreo cookie, or animal cracker. The child was told: I'm going to leave the room. If you can wait until I come back, you can have two marshmallows. If you can't wait, you can ring this bell, and I will come back, and you can have one. The experimenter left for up to about fifteen minutes (the exact ceiling varied across studies and conditions). The child's behavior in the interval was filmed or observed through a one-way mirror, with wait time recorded in seconds.

Experimental Variations

The protocol was the foundation for a long series of variations designed to manipulate the cognitive demands of waiting. In some conditions the reward was visible on the table; in others it was covered. In some conditions the child was instructed to think about the food itself; in others to think about the food in different ways (its color, its shape as if it were a cloud); in others still to think about something else entirely. Some children were given distracting toys; some were given no distractions. Children were sometimes presented with both the smaller immediate reward and the larger delayed reward; sometimes only one was visible. Each variation produced a small experiment, and across dozens of small experiments Mischel and his collaborators built a picture of which mental moves helped children wait.

Longitudinal Follow-Up

In the late 1980s, Mischel and colleagues conducted follow-up studies on the children who had originally been tested at the Bing Nursery School. Yuichi Shoda, Mischel, and Philip Peake reported in Developmental Psychology in 1990 that preschool delay time correlated with parental ratings of adolescent academic and social competence and with SAT verbal and quantitative scores. Further follow-ups extended into adulthood, examining outcomes such as body mass index, drug use, and self-report measures of self-control.

Other Reward Paradigms

Mischel's group also worked with self-imposed delay paradigms, the use of imaginal reward representations, and conditions in which children were asked to keep waiting in the absence of an actual reward present. The marshmallow itself was not the point. The point was the situation of a small immediate reward versus a larger delayed reward, in a context where the child had to manage attention and emotion to bridge the gap.

The Kidd, Palmeri & Aslin Replication of Reliability

In 2013, Celeste Kidd, Holly Palmeri, and Richard Aslin published an important variant in which children's expectations about the reliability of the experimenter were experimentally manipulated before the marshmallow test. In one condition, the experimenter promised art supplies and then delivered them. In another, the experimenter promised art supplies and did not deliver. Children in the reliable condition waited dramatically longer on the subsequent marshmallow trial than children in the unreliable condition. This showed that delay behavior is sensitive to a child's read of how trustworthy the environment is — not just to internal capacity.

4. Participants and Setting

The Bing Nursery School

The Bing Nursery School was a campus preschool serving primarily the children of Stanford faculty, graduate students, and affiliated personnel. This means the original Mischel sample was demographically narrow: predominantly white, highly educated parents, high socioeconomic status, with substantial cultural and material resources at home. Sample size in the original studies was typically in the low hundreds, with subsamples for individual experiments often smaller.

Age Range

Most participants were three to six years old. The youngest children typically could not wait at all and were sometimes excluded from analyses requiring meaningful variation in wait time. The four- and five-year-olds produced the richest behavioral variation.

Later Samples

Subsequent work extended into more diverse populations. The 2018 Watts, Duncan, and Quan study (discussed below) used a large nationally representative sample from the National Institute of Child Health and Human Development Study of Early Child Care and Youth Development, with 918 children for the analysis. International work has explored delay of gratification in non-Western settings, with some studies in Cameroon and elsewhere showing higher average wait times among children from cultural backgrounds emphasizing patience and adult deference.

The Setting

The testing rooms were deliberately Spartan: a small table, a chair scaled to the child, a bell to ring, the reward, and minimal other distractions. The minimalism mattered because the manipulation under study was the child's mental relationship to the reward; visual or auditory distractors would have been confounds. Researchers observed through one-way mirrors and recorded behavior unobtrusively.

5. Results

Wait Times

Across the canonical studies, mean wait times sat in the range of six to eight minutes out of the fifteen-minute ceiling, with very large individual variation. Some children gave in within seconds; others waited the full fifteen minutes. Several behavioral patterns repeated across children: covering eyes, looking away from the marshmallow, singing or talking to oneself, kicking the table, smelling the treat without eating it, taking a tiny bite from the underside and replacing the marshmallow.

Effects of Cognitive Strategy

The variations in attentional focus produced large differences. Children instructed to think about the marshmallow as if it were a fluffy cloud waited substantially longer than children encouraged to think about how it would taste. Children given the option to look at a picture of the reward, but not the reward itself, waited longer than those who could see the actual reward. Distracting toys helped. Self-instruction to wait helped. The effects were large in absolute terms.

The Visibility Effect

When both the small immediate reward and the large delayed reward were visible, children's average wait time fell substantially below conditions in which one or both were covered. Visibility of the reward heightened its "hot" representation and shortened delay.

Long-Term Correlations

The 1990 Shoda, Mischel, and Peake follow-up of 185 of the original Bing children reported a correlation of about 0.18 between preschool delay time and combined SAT scores ten years later, with similar small-to-moderate correlations with parent ratings of academic and social competence. The differences attached most strongly to children who were specifically tested in conditions where the reward was visible — that is, where the situation taxed cognitive control most strongly.

Subsequent follow-ups by Mischel's group and collaborators reported associations between early delay and adult body mass index, self-reported drug use, prefrontal cortex function during go/no-go tasks in midlife (Casey and colleagues, 2011), and various self-report measures of self-control. None of these later correlations was very large in absolute terms, and many were based on shrinking subsamples of the original cohort.

The 2018 Watts, Duncan and Quan Reanalysis

The most important quantitative reassessment was published by Tyler Watts, Greg Duncan, and Haonan Quan in Psychological Science in 2018. Using the much larger and more demographically diverse Study of Early Child Care and Youth Development sample, they tested whether seven-minute wait time at age four predicted academic achievement at age fifteen. Without controls, they reproduced an effect of roughly the size Mischel had reported. After adding controls for family background, cognitive ability, and home environment, the effect attenuated by more than half and was no longer statistically significant for most outcomes. Their interpretation was that earlier delay behavior is partly a proxy for socioeconomic and cognitive variables that themselves predict later outcomes, and that the unique predictive power of the marshmallow test is much smaller than the popular narrative suggested.

6. The Researchers' Interpretation

Mischel's mature interpretation, developed across several decades and synthesized in his 2014 book The Marshmallow Test: Mastering Self-Control, treats delay of gratification not as a global personality trait but as a domain of mental skills. The hot/cool system framework, formalized with Janet Metcalfe in 1999, distinguishes a fast, affective, stimulus-driven "hot" system from a slower, reflective, knowledge-based "cool" system. Successful delay reflects the cool system's ability to suppress, redirect, or reframe what the hot system finds compelling.

Within this framework, the individual differences observed in the marshmallow room are differences in habitual cognitive strategy, not differences in raw willpower. Children who waited longer were not stoic; they were skilled at distracting themselves, at re-representing the reward, at using internal speech to support the goal. These skills could in principle be taught.

Mischel embedded the marshmallow work in a broader theoretical structure he and Yuichi Shoda called the cognitive-affective processing system, or CAPS, articulated in a 1995 Psychological Review paper. CAPS proposed that personality is best understood as a network of cognitive and affective units — encodings, expectancies, goals, and self-regulatory plans — whose activation depends on features of the situation. The marshmallow test is then a particular situation that recruits a particular network of cognitive-affective units in each child, producing characteristic behavior.

The long-term predictive findings were interpreted cautiously by Mischel himself, more enthusiastically by popularizers. He emphasized that the test was a snapshot of strategy and skill in a particular situation, that the correlations were modest, and that the more important practical implication was that self-regulation skills could be taught at any age. His view was always more sophisticated than the "two marshmallows means a better life" reading that the public discussion produced.

7. Modern Reanalyses and Criticisms

The Replication Refinement

The 2018 Watts, Duncan and Quan study did not fail to replicate the marshmallow test in the way some failed replications fail. The original correlation was reproducible, and the marshmallow test does predict later outcomes in raw terms. What the study showed is that much of the predictive power is attributable to the same family and cognitive variables that the marshmallow behavior itself partly reflects. Once those variables are controlled, the unique contribution of preschool delay is small.

This is now the standard interpretation. The marshmallow test is a real behavior that correlates with real later outcomes, but it is best understood as an index of an environment and capacities that include far more than the willpower of a four-year-old. A child in a household where adults reliably keep promises, where food is consistently available, where stimulation and cognitive support are high, will tend both to wait longer for the marshmallow and to score better on later achievement tests. The marshmallow does not cause the achievement; both are downstream of the same upstream conditions.

The Reliability Mechanism

The Kidd, Palmeri and Aslin reliability study is the empirical centerpiece of the new interpretation. If children's behavior on the test is sensitive to the experimenter's recent track record of keeping promises, the test is partly a rational decision under uncertainty rather than a measure of impulse control. A child who has learned that adults rarely deliver on promised rewards is making a defensible choice in eating the marshmallow now. A child who has learned that adults reliably deliver is making a defensible choice in waiting.

This reframing has implications. It suggests that early environmental reliability — broadly construed to include food security, household stability, and adult follow-through — shapes the very behavior the test measures. Self-control, in this view, is not an internal asset that some children have and others lack. It is a strategy adapted to the environment a child has experienced.

Socioeconomic Confounds

Subsequent analyses have emphasized that performance on the marshmallow test correlates with family socioeconomic status. Children from lower-income families on average wait less time, which is consistent both with the reliability interpretation and with cognitive correlates of socioeconomic stress. This makes the test, in part, a measure of cumulative environmental advantage rather than a culture-fair measure of self-control.

Effect Size Concerns

Even the original Mischel correlations were modest by absolute standards. A correlation of 0.18 between preschool delay and SAT scores explains roughly three percent of the variance in SAT performance. This is interesting, but it is not the deterministic relationship implied by popular accounts. The marshmallow test is a contributor to a complex story, not the story itself.

What Has Not Been Overturned

It is worth being clear about what has held up. The basic cognitive findings about strategy — that distraction helps, that reframing the reward in cool terms helps, that visibility of the reward hurts — are robust and theoretically important. The CAPS framework remains a useful description of how cognitive-affective units interact with situations. The general claim that early self-regulation matters in development is well supported by other measures, including parent and teacher ratings, executive function tasks, and effortful control measures. What has been challenged is the specific claim that brief preschool delay behavior is a strong, independent predictor of later life outcomes after controlling for what else the behavior carries with it.

8. Ethical Considerations

Compared to many landmark studies, the marshmallow test is ethically modest. The intervention is brief, the worst outcome for a child is a few minutes of mild frustration followed by a small reward, and parental consent has been obtained throughout the longitudinal program. There has been no deception of substance: children know that they are waiting for a reward, and the experimenter does in fact return.

That said, there are subtler ethical questions. The first is the cultural framing of the test as a measure of "willpower" or "character." When the popular discussion attaches sweeping value judgments to a four-year-old's behavior in a fifteen-minute laboratory test, the moral consequences can outrun the science. Schools, parents, and policy makers may infer that children who eat the marshmallow are deficient in some essential virtue. The reliability literature suggests this inference is wrong — and stigmatizing children based on a single behavioral measure is itself an ethical concern.

A second concern is the use of brief behavioral measures for screening or selection. There have been proposals to use delay-of-gratification measures or related self-control measures as components of early educational assessment. The replication evidence suggests such use should be cautious; the test does not have the predictive purchase that selection use would require, and it carries the same confounds that make it socioeconomically loaded.

Third, the international and cross-cultural application of the test raises validity questions. Children in cultures with different norms about food, adult authority, and impulse expression may show systematically different behavior on the test for reasons unrelated to the construct of self-control. This is a measurement-fairness concern that has not always been respected in popular accounts.

9. Influence on Psychology

The Self-Control Literature

The marshmallow test became one of the central reference points for the broader psychological literature on self-control, which has grown into a major research area with substantial overlap with developmental, social, and health psychology. Roy Baumeister's work on ego depletion, Angela Duckworth's work on grit, and Terrie Moffitt and colleagues' Dunedin study findings on childhood self-control all sit downstream of the question Mischel made tractable.

Cognitive-Affective Processing

The CAPS framework has influenced how personality and social psychologists conceptualize the relationship between internal mental units and external situations. The general point — that personality is better understood as a network of context-sensitive cognitive-affective processes than as a collection of context-free traits — has been broadly assimilated, though it has also coexisted uneasily with the Big Five trait tradition that remains dominant in personality psychology.

Educational Implications

Educational programs have drawn on the marshmallow research to develop curricula targeting executive function and self-regulation in early childhood. Programs like Tools of the Mind, drawing on Vygotskian theory and self-regulation research, train children in attentional control, working memory, and impulse management. These programs do not depend on the predictive validity of the marshmallow test itself; they depend on the broader case that self-regulation is teachable and consequential.

Behavioral Economics

The marshmallow test paradigm has been used in behavioral economics work on intertemporal choice and discounting. Adults' patterns of discounting future rewards relative to immediate ones — measured in much more sophisticated ways than a single marshmallow — have been linked to substance use, savings behavior, and health outcomes. The conceptual lineage runs back to Mischel.

Neuroscience

Functional neuroimaging studies of self-control and delay behavior have implicated a network including the prefrontal cortex, anterior cingulate, and striatum. Casey and colleagues' 2011 follow-up of Mischel's original children using fMRI showed that those who had been low delayers as preschoolers showed differences in ventral striatum response to social cues forty years later. The effect sizes were small and the sample shrinking, but the integration of behavioral, developmental, and neural data is a model for the kind of long-arc psychological neuroscience the field aspires to.

10. What the Experiment Means Today

The marshmallow test in 2026 is a more modest, more interesting, and more accurate scientific object than it was in 2010. The popular narrative — that a child's ability to wait for a marshmallow at four predicts their success in life — has been substantially weakened. The actual story is more nuanced: delay behavior in young children reflects the environment they have been raised in, the cognitive strategies they have learned, and the reasonableness of their expectations that promises will be kept. It is a moment in a developmental conversation, not a verdict.

This shift is exactly the kind of refinement that good science is supposed to produce. The original studies identified a real behavioral phenomenon with real cognitive correlates. Subsequent replications and reanalyses identified the limits of the original interpretation. The result is not a discredited study, but a deeper understanding of what the study measures.

For parents and educators, the practical implication is to keep promises, to provide reliable environments, to teach attentional strategies, and to avoid placing too much moral or predictive weight on any single behavioral observation in a child. For researchers, the implication is to treat the marshmallow test as one of many converging measures of self-regulation, not as the gold standard. For policy makers, the implication is that early-childhood interventions that address the underlying conditions of life — food security, parenting support, stable caregiving — are likely to do more for later self-regulation than interventions targeting the marshmallow behavior itself.

And for the children whose behavior in the Bing Nursery School room launched a research program: the better account is not that some of them had character and others did not. The better account is that some of them, on that particular day, in that particular room, with that particular history of experience with adults, found a way to bridge fifteen minutes for the promise of an extra marshmallow. That is a remarkable feat of cognitive coordination in a four-year-old, and the others were not failing — they were responding sensibly to whatever world they had been living in.

Conclusion

The marshmallow test is one of the most photographed, most discussed, and most misunderstood experiments in modern psychology. Walter Mischel and his collaborators did not set out to produce a national parable about willpower. They set out to study how young children regulate attention and emotion in the face of a temptation they could not yet name. The behavior they captured was real, the cognitive findings about strategy were robust, and the framework they built around the work — the hot/cool system, the cognitive-affective processing system — has influenced personality and developmental psychology in lasting ways.

What has changed in the past decade is the size of the claim attached to the test. Modern reanalyses, particularly the 2018 Watts, Duncan, and Quan study, have shown that the unique predictive power of preschool delay behavior is much smaller than the popular narrative implied. The Kidd, Palmeri, and Aslin reliability study has shown that delay behavior is sensitive to whether the surrounding environment has earned the child's trust. Self-control is not a single internal resource installed at four; it is a strategy adapted to a world.

The marshmallow test now functions less as a verdict on individual children and more as a window onto an interaction. The interaction is between a young brain learning to manage its own attention, a particular situation that taxes that learning, and an environment that may or may not have given the child reasons to expect that waiting pays off. That is a less catchy story than the original headlines, but it is closer to the truth — and the truth, here, is more humane than the myth.