Medical AIClinical BreakthroughJun 21, 2026, 8:44 AM· 6 min read· #4 of 4 in ai

AI Reasoning Model Outperforms Human Physicians in Complex Diagnostic Tests

A landmark study published in Science reveals that OpenAI's new reasoning model matched or exceeded expert doctors in emergency room triage and complex clinical decision-making.

By Factlen Editorial Team

Share this story

Clinical AI Researchers 40%Practicing Physicians 40%Healthcare Administrators 20%

Clinical AI Researchers: Argue that the architectural shift to chain-of-thought reasoning represents a historic milestone that is ready for prospective clinical trials.
Practicing Physicians: Express cautious optimism about AI reducing documentation burdens, but warn that models can still converge prematurely on incorrect diagnoses.
Healthcare Administrators: Focus on the technology's potential to improve triage efficiency, reduce emergency room bottlenecks, and lower the rate of costly misdiagnoses.

What's not represented

· Medical Malpractice Insurers
· Patient Privacy Advocates

Why this matters

Medical misdiagnosis affects millions of patients annually, often due to the intense time pressure placed on emergency room staff. The validation of an AI system capable of reliable, step-by-step clinical reasoning paves the way for an always-available second opinion that could catch life-threatening conditions before it is too late.

Key points

A new study in Science shows OpenAI's o1 model outperformed human doctors in complex clinical reasoning tests.
The AI achieved 67.1% accuracy during initial emergency room triage, beating two expert attending physicians.
The model uses 'chain of thought' reasoning to break down symptoms step-by-step and correct its own logical errors.
In management reasoning tasks—deciding the next steps for patient care—the AI scored 89% compared to physicians' 34%.
Researchers stress the AI is not meant to replace doctors, but to serve as a highly reliable second opinion.
The medical community is now calling for urgent prospective clinical trials to test the AI in live hospital environments.

67.1%

AI accuracy at initial ER triage

89%

AI median score on management reasoning

98%

Cases where AI achieved a perfect clinical logic score

78.3%

AI accuracy on complex NEJM medical puzzles

For more than six decades, the medical community has chased a singular technological holy grail: a computing system capable of the nuanced, step-by-step logic required to diagnose complex human diseases. That milestone appears to have been reached. According to a landmark study published this week in the journal Science, an advanced artificial intelligence model has successfully outperformed human physicians across a battery of rigorous clinical reasoning tests. The research, led by teams at Harvard University and Stanford University, demonstrated that OpenAI's o1-preview model can conduct real-world emergency room triage, recommend diagnostic tests, and manage patient care at a level that matches or exceeds expert attending physicians.[1][2]

The breakthrough represents a fundamental architectural shift in how artificial intelligence processes medical data. Previous generations of large language models, such as GPT-4, were optimized for rapid pattern recognition and information retrieval, essentially acting as highly sophisticated search engines. The o1 model, however, utilizes a chain of thought reasoning process. Instead of generating an immediate answer, the system generates hidden reasoning tokens, breaking down a patient's symptoms, exploring multiple diagnostic pathways in parallel, and actively backtracking if it identifies a logical flaw in its own assessment.[1][7]

To test this new capability, the researchers moved beyond theoretical puzzles and introduced the AI to the chaotic reality of emergency medicine. The team utilized blinded data from 76 real-world patient encounters at Beth Israel Deaconess Medical Center in Boston. The AI was fed information in the exact increments that a human doctor would receive it: first the initial intake notes from a triage nurse, then the physician's evaluation, and finally the decision to admit the patient to the medical floor or intensive care unit.[2][4]

The results at the earliest stage of care were particularly striking. During initial triage—the high-stakes moment when medical professionals must make critical decisions with the least amount of information—the o1 model identified the exact or a very close diagnosis 67.1 percent of the time. In contrast, two expert attending physicians reviewing the exact same intake data achieved accuracies of 55.3 percent and 50.0 percent.[1][4]

The o1 reasoning model outperformed human physicians during the earliest and most uncertain stages of emergency triage.

As more clinical information became available later in the care pipeline, both human and machine performance improved, but the AI maintained its lead. Upon the patient's admission to the medical floor or intensive care unit, the o1 model reached an 81.6 percent accuracy rate, compared to 78.9 percent and 69.7 percent for the two human physicians. The researchers noted that the AI was particularly adept at catching subtle indicators that human clinicians, operating under the immense time pressures of a modern emergency department, occasionally overlooked.[3][5]

In one notable case highlighted by the study authors, an immunosuppressed patient who had recently received an organ transplant presented to the emergency room with respiratory symptoms. While the treating physician initially pursued a standard respiratory diagnostic path, the AI model flagged the subtle presentation of a necrotizing soft tissue infection—a life-threatening condition requiring immediate surgical intervention—considerably earlier in the patient's stay.[3]

Beyond real-world triage, the researchers subjected the model to the ultimate academic stress test: the clinicopathological conferences published by the New England Journal of Medicine. These complex, multi-layered medical mysteries have served as the gold standard for evaluating expert medical computing systems since the 1950s. Across 143 of these notoriously difficult cases, the o1 model included the correct diagnosis in its differential list 78.3 percent of the time. When the criteria expanded to include potentially helpful or very close diagnoses, the model's accuracy soared to 97.9 percent.[1][3]

These complex, multi-layered medical mysteries have served as the gold standard for evaluating expert medical computing systems since the 1950s.

Diagnostic accuracy is only half of a physician's job; the other half is management reasoning, which involves deciding the next best step for a patient. This includes determining which laboratory tests to order, whether to prescribe antibiotics, and how to navigate sensitive end-of-life care conversations. Clinical fellows note that management reasoning is often far more complex than initial diagnosis, as it requires weighing the risks, benefits, and logistical realities of various interventions.[2][5]

In a series of five complex management reasoning vignettes, the performance gap between human and machine widened significantly. The o1 model achieved a median score of 89 percent. Human physicians relying on conventional medical resources and search engines scored a median of just 34 percent. Even physicians who were given access to the older GPT-4 model to assist them only managed a median score of 41 percent, underscoring the massive leap in capability between standard language models and dedicated reasoning engines.[1][5]

Management reasoning—deciding the next steps for patient care—saw the widest performance gap between the reasoning model and human baselines.

The AI also demonstrated an unprecedented ability to document and explain its clinical logic. Using the Revised-IDEA score—a validated 10-point scale for evaluating how well a clinician articulates their diagnostic thinking and justifies their next steps—the o1 model achieved a perfect score in 98 percent of the cases it reviewed. Attending physicians achieved a perfect score on the same metric only 35 percent of the time. This suggests that reasoning models could eventually alleviate the massive documentation burden that contributes to physician burnout.[3][4]

Despite the overwhelmingly positive data, the medical community remains cautious about the immediate deployment of autonomous AI in clinical settings. A separate study published recently in JAMA Network Open evaluated 21 different AI models across the diagnostic process and identified a consistent vulnerability: when faced with extreme uncertainty and competing possibilities, large language models can sometimes converge prematurely on a single diagnosis, failing to keep their differential broad enough.[3][7]

Unlike older language models, reasoning models break down symptoms step-by-step and can backtrack if they detect a logical flaw.

The authors of the Science study explicitly agree with this limitation, emphasizing that their findings do not suggest AI systems are ready to practice medicine independently or replace human doctors. Medicine is fundamentally a human endeavor that requires physical examination, empathy, and the ability to read non-verbal cues—inputs that a text-based reasoning model cannot perceive.[4][5]

Instead, researchers envision these advanced models serving as an always-available, highly reliable second opinion. In a crowded emergency room at 3:00 AM, an AI system running quietly in the background of an electronic health record could continuously analyze incoming lab results and triage notes, gently nudging a fatigued physician to consider a rare complication or order a specific blood test that might otherwise be delayed.[2][6]

Researchers emphasize that AI will not replace doctors, but rather run in the background of health records to catch subtle anomalies.

The immediate next step, according to the study's lead authors, is the urgent initiation of prospective, randomized clinical trials. While retrospective data on past cases is highly encouraging, the medical field needs to understand how human physicians and AI reasoning models interact in real-time. Researchers must determine whether doctors will trust the AI's suggestions, whether the system will slow down or speed up clinical workflows, and how to safely integrate these tools into existing hospital infrastructure.[1][3]

For decades, the promise of artificial intelligence in healthcare has been characterized by high hopes and incremental realities. The validation of a reasoning model that can consistently match the diagnostic acumen of expert physicians marks a definitive turning point. As these systems move from academic benchmarks to clinical trials, the focus shifts from whether AI can understand medicine to how quickly it can be safely deployed to save lives.[2][6]

How we got here

1950s
The medical community begins searching for computational systems capable of complex diagnostic reasoning.
Late 2024
OpenAI releases the o1 model, introducing 'chain of thought' reasoning capabilities to the public.
April 2026
A JAMA Network Open study highlights that large language models can sometimes converge prematurely on incorrect diagnoses.
May 2026
Science publishes a landmark study showing the o1 model outperforming human physicians in real-world ER triage and complex medical puzzles.

Viewpoints in depth

Clinical AI Researchers

Focus on the architectural leap of reasoning tokens and the need for immediate clinical trials.

Computer scientists and medical researchers argue that the shift from pattern-matching language models to true reasoning engines represents a historic inflection point. By utilizing hidden 'reasoning tokens,' these new models can simulate the deductive logic of a human expert, exploring multiple hypotheses and backtracking when they encounter contradictory evidence. Researchers emphasize that the technology has now saturated retrospective benchmarks, making prospective, randomized clinical trials in live hospital environments the urgent next step.

Practicing Physicians

Cautiously optimistic about reducing burnout, while emphasizing the irreplaceable nature of human touch.

Frontline doctors acknowledge the immense potential of having a tireless, highly accurate second opinion available during grueling overnight shifts. Many are particularly excited by the model's perfect scores in clinical documentation, hoping the technology could eventually automate the crushing paperwork burden that drives physician burnout. However, they caution that medicine is not just a data problem; it requires physical examinations, reading non-verbal cues, and delivering empathetic care—skills no algorithm possesses.

Patient Safety Advocates

Highlight the potential to reduce misdiagnoses, while warning against the dangers of automation bias.

Advocacy groups focused on healthcare quality view the technology as a vital tool for reducing the millions of diagnostic errors that occur annually, particularly in high-stress emergency departments. They argue that an AI running in the background could serve as a crucial safety net, catching subtle anomalies that exhausted humans might miss. Yet, they also warn of 'automation bias'—the risk that overworked doctors might begin to blindly trust the AI's recommendations without applying their own critical judgment.

What we don't know

How the AI will perform in live, prospective clinical trials where its suggestions actively influence a doctor's workflow.
Whether the introduction of AI second opinions will speed up emergency room triage or slow it down due to the need for human verification.
How medical malpractice liability will be handled if a physician ignores a correct AI diagnosis, or follows an incorrect one.

Key terms

Chain of Thought Reasoning: An AI processing method where the model breaks down a complex problem into smaller, logical steps before providing a final answer.
Differential Diagnosis: A list of possible conditions or diseases that could be causing a patient's symptoms, ranked by probability.
Management Reasoning: The clinical decision-making process regarding how to treat a patient, including which tests to order and which therapies to apply.
Triage: The process of quickly examining patients who are taken to a hospital to decide which ones are the most seriously ill and must be treated first.

Frequently asked

Will this AI replace human doctors?

No. Researchers emphasize that medicine requires physical examination and empathy. The AI is designed to act as a highly reliable, always-available second opinion to support doctors.

How is the o1 model different from older AI?

Unlike older models that act like fast search engines, the o1 model uses 'chain of thought' reasoning. It thinks step-by-step, explores multiple possibilities, and can correct its own logical errors before answering.

Was the AI tested on real patients?

Yes. In addition to academic medical puzzles, the AI was tested on blinded data from 76 real-world emergency room patients at a major Boston hospital.

What is management reasoning?

Management reasoning is the process of deciding what to do next after a diagnosis, such as ordering specific lab tests, prescribing medication, or planning long-term care.

Sources

[1]ScienceClinical AI Researchers
Performance of a large language model on the reasoning tasks of a physician
Read on Science →
[2]Harvard MagazineClinical AI Researchers
AI model outperforms doctors in clinical reasoning tests
Read on Harvard Magazine →
[3]MedPage TodayPracticing Physicians
AI Outperforms Physicians in Diagnostic Reasoning Study, Though Debate Continues
Read on MedPage Today →
[4]R&D WorldHealthcare Administrators
Study shows LLMs can diagnose ER patients more accurately than physicians
Read on R&D World →
[5]ASCO AIPracticing Physicians
A large language model outperformed physicians in multiple clinical tasks
Read on ASCO AI →
[6]News-MedicalClinical AI Researchers
AI model outperforms doctors in clinical reasoning tests
Read on News-Medical →
[7]JAMA Network OpenPracticing Physicians
Evaluation of Large Language Models in Diagnostic Reasoning
Read on JAMA Network Open →

Up next

Local AI

The Quiet Shift to Local AI: How Consumer Laptops Are Replacing Cloud Servers in 2026

Driven by privacy concerns and hardware leaps, running powerful AI models entirely offline has become a mainstream practice. Here is how tools like Ollama and LM Studio are putting frontier-class intelligence directly onto consumer laptops.

Every angle. Every day.

Get ai stories with full source coverage and perspective breakdowns delivered to your inbox.

Get the briefing →Browse ai