A new study has sparked significant discussion about the capabilities of large language models in medical settings, particularly in emergency room scenarios, where findings indicate that one AI model performed decisively better than human doctors in some instances.
The groundbreaking research, published in Science, was conducted by a collaborative team of physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center. They undertook systematic experiments to evaluate the performance of OpenAI’s models against human physicians in a clinical environment.
In one particularly focused experiment, the researchers analyzed 76 patients who arrived at the emergency room of Beth Israel. They compared the diagnoses made by two attending physicians—in the field of internal medicine—with those generated by OpenAI’s models 01 and 04. The evaluations of these diagnoses were carried out by two additional attending physicians who were unaware of which diagnoses stemmed from humans and which were produced by AI.
Results from the study revealed that model 01 either matched or slightly surpassed the accuracy of the human doctors at various stages of diagnosis. The disparity was especially notable during the initial ER triage, a phase characterized by minimal patient information and the critical importance of swift, accurate decision-making. Model 01 managed to produce either the exact or a very close diagnosis in 67% of triage cases. In contrast, one attending physician’s related accuracy was 55%, and the other reached only 50%.
Arjun Manrai, head of an AI lab at Harvard Medical School and a lead author on the study, expressed the significance of the findings in a press release. “We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” he remarked.
Despite these promising outcomes, the research team was quick to clarify that their study should not be interpreted as indicating that AI is fully prepared to handle critical life-or-death decisions in emergency situations. They highlighted the pressing need for prospective trials to rigorously assess these AI technologies in real-world medical settings.
Moreover, the study exclusively analyzed the model’s performance based on text-based information, and the researchers acknowledged limitations in the ability of current AI models to interpret non-text inputs effectively.
Adam Rodman, another lead author from Beth Israel, underscored the absence of a structured framework for accountability in AI diagnoses, warning that patients typically prefer human involvement in making critical healthcare decisions.
Additionally, the study’s findings have met with cautionary interpretations from some medical professionals. Emergency physician Kristen Panthagani critiqued the comparison made in the study, suggesting that it might lead to exaggerated interpretations in the media. She noted that the diagnoses were assessed against the expertise of internal medicine physicians, rather than specialists in emergency medicine.
Panthagani emphasized the primary objective of emergency physicians: to rapidly discern life-threatening conditions rather than to pinpoint exact diagnoses. “As an ER doctor seeing a patient for the first time, my primary goal is not to guess your ultimate diagnosis,” she stated.
As discussions loom around the potential of AI in clinical settings, it remains essential for research to address ethical implications, accountability, and the roles of human practitioners in guiding patient care. The engagement of multiple specialists and stakeholders will be vital as the medical community navigates this evolving landscape.


