Can ChatGPT Be Accurate in Clinical Decision Making? New Study Has a Surprising Answer

A recent study led by researchers from Mass General Brigham has shed light on the accuracy of ChatGPT in clinical decision making.

The research revealed that the large-language model (LLM) AI chatbot achieved approximately 72% accuracy in overall clinical decisions, encompassing tasks from generating potential diagnoses to final diagnoses and care management choices.

The study included various medical specialties and was conducted in both primary care and emergency settings.

Can ChatGPT Be Accurate in Clinical Decision Making? New Study Has a Surprising Answer — A new study has shed light on the accuracy of ChatGPT in clinical decision making. STEFANI REYNOLDS/AFP via Getty Images

Comparable to Fresh-Grad Medical Professional

Lead author Marc Succi, MD, expressed that the performance of ChatGPT was comparable to that of a freshly graduated medical professional, highlighting the potential of LLMs to serve as effective tools in the realm of medicine.

"No real benchmarks exists, but we estimate this performance to be at the level of someone who has just graduated from medical school, such as an intern or resident. This tells us that LLMs in general have the potential to be an augmenting tool for the practice of medicine and support clinical decision making with impressive accuracy," Succi said in a statement.

Despite the rapid advancements in artificial intelligence, the extent to which LLMs can contribute to comprehensive clinical care remains unexplored.

This study sought to investigate ChatGPT's capabilities in advising and making clinical decisions across a complete patient encounter, including diagnostic workups, clinical management, and final diagnoses.

The research involved presenting segments of standardized clinical scenarios to ChatGPT, simulating real-world patient interactions. ChatGPT was tasked with generating differential diagnoses based on initial patient information, followed by making management decisions and arriving at a final diagnosis through successive iterations of data input.

The researchers discovered that ChatGPT's accuracy averaged around 72%, with its highest performance observed in making final diagnoses at 77%. However, its accuracy was lower in making differential diagnoses (60%) and clinical management decisions (68%).

Notably, the study revealed that ChatGPT's responses did not demonstrate gender bias and that its performance was consistent across primary and emergency care scenarios.

Succi emphasized that ChatGPT struggled with differential diagnosis, an essential aspect of medicine that requires determining potential courses of action when faced with limited patient information. It points to the strengths of physicians in the early stages of patient care, where generating a list of possible diagnoses is pivotal.

ChatGPT in Clinical Practice

The study's authors acknowledge that further benchmark research and regulatory guidance are crucial before integrating tools like ChatGPT into clinical practice. The team's future work aims to explore whether AI tools can enhance patient care in resource-constrained healthcare settings.

Mass General Brigham, an integrated academic health system and innovation enterprise, is actively engaged in rigorous research to responsibly incorporate AI into care delivery, workforce support, and administrative processes.

Co-author Adam Landman, MD, MS, MIS, MHS, Chief Information Officer and Senior Vice President of Digital at Mass General Brigham, emphasized the importance of thorough studies like this in evaluating the accuracy, reliability, safety, and fairness of AI solutions before their integration into clinical care.

"Mass General Brigham sees great promise for LLMs to help improve care delivery and clinician experience," Landman said in a statement. The study's findings were published in the Journal of Medical Internet Research.