ChatGPT has demonstrated its examination skills, scoring similarly to students on several degree courses and other tests, such as the Bar Exam for lawyers. But can it deliver satisfactory results on medical exams?
A group of pediatric doctors put ChatGPT, specifically the GPT-3.5 model, to the test.
They tested ChatGPT on the neonatal-perinatal board exam, which is critical for pediatric students. The study, published in JAMA, revealed that ChatGPT version 3.5 scored only 46% correct answers.
ChatGPT’s performed best on basic recall and clinical reasoning-themed questions, but its limitations were exposed by questions requiring multi-logic reasoning.
Specifically, the model scored its lowest, 37.5 percent, in the gastroenterology section and its highest, 78.5 percent, in ethics – perhaps ironically.
The study’s senior author, Andrew Beam, is an assistant professor of biomedical informatics at Harvard Medical School.
He pointed out that the rapid advancements in AI have been nothing short of remarkable. “There was this moment last year when, all of a sudden, five or six different models were all getting scores of 80 percent or higher,” he said, emphasizing the quick pace at which the field is evolving.
Beam’s wife, Kristyn, an instructor in pediatrics at Harvard Medical School, also participated in the study. “I wanted it not to do well, so from that perspective I was happy,” she confessed.
However, she acknowledges the inevitability of AI embedding itself into healthcare, as we’ve already seen with AI-powered MRI scanning, eye disease diagnostics, and drug development, to name but a few of its burgeoning repertoire of applications.
“It is really important to figure out how to bring that into the clinical world and to bring it in safely,”
The team plans to conduct tests with the superior GPT-4 and apply them to the same neonatal-perinatal and anesthesiology board exams.
Andrew Beam also pointed out the importance of knowing which version of a large language model you’re using, noting that the newer GPT-4 is available on a subscription basis, while the older ChatGPT 3.5 is still freely available.
“Most users will likely be attracted to the free tool and should keep in mind its limitations,” he said. Globally, $20/mo is far from negligible.
ChatGPT has been tested on various exams, including a recent study that pitched it against 32 degree-level topics, finding that it beat or exceeded students on only 9/32 exams.
The AI has also been tested on the bar exam for law, Graduate Record Examinations (GRE), SAT Reading and Writing, Advanced Placement exams, and many others, often scoring very highly.