Ischemic strokes, which occur when blood flow to the brain is blocked, are a major causes of death and disability.
A new pre-print study evaluated the potential of GPT-4 to assist doctors in making critical decisions in treating stroke patients.
The research team, comprising experts from the Technion-Israel Institute of Technology in Israel and the Mayo Clinic in the US, analyzed data from 100 patients who had shown acute stroke symptoms.
🚨New preprint alert. GPT-4 has been shown to work well for diagnostics, but rarely as a clinical decision-support tool. In joint work with @ShellyShahar from @RambamHCC we used it to decide how to treat patients with stroke. How did it do? Amazingly well! https://t.co/8q31Mu2PRB pic.twitter.com/Um4IzULJVS
— Dvir Aran (@dvir_a) January 27, 2024
The team compared GPT- 4’s treatment recommendations with those given by experienced neurologists and the actual treatments administered to the patients.
The aim was to see how well the AI’s suggestions matched up with expert human judgment and real-world medical practice.
For each patient we asked GPT-4 whether to treat or monitor and with which treatment. We designed the prompt to return a scaled response (1-7) instead of yes/no. This had many advantages: identify uncertainty, use ROC-AUC, and reduce hallucinations by replicating the question. pic.twitter.com/n0la4Oqgfg
— Dvir Aran (@dvir_a) January 27, 2024
One of the key measures used in this study was the Area Under the Curve (AUC).
Without getting too bogged down in the technicalities, the ROC curve is a way to visualize how well a diagnostic test performs.
It plots the rate of true positives (correctly identified cases) against the rate of false positives (incorrectly identified cases) at various thresholds.
The AUC, then, is a single number that summarizes the test’s performance across all possible thresholds, with 1.0 representing a perfect test and 0.5 representing a guess.
In the medical world, an AUC of 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is excellent, and above 0.9 is outstanding.
In this study, GPT-4 achieved an AUC of 0.85 when its recommendations were compared to the opinions of stroke specialists, indicating a high level of agreement and an excellent performance by the AI.
Compared to the treatments given, the AUC was 0.80, showing that GPT-4’s suggestions were closely aligned with real-world medical practice.
So, to treat or not to treat? GPT-4 agreement with the expert was AUC=0.85, with 20 differences, and with the real-world the AUC was 0.8. These are similar to the agreement between the expert and the real-world, so pretty remarkable! pic.twitter.com/6dDbEa6ycv
— Dvir Aran (@dvir_a) January 27, 2024
These results are particularly promising because they suggest that GPT-4 can potentially provide valuable support in emergency rooms, especially when a neurology specialist might not be immediately available.
Moreover, GPT-4 showed a remarkable ability to predict the risk of mortality within 90 days post-stroke.
The AI model identified patients at high risk with significant accuracy, outperforming some existing machine-learning models specifically trained for this purpose.
But what was really astonishing was that we also asked GPT-4 to estimate 90-day mortality and compared it with two recent traditional ML prediction models for this task. GPT-4 was significantly better (AUC=0.89 compared to 0.77 and 0.7). pic.twitter.com/FFhDQSmQc5
— Dvir Aran (@dvir_a) January 27, 2024
This could be incredibly useful for doctors in prioritizing treatments and managing resources more effectively.
This isn’t the first time LLMs have been used successfully for healthcare applications.
Google recently created Articulate Medical Intelligence Explorer (AMIE), which matched or even outperformed board-certified primary-care physicians in gathering patient information during medical interviews and scored higher in empathy.
Danish researchers even used LLMs to understand how life events affected mortality, with their model beating the next-best by 11%.
Other sophisticated machine learning models have discovered new antibiotics or therapeutic compounds in mere minutes compared to the months or years of traditional experimental techniques.