Google’s AI system, trained to conduct medical interviews, surpasses human doctors in areas relating to bedside manner and diagnostic accuracy.
Developed by teams at DeepMind and Google Research and currently published on ArXiv, the chatbot, named the Articulate Medical Intelligence Explorer (AMIE), excelled in diagnosing respiratory and cardiovascular conditions, among others.
It matched or even outperformed board-certified primary-care physicians in gathering patient information during medical interviews and scored higher in empathy.
Delighted to introduce our new research paper on AMIE (Articulate Medical Intelligence Explorer), a step towards conversational diagnostic AI by @GoogleAI @GoogleHealth @GoogleDeepMindhttps://t.co/KIl1cYjgWO pic.twitter.com/JcPAiNcPgJ
— Mike Schaekermann (@HardyShakerman) January 12, 2024
One of the key challenges in developing AMIE was the scarcity of real-world medical conversations for training data.
To overcome this, the team at Google Health, including AI research scientist Vivek Natarajan, crafted a method allowing the chatbot to engage in simulated ‘conversations.’
The AI was trained to play the roles of a patient, an empathetic clinician, and a critic evaluating the doctor-patient interaction.
In tests involving 20 actors trained to simulate patients and 20 board-certified clinicians, AMIE consistently matched or surpassed the doctors’ diagnostic accuracy across six medical specialties.
It outperformed physicians in 24 of 26 conversation quality criteria, like politeness and explaining conditions and treatments.
Alan Karthikesalingam, a clinical research scientist at Google Health in London and co-author of the study, noted, “To our knowledge, this is the first time that a conversational AI system has ever been designed optimally for diagnostic dialogue and taking the clinical history.”
Happy to introduce AMIE (Articulate Medical Intelligence Explorer) our research LLM for diagnostic conversations. AMIE surpassed Primary Care Drs in conversational quality & diagnostic accuracy in a “virtual OSCE”-style randomized study. Preprint ➡️ https://t.co/XZizS9PtDG (1/7) pic.twitter.com/3t8hTkLmO9
— Alan Karthikesalingam (@alan_karthi) January 12, 2024
However, Karthikesalingam highlights that AMIE remains experimental and hasn’t been tested on real patients, only on actors portraying fictitious yet plausible medical conditions.
How the study worked
The study named “Towards Conversational Diagnostic AI” introduces AMIE, an LLM designed for medical diagnostic interactions.
Here’s more information about how it works:
- Development of AMIE: The Articulate Medical Intelligence Explorer (AMIE) is an AI system based on a Large Language Model (LLM) created by Google. It’s optimized for diagnostic dialogue in medical contexts. AMIE was designed to emulate the complex process of clinical history-taking and diagnostic reasoning.
- Simulated dialogue training: The researchers developed a novel self-play simulated environment due to a lack of real-world medical conversations for training. This allowed AMIE to engage in simulated dialogues, playing different roles (patient, doctor, critic) to enhance learning. These dialogues covered a range of medical conditions, specialties, and contexts.
- Instruction fine-tuning and chain-of-reasoning strategy: AMIE underwent instruction fine-tuning using various real-world datasets, including medical question-answering, reasoning, summarization, and dialogue data. A chain-of-reasoning strategy involved analyzing patient information, formulating responses and actions, and refining the responses based on the current conversation.
- Remote objective structured clinical examination (OSCE) study: The researchers conducted a randomized, double-blind crossover study comparing AMIE with primary care physicians (PCPs). This study used text-based consultations with 149 simulated patients portrayed by actors involving various clinical scenarios. Both specialist physicians and the patient actors assessed the performance of AMIE and the PCPs.
- Evaluation and results: The evaluation focused on diagnostic accuracy, management reasoning, communication skills, and empathy. AMIE demonstrated superior performance compared to PCPs in several areas, including diagnostic accuracy and empathy.
The researchers caution that these results should be interpreted lightly for the time being, noting the study’s limitations, such as the use of a text-chat interface and the lack of real-world patient interactions.
However, it marks progress towards developing AI systems capable of conducting medical interviews and diagnostic dialogues.
Generative AI in healthcare has been an area of tremendous success, with models successfully locating new antibiotics, improving colonoscopies, and simulating interactions between compounds and biological processes.
Might AI models now also adopt patient-facing roles?