Mount Sinai medical researchers say ChatGPT is ready to practice medicine

A team of medical researchers from the Icahn School of Medicine at Mount Sinai recently conducted a study on AI chatbots and determined that "generative large language models are autonomous practitioners of evidence-based medicine."

How did the test go?

According to pre-print research published on arXiv, the Mount Sinai team tested various consumer-facing large language models (LLMs) — including both ChatGPT 3.5 and 4, and Gemini Pro, as well as open-source models LLaMA v2 and Mixtral-8x7B.

The models were given prompts engineered with information such as "you are a medical professor" and then asked to follow evidence-based medical protocols to suggest the proper course of treatment for a series of test cases.

Once given a case, models were tasked with suggesting the next action — such as ordering tests or starting a treatment protocol. Then, they got the results of the action and were prompted to integrate this new information and suggest the next action, and so on.

According to the team, ChatGPT 4 was the most successful, reaching an accuracy of 74% over all cases and outperforming the next-best model (ChatGPT 3.5) by a margin of approximately 10%.

This performance led the team to the conclusion that such models can practice medicine.

"LLMs can be made to function as autonomous practitioners of evidence-based medicine," the paper says. "Their ability to utilize tooling can be harnessed to interact with the infrastructure of a real-world healthcare system and perform the tasks of patient management in a guideline directed manner."

Automating evidence-based medicine

Evidence-based medicine (EBM) uses the lessons learned from previous cases to dictate the trajectory of treatment for similar cases.

While EBM works somewhat like a flowchart in this way, the number of complications, permutations, and overall decisions can make the process unwieldy.

"Clinicians often face the challenge of information overload with the sheer number of possible interactions and treatment paths exceeding what they can feasibly manage or keep track of," the researchers write, adding that LLMs can mitigate this overload by performing tasks usually handled by human medical experts — such as "ordering and interpreting investigations, or issuing alarms," while humans focus on physical care.

"LLMs are versatile tools capable of understanding clinical context and generating possible downstream actions," the researchers add.

It works, but...

The researcher demonstrated that the capacity of LLMs to reason is a profound ability that can have "implications far beyond treating such models as databases that can be queried using natural language."

On the other hand, there's no general consensus among computer scientists that LLMs have any capacity to reason.

The paper doesn't mention the ethical considerations involving the insertion of an unpredictable automated system into existing clinical workflows.

The issue with LLMs such as ChatGPT is that they generate new text every time they're queried, and in a clinical setting - there is no method by which it can be constrained from occasionally fabricating nonsense, which is a phenomenon referred to as "hallucinating."

According to researchers from the Icahn School of Medicine at Mount Sinai, the hallucinations were minimal during their testing.

We can only hope that the technology will only get better and perhaps one day, replace doctors in treating common illnesses. That feat alone is worth pursuing.

source

💡Did you know?

You can take your DHArab experience to the next level with our Premium Membership.
👉 Click here to learn more