Chatbots provided incorrect, conflicting medical advice, researchers found: “Despite all the hype, AI just isn’t ready to take on the role of the physician.”
So, I can speak to this a little bit, as it touches two domains I’m involved it. TL;DR - LLMs bullshit and are unreliable, but there’s a way to use them in this domain as a force multiplier of sorts.
In one; I’ve created a python router that takes my (deidentified) clinical notes, extract and compacts input and creates a summary, then -
-
benchmarks the summary against my (user defined) gold standard and provides management plan (again, based on user defined database).
-
this is then dropped into my on device LLM for light editing and polishing to condense, which I then eyeball, correct and then escalate to supervisor for review.
Additionally, the llm generated note can be approved / denied by the python router, in the first instance based on certain policy criteria I’ve defined.
It can also suggest probable DDX based on my database (which are .CSV based)
Finally, if the llm output fails policy check, the router tells me why it failed and just says “go look at the prior summary and edit it yourself”.
This three step process takes the tedium of paperwork from 15-20 mins to 1 minute generation, 2 mins manual editing.
The reason why this is interesting:
All of this runs within the llm (it calls / invokes the python tooling via >> command) and is 100% deterministic; no llm jazz until the final step, which the router can outright reject and is user auditble anyway.
Ive found that using a fairly “dumb” llm (Qwen2.5-1.5B), with settings dialed down, produces consistently solid final notes (2 out of 3 are graded as passed by router invoking policy document and checking output). Its too dumb to jazz, which is useful in this instance.
Would I trust the LLM, end to end? Well, I’d trust my system, approx 80% of the time. I wouldn’t trust ChatGPT … even though its been more right than wrong in similar tests.
alzjim@lemmy.world 1 hour ago
Calling chatbots “terrible doctors” misses what actually makes a good GP — accessibility, consistency, pattern recognition, and prevention — not just physical exams. AI shines here — it’s available 24/7 🕒, never rushed or dismissive, asks structured follow-up questions, and reliably applies up-to-date guidelines without fatigue. It’s excellent at triage — spotting red flags early 🚩, monitoring symptoms over time, and knowing when to escalate to a human clinician — which is exactly where many real-world failures happen. AI shouldn’t replace hands-on care — and no serious advocate claims it should — but as a first-line GP focused on education, reassurance, and early detection, it can already reduce errors, widen access, and ease overloaded systems — which is a win for patients 💙 and doctors alike.
/s