Gustavo Monnerat: Frontier AI Takes the Lead in Clinical Benchmarking, but Real-World Questions Persist
Gustavo Monnerat/LinkedIn

Gustavo Monnerat: Frontier AI Takes the Lead in Clinical Benchmarking, but Real-World Questions Persist

Gustavo Monnerat, Deputy Editor at The Lancet, shared a post on LinkedIn:

2 blinded clinicians. 6 AI systems. 100 real patient questions

New in Nature Medicine: 3 frontier LLMs vs 2 specialized clinical AI tools, across two automated benchmarks plus 100 real clinical queries judged by 12 blinded clinicians. The frontier models came out ahead on all three.

Good news on safety: all six systems hallucinated at the same low rate.

Caveats: the headline gap comes from public exam questions the models may have trained on, and the clinical tools were tested through their browser, not API. Latency and citation quality weren’t assessed.

My take: relevant study, but a benchmark won’t always reflect real-world use. I’d expect frontier models to win some tasks and specialized tools others.

The open questions are the ones that interest me most:

  • How do these tools perform with residency doctors versus experienced clinicians?
  • Should we rely on models that can’t open the full text of clinical trials?
  • How do they hold up across settings and countries with different patterns of epidemiology and disease burden?”

Title: General-purpose large language models outperform specialized clinical AI tools on medical benchmarks

Authors: Krithik Vishwanath, Anton Alyakin, Mrigayu Ghosh, Ali Hage, Sean N. Neifert, Cordelia Orillac, Nataniel J. Mandelberg, Hammad A. Khan, Jin Vivian Lee, Jie J. Yao, William Robert Small, Aakaash Varma, D. Brock Hewitt, Yindalon Aphinyanaphongs, Daniel Alexander Alber, Eric Karl Oermann

Read the article here.

Gustavo Monnerat

Other articles featuring Gustavo Monnerat on OncoDaily.