Gustavo Monnerat, Deputy Editor at The Lancet, shared a post on LinkedIn:

“Frontier models pass benchmarks. even with the key inputs removed.

Nature Medicine published a stress-tested GPT-5, Gemini, and peers on health AI tasks.

Why It Matters: ‘State-of-the-art on MedQA’ is being used to justify clinical ambitions.

Key Findings

Inputs removed: Models still “got the right answer”, without the image or key data they supposedly reasoned over
Prompts slightly reworded: Same question, wildly different answers.

What’s Missing: High scores can reflect what’s memorizable in the test set, not what’s reasoned from the case. Caveats: with models evolving faster than robust peer review, today’s published evidence is already chasing yesterday’s system.

Main Message: Benchmark performance ≠ clinical readiness. Implementation studies and real-world evidence are what should move the needle.

Ref: Gu et al, Evaluating the robustness and readiness of large frontier models in health AI applications. Nature Medicine, 2026.”

Gustavo Monnerat

Other articles featuring Gustavo Monnerat on OncoDaily.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Gustavo Monnerat։ Benchmark Performance Does Not Equal Clinical Readiness in Healthcare AI

European School of Oncology

Sitemap

Hemostasis Today

Fertility News

Oncodaily Journal