Gustavo Monnerat, Deputy Editor at The Lancet, shared a post on LinkedIn:
“Frontier models pass benchmarks. even with the key inputs removed.
Nature Medicine published a stress-tested GPT-5, Gemini, and peers on health AI tasks.
Why It Matters: ‘State-of-the-art on MedQA’ is being used to justify clinical ambitions.
Key Findings
- Inputs removed: Models still “got the right answer”, without the image or key data they supposedly reasoned over
- Prompts slightly reworded: Same question, wildly different answers.
What’s Missing: High scores can reflect what’s memorizable in the test set, not what’s reasoned from the case. Caveats: with models evolving faster than robust peer review, today’s published evidence is already chasing yesterday’s system.
Main Message: Benchmark performance ≠ clinical readiness. Implementation studies and real-world evidence are what should move the needle.
Ref: Gu et al, Evaluating the robustness and readiness of large frontier models in health AI applications. Nature Medicine, 2026.”

Other articles featuring Gustavo Monnerat on OncoDaily.