
Roupen Odabashian: Benchmarking LLM Performance on Breast Oncology Multiple-Choice Questions
Roupen Odabashian, Hematology/Oncology Fellow at the Karmanos Cancer Institute, posted on LinkedIn:
“New Study at ASCO2025.
Can large language models like GPT-4 and Claude Opus reason like oncologists?
The way we’re currently evaluating large language models—with those shiny journal titles touting multiple-choice exam benchmarks for accuracy—is just horribly WRONG.
Would you trust a fresh-out-of-med-school doctor to treat your cancer based solely on passing a multiple-choice test, without any real-world experience handling complex cases with multiple, difficult treatment options?
In our study at ASCO2025, we assessed large language models using multiple-choice questions, but we focused on their clinical reasoning, not just their accuracy. And the results? Shocking.
We benchmarked the clinical reasoning of AI models using 273 breast oncology multiple-choice questions from the ASCO QBank.
Key findings: GPT-4 and Claude Opus both started with high accuracy (81.3% and 79.5%, respectively).
After applying chain-of-thought prompting to simulate stepwise reasoning: Claude’s performance improved to 86.4%. GPT-4’s accuracy slightly declined to 80.95%.
Common AI errors included. That’s where we looked at their clinical reasoning!
- Reliance on outdated guidelines
- Misinterpretation of clinical trial data
- Lack of individualized/multidisciplinary care reasoning
Conclusion: LLMs are promising tools, but still fall short in nuanced, real-world oncology decision-making. Human supervision remains essential.
More posts featuring ASCO25.
-
Challenging the Status Quo in Colorectal Cancer 2024
December 6-8, 2024
-
ESMO 2024 Congress
September 13-17, 2024
-
ASCO Annual Meeting
May 30 - June 4, 2024
-
Yvonne Award 2024
May 31, 2024
-
OncoThon 2024, Online
Feb. 15, 2024
-
Global Summit on War & Cancer 2023, Online
Dec. 14-16, 2023