June, 2025
June 2025
M T W T F S S
 1
2345678
9101112131415
16171819202122
23242526272829
30  
Roupen Odabashian: Benchmarking LLM Performance on Breast Oncology Multiple-Choice Questions
Jun 2, 2025, 07:08

Roupen Odabashian: Benchmarking LLM Performance on Breast Oncology Multiple-Choice Questions

Roupen Odabashian, Hematology/Oncology Fellow at the Karmanos Cancer Institute, posted on LinkedIn:

“New Study at ASCO2025.

Can large language models like GPT-4 and Claude Opus reason like oncologists?

The way we’re currently evaluating large language models—with those shiny journal titles touting multiple-choice exam benchmarks for accuracy—is just horribly WRONG.

Would you trust a fresh-out-of-med-school doctor to treat your cancer based solely on passing a multiple-choice test, without any real-world experience handling complex cases with multiple, difficult treatment options?

In our study at ASCO2025, we assessed large language models using multiple-choice questions, but we focused on their clinical reasoning, not just their accuracy. And the results? Shocking.

We benchmarked the clinical reasoning of AI models using 273 breast oncology multiple-choice questions from the ASCO QBank.

Key findings: GPT-4 and Claude Opus both started with high accuracy (81.3% and 79.5%, respectively).

After applying chain-of-thought prompting to simulate stepwise reasoning: Claude’s performance improved to 86.4%. GPT-4’s accuracy slightly declined to 80.95%.

Common AI errors included. That’s where we looked at their clinical reasoning!

  • Reliance on outdated guidelines
  • Misinterpretation of clinical trial data
  • Lack of individualized/multidisciplinary care reasoning

Conclusion: LLMs are promising tools, but still fall short in nuanced, real-world oncology decision-making. Human supervision remains essential.

Read the abstract.”

Roupen Odabashian: Benchmarking LLM Performance on Breast Oncology Multiple-Choice Questions

More posts featuring ASCO25.