Across generations, sizes, and types, large language models poorly report self-confidence in gastroenterology clinical reasoning tasks

Nariman Naderi, Seyed Amir Ahmad Safavi-Naini, Thomas Savage, Mohammad Amin Khalafi, Peter R. Lewis, Zahra Atf, Girish Nadkarni, Ali Soroush

February 2026

Abstract

This study evaluated confidence calibration across 48 large language models (LLMs) using 300 gastroenterology board exam-style multiple-choice questions. Regardless of accuracy, all models demonstrated poor self-estimation of certainty. Even the best-calibrated systems (o1 preview, GPT-4o, Claude-3.5-Sonnet) exhibited substantial overconfidence (Brier scores 0.15–0.2, AUROC ~ 0.6). Models maintained high confidence regardless of question difficulty or response correctness. In their current form, LLMs cannot be relied upon to communicate uncertainty, and human oversight remains essential for safe use.

Type

Journal-Article

Publication

npj Gut and Liver, 3(1), 6