External validation: a model trained in Türkiye, tested in California

The test that counts

Atlas was developed on a national teleradiology dataset from Türkiye [1]. A strong internal number on that data is necessary but not sufficient. The question that decides whether a model is real is whether it holds up somewhere else, on different scanners, different protocols, and a different patient population, without being retuned to fit.

So we froze everything: the trained weights and the per-class operating thresholds chosen on the internal validation split. We then applied the model to a 280-patient cohort drawn from the Stanford AIMI Merlin abdominal CT dataset in the United States [2], each case adjudicated by a board-certified abdominal radiologist. No fine-tuning, no threshold re-tuning in the primary analysis.

0.879external macro AUROC

6 / 6classes at AUROC 0.80 or higher

0.991AAA AUROC, near-intact transfer

Discrimination travels; thresholds do not

Discrimination held up well. External macro AUROC was 0.879, every class stayed at or above 0.80, and three were at or above 0.90. AAA barely moved, at 0.991 AUROC and 0.889 F1 at the original threshold. Because AUROC is threshold-independent, this is a clean statement about the model's ranking ability surviving a change of country.

Operating points were a different story. At the frozen thresholds, macro F1 fell to 0.545, recovering to 0.648 after a site-specific recalibration. That gap is the practical headline: a model can rank cases well in a new hospital while still needing its decision thresholds calibrated locally before the yes/no outputs are trustworthy.

Why this is the norm, not a flaw

Imaging models are known to degrade when scanners, protocols, and populations shift, sometimes learning shortcuts that do not generalize [3]. Rigorous patient-level evaluation, with no leakage between training and test, is the baseline that makes any of these numbers meaningful [4]. Reporting both the frozen-threshold result and the recalibrated one, rather than only the better figure, follows current reporting guidance for medical-imaging AI [5].

The test that counts

Discrimination travels; thresholds do not

Why this is the norm, not a flaw

References