External validation: a model trained in Türkiye, tested in California
Internal accuracy is the easy part. We froze the weights and thresholds and ran Atlas on a US cohort it had never seen. Here is what transferred, what did not, and why that distinction matters.

The test that counts
Atlas was developed on a national teleradiology dataset from Türkiye [1]. A strong internal number on that data is necessary but not sufficient. The question that decides whether a model is real is whether it holds up somewhere else, on different scanners, different protocols, and a different patient population, without being retuned to fit.
So we froze everything: the trained weights and the per-class operating thresholds chosen on the internal validation split. We then applied the model to a 280-patient cohort drawn from the Stanford AIMI Merlin abdominal CT dataset in the United States [2], each case adjudicated by a board-certified abdominal radiologist. No fine-tuning, no threshold re-tuning in the primary analysis.
Discrimination travels; thresholds do not
Discrimination held up well. External macro AUROC was 0.879, every class stayed at or above 0.80, and three were at or above 0.90. AAA barely moved, at 0.991 AUROC and 0.889 F1 at the original threshold. Because AUROC is threshold-independent, this is a clean statement about the model's ranking ability surviving a change of country.
Operating points were a different story. At the frozen thresholds, macro F1 fell to 0.545, recovering to 0.648 after a site-specific recalibration. That gap is the practical headline: a model can rank cases well in a new hospital while still needing its decision thresholds calibrated locally before the yes/no outputs are trustworthy.
Why this is the norm, not a flaw
Imaging models are known to degrade when scanners, protocols, and populations shift, sometimes learning shortcuts that do not generalize [3]. Rigorous patient-level evaluation, with no leakage between training and test, is the baseline that makes any of these numbers meaningful [4]. Reporting both the frozen-threshold result and the recalibrated one, rather than only the better figure, follows current reporting guidance for medical-imaging AI [5].
References
- Koç U, et al. Elevating healthcare through AI: the abdominal emergencies dataset at TEKNOFEST-2022. Eur Radiol. 2024;34(6):3588-3597.
- Stanford AIMI. Merlin Abdominal CT Dataset (v1.0). Redivis; 2026.
- Zech JR, et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs. PLoS Med. 2018;15(11):e1002683.
- Varoquaux G, Cheplygina V. Machine learning for medical imaging: methodological failures and recommendations. NPJ Digit Med. 2022;5(1):48.
- Tejani AS, et al. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): 2024 update. Radiol Artif Intell. 2024;6(4):e240300.