All articles
MethodMay 6, 20265 min read

Calibration is a deployment problem, not a training detail

A detector's confidence score is not a probability. If you want a clinical AI to support decisions, the gap between those two things is where the work is.

Clinical monitoring and review

Confidence is not probability

Object detectors are trained to rank and localize, not to output calibrated probabilities. Atlas was no exception: its raw confidences were systematically miscalibrated, with a macro Brier score of 0.084 and a macro expected calibration error of 0.101. A score of 0.7 from such a model does not mean a 70 percent chance the finding is real.

Fixing it after the fact

We fit a per-class isotonic regression on the validation set and applied it to held-out test scores. That cut macro expected calibration error to 0.042, a 58 percent relative reduction, and improved the Brier score, without touching the underlying detector. Calibration is a thin, honest layer on top of a frozen model, not a retraining exercise.

Does it help a decision?

Calibrated probabilities are only useful if they change decisions for the better. On decision-curve analysis across the prespecified threshold range, the recalibrated model showed positive net benefit over both treat-all and treat-none strategies for all six classes. That is the relevant test for a triage aid: not just sharper probabilities, but probabilities that earn their place in a workflow.

The deployment lesson

Calibration and threshold selection are deployment-specific steps, not fixed properties of a trained model. The same weights that rank cases well will need site-specific calibration to produce trustworthy probabilities and yes/no flags, which is exactly why current reporting standards for medical-imaging AI ask authors to be explicit about how thresholds and probabilities were derived [1].

References

  1. Tejani AS, et al. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): 2024 update. Radiol Artif Intell. 2024;6(4):e240300.