Calibration is a deployment problem, not a training detail

Confidence is not probability

Object detectors are trained to rank and localize, not to output calibrated probabilities. Atlas was no exception: its raw confidences were systematically miscalibrated, with a macro Brier score of 0.084 and a macro expected calibration error of 0.101. A score of 0.7 from such a model does not mean a 70 percent chance the finding is real.

Fixing it after the fact

We fit a per-class isotonic regression on the validation set and applied it to held-out test scores. That cut macro expected calibration error to 0.042, a 58 percent relative reduction, and improved the Brier score, without touching the underlying detector. Calibration is a thin, honest layer on top of a frozen model, not a retraining exercise.

Does it help a decision?

Calibrated probabilities are only useful if they change decisions for the better. On decision-curve analysis across the prespecified threshold range, the recalibrated model showed positive net benefit over both treat-all and treat-none strategies for all six classes. That is the relevant test for a triage aid: not just sharper probabilities, but probabilities that earn their place in a workflow.

The deployment lesson

Calibration and threshold selection are deployment-specific steps, not fixed properties of a trained model. The same weights that rank cases well will need site-specific calibration to produce trustworthy probabilities and yes/no flags, which is exactly why current reporting standards for medical-imaging AI ask authors to be explicit about how thresholds and probabilities were derived [1].

Confidence is not probability

Fixing it after the fact

Does it help a decision?

The deployment lesson

References