arxiv:2604.15950

TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

Published on Apr 17

· Submitted by

Tristan on Apr 20

MIC at DKFZ

Upvote

Authors:

Tristan Kirscher ,

Abstract

TwinTrack framework addresses pancreatic cancer segmentation ambiguity through post-hoc calibration of ensemble probabilities to empirical mean human response, improving calibration metrics on multi-rater benchmarks.

AI-generated summary

Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous: inter-rater disagreement among experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producing probabilistic outputs that can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap through post-hoc calibration of ensemble segmentation probabilities to the empirical mean human response (MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modeling inter-rater disagreement. The proposed post-hoc calibration procedure is simple and requires only a small multi-rater calibration set. It consistently improves calibration metrics over standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.

View arXiv page View PDF Add to collection

Community

Kirscher

Paper author Paper submitter about 24 hours ago

Pancreatic ductal adenocarcinoma (PDAC) is one of the deadliest cancers, and its segmentation on contrast-enhanced CT is fundamentally ambiguous: when experts disagree, that disagreement often reflects real uncertainty rather than annotation noise. TwinTrack is a simple post-hoc multi-rater calibration method that transforms ensemble segmentation probabilities into predictions aligned with the Mean Human Response, better capturing expert disagreement. In other words: not just better segmentation, but better-calibrated uncertainty for genuinely ambiguous clinical images.