CLIP-to-Clinic, MIMIC-CXR Reproduction
Weights for an academic re-implementation of "Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis" by Hanbin Ko & Chang-Min Park (CVPR 2025, arXiv:2505.22079).
Produced for METU CENG501 (Spring 2026) Deep Learning (course page), the term project where students reproduce a recent paper that ships no public code. Companion code: https://github.com/alpozaydin/CENG501-Spring2026/tree/main/HorozOzaydin.
Headline result
The full method (n_g_dtcg) recovers the paper's central CXR-Align numbers within 1 percentage point:
| Metric | Paper | Ours |
|---|---|---|
| CXR-Align Task A (Qwen-fair cross-generator) | 96.5 | 95.5 |
| CXR-Align Task B | 80.1 | 80.4 |
What's in this repo
Eight ablation checkpoints, one per row of the paper's component decomposition. All trained for 10 epochs at batch size 64 on the same MIMIC-CXR PA+AP frontal split.
| File | Soft labels | Graph stream | Hard negatives | What it isolates |
|---|---|---|---|---|
baseline_final.safetensors |
none (InfoNCE) | β | β | Paper's CLIP baseline |
dt_final.safetensors |
textual | β | β | Textual soft labels alone |
dt_c_final.safetensors |
textual + clinical | β | β | + CheXbert label similarity |
n_final.safetensors |
none | β | β | Negation hard negatives only |
n_dt_c_final.safetensors |
textual + clinical | β | β | 2-modality full method |
g_final.safetensors |
graph | β | β | Graph soft labels alone |
g_dtcg_final.safetensors |
textual + clinical + graph | β | β | 3-modality, no negation |
n_g_dtcg_final.safetensors |
textual + clinical + graph | β | β | Paper's full CLIP^(N,G)-D_(t+c+g) |
results_summary.csv carries the 8 Γ 37 metric matrix across every benchmark in the paper we could replicate.
Model architecture
- Image encoder:
swin_tiny_patch4_window7_224(timm), 224 Γ 224 input - Text encoder:
emilyalsentzer/Bio_ClinicalBERT - Graph encoder: 2-layer GCN over RadGraph entity/relation graphs, 772 β 256 β 512 (used only by
g,g_dtcg,n_g_dtcg) - Projection: 512-d, L2-normalised
- Temperature: fixed
Ο = 0.1(paper App. C.4)
Ablation summary
Within-our-setup numbers (matches paper Table 2 shape: single streams inert, combinations deliver):
| Task | baseline | dt | dt_c | n | n_dt_c | g | g_dtcg | n_g_dtcg |
|---|---|---|---|---|---|---|---|---|
| MIMIC ZS mean AUC | 0.562 | 0.563 | 0.618 | 0.581 | 0.620 | 0.564 | 0.615 | 0.622 |
| CXR-Align Task B | 0.739 | 0.737 | 0.791 | 0.710 | 0.818 | 0.766 | 0.778 | 0.804 |
| Retrieval F1 (tβi) | 0.198 | 0.200 | 0.234 | 0.199 | 0.218 | 0.199 | 0.214 | 0.206 |
| CheXpert 5Γ200 acc | 0.279 | 0.290 | 0.373 | 0.289 | 0.396 | 0.256 | 0.361 | 0.327 |
| SIIM ZS AUC | 0.613 | 0.616 | 0.790 | 0.558 | 0.742 | 0.611 | 0.791 | 0.752 |
Negation alone (n) pushes Task A near ceiling but drops Task B. Combining negation with dynamic soft labels (n_dt_c, n_g_dtcg) restores Task B and adds the ZS / disease-classification gain, exactly the paper's claim.
Known gaps vs the paper
- Disease ZS / retrieval lag. CheXpert 5Γ200 32.7 vs paper 57.3, retrieval TβI F1 20.6 vs 50.6, SIIM ZS 75.2 vs 87.2. The within-setup ablation direction still matches; absolute values are below the paper.
- Report cleanup. We use regex-based temporal-phrase removal instead of the paper's Gemini-Flash single-entity splitting. Same cleanup is applied to every variant so it mostly cancels in the within-method delta.
- CXR-Align Task A protocol. The headline 95.5 is cross-generator:
r^nregenerated by Qwen2.5-7B-Instruct so the eval scores against a different LLM than training. The same-generator template variant inflates to around 98 and is not the headline number. - PAIR_STRICT. 14.2 vs paper 34.4. Negation training widens the gt minus hallu gap from 9.4 pt to 38.7 pt because the entity token dominates the
"There is no X"embedding (mechanism documented in the companion repo, not a method bug). - Training scope. PA+AP frontal only (lateral excluded matching the paper). DeepMCDD OOD filtering is documented separately in the companion repo and does not change the within-method direction.
Intended use & limitations
Research only. These weights are an academic reproduction. They are not a medical device, not approved for clinical use, and must not be used to drive any patient-care decision. The underlying MIMIC-CXR-JPG corpus is gated by PhysioNet's Credentialed Health Data License, so downloading these weights inherits that DUA's restrictions on re-identification attempts and downstream sharing.
Known failure modes:
- The
n*variants are abnormality-eager: on normal queries they retrieve abnormal reports preferentially. For normal-vs-abnormal separation, preferdt_corg_dtcg. - Classes absent from MIMIC training text (e.g. CXR14's
Mass,Nodule,Infiltration,Hernia) sit near 0.5 AUC. - Trained on PA+AP frontal only. Lateral views are out of distribution.
Citation
@inproceedings{ko2025clip2clinic,
title = {Bringing {CLIP} to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis},
author = {Ko, Hanbin and Park, Chang-Min},
booktitle = {Proceedings of the {IEEE/CVF} Conference on Computer Vision and Pattern Recognition ({CVPR})},
year = {2025},
eprint = {2505.22079},
archivePrefix = {arXiv},
}
@misc{horoz2026clip2clinic_repro,
title = {Reproduction of "Bringing CLIP to the Clinic" (CVPR 2025)},
author = {Horoz, Furkan and Ozaydin, Mehmet Alp},
year = {2026},
note = {METU CENG501 Spring 2026 term project, \url{https://github.com/alpozaydin/CENG501-Spring2026/tree/main/HorozOzaydin}},
}
Authors
- Furkan Horoz,
furkanhoroz125@gmail.com - Mehmet Alp Ozaydin,
alpozaydin@gmail.com
Middle East Technical University, Computer Engineering. CENG501 Spring 2026, under Prof. Sinan Kalkan.
Model tree for CENG501-HorozOzaydin/clip2clinic-mimic-pa
Base model
emilyalsentzer/Bio_ClinicalBERTPaper for CENG501-HorozOzaydin/clip2clinic-mimic-pa
Evaluation results
- Task A acc on MIMIC-CXR-JPG PA+AP testself-reported0.955
- Task B acc on MIMIC-CXR-JPG PA+AP testself-reported0.804