CLIP-to-Clinic, MIMIC-CXR Reproduction

Weights for an academic re-implementation of "Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis" by Hanbin Ko & Chang-Min Park (CVPR 2025, arXiv:2505.22079).

Produced for METU CENG501 (Spring 2026) Deep Learning (course page), the term project where students reproduce a recent paper that ships no public code. Companion code: https://github.com/alpozaydin/CENG501-Spring2026/tree/main/HorozOzaydin.

Headline result

The full method (n_g_dtcg) recovers the paper's central CXR-Align numbers within 1 percentage point:

Metric	Paper	Ours
CXR-Align Task A (Qwen-fair cross-generator)	96.5	95.5
CXR-Align Task B	80.1	80.4

What's in this repo

Eight ablation checkpoints, one per row of the paper's component decomposition. All trained for 10 epochs at batch size 64 on the same MIMIC-CXR PA+AP frontal split.

File	Soft labels	Graph stream	Hard negatives	What it isolates
`baseline_final.safetensors`	none (InfoNCE)	✗	✗	Paper's CLIP baseline
`dt_final.safetensors`	textual	✗	✗	Textual soft labels alone
`dt_c_final.safetensors`	textual + clinical	✗	✗	+ CheXbert label similarity
`n_final.safetensors`	none	✗	✓	Negation hard negatives only
`n_dt_c_final.safetensors`	textual + clinical	✗	✓	2-modality full method
`g_final.safetensors`	graph	✓	✗	Graph soft labels alone
`g_dtcg_final.safetensors`	textual + clinical + graph	✓	✗	3-modality, no negation
`n_g_dtcg_final.safetensors`	textual + clinical + graph	✓	✓	Paper's full `CLIP^(N,G)-D_(t+c+g)`

results_summary.csv carries the 8 × 37 metric matrix across every benchmark in the paper we could replicate.

Model architecture

Image encoder: swin_tiny_patch4_window7_224 (timm), 224 × 224 input
Text encoder: emilyalsentzer/Bio_ClinicalBERT
Graph encoder: 2-layer GCN over RadGraph entity/relation graphs, 772 → 256 → 512 (used only by g, g_dtcg, n_g_dtcg)
Projection: 512-d, L2-normalised
Temperature: fixed τ = 0.1 (paper App. C.4)

Ablation summary

Within-our-setup numbers (matches paper Table 2 shape: single streams inert, combinations deliver):

Task	baseline	dt	dt_c	n	n_dt_c	g	g_dtcg	n_g_dtcg
MIMIC ZS mean AUC	0.562	0.563	0.618	0.581	0.620	0.564	0.615	0.622
CXR-Align Task B	0.739	0.737	0.791	0.710	0.818	0.766	0.778	0.804
Retrieval F1 (t→i)	0.198	0.200	0.234	0.199	0.218	0.199	0.214	0.206
CheXpert 5×200 acc	0.279	0.290	0.373	0.289	0.396	0.256	0.361	0.327
SIIM ZS AUC	0.613	0.616	0.790	0.558	0.742	0.611	0.791	0.752

Negation alone (n) pushes Task A near ceiling but drops Task B. Combining negation with dynamic soft labels (n_dt_c, n_g_dtcg) restores Task B and adds the ZS / disease-classification gain, exactly the paper's claim.

Known gaps vs the paper

Disease ZS / retrieval lag. CheXpert 5×200 32.7 vs paper 57.3, retrieval T→I F1 20.6 vs 50.6, SIIM ZS 75.2 vs 87.2. The within-setup ablation direction still matches; absolute values are below the paper.
Report cleanup. We use regex-based temporal-phrase removal instead of the paper's Gemini-Flash single-entity splitting. Same cleanup is applied to every variant so it mostly cancels in the within-method delta.
CXR-Align Task A protocol. The headline 95.5 is cross-generator: r^n regenerated by Qwen2.5-7B-Instruct so the eval scores against a different LLM than training. The same-generator template variant inflates to around 98 and is not the headline number.
PAIR_STRICT. 14.2 vs paper 34.4. Negation training widens the gt minus hallu gap from 9.4 pt to 38.7 pt because the entity token dominates the "There is no X" embedding (mechanism documented in the companion repo, not a method bug).
Training scope. PA+AP frontal only (lateral excluded matching the paper). DeepMCDD OOD filtering is documented separately in the companion repo and does not change the within-method direction.

Intended use & limitations

Research only. These weights are an academic reproduction. They are not a medical device, not approved for clinical use, and must not be used to drive any patient-care decision. The underlying MIMIC-CXR-JPG corpus is gated by PhysioNet's Credentialed Health Data License, so downloading these weights inherits that DUA's restrictions on re-identification attempts and downstream sharing.

Known failure modes:

The n* variants are abnormality-eager: on normal queries they retrieve abnormal reports preferentially. For normal-vs-abnormal separation, prefer dt_c or g_dtcg.
Classes absent from MIMIC training text (e.g. CXR14's Mass, Nodule, Infiltration, Hernia) sit near 0.5 AUC.
Trained on PA+AP frontal only. Lateral views are out of distribution.

Citation

@inproceedings{ko2025clip2clinic,
  title     = {Bringing {CLIP} to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis},
  author    = {Ko, Hanbin and Park, Chang-Min},
  booktitle = {Proceedings of the {IEEE/CVF} Conference on Computer Vision and Pattern Recognition ({CVPR})},
  year      = {2025},
  eprint    = {2505.22079},
  archivePrefix = {arXiv},
}

@misc{horoz2026clip2clinic_repro,
  title  = {Reproduction of "Bringing CLIP to the Clinic" (CVPR 2025)},
  author = {Horoz, Furkan and Ozaydin, Mehmet Alp},
  year   = {2026},
  note   = {METU CENG501 Spring 2026 term project, \url{https://github.com/alpozaydin/CENG501-Spring2026/tree/main/HorozOzaydin}},
}

Authors

Furkan Horoz, furkanhoroz125@gmail.com
Mehmet Alp Ozaydin, alpozaydin@gmail.com

Middle East Technical University, Computer Engineering. CENG501 Spring 2026, under Prof. Sinan Kalkan.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for CENG501-HorozOzaydin/clip2clinic-mimic-pa

Base model

emilyalsentzer/Bio_ClinicalBERT

Finetuned

(68)

this model

Paper for CENG501-HorozOzaydin/clip2clinic-mimic-pa

Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis

Paper • 2505.22079 • Published May 28, 2025

Evaluation results

Task A acc on MIMIC-CXR-JPG PA+AP test
self-reported

0.955
Task B acc on MIMIC-CXR-JPG PA+AP test
self-reported

0.804