CLIP-to-Clinic, MIMIC-CXR Reproduction

Weights for an academic re-implementation of "Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis" by Hanbin Ko & Chang-Min Park (CVPR 2025, arXiv:2505.22079).

Produced for METU CENG501 (Spring 2026) Deep Learning (course page), the term project where students reproduce a recent paper that ships no public code. Companion code: https://github.com/alpozaydin/CENG501-Spring2026/tree/main/HorozOzaydin.

Headline result

The full method (n_g_dtcg) recovers the paper's central CXR-Align numbers within 1 percentage point:

Metric Paper Ours
CXR-Align Task A (Qwen-fair cross-generator) 96.5 95.5
CXR-Align Task B 80.1 80.4

What's in this repo

Eight ablation checkpoints, one per row of the paper's component decomposition. All trained for 10 epochs at batch size 64 on the same MIMIC-CXR PA+AP frontal split.

File Soft labels Graph stream Hard negatives What it isolates
baseline_final.safetensors none (InfoNCE) βœ— βœ— Paper's CLIP baseline
dt_final.safetensors textual βœ— βœ— Textual soft labels alone
dt_c_final.safetensors textual + clinical βœ— βœ— + CheXbert label similarity
n_final.safetensors none βœ— βœ“ Negation hard negatives only
n_dt_c_final.safetensors textual + clinical βœ— βœ“ 2-modality full method
g_final.safetensors graph βœ“ βœ— Graph soft labels alone
g_dtcg_final.safetensors textual + clinical + graph βœ“ βœ— 3-modality, no negation
n_g_dtcg_final.safetensors textual + clinical + graph βœ“ βœ“ Paper's full CLIP^(N,G)-D_(t+c+g)

results_summary.csv carries the 8 Γ— 37 metric matrix across every benchmark in the paper we could replicate.

Model architecture

  • Image encoder: swin_tiny_patch4_window7_224 (timm), 224 Γ— 224 input
  • Text encoder: emilyalsentzer/Bio_ClinicalBERT
  • Graph encoder: 2-layer GCN over RadGraph entity/relation graphs, 772 β†’ 256 β†’ 512 (used only by g, g_dtcg, n_g_dtcg)
  • Projection: 512-d, L2-normalised
  • Temperature: fixed Ο„ = 0.1 (paper App. C.4)

Ablation summary

Within-our-setup numbers (matches paper Table 2 shape: single streams inert, combinations deliver):

Task baseline dt dt_c n n_dt_c g g_dtcg n_g_dtcg
MIMIC ZS mean AUC 0.562 0.563 0.618 0.581 0.620 0.564 0.615 0.622
CXR-Align Task B 0.739 0.737 0.791 0.710 0.818 0.766 0.778 0.804
Retrieval F1 (t→i) 0.198 0.200 0.234 0.199 0.218 0.199 0.214 0.206
CheXpert 5Γ—200 acc 0.279 0.290 0.373 0.289 0.396 0.256 0.361 0.327
SIIM ZS AUC 0.613 0.616 0.790 0.558 0.742 0.611 0.791 0.752

Negation alone (n) pushes Task A near ceiling but drops Task B. Combining negation with dynamic soft labels (n_dt_c, n_g_dtcg) restores Task B and adds the ZS / disease-classification gain, exactly the paper's claim.

Known gaps vs the paper

  • Disease ZS / retrieval lag. CheXpert 5Γ—200 32.7 vs paper 57.3, retrieval Tβ†’I F1 20.6 vs 50.6, SIIM ZS 75.2 vs 87.2. The within-setup ablation direction still matches; absolute values are below the paper.
  • Report cleanup. We use regex-based temporal-phrase removal instead of the paper's Gemini-Flash single-entity splitting. Same cleanup is applied to every variant so it mostly cancels in the within-method delta.
  • CXR-Align Task A protocol. The headline 95.5 is cross-generator: r^n regenerated by Qwen2.5-7B-Instruct so the eval scores against a different LLM than training. The same-generator template variant inflates to around 98 and is not the headline number.
  • PAIR_STRICT. 14.2 vs paper 34.4. Negation training widens the gt minus hallu gap from 9.4 pt to 38.7 pt because the entity token dominates the "There is no X" embedding (mechanism documented in the companion repo, not a method bug).
  • Training scope. PA+AP frontal only (lateral excluded matching the paper). DeepMCDD OOD filtering is documented separately in the companion repo and does not change the within-method direction.

Intended use & limitations

Research only. These weights are an academic reproduction. They are not a medical device, not approved for clinical use, and must not be used to drive any patient-care decision. The underlying MIMIC-CXR-JPG corpus is gated by PhysioNet's Credentialed Health Data License, so downloading these weights inherits that DUA's restrictions on re-identification attempts and downstream sharing.

Known failure modes:

  • The n* variants are abnormality-eager: on normal queries they retrieve abnormal reports preferentially. For normal-vs-abnormal separation, prefer dt_c or g_dtcg.
  • Classes absent from MIMIC training text (e.g. CXR14's Mass, Nodule, Infiltration, Hernia) sit near 0.5 AUC.
  • Trained on PA+AP frontal only. Lateral views are out of distribution.

Citation

@inproceedings{ko2025clip2clinic,
  title     = {Bringing {CLIP} to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis},
  author    = {Ko, Hanbin and Park, Chang-Min},
  booktitle = {Proceedings of the {IEEE/CVF} Conference on Computer Vision and Pattern Recognition ({CVPR})},
  year      = {2025},
  eprint    = {2505.22079},
  archivePrefix = {arXiv},
}

@misc{horoz2026clip2clinic_repro,
  title  = {Reproduction of "Bringing CLIP to the Clinic" (CVPR 2025)},
  author = {Horoz, Furkan and Ozaydin, Mehmet Alp},
  year   = {2026},
  note   = {METU CENG501 Spring 2026 term project, \url{https://github.com/alpozaydin/CENG501-Spring2026/tree/main/HorozOzaydin}},
}

Authors

  • Furkan Horoz, furkanhoroz125@gmail.com
  • Mehmet Alp Ozaydin, alpozaydin@gmail.com

Middle East Technical University, Computer Engineering. CENG501 Spring 2026, under Prof. Sinan Kalkan.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for CENG501-HorozOzaydin/clip2clinic-mimic-pa

Finetuned
(68)
this model

Paper for CENG501-HorozOzaydin/clip2clinic-mimic-pa

Evaluation results