sonichu / README.md
chrisvoncsefalvay's picture
Update README.md
28dfddc verified
metadata
license: apache-2.0
language:
  - en
library_name: pytorch
thumbnail: opengraph_card.jpeg
tags:
  - medical-imaging
  - ultrasound
  - keypoint-detection
  - fetal-biometry
  - cardiac-ultrasound
  - ijepa
  - vitpose
  - knowledge-distillation
pipeline_tag: keypoint-detection
datasets:
  - FM_UIA_2026
  - multicentre-fetal-biometry-2025
model-index:
  - name: Sonichu
    results:
      - task:
          type: keypoint-detection
          name: Multi-task ultrasound biometry (9 tasks)
        dataset:
          name: >-
            FM_UIA 2026 fair validation (original labels only, 15% split, seed
            42)
          type: private
        metrics:
          - name: Weighted MRE (pixels, TTA)
            type: mre
            value: 7.3
          - name: Unweighted per-task mean MRE (pixels, TTA)
            type: mre
            value: 18.07
          - name: FUGC MRE (pixels, TTA)
            type: mre
            value: 3.9
          - name: Foetal femur MRE (pixels, TTA)
            type: mre
            value: 7.1
metrics:
  - mae

SONICHU-124M: a foundation model of ultrasound biometry

SONICHU-124M (Single One-shot Neural Inference of Coordinates in Human Ultrasound) is a foundation model for 9-task B-mode ultrasound biometry. Achieves 7.30 px weighted mean radial error on the FM_UIA tasks β€” in a single forward pass (2 passes with TTA). And it's tiny!

What this model does

Given a 2D B-mode ultrasound image, SONICHU predicts anatomical keypoints for nine biometric measurements. Users specify which task they want at inference time, and the model returns normalised xy coordinates in [0, 1] which can be scaled back to pixel coordinates.

task keypoints anatomy fair MRE (px, TTA)
AOP 4 angle of progression (intrapartum) 4.8
FUGC 2 foetal umbilical cord 3.9
FA 4 foetal abdomen biometry 7.6
HC 4 foetal head circumference 8.0
IVC 2 inferior vena cava 29.0
PLAX 22 cardiac parasternal long-axis 15.7
PSAX 4 cardiac parasternal short-axis 22.8
A4C 16 apical four-chamber view 63.7
fetal_femur 2 foetal femur length 7.1

Cardiac tasks (PLAX, PSAX, A4C, IVC) and fetal_femur have limited real-labelled training data (under 100 samples for the first three); treat those numbers as indicative rather than clinical-grade. The FUGC result (3.9 px) is the best across all models we evaluated.

Intended use

  • Transfer-learning starting point for related ultrasound keypoint tasks
  • The IJEPA-pretrained backbone alone is a useful domain-adapted feature extractor (160k ultrasound frames of self-supervised pretraining)

Not for clinical use. This model has not been clinically validated. It must not be used for patient diagnosis or treatment decisions.

Out of scope

  • Non-ultrasound imaging modalities (CT, MRI, optical)
  • 3D volumes (this is a 2D frame-level model)

Quick start

import cv2
import torch
import numpy as np
from modeling_sonichu import SonichuModel, SonichuPreprocessor

model = SonichuModel.from_pretrained(".")
prep = SonichuPreprocessor.from_pretrained(".")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

img_bgr = cv2.imread("my_ultrasound.png")
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)

inputs = prep(img_rgb)
kps_norm = model.predict(inputs["pixel_values"].to(device),
                         task="fetal_femur", tta=True)
# kps_norm: (1, n_kp, 2) in [0, 1]

kps_pixel = kps_norm[0].cpu().numpy() * np.array([inputs["orig_w"], inputs["orig_h"]])
print(kps_pixel)

A complete inference script with overlay rendering is provided in inference.py:

python inference.py my_ultrasound.png fetal_femur

Model architecture

component spec
backbone ViT-B/16, 86M params, 768-dim, 12 layers, 12 heads
head ViTPose: 16Γ—16 patch tokens β†’ 2 deconv layers Γ— 256 filters β†’ soft-argmax per keypoint
input 256Γ—256 RGB, ImageNet normalisation, replicate single-channel inputs
output Normalised xy coordinates in [0, 1] per keypoint
total params 128M
weights model.safetensors (511 MB, fp32)

How this model was trained

Stage 1: IJEPA self-supervised pretraining

The ViT-B/16 backbone was pretrained using the I-JEPA objective on 160,486 unlabelled ultrasound frames (A4C, HC, FA, AOP views). Representation-space prediction is more robust to speckle noise than pixel-space methods such as MAE, as US-JEPA demonstrates.

Stage 2: Five-teacher ensemble

Five ViTPose models were trained separately from the IJEPA backbone, each with a different pseudo-label regime (R1, R2, R3, Selective/FUGC-capped, r3capped/balanced). The ensemble of their coordinate-wise medians achieved 6.30 px weighted MRE on fair validation.

Stage 3: Knowledge distillation to a single model

This published model is the distilled student:

  • Same ViTPose architecture as each teacher
  • On-the-fly teacher predictions during training: each batch runs all five teachers, their coordinate-wise median becomes the teacher target
  • Combined loss: loss_real + 0.5 * loss_teacher
  • 100 epochs, AdamW, cosine schedule with 3-epoch warm-up
  • Training set: 32,722 labelled + pseudo-labelled samples from the FM_UIA challenge

Why distillation? A single model is 5Γ— cheaper to run than the ensemble at inference time. Naive weight averaging (model soup) was destructive at 87 px β€” the teachers had diverged too far during supervised training with different pseudo-label distributions. Distillation was required.

Performance

Fair validation = original labelled samples only (no pseudo-labels), 15% random split with seed 42.

model weighted MRE unweighted mean inference cost
Competition baseline (EfficientNet-B4 + FPN) 67.43 β€” 1 forward pass
Sonichu distilled (this model) 7.30 18.07 2 passes (with TTA)
Best single teacher (R3) 7.63 19.54 2 passes
4-model median ensemble 6.81 17.29 8 passes
5-model median ensemble 6.30 16.40 10 passes

The distilled model trades approximately 1 px of weighted MRE for a 5Γ— inference speedup versus the full ensemble, making it the practical choice for deployment.

Note that the weighted MRE is dominated by AOP (60.7% of fair val samples), so for applications where per-task balance matters, the unweighted mean is more informative.

Limitations

  1. A4C is the weakest task (63.7 px). Only 108 real A4C labels exist in the training set. Further improvements require external cardiac data (e.g. EchoNet-Dynamic).
  2. IVC (n=8 val), PSAX (n=6 val), PLAX (n=16 val) have very small validation counts. Their per-sample MRE has high variance; treat these numbers as trend indicators.
  3. Weighted vs unweighted: the weighted overall (7.30) is AOP-dominated. Clinical deployment should consider per-task performance.
  4. Population: training data comes from a specific set of clinical sites and devices. Performance on out-of-distribution populations is untested.
  5. Static frames only: this model does not use temporal information from ultrasound video sequences.

Citation

If you use this model, please cite:

@misc{sonichu-2026,
  title  = {Sonichu: a distilled IJEPA-based model for multi-task ultrasound biometry},
  author = {von Csefalvay, Chris},
  year   = {2026},
  note   = {ISBI 2026 FM\_UIA challenge submission}
}

Key underlying references:

  • Assran et al. 2023 β€” I-JEPA (arXiv:2301.08243)
  • Xu et al. 2022 β€” ViTPose (arXiv:2204.12484)
  • Radhachandran et al. 2026 β€” US-JEPA (arXiv:2602.19322)
  • Deng, Tang, Li 2026 β€” FM_UIA 2026 baseline (arXiv:2602.01055)
  • Hinton, Vinyals, Dean 2015 β€” Distilling the Knowledge in a Neural Network (arXiv:1503.02531)

Contents of this repository

file description
README.md this model card
config.json model hyperparameters and task metadata
preprocessor_config.json image preprocessing parameters
model.safetensors model weights (128M params, 511 MB)
modeling_sonichu.py self-contained PyTorch model class and preprocessor
inference.py end-to-end inference example with overlay rendering

License

Apache 2.0. See LICENSE in the repository.

Training used the FM_UIA 2026 challenge dataset (competition terms of use) and the Multi-centre Fetal Biometry Benchmark Dataset (DOI 10.5522/04/30819911, CC BY-NC-SA 4.0). Downstream users should respect those licenses.