SONICHU-124M: a foundation model of ultrasound biometry
SONICHU-124M (Single One-shot Neural Inference of Coordinates in Human Ultrasound) is a foundation model for 9-task B-mode ultrasound biometry. Achieves 7.30 px weighted mean radial error on the FM_UIA tasks β in a single forward pass (2 passes with TTA). And it's tiny!
What this model does
Given a 2D B-mode ultrasound image, SONICHU predicts anatomical keypoints for nine biometric measurements. Users specify which task they want at inference time, and the model returns normalised xy coordinates in [0, 1] which can be scaled back to pixel coordinates.
| task | keypoints | anatomy | fair MRE (px, TTA) |
|---|---|---|---|
| AOP | 4 | angle of progression (intrapartum) | 4.8 |
| FUGC | 2 | foetal umbilical cord | 3.9 |
| FA | 4 | foetal abdomen biometry | 7.6 |
| HC | 4 | foetal head circumference | 8.0 |
| IVC | 2 | inferior vena cava | 29.0 |
| PLAX | 22 | cardiac parasternal long-axis | 15.7 |
| PSAX | 4 | cardiac parasternal short-axis | 22.8 |
| A4C | 16 | apical four-chamber view | 63.7 |
| fetal_femur | 2 | foetal femur length | 7.1 |
Cardiac tasks (PLAX, PSAX, A4C, IVC) and fetal_femur have limited real-labelled training data (under 100 samples for the first three); treat those numbers as indicative rather than clinical-grade. The FUGC result (3.9 px) is the best across all models we evaluated.
Intended use
- Transfer-learning starting point for related ultrasound keypoint tasks
- The IJEPA-pretrained backbone alone is a useful domain-adapted feature extractor (160k ultrasound frames of self-supervised pretraining)
Not for clinical use. This model has not been clinically validated. It must not be used for patient diagnosis or treatment decisions.
Out of scope
- Non-ultrasound imaging modalities (CT, MRI, optical)
- 3D volumes (this is a 2D frame-level model)
Quick start
import cv2
import torch
import numpy as np
from modeling_sonichu import SonichuModel, SonichuPreprocessor
model = SonichuModel.from_pretrained(".")
prep = SonichuPreprocessor.from_pretrained(".")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
img_bgr = cv2.imread("my_ultrasound.png")
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
inputs = prep(img_rgb)
kps_norm = model.predict(inputs["pixel_values"].to(device),
task="fetal_femur", tta=True)
# kps_norm: (1, n_kp, 2) in [0, 1]
kps_pixel = kps_norm[0].cpu().numpy() * np.array([inputs["orig_w"], inputs["orig_h"]])
print(kps_pixel)
A complete inference script with overlay rendering is provided in inference.py:
python inference.py my_ultrasound.png fetal_femur
Model architecture
| component | spec |
|---|---|
| backbone | ViT-B/16, 86M params, 768-dim, 12 layers, 12 heads |
| head | ViTPose: 16Γ16 patch tokens β 2 deconv layers Γ 256 filters β soft-argmax per keypoint |
| input | 256Γ256 RGB, ImageNet normalisation, replicate single-channel inputs |
| output | Normalised xy coordinates in [0, 1] per keypoint |
| total params | 128M |
| weights | model.safetensors (511 MB, fp32) |
How this model was trained
Stage 1: IJEPA self-supervised pretraining
The ViT-B/16 backbone was pretrained using the I-JEPA objective on 160,486 unlabelled ultrasound frames (A4C, HC, FA, AOP views). Representation-space prediction is more robust to speckle noise than pixel-space methods such as MAE, as US-JEPA demonstrates.
Stage 2: Five-teacher ensemble
Five ViTPose models were trained separately from the IJEPA backbone, each with a different pseudo-label regime (R1, R2, R3, Selective/FUGC-capped, r3capped/balanced). The ensemble of their coordinate-wise medians achieved 6.30 px weighted MRE on fair validation.
Stage 3: Knowledge distillation to a single model
This published model is the distilled student:
- Same ViTPose architecture as each teacher
- On-the-fly teacher predictions during training: each batch runs all five teachers, their coordinate-wise median becomes the teacher target
- Combined loss:
loss_real + 0.5 * loss_teacher - 100 epochs, AdamW, cosine schedule with 3-epoch warm-up
- Training set: 32,722 labelled + pseudo-labelled samples from the FM_UIA challenge
Why distillation? A single model is 5Γ cheaper to run than the ensemble at inference time. Naive weight averaging (model soup) was destructive at 87 px β the teachers had diverged too far during supervised training with different pseudo-label distributions. Distillation was required.
Performance
Fair validation = original labelled samples only (no pseudo-labels), 15% random split with seed 42.
| model | weighted MRE | unweighted mean | inference cost |
|---|---|---|---|
| Competition baseline (EfficientNet-B4 + FPN) | 67.43 | β | 1 forward pass |
| Sonichu distilled (this model) | 7.30 | 18.07 | 2 passes (with TTA) |
| Best single teacher (R3) | 7.63 | 19.54 | 2 passes |
| 4-model median ensemble | 6.81 | 17.29 | 8 passes |
| 5-model median ensemble | 6.30 | 16.40 | 10 passes |
The distilled model trades approximately 1 px of weighted MRE for a 5Γ inference speedup versus the full ensemble, making it the practical choice for deployment.
Note that the weighted MRE is dominated by AOP (60.7% of fair val samples), so for applications where per-task balance matters, the unweighted mean is more informative.
Limitations
- A4C is the weakest task (63.7 px). Only 108 real A4C labels exist in the training set. Further improvements require external cardiac data (e.g. EchoNet-Dynamic).
- IVC (n=8 val), PSAX (n=6 val), PLAX (n=16 val) have very small validation counts. Their per-sample MRE has high variance; treat these numbers as trend indicators.
- Weighted vs unweighted: the weighted overall (7.30) is AOP-dominated. Clinical deployment should consider per-task performance.
- Population: training data comes from a specific set of clinical sites and devices. Performance on out-of-distribution populations is untested.
- Static frames only: this model does not use temporal information from ultrasound video sequences.
Citation
If you use this model, please cite:
@misc{sonichu-2026,
title = {Sonichu: a distilled IJEPA-based model for multi-task ultrasound biometry},
author = {von Csefalvay, Chris},
year = {2026},
note = {ISBI 2026 FM\_UIA challenge submission}
}
Key underlying references:
- Assran et al. 2023 β I-JEPA (arXiv:2301.08243)
- Xu et al. 2022 β ViTPose (arXiv:2204.12484)
- Radhachandran et al. 2026 β US-JEPA (arXiv:2602.19322)
- Deng, Tang, Li 2026 β FM_UIA 2026 baseline (arXiv:2602.01055)
- Hinton, Vinyals, Dean 2015 β Distilling the Knowledge in a Neural Network (arXiv:1503.02531)
Contents of this repository
| file | description |
|---|---|
README.md |
this model card |
config.json |
model hyperparameters and task metadata |
preprocessor_config.json |
image preprocessing parameters |
model.safetensors |
model weights (128M params, 511 MB) |
modeling_sonichu.py |
self-contained PyTorch model class and preprocessor |
inference.py |
end-to-end inference example with overlay rendering |
License
Apache 2.0. See LICENSE in the repository.
Training used the FM_UIA 2026 challenge dataset (competition terms of use) and the Multi-centre Fetal Biometry Benchmark Dataset (DOI 10.5522/04/30819911, CC BY-NC-SA 4.0). Downstream users should respect those licenses.
- Downloads last month
- 75
Papers for chrisvoncsefalvay/sonichu
Baseline Method of the Foundation Model Challenge for Ultrasound Image Analysis
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
Distilling the Knowledge in a Neural Network
Evaluation results
- Weighted MRE (pixels, TTA) on FM_UIA 2026 fair validation (original labels only, 15% split, seed 42)self-reported7.300
- Unweighted per-task mean MRE (pixels, TTA) on FM_UIA 2026 fair validation (original labels only, 15% split, seed 42)self-reported18.070
- FUGC MRE (pixels, TTA) on FM_UIA 2026 fair validation (original labels only, 15% split, seed 42)self-reported3.900
- Foetal femur MRE (pixels, TTA) on FM_UIA 2026 fair validation (original labels only, 15% split, seed 42)self-reported7.100