SONICHU-124M: a foundation model of ultrasound biometry

SONICHU-124M (Single One-shot Neural Inference of Coordinates in Human Ultrasound) is a foundation model for 9-task B-mode ultrasound biometry. Achieves 7.30 px weighted mean radial error on the FM_UIA tasks — in a single forward pass (2 passes with TTA). And it's tiny!

What this model does

Given a 2D B-mode ultrasound image, SONICHU predicts anatomical keypoints for nine biometric measurements. Users specify which task they want at inference time, and the model returns normalised xy coordinates in [0, 1] which can be scaled back to pixel coordinates.

task	keypoints	anatomy	fair MRE (px, TTA)
AOP	4	angle of progression (intrapartum)	4.8
FUGC	2	foetal umbilical cord	3.9
FA	4	foetal abdomen biometry	7.6
HC	4	foetal head circumference	8.0
IVC	2	inferior vena cava	29.0
PLAX	22	cardiac parasternal long-axis	15.7
PSAX	4	cardiac parasternal short-axis	22.8
A4C	16	apical four-chamber view	63.7
fetal_femur	2	foetal femur length	7.1

Cardiac tasks (PLAX, PSAX, A4C, IVC) and fetal_femur have limited real-labelled training data (under 100 samples for the first three); treat those numbers as indicative rather than clinical-grade. The FUGC result (3.9 px) is the best across all models we evaluated.

Intended use

Transfer-learning starting point for related ultrasound keypoint tasks
The IJEPA-pretrained backbone alone is a useful domain-adapted feature extractor (160k ultrasound frames of self-supervised pretraining)

Not for clinical use. This model has not been clinically validated. It must not be used for patient diagnosis or treatment decisions.

Out of scope

Non-ultrasound imaging modalities (CT, MRI, optical)
3D volumes (this is a 2D frame-level model)

Quick start

import cv2
import torch
import numpy as np
from modeling_sonichu import SonichuModel, SonichuPreprocessor

model = SonichuModel.from_pretrained(".")
prep = SonichuPreprocessor.from_pretrained(".")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

img_bgr = cv2.imread("my_ultrasound.png")
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)

inputs = prep(img_rgb)
kps_norm = model.predict(inputs["pixel_values"].to(device),
                         task="fetal_femur", tta=True)
# kps_norm: (1, n_kp, 2) in [0, 1]

kps_pixel = kps_norm[0].cpu().numpy() * np.array([inputs["orig_w"], inputs["orig_h"]])
print(kps_pixel)

A complete inference script with overlay rendering is provided in inference.py:

python inference.py my_ultrasound.png fetal_femur

Model architecture

component	spec
backbone	ViT-B/16, 86M params, 768-dim, 12 layers, 12 heads
head	ViTPose: 16×16 patch tokens → 2 deconv layers × 256 filters → soft-argmax per keypoint
input	256×256 RGB, ImageNet normalisation, replicate single-channel inputs
output	Normalised xy coordinates in [0, 1] per keypoint
total params	128M
weights	`model.safetensors` (511 MB, fp32)

How this model was trained

Stage 1: IJEPA self-supervised pretraining

The ViT-B/16 backbone was pretrained using the I-JEPA objective on 160,486 unlabelled ultrasound frames (A4C, HC, FA, AOP views). Representation-space prediction is more robust to speckle noise than pixel-space methods such as MAE, as US-JEPA demonstrates.

Stage 2: Five-teacher ensemble

Five ViTPose models were trained separately from the IJEPA backbone, each with a different pseudo-label regime (R1, R2, R3, Selective/FUGC-capped, r3capped/balanced). The ensemble of their coordinate-wise medians achieved 6.30 px weighted MRE on fair validation.

Stage 3: Knowledge distillation to a single model

This published model is the distilled student:

Same ViTPose architecture as each teacher
On-the-fly teacher predictions during training: each batch runs all five teachers, their coordinate-wise median becomes the teacher target
Combined loss: loss_real + 0.5 * loss_teacher
100 epochs, AdamW, cosine schedule with 3-epoch warm-up
Training set: 32,722 labelled + pseudo-labelled samples from the FM_UIA challenge

Why distillation? A single model is 5× cheaper to run than the ensemble at inference time. Naive weight averaging (model soup) was destructive at 87 px — the teachers had diverged too far during supervised training with different pseudo-label distributions. Distillation was required.

Performance

Fair validation = original labelled samples only (no pseudo-labels), 15% random split with seed 42.

model	weighted MRE	unweighted mean	inference cost
Competition baseline (EfficientNet-B4 + FPN)	67.43	—	1 forward pass
Sonichu distilled (this model)	7.30	18.07	2 passes (with TTA)
Best single teacher (R3)	7.63	19.54	2 passes
4-model median ensemble	6.81	17.29	8 passes
5-model median ensemble	6.30	16.40	10 passes

The distilled model trades approximately 1 px of weighted MRE for a 5× inference speedup versus the full ensemble, making it the practical choice for deployment.

Note that the weighted MRE is dominated by AOP (60.7% of fair val samples), so for applications where per-task balance matters, the unweighted mean is more informative.

Limitations

A4C is the weakest task (63.7 px). Only 108 real A4C labels exist in the training set. Further improvements require external cardiac data (e.g. EchoNet-Dynamic).
IVC (n=8 val), PSAX (n=6 val), PLAX (n=16 val) have very small validation counts. Their per-sample MRE has high variance; treat these numbers as trend indicators.
Weighted vs unweighted: the weighted overall (7.30) is AOP-dominated. Clinical deployment should consider per-task performance.
Population: training data comes from a specific set of clinical sites and devices. Performance on out-of-distribution populations is untested.
Static frames only: this model does not use temporal information from ultrasound video sequences.

Citation

If you use this model, please cite:

@misc{sonichu-2026,
  title  = {Sonichu: a distilled IJEPA-based model for multi-task ultrasound biometry},
  author = {von Csefalvay, Chris},
  year   = {2026},
  note   = {ISBI 2026 FM\_UIA challenge submission}
}

Key underlying references:

Assran et al. 2023 — I-JEPA (arXiv:2301.08243)
Xu et al. 2022 — ViTPose (arXiv:2204.12484)
Radhachandran et al. 2026 — US-JEPA (arXiv:2602.19322)
Deng, Tang, Li 2026 — FM_UIA 2026 baseline (arXiv:2602.01055)
Hinton, Vinyals, Dean 2015 — Distilling the Knowledge in a Neural Network (arXiv:1503.02531)

Contents of this repository

file	description
`README.md`	this model card
`config.json`	model hyperparameters and task metadata
`preprocessor_config.json`	image preprocessing parameters
`model.safetensors`	model weights (128M params, 511 MB)
`modeling_sonichu.py`	self-contained PyTorch model class and preprocessor
`inference.py`	end-to-end inference example with overlay rendering

License

Apache 2.0. See LICENSE in the repository.

Training used the FM_UIA 2026 challenge dataset (competition terms of use) and the Multi-centre Fetal Biometry Benchmark Dataset (DOI 10.5522/04/30819911, CC BY-NC-SA 4.0). Downstream users should respect those licenses.

Downloads last month: 75

Inference Providers NEW

Keypoint Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for chrisvoncsefalvay/sonichu

US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound

Paper • 2602.19322 • Published Feb 22

Baseline Method of the Foundation Model Challenge for Ultrasound Image Analysis

Paper • 2602.01055 • Published Feb 1

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Paper • 2301.08243 • Published Jan 19, 2023 • 7

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Paper • 2204.12484 • Published Apr 26, 2022 • 3

Distilling the Knowledge in a Neural Network

Paper • 1503.02531 • Published Mar 9, 2015 • 2

Evaluation results

Weighted MRE (pixels, TTA) on FM_UIA 2026 fair validation (original labels only, 15% split, seed 42)
self-reported

7.300
Unweighted per-task mean MRE (pixels, TTA) on FM_UIA 2026 fair validation (original labels only, 15% split, seed 42)
self-reported

18.070
FUGC MRE (pixels, TTA) on FM_UIA 2026 fair validation (original labels only, 15% split, seed 42)
self-reported

3.900
Foetal femur MRE (pixels, TTA) on FM_UIA 2026 fair validation (original labels only, 15% split, seed 42)
self-reported

7.100