---
license: apache-2.0
language:
- en
library_name: pytorch
thumbnail: opengraph_card.jpeg
tags:
- medical-imaging
- ultrasound
- keypoint-detection
- fetal-biometry
- cardiac-ultrasound
- ijepa
- vitpose
- knowledge-distillation
pipeline_tag: keypoint-detection
datasets:
- FM_UIA_2026
- multicentre-fetal-biometry-2025
model-index:
- name: Sonichu
  results:
  - task:
      type: keypoint-detection
      name: Multi-task ultrasound biometry (9 tasks)
    dataset:
      name: FM_UIA 2026 fair validation (original labels only, 15% split, seed 42)
      type: private
    metrics:
    - name: Weighted MRE (pixels, TTA)
      type: mre
      value: 7.3
    - name: Unweighted per-task mean MRE (pixels, TTA)
      type: mre
      value: 18.07
    - name: FUGC MRE (pixels, TTA)
      type: mre
      value: 3.9
    - name: Foetal femur MRE (pixels, TTA)
      type: mre
      value: 7.1
metrics:
- mae
---

# SONICHU-124M: a foundation model of ultrasound biometry

SONICHU-124M (Single One-shot Neural Inference of Coordinates in Human Ultrasound) is a foundation model for 9-task B-mode ultrasound biometry. Achieves **7.30 px weighted mean radial error** on the FM_UIA tasks — in a single forward pass (2 passes with TTA). And it's tiny!

## What this model does

Given a 2D B-mode ultrasound image, SONICHU predicts anatomical keypoints for nine biometric measurements. Users specify which task they want at inference time, and the model returns normalised xy coordinates in [0, 1] which can be scaled back to pixel coordinates.

| task | keypoints | anatomy | fair MRE (px, TTA) |
|------|-----------|---------|---------------------|
| AOP | 4 | angle of progression (intrapartum) | 4.8 |
| FUGC | 2 | foetal umbilical cord | **3.9** |
| FA | 4 | foetal abdomen biometry | 7.6 |
| HC | 4 | foetal head circumference | 8.0 |
| IVC | 2 | inferior vena cava | 29.0 |
| PLAX | 22 | cardiac parasternal long-axis | 15.7 |
| PSAX | 4 | cardiac parasternal short-axis | 22.8 |
| A4C | 16 | apical four-chamber view | 63.7 |
| fetal_femur | 2 | foetal femur length | 7.1 |

Cardiac tasks (PLAX, PSAX, A4C, IVC) and fetal_femur have limited real-labelled training data (under 100 samples for the first three); treat those numbers as indicative rather than clinical-grade. The FUGC result (3.9 px) is the best across all models we evaluated.

<img src='demo_prediction.png' width='700'>


## Intended use

- Transfer-learning starting point for related ultrasound keypoint tasks
- The IJEPA-pretrained backbone alone is a useful domain-adapted feature extractor (160k ultrasound frames of self-supervised pretraining)

**Not for clinical use.** This model has not been clinically validated. It must not be used for patient diagnosis or treatment decisions.

## Out of scope

- Non-ultrasound imaging modalities (CT, MRI, optical)
- 3D volumes (this is a 2D frame-level model)

## Quick start

```python
import cv2
import torch
import numpy as np
from modeling_sonichu import SonichuModel, SonichuPreprocessor

model = SonichuModel.from_pretrained(".")
prep = SonichuPreprocessor.from_pretrained(".")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

img_bgr = cv2.imread("my_ultrasound.png")
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)

inputs = prep(img_rgb)
kps_norm = model.predict(inputs["pixel_values"].to(device),
                         task="fetal_femur", tta=True)
# kps_norm: (1, n_kp, 2) in [0, 1]

kps_pixel = kps_norm[0].cpu().numpy() * np.array([inputs["orig_w"], inputs["orig_h"]])
print(kps_pixel)
```

A complete inference script with overlay rendering is provided in `inference.py`:

```bash
python inference.py my_ultrasound.png fetal_femur
```

## Model architecture

| component | spec |
|-----------|------|
| backbone | ViT-B/16, 86M params, 768-dim, 12 layers, 12 heads |
| head | ViTPose: 16×16 patch tokens → 2 deconv layers × 256 filters → soft-argmax per keypoint |
| input | 256×256 RGB, ImageNet normalisation, replicate single-channel inputs |
| output | Normalised xy coordinates in [0, 1] per keypoint |
| total params | 128M |
| weights | `model.safetensors` (511 MB, fp32) |

## How this model was trained

### Stage 1: IJEPA self-supervised pretraining

The ViT-B/16 backbone was pretrained using the [I-JEPA](https://arxiv.org/abs/2301.08243) objective on 160,486 unlabelled ultrasound frames (A4C, HC, FA, AOP views). Representation-space prediction is more robust to speckle noise than pixel-space methods such as MAE, as [US-JEPA](https://arxiv.org/abs/2602.19322) demonstrates.

### Stage 2: Five-teacher ensemble

Five ViTPose models were trained separately from the IJEPA backbone, each with a different pseudo-label regime (R1, R2, R3, Selective/FUGC-capped, r3capped/balanced). The ensemble of their coordinate-wise medians achieved 6.30 px weighted MRE on fair validation.

### Stage 3: Knowledge distillation to a single model

This published model is the distilled student:

- Same ViTPose architecture as each teacher
- On-the-fly teacher predictions during training: each batch runs all five teachers, their coordinate-wise median becomes the teacher target
- Combined loss: `loss_real + 0.5 * loss_teacher`
- 100 epochs, AdamW, cosine schedule with 3-epoch warm-up
- Training set: 32,722 labelled + pseudo-labelled samples from the FM_UIA challenge

**Why distillation?** A single model is 5× cheaper to run than the ensemble at inference time. Naive weight averaging (model soup) was destructive at 87 px — the teachers had diverged too far during supervised training with different pseudo-label distributions. Distillation was required.

## Performance

Fair validation = original labelled samples only (no pseudo-labels), 15% random split with seed 42.

| model | weighted MRE | unweighted mean | inference cost |
|-------|--------------|-----------------|----------------|
| Competition baseline (EfficientNet-B4 + FPN) | 67.43 | — | 1 forward pass |
| **Sonichu distilled (this model)** | **7.30** | **18.07** | **2 passes (with TTA)** |
| Best single teacher (R3) | 7.63 | 19.54 | 2 passes |
| 4-model median ensemble | 6.81 | 17.29 | 8 passes |
| 5-model median ensemble | 6.30 | 16.40 | 10 passes |

The distilled model trades approximately 1 px of weighted MRE for a 5× inference speedup versus the full ensemble, making it the practical choice for deployment.

Note that the weighted MRE is dominated by AOP (60.7% of fair val samples), so for applications where per-task balance matters, the unweighted mean is more informative.

## Limitations

1. **A4C is the weakest task** (63.7 px). Only 108 real A4C labels exist in the training set. Further improvements require external cardiac data (e.g. EchoNet-Dynamic).
2. **IVC (n=8 val), PSAX (n=6 val), PLAX (n=16 val)** have very small validation counts. Their per-sample MRE has high variance; treat these numbers as trend indicators.
3. **Weighted vs unweighted**: the weighted overall (7.30) is AOP-dominated. Clinical deployment should consider per-task performance.
4. **Population**: training data comes from a specific set of clinical sites and devices. Performance on out-of-distribution populations is untested.
5. **Static frames only**: this model does not use temporal information from ultrasound video sequences.

## Citation

If you use this model, please cite:

```bibtex
@misc{sonichu-2026,
  title  = {Sonichu: a distilled IJEPA-based model for multi-task ultrasound biometry},
  author = {von Csefalvay, Chris},
  year   = {2026},
  note   = {ISBI 2026 FM\_UIA challenge submission}
}
```

Key underlying references:
- Assran et al. 2023 — I-JEPA (arXiv:2301.08243)
- Xu et al. 2022 — ViTPose (arXiv:2204.12484)
- Radhachandran et al. 2026 — US-JEPA (arXiv:2602.19322)
- Deng, Tang, Li 2026 — FM_UIA 2026 baseline (arXiv:2602.01055)
- Hinton, Vinyals, Dean 2015 — Distilling the Knowledge in a Neural Network (arXiv:1503.02531)

## Contents of this repository

| file | description |
|------|-------------|
| `README.md` | this model card |
| `config.json` | model hyperparameters and task metadata |
| `preprocessor_config.json` | image preprocessing parameters |
| `model.safetensors` | model weights (128M params, 511 MB) |
| `modeling_sonichu.py` | self-contained PyTorch model class and preprocessor |
| `inference.py` | end-to-end inference example with overlay rendering |

## License

Apache 2.0. See LICENSE in the repository.

Training used the FM_UIA 2026 challenge dataset (competition terms of use) and the Multi-centre Fetal Biometry Benchmark Dataset ([DOI 10.5522/04/30819911](https://doi.org/10.5522/04/30819911), CC BY-NC-SA 4.0). Downstream users should respect those licenses.