--- license: apache-2.0 language: - en library_name: pytorch thumbnail: opengraph_card.jpeg tags: - medical-imaging - ultrasound - keypoint-detection - fetal-biometry - cardiac-ultrasound - ijepa - vitpose - knowledge-distillation pipeline_tag: keypoint-detection datasets: - FM_UIA_2026 - multicentre-fetal-biometry-2025 model-index: - name: Sonichu results: - task: type: keypoint-detection name: Multi-task ultrasound biometry (9 tasks) dataset: name: FM_UIA 2026 fair validation (original labels only, 15% split, seed 42) type: private metrics: - name: Weighted MRE (pixels, TTA) type: mre value: 7.3 - name: Unweighted per-task mean MRE (pixels, TTA) type: mre value: 18.07 - name: FUGC MRE (pixels, TTA) type: mre value: 3.9 - name: Foetal femur MRE (pixels, TTA) type: mre value: 7.1 metrics: - mae --- # SONICHU-124M: a foundation model of ultrasound biometry SONICHU-124M (Single One-shot Neural Inference of Coordinates in Human Ultrasound) is a foundation model for 9-task B-mode ultrasound biometry. Achieves **7.30 px weighted mean radial error** on the FM_UIA tasks — in a single forward pass (2 passes with TTA). And it's tiny! ## What this model does Given a 2D B-mode ultrasound image, SONICHU predicts anatomical keypoints for nine biometric measurements. Users specify which task they want at inference time, and the model returns normalised xy coordinates in [0, 1] which can be scaled back to pixel coordinates. | task | keypoints | anatomy | fair MRE (px, TTA) | |------|-----------|---------|---------------------| | AOP | 4 | angle of progression (intrapartum) | 4.8 | | FUGC | 2 | foetal umbilical cord | **3.9** | | FA | 4 | foetal abdomen biometry | 7.6 | | HC | 4 | foetal head circumference | 8.0 | | IVC | 2 | inferior vena cava | 29.0 | | PLAX | 22 | cardiac parasternal long-axis | 15.7 | | PSAX | 4 | cardiac parasternal short-axis | 22.8 | | A4C | 16 | apical four-chamber view | 63.7 | | fetal_femur | 2 | foetal femur length | 7.1 | Cardiac tasks (PLAX, PSAX, A4C, IVC) and fetal_femur have limited real-labelled training data (under 100 samples for the first three); treat those numbers as indicative rather than clinical-grade. The FUGC result (3.9 px) is the best across all models we evaluated. ## Intended use - Transfer-learning starting point for related ultrasound keypoint tasks - The IJEPA-pretrained backbone alone is a useful domain-adapted feature extractor (160k ultrasound frames of self-supervised pretraining) **Not for clinical use.** This model has not been clinically validated. It must not be used for patient diagnosis or treatment decisions. ## Out of scope - Non-ultrasound imaging modalities (CT, MRI, optical) - 3D volumes (this is a 2D frame-level model) ## Quick start ```python import cv2 import torch import numpy as np from modeling_sonichu import SonichuModel, SonichuPreprocessor model = SonichuModel.from_pretrained(".") prep = SonichuPreprocessor.from_pretrained(".") device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device) img_bgr = cv2.imread("my_ultrasound.png") img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB) inputs = prep(img_rgb) kps_norm = model.predict(inputs["pixel_values"].to(device), task="fetal_femur", tta=True) # kps_norm: (1, n_kp, 2) in [0, 1] kps_pixel = kps_norm[0].cpu().numpy() * np.array([inputs["orig_w"], inputs["orig_h"]]) print(kps_pixel) ``` A complete inference script with overlay rendering is provided in `inference.py`: ```bash python inference.py my_ultrasound.png fetal_femur ``` ## Model architecture | component | spec | |-----------|------| | backbone | ViT-B/16, 86M params, 768-dim, 12 layers, 12 heads | | head | ViTPose: 16×16 patch tokens → 2 deconv layers × 256 filters → soft-argmax per keypoint | | input | 256×256 RGB, ImageNet normalisation, replicate single-channel inputs | | output | Normalised xy coordinates in [0, 1] per keypoint | | total params | 128M | | weights | `model.safetensors` (511 MB, fp32) | ## How this model was trained ### Stage 1: IJEPA self-supervised pretraining The ViT-B/16 backbone was pretrained using the [I-JEPA](https://arxiv.org/abs/2301.08243) objective on 160,486 unlabelled ultrasound frames (A4C, HC, FA, AOP views). Representation-space prediction is more robust to speckle noise than pixel-space methods such as MAE, as [US-JEPA](https://arxiv.org/abs/2602.19322) demonstrates. ### Stage 2: Five-teacher ensemble Five ViTPose models were trained separately from the IJEPA backbone, each with a different pseudo-label regime (R1, R2, R3, Selective/FUGC-capped, r3capped/balanced). The ensemble of their coordinate-wise medians achieved 6.30 px weighted MRE on fair validation. ### Stage 3: Knowledge distillation to a single model This published model is the distilled student: - Same ViTPose architecture as each teacher - On-the-fly teacher predictions during training: each batch runs all five teachers, their coordinate-wise median becomes the teacher target - Combined loss: `loss_real + 0.5 * loss_teacher` - 100 epochs, AdamW, cosine schedule with 3-epoch warm-up - Training set: 32,722 labelled + pseudo-labelled samples from the FM_UIA challenge **Why distillation?** A single model is 5× cheaper to run than the ensemble at inference time. Naive weight averaging (model soup) was destructive at 87 px — the teachers had diverged too far during supervised training with different pseudo-label distributions. Distillation was required. ## Performance Fair validation = original labelled samples only (no pseudo-labels), 15% random split with seed 42. | model | weighted MRE | unweighted mean | inference cost | |-------|--------------|-----------------|----------------| | Competition baseline (EfficientNet-B4 + FPN) | 67.43 | — | 1 forward pass | | **Sonichu distilled (this model)** | **7.30** | **18.07** | **2 passes (with TTA)** | | Best single teacher (R3) | 7.63 | 19.54 | 2 passes | | 4-model median ensemble | 6.81 | 17.29 | 8 passes | | 5-model median ensemble | 6.30 | 16.40 | 10 passes | The distilled model trades approximately 1 px of weighted MRE for a 5× inference speedup versus the full ensemble, making it the practical choice for deployment. Note that the weighted MRE is dominated by AOP (60.7% of fair val samples), so for applications where per-task balance matters, the unweighted mean is more informative. ## Limitations 1. **A4C is the weakest task** (63.7 px). Only 108 real A4C labels exist in the training set. Further improvements require external cardiac data (e.g. EchoNet-Dynamic). 2. **IVC (n=8 val), PSAX (n=6 val), PLAX (n=16 val)** have very small validation counts. Their per-sample MRE has high variance; treat these numbers as trend indicators. 3. **Weighted vs unweighted**: the weighted overall (7.30) is AOP-dominated. Clinical deployment should consider per-task performance. 4. **Population**: training data comes from a specific set of clinical sites and devices. Performance on out-of-distribution populations is untested. 5. **Static frames only**: this model does not use temporal information from ultrasound video sequences. ## Citation If you use this model, please cite: ```bibtex @misc{sonichu-2026, title = {Sonichu: a distilled IJEPA-based model for multi-task ultrasound biometry}, author = {von Csefalvay, Chris}, year = {2026}, note = {ISBI 2026 FM\_UIA challenge submission} } ``` Key underlying references: - Assran et al. 2023 — I-JEPA (arXiv:2301.08243) - Xu et al. 2022 — ViTPose (arXiv:2204.12484) - Radhachandran et al. 2026 — US-JEPA (arXiv:2602.19322) - Deng, Tang, Li 2026 — FM_UIA 2026 baseline (arXiv:2602.01055) - Hinton, Vinyals, Dean 2015 — Distilling the Knowledge in a Neural Network (arXiv:1503.02531) ## Contents of this repository | file | description | |------|-------------| | `README.md` | this model card | | `config.json` | model hyperparameters and task metadata | | `preprocessor_config.json` | image preprocessing parameters | | `model.safetensors` | model weights (128M params, 511 MB) | | `modeling_sonichu.py` | self-contained PyTorch model class and preprocessor | | `inference.py` | end-to-end inference example with overlay rendering | ## License Apache 2.0. See LICENSE in the repository. Training used the FM_UIA 2026 challenge dataset (competition terms of use) and the Multi-centre Fetal Biometry Benchmark Dataset ([DOI 10.5522/04/30819911](https://doi.org/10.5522/04/30819911), CC BY-NC-SA 4.0). Downstream users should respect those licenses.