---
license: other
license_name: health-ai-developer-foundations
license_link: https://developers.google.com/health-ai-developer-foundations/terms
library_name: transformers
pipeline_tag: feature-extraction
tags:
- audio
- embeddings
- vision-transformer
- distillation
- canon
- maeb
- mteb
---

# HeAR-s1.1 model card

**Model documentation:** HeAR (Google Health Acoustic Representations)

## Model information

This package contains a distilled HeAR student model implemented in PyTorch with a ViT-S backbone and Canon layers.

### Description

This export is the `HeAR-s1.1` follow-on package to the original `HeAR-s`
upload. It uses the `student` head only from the MAEB evaluation checkpoint and
keeps the same custom Transformers packaging style as the prior release.

- Backbone: ViT-S (`vit_small_patch16_224`)
- Input: single-channel mel+PCEN spectrograms (`[B, 1, 192, 128]`) generated from 2-second audio clips at 16 kHz
- Canon setup: A/B/C/D enabled, 2D Canon, kernel size 4, positional encodings disabled
- Output embedding: `pooler_output` with shape `[B, 384]`
- Exported weights: `student` head only, no projection layer

### MAEB(beta, audio-only) evaluation

This checkpoint was evaluated on `MAEB(beta, audio-only)` with:

- embedding head: `student`
- window size: `2.0s`
- window hop: `2.0s`
- pooling over windows: mean
- benchmark mean: `0.42442`

Suggested Hub repo id: `matthewagi/HeAR-s1.1`

## Files in this package

- `config.json`: model config and `auto_map`
- `configuration_hear_canon.py`: custom `PretrainedConfig`
- `modeling_hear_canon.py`: custom `PreTrainedModel` with integrated audio preprocessing
- `model.safetensors`: distilled student weights
- `preprocessor_config.json`: preprocessing metadata
- `model_shapes.json`: structure and tensor shape inventory
- `training_args.json`: training/checkpoint args captured from the source checkpoint
- `maeb_summary.json`: saved MAEB task summary for this checkpoint
- `maeb_run_config.json`: saved MAEB inference settings for that benchmark run
- `.gitattributes`: git/LFS attributes for model artifacts
- `smoke_test.py`: local verification script

## How to use

Install dependencies:

```bash
pip install -U "transformers>=4.50.0" timm torch scipy soundfile
```

Run local smoke test:

```bash
python3 trained_model_hf_upload_maeb/smoke_test.py
```

### Inference from raw audio waveform

```python
import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "trained_model_hf_upload_maeb",
    trust_remote_code=True,
)
model.eval()

# 4 clips, each 2 seconds at 16 kHz => 32000 samples
raw_audio_batch = torch.rand((4, 32000), dtype=torch.float32)

with torch.inference_mode():
    out = model(input_values=raw_audio_batch, return_dict=True)

embeddings = out.pooler_output
print(embeddings.shape)  # torch.Size([4, 384])
```

### Inference from `.wav` file

```python
import torch
import soundfile as sf
from scipy import signal
from transformers import AutoModel


def load_wav_mono_16k(path: str, target_sr: int = 16000) -> torch.Tensor:
    audio, sr = sf.read(path, dtype="float32", always_2d=False)
    if audio.ndim == 2:
        audio = audio.mean(axis=1)
    if sr != target_sr:
        new_len = int(round(audio.shape[0] * (target_sr / sr)))
        audio = signal.resample(audio, new_len)
    return torch.from_numpy(audio).float()


model = AutoModel.from_pretrained("trained_model_hf_upload_maeb", trust_remote_code=True)
model.eval()

waveform = load_wav_mono_16k("example.wav")

with torch.inference_mode():
    embedding = model.embed_audio(waveform)

print(embedding.shape)  # torch.Size([1, 384])
```

### Inference from preprocessed spectrograms

```python
import torch
from transformers import AutoModel

model = AutoModel.from_pretrained("trained_model_hf_upload_maeb", trust_remote_code=True)
model.eval()

raw_audio = torch.rand((2, 32000), dtype=torch.float32)
spectrogram = model.preprocess_audio(raw_audio)

with torch.inference_mode():
    out = model(pixel_values=spectrogram, return_dict=True)

print(spectrogram.shape)        # torch.Size([2, 1, 192, 128])
print(out.pooler_output.shape)  # torch.Size([2, 384])
```

## Model architecture overview

- Student model parameters: `22,140,288`
- Embedding dimension: `384`
- Input shape: `[B, 1, 192, 128]`
- Output shape: `[B, 384]`

Detailed tensor shapes are provided in `model_shapes.json`.