HeAR-s1.1 model card

Model documentation: HeAR (Google Health Acoustic Representations)

Model information

This package contains a distilled HeAR student model implemented in PyTorch with a ViT-S backbone and Canon layers.

Description

This export is the HeAR-s1.1 follow-on package to the original HeAR-s upload. It uses the student head only from the MAEB evaluation checkpoint and keeps the same custom Transformers packaging style as the prior release.

Backbone: ViT-S (vit_small_patch16_224)
Input: single-channel mel+PCEN spectrograms ([B, 1, 192, 128]) generated from 2-second audio clips at 16 kHz
Canon setup: A/B/C/D enabled, 2D Canon, kernel size 4, positional encodings disabled
Output embedding: pooler_output with shape [B, 384]
Exported weights: student head only, no projection layer

MAEB(beta, audio-only) evaluation

This checkpoint was evaluated on MAEB(beta, audio-only) with:

embedding head: student
window size: 2.0s
window hop: 2.0s
pooling over windows: mean
benchmark mean: 0.42442

Suggested Hub repo id: matthewagi/HeAR-s1.1

Files in this package

config.json: model config and auto_map
configuration_hear_canon.py: custom PretrainedConfig
modeling_hear_canon.py: custom PreTrainedModel with integrated audio preprocessing
model.safetensors: distilled student weights
preprocessor_config.json: preprocessing metadata
model_shapes.json: structure and tensor shape inventory
training_args.json: training/checkpoint args captured from the source checkpoint
maeb_summary.json: saved MAEB task summary for this checkpoint
maeb_run_config.json: saved MAEB inference settings for that benchmark run
.gitattributes: git/LFS attributes for model artifacts
smoke_test.py: local verification script

How to use

Install dependencies:

pip install -U "transformers>=4.50.0" timm torch scipy soundfile

Run local smoke test:

python3 trained_model_hf_upload_maeb/smoke_test.py

Inference from raw audio waveform

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "trained_model_hf_upload_maeb",
    trust_remote_code=True,
)
model.eval()

# 4 clips, each 2 seconds at 16 kHz => 32000 samples
raw_audio_batch = torch.rand((4, 32000), dtype=torch.float32)

with torch.inference_mode():
    out = model(input_values=raw_audio_batch, return_dict=True)

embeddings = out.pooler_output
print(embeddings.shape)  # torch.Size([4, 384])

Inference from `.wav` file

import torch
import soundfile as sf
from scipy import signal
from transformers import AutoModel


def load_wav_mono_16k(path: str, target_sr: int = 16000) -> torch.Tensor:
    audio, sr = sf.read(path, dtype="float32", always_2d=False)
    if audio.ndim == 2:
        audio = audio.mean(axis=1)
    if sr != target_sr:
        new_len = int(round(audio.shape[0] * (target_sr / sr)))
        audio = signal.resample(audio, new_len)
    return torch.from_numpy(audio).float()


model = AutoModel.from_pretrained("trained_model_hf_upload_maeb", trust_remote_code=True)
model.eval()

waveform = load_wav_mono_16k("example.wav")

with torch.inference_mode():
    embedding = model.embed_audio(waveform)

print(embedding.shape)  # torch.Size([1, 384])

Inference from preprocessed spectrograms

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained("trained_model_hf_upload_maeb", trust_remote_code=True)
model.eval()

raw_audio = torch.rand((2, 32000), dtype=torch.float32)
spectrogram = model.preprocess_audio(raw_audio)

with torch.inference_mode():
    out = model(pixel_values=spectrogram, return_dict=True)

print(spectrogram.shape)        # torch.Size([2, 1, 192, 128])
print(out.pooler_output.shape)  # torch.Size([2, 384])

Model architecture overview

Student model parameters: 22,140,288
Embedding dimension: 384
Input shape: [B, 1, 192, 128]
Output shape: [B, 384]

Detailed tensor shapes are provided in model_shapes.json.

Downloads last month: 16