HeAR-s1.1 model card
Model documentation: HeAR (Google Health Acoustic Representations)
Model information
This package contains a distilled HeAR student model implemented in PyTorch with a ViT-S backbone and Canon layers.
Description
This export is the HeAR-s1.1 follow-on package to the original HeAR-s
upload. It uses the student head only from the MAEB evaluation checkpoint and
keeps the same custom Transformers packaging style as the prior release.
- Backbone: ViT-S (
vit_small_patch16_224) - Input: single-channel mel+PCEN spectrograms (
[B, 1, 192, 128]) generated from 2-second audio clips at 16 kHz - Canon setup: A/B/C/D enabled, 2D Canon, kernel size 4, positional encodings disabled
- Output embedding:
pooler_outputwith shape[B, 384] - Exported weights:
studenthead only, no projection layer
MAEB(beta, audio-only) evaluation
This checkpoint was evaluated on MAEB(beta, audio-only) with:
- embedding head:
student - window size:
2.0s - window hop:
2.0s - pooling over windows: mean
- benchmark mean:
0.42442
Suggested Hub repo id: matthewagi/HeAR-s1.1
Files in this package
config.json: model config andauto_mapconfiguration_hear_canon.py: customPretrainedConfigmodeling_hear_canon.py: customPreTrainedModelwith integrated audio preprocessingmodel.safetensors: distilled student weightspreprocessor_config.json: preprocessing metadatamodel_shapes.json: structure and tensor shape inventorytraining_args.json: training/checkpoint args captured from the source checkpointmaeb_summary.json: saved MAEB task summary for this checkpointmaeb_run_config.json: saved MAEB inference settings for that benchmark run.gitattributes: git/LFS attributes for model artifactssmoke_test.py: local verification script
How to use
Install dependencies:
pip install -U "transformers>=4.50.0" timm torch scipy soundfile
Run local smoke test:
python3 trained_model_hf_upload_maeb/smoke_test.py
Inference from raw audio waveform
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(
"trained_model_hf_upload_maeb",
trust_remote_code=True,
)
model.eval()
# 4 clips, each 2 seconds at 16 kHz => 32000 samples
raw_audio_batch = torch.rand((4, 32000), dtype=torch.float32)
with torch.inference_mode():
out = model(input_values=raw_audio_batch, return_dict=True)
embeddings = out.pooler_output
print(embeddings.shape) # torch.Size([4, 384])
Inference from .wav file
import torch
import soundfile as sf
from scipy import signal
from transformers import AutoModel
def load_wav_mono_16k(path: str, target_sr: int = 16000) -> torch.Tensor:
audio, sr = sf.read(path, dtype="float32", always_2d=False)
if audio.ndim == 2:
audio = audio.mean(axis=1)
if sr != target_sr:
new_len = int(round(audio.shape[0] * (target_sr / sr)))
audio = signal.resample(audio, new_len)
return torch.from_numpy(audio).float()
model = AutoModel.from_pretrained("trained_model_hf_upload_maeb", trust_remote_code=True)
model.eval()
waveform = load_wav_mono_16k("example.wav")
with torch.inference_mode():
embedding = model.embed_audio(waveform)
print(embedding.shape) # torch.Size([1, 384])
Inference from preprocessed spectrograms
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained("trained_model_hf_upload_maeb", trust_remote_code=True)
model.eval()
raw_audio = torch.rand((2, 32000), dtype=torch.float32)
spectrogram = model.preprocess_audio(raw_audio)
with torch.inference_mode():
out = model(pixel_values=spectrogram, return_dict=True)
print(spectrogram.shape) # torch.Size([2, 1, 192, 128])
print(out.pooler_output.shape) # torch.Size([2, 384])
Model architecture overview
- Student model parameters:
22,140,288 - Embedding dimension:
384 - Input shape:
[B, 1, 192, 128] - Output shape:
[B, 384]
Detailed tensor shapes are provided in model_shapes.json.
- Downloads last month
- 16