--- license: other license_name: health-ai-developer-foundations license_link: https://developers.google.com/health-ai-developer-foundations/terms library_name: transformers pipeline_tag: feature-extraction tags: - audio - embeddings - vision-transformer - distillation - canon - maeb - mteb --- # HeAR-s1.1 model card **Model documentation:** HeAR (Google Health Acoustic Representations) ## Model information This package contains a distilled HeAR student model implemented in PyTorch with a ViT-S backbone and Canon layers. ### Description This export is the `HeAR-s1.1` follow-on package to the original `HeAR-s` upload. It uses the `student` head only from the MAEB evaluation checkpoint and keeps the same custom Transformers packaging style as the prior release. - Backbone: ViT-S (`vit_small_patch16_224`) - Input: single-channel mel+PCEN spectrograms (`[B, 1, 192, 128]`) generated from 2-second audio clips at 16 kHz - Canon setup: A/B/C/D enabled, 2D Canon, kernel size 4, positional encodings disabled - Output embedding: `pooler_output` with shape `[B, 384]` - Exported weights: `student` head only, no projection layer ### MAEB(beta, audio-only) evaluation This checkpoint was evaluated on `MAEB(beta, audio-only)` with: - embedding head: `student` - window size: `2.0s` - window hop: `2.0s` - pooling over windows: mean - benchmark mean: `0.42442` Suggested Hub repo id: `matthewagi/HeAR-s1.1` ## Files in this package - `config.json`: model config and `auto_map` - `configuration_hear_canon.py`: custom `PretrainedConfig` - `modeling_hear_canon.py`: custom `PreTrainedModel` with integrated audio preprocessing - `model.safetensors`: distilled student weights - `preprocessor_config.json`: preprocessing metadata - `model_shapes.json`: structure and tensor shape inventory - `training_args.json`: training/checkpoint args captured from the source checkpoint - `maeb_summary.json`: saved MAEB task summary for this checkpoint - `maeb_run_config.json`: saved MAEB inference settings for that benchmark run - `.gitattributes`: git/LFS attributes for model artifacts - `smoke_test.py`: local verification script ## How to use Install dependencies: ```bash pip install -U "transformers>=4.50.0" timm torch scipy soundfile ``` Run local smoke test: ```bash python3 trained_model_hf_upload_maeb/smoke_test.py ``` ### Inference from raw audio waveform ```python import torch from transformers import AutoModel model = AutoModel.from_pretrained( "trained_model_hf_upload_maeb", trust_remote_code=True, ) model.eval() # 4 clips, each 2 seconds at 16 kHz => 32000 samples raw_audio_batch = torch.rand((4, 32000), dtype=torch.float32) with torch.inference_mode(): out = model(input_values=raw_audio_batch, return_dict=True) embeddings = out.pooler_output print(embeddings.shape) # torch.Size([4, 384]) ``` ### Inference from `.wav` file ```python import torch import soundfile as sf from scipy import signal from transformers import AutoModel def load_wav_mono_16k(path: str, target_sr: int = 16000) -> torch.Tensor: audio, sr = sf.read(path, dtype="float32", always_2d=False) if audio.ndim == 2: audio = audio.mean(axis=1) if sr != target_sr: new_len = int(round(audio.shape[0] * (target_sr / sr))) audio = signal.resample(audio, new_len) return torch.from_numpy(audio).float() model = AutoModel.from_pretrained("trained_model_hf_upload_maeb", trust_remote_code=True) model.eval() waveform = load_wav_mono_16k("example.wav") with torch.inference_mode(): embedding = model.embed_audio(waveform) print(embedding.shape) # torch.Size([1, 384]) ``` ### Inference from preprocessed spectrograms ```python import torch from transformers import AutoModel model = AutoModel.from_pretrained("trained_model_hf_upload_maeb", trust_remote_code=True) model.eval() raw_audio = torch.rand((2, 32000), dtype=torch.float32) spectrogram = model.preprocess_audio(raw_audio) with torch.inference_mode(): out = model(pixel_values=spectrogram, return_dict=True) print(spectrogram.shape) # torch.Size([2, 1, 192, 128]) print(out.pooler_output.shape) # torch.Size([2, 384]) ``` ## Model architecture overview - Student model parameters: `22,140,288` - Embedding dimension: `384` - Input shape: `[B, 1, 192, 128]` - Output shape: `[B, 384]` Detailed tensor shapes are provided in `model_shapes.json`.