malper/abjadsr-he-ipa

Fine-tuned from outputs/pretrain_ipa_only on ILSpeech (~2h Hebrew, studio quality). Given audio, outputs the IPA transcription only (no Hebrew text).

Stage 2 of 2 — use this model for inference.

Training

  • Dataset: ILSpeech (~2h Hebrew, studio quality) — train split, with 10% held out as dev
  • Base model: outputs/pretrain_ipa_only
  • Checkpoint: step 200 (best by dev token accuracy)
  • Dev token accuracy: 96.5%
  • Dev loss: 0.945
  • Learning rate: 5e-06, warmup 100 steps
  • Batch size: 1 × 64 grad-accum steps × 4 GPUs (effective 256)

Output format

Plain ASCII IPA transcription, space-separated by word:

hex'lit j'uzem lena'tsel

IPA special characters are mapped to ASCII: ʃ→S, ʒ→Z, dʒ→dZ, tʃ→tS, ʔ→q, ˈ→', ʁ→r, χ→x, ɡ→g.

Usage

import torch
import soundfile as sf
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_id = "malper/abjadsr-he-ipa"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
model.eval()

# Load audio (must be 16 kHz mono float32)
audio, sr = sf.read("audio.wav", dtype="float32", always_2d=False)
# resample if needed: torchaudio.functional.resample(torch.from_numpy(audio), sr, 16000).numpy()

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
forced_ids = processor.get_decoder_prompt_ids(language="he", task="transcribe")

with torch.no_grad():
    generated = model.generate(
        inputs.input_features,
        forced_decoder_ids=forced_ids,
        max_new_tokens=444,
    )

output = processor.batch_decode(generated, skip_special_tokens=True)[0].strip()
print(output)
# e.g. "hex'lit j'uzem lena'tsel"
Downloads last month
32
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support