Granite-speech-4.1-2b-plus

Model Summary

Granite-speech-4.1-2b-plus has similar capabilities to the Granite-speech-4.1-2b model. The plus model adds two new community-requested rich transcription features that can be activated with a simple prompt change: speaker-attributed ASR (speaker labels and word transcripts) and word-level timing information. Unlike the base mode, the plus model doesn't provide punctuation and capitalization.

The model was trained on corpora similar to the Granite-speech-4.1-2b model which were augmented with speaker turns and word-level timestamp tags. This allows the model to provide different modes of functionality controlled by different prompts.

Two additional model variants explore different capabilities and inference optimization:

  • granite-speech-4.1-2b for applications where accuracy is the primary concern with support for punctuated, capitalized transcripts, AST and keyword-biased recognition, and includes Japanese.
  • granite-speech-4.1-2b-nar introduces a novel non-autoregressive architecture for higher throughput

ASR only mode

In this mode the model generates only the text transcript similar to the Granite-speech-4.1-2b model.

Speaker attributed ASR (SAA)

In this mode, the model adds speaker tags in the format of [Speaker N]: where $N$ is the speaker number, before each speaker turn. The speakers are numbered by their order of appearance so the first speaker will always be marked with [Speaker 1]: and the second with [Speaker 2]:, etc. For example: "[Speaker 1]: Hello how are you [Speaker 2]: I'm fine and how are you feeling [Speaker 1]: I feel wonderful".

For more information about SAA, see [1].

Word-level timestamps

In this mode, the model adds timestamp tags after each word indicating the end of the word in the audio. Silences are transcribed as _ and a timestamp tag also indicates their end. The format of the tag is [T:N] where $N$ is an integer number indicating the time in centiseconds (1/100th of a second). To reduce the amount of generated tokens, only the last three digits of $N$ are provided. This causes a rollover after 10 seconds.

The conversion from time $t$ in seconds to timestamp is $N = round(t*100) \mod 1000$. To convert back to seconds, use $t = N/100 + 10R$ where $R$ is the rollover counter. See code below for example implementation in Python.

Incremental decoding

There are cases where we want to transcribe a new audio segment along with previous segments that we've already transcribed. This can be useful for providing longer context for the model in order to improve transcription accuracy or to maintain the speaker numbering in SAA mode. To avoid re-decoding the previous segments, we can provide the previous transcription in the prefix_text field of the conversation template. The model will decode the parts after that. See the code below for examples.

Keyword list biasing (KWB)

Keyword list biasing capability [16] is available to enhance the recognition of keywords, such as names and technical terms. This is particularly useful in tasks where complex terms may otherwise be misrecognized. Keyword biasing can be applied by including the keywords directly in the prompt; for example, in ASR mode: Can you transcribe the speech into a written format? Keywords: …

Users may provide either a single keyword or a list of keywords, which may also include terms that do not appear in the input audio, making them well suited for batch processing or recurring domain-specific use cases.

Evaluations

ASR

Performance on HuggingFace Open ASR leaderboard:

model Average WER AMI Earnings22 Gigaspeech LS Clean LS Other SPGISpeech Tedlium Voxpopuli
ibm-granite/granite-speech-4.1-2b-plus 5.71 8.63 8.68 10.38 1.44 3.06 3.72 3.89 5.9

(Using speculative decoding)

Keyword list biasing accuracy - Keyword F1 score (%, ↑ higher is better):

Mode Gigaspeech LS-C LS-O SPGISpeech VOX TED_LIUM Earnings22 CV-en CV-de CV-es CV-fr CV-pt
Without KWB 74.2 89.1 78.2 80.8 93.9 87.9 68.8 74.6 78.5 83.1 74.5 90.0
With KWB 84.1 96.1 93.0 92.5 96.3 94.9 81.5 91.5 92.9 93.9 90.6 95.0

Speaker Attributed ASR

Speaker Attributed ASR performance - WDER (%, ↓ lower is better):

Model FISHER CALLHOME English AMI-SDM GALE [1]
VibeVoice ASR [17] 2.8 7.1 27.4 44.8
Granite-speech-4.1-2b-plus 0.9 2.2 14.6 30.2

The results are averaged over 2-5 minute speech segments.

Timestamps

Word-level timestamp accuracy - AAS (ms, ↓ lower is better):

Model AMI-I AMI-S LS-C LS-O VOX CV MLS TMT En Avg MLS-fr MLS-es MLS-de MLS-pt CV-fr CV-es CV-de CV-pt ML Avg
Qwen3-FA [12] 48.1 82.5 27.8 29.3 41.0 48.4 34.3 29.9 42.7 38.1 27.0 31.2 26.3 30.3 40.0 29.4 34.2 33.3
CrisperWhisper [13] 55.7 64.3 35.9 40.1 47.2 97.4 46.4 42.7 53.7 35.6 28.0 31.2 36.8 62.9 58.9 60.9 83.8 50.1
Canary-v2 [14] 127.8 129.7 92.5 89.2 109.9 110.3 94.3 86.1 105.0 85.0 81.1 80.2 86.8 88.5 91.5
WhisperX [15] 107.1 150.2 71.7 72.0 78.8 91.2 79.2 63.6 89.2 117.3 84.7 132.2 75.0 104.2 88.1 126.8 79.5 101.0
Granite-speech-4.1-2b-plus 43.4 69.0 11.4 14.6 80.2 43.3 24.3 24.5 38.8 45.4 23.0 41.3 47.1 18.6 19.3 19.5 24.2 29.8

Generation

The Granite Speech model is supported natively in transformers>=5.6. Below is a simple example of how to use the different modes of the model.

Usage with transformers

First, make sure to install a recent version of transformers:

pip install transformers torchaudio datasets

Setup — load the model and a test audio clip:

import re
import torch
from datasets import Audio, load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

SAMPLE_RATE = 16000
MODEL_NAME = "ibm-granite/granite-speech-4.1-2b-plus"

Load the model and define a general function for decoding the audio:

device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(MODEL_NAME)
tokenizer = processor.tokenizer
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, device_map=device, dtype=torch.bfloat16)

def transcribe(audio, prompt, max_new_tokens=2000, prefix_text=None):
    chat = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": prompt}]
    extra = {"prefix_text": prefix_text} if prefix_text is not None else {}
    prompt_text = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True, **extra)
    inputs = processor(prompt_text, audio, device=device, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, num_beams=1)
    new_tokens = outputs[0, inputs["input_ids"].shape[-1]:]
    return tokenizer.decode(new_tokens, add_special_tokens=False, skip_special_tokens=True)

Load some example audio data from the AMI dataset

ds = load_dataset("diarizers-community/ami", "ihm", split="test")
ds = ds.cast_column("audio", Audio(sampling_rate=SAMPLE_RATE, num_channels=1))

TEST_SAMPLE = 0
START_TIME, END_TIME = 5 * 60, 6 * 60
audio = ds["audio"][TEST_SAMPLE].get_samples_played_in_range(START_TIME, END_TIME)

Define the prompts used for the different tasks:

SYSTEM_PROMPT = "Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant"
ASR_PROMPT = "<|audio|> can you transcribe the speech into a written format?"
SAA_PROMPT = "<|audio|> Speaker attribution: Transcribe and denote who is speaking by adding [Speaker 1]: and [Speaker 2]: tags before speaker turns."
TS_PROMPT = "<|audio|> Timestamps: Transcribe the speech. After each word, add a timestamp tag showing the end time in centiseconds, e.g. hello [T:45] world [T:82]"

Task 1: ASR — plain speech-to-text transcription:

asr_text = transcribe(audio.data, ASR_PROMPT)
print(asr_text)

Task 2: Speaker Attributed ASR — transcription with speaker labels:

saa_text = transcribe(audio.data, SAA_PROMPT)
for segment in re.split(r"(\[Speaker \d+\]:)", saa_text):
    print(segment.strip())

Task 3: Word-level timestamps — transcription with per-word timing:

The timestamps are given in centiseconds and are modulo 1000 (=10 seconds) so we need to unwrap them by adding multiples of 10 seconds.

ts_text = transcribe(audio.data, TS_PROMPT, max_new_tokens=10000)
ts_words = re.split(r"\[T:(\d+)\]", ts_text)
last_word_end_time = 0
offset_time = 0
for word, ts in zip(ts_words[::2], ts_words[1::2]):
    word_end_time = float(ts) / 100
    while word_end_time + offset_time < last_word_end_time:
        offset_time += 10
    last_word_end_time = word_end_time + offset_time
    print(f"{word}\t{last_word_end_time:.2f}s")

Task 4: Incremental decoding — transcribe segments while accumulating audio context:

NUM_SEGMENTS = 3
previous_transcript = ""
all_audio = None

for k in range(NUM_SEGMENTS):
    t1 = START_TIME + (END_TIME - START_TIME) * k / NUM_SEGMENTS
    t2 = START_TIME + (END_TIME - START_TIME) * (k + 1) / NUM_SEGMENTS
    new_audio = ds["audio"][TEST_SAMPLE].get_samples_played_in_range(t1, t2)
    all_audio = new_audio.data if all_audio is None else torch.cat([all_audio, new_audio.data], dim=-1)
    saa_text = transcribe(all_audio, SAA_PROMPT, prefix_text=previous_transcript)
    print(f"{t1:06.2f}-{t2:06.2f}:\t{saa_text}")
    previous_transcript = (previous_transcript + " " + saa_text).strip()

Model information

Release Date: April 28, 2026

License: Apache 2.0

Supported Languages

English, French, German, Spanish, Portuguese

Intended Use

The model is intended to be used in enterprise applications that involve processing of speech input especially when a rich transcript adding speaker turns and time stamps is desired. In particular, the model is well-suited for English, French, German, Spanish, and Portuguese speech-to-text.

Model Architecture

The model shares the same architecture as the Granite-speech-4.1-2b model.

Training Data

The model was trained on the same datasets as Granite-speech-4.1-2b.

Additional training data for SAA was using audio segments from datasets that have speaker identification (e.g. Multilingual-Librispeech). Segments with alternating speakers were concatenated to create a long multi-speaker sample.

Similarly, long samples with timestamps were generated from concatenation of short segments.

The model was trained on audio samples up to 10 minutes for ASR and SAA, and up to 5 minutes for timestamps.

Training Data for Timestamps

Word-level timestamping capabilities are achieved by using a combination of publicly available speech corpora: LibriSpeech [2], MLS (en, fr, de, pt, es) [3], CommonVoice (en, fr, de, pt, es) [4], VoxPopuli (en, fr, de, es) [5], AMI-IHM [6], Switchboard [7], TIMIT [8] and YODAS [9]. For AMI-IHM, Switchboard and TIMIT, we use the available timestamp annotations. For all other datasets, we obtain word-level alignments using the Montreal Forced Aligner (MFA) [10], a GMM-HMM based forced alignment tool. We also use MFA to insert silence boundaries into the manually annotated datasets.

To ensure high-quality training data, we validate the MFA-derived alignments using forced alignments with our CTC-based speech encoder. We compute the Accumulated Average Shift (AAS) [11], the mean absolute error between timestamps in milliseconds, for the CTC and MFA alignments and retain only samples with the lowest alignment error: the top 95% for English and top 70% for non-English data. For the larger datasets (YODAS and MLS-en), we cap the training data at 4M and 5M samples, respectively.

Infrastructure

We train Granite Speech using IBM's supercomputing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. The training of this particular model was completed in about 5 days on 32 H100 GPUs.

Ethical Considerations and Limitations

The use of Large Speech and Language Models can trigger certain risks and ethical considerations. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate, biased, offensive or unwanted responses to user prompts. Additionally, whether smaller models may exhibit increased susceptibility to hallucination in generation scenarios due to their reduced sizes, which could limit their ability to generate coherent and contextually accurate responses, remains uncertain. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain.

IBM recommends using this model for automatic speech recognition and translation tasks. The model's design improves safety by limiting how audio inputs can influence the system. If an unfamiliar or malformed prompt is received, the model simply ignores it and performs transcription, which is the default fallback mode. This minimizes the risk of adversarial inputs, unlike integrated models that directly interpret audio and may be more exposed to such attacks. Note that more general speech tasks may pose higher inherent risks of triggering unwanted outputs.

To enhance safety, we recommend using Granite-speech-4.1-2b-plus alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas.

Resources

References

[1] Hagai Aronowitz, Zvi Kons, Avihu Dekel, George Saon, Ron Hoory, "Speaker Attributed Automatic Speech Recognition Using Speech Aware LLMs", to be published in ICASSP 2026, arXiv

[2] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, "Librispeech: an ASR corpus based on public domain audio books," in Proc. ICASSP, 2015.

[3] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, "MLS: A large-scale multilingual dataset for speech research," in Proc. Interspeech, 2020.

[4] R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, "Common Voice: A massively-multilingual speech corpus," in Proc. LREC, 2020.

[5] C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, "VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation," in Proc. ACL-IJCNLP, 2021.

[6] W. Kraaij, T. Hain, M. Lincoln, and W. Post, "The AMI meeting corpus," in Proc. International Conference on Methods and Techniques in Behavioral Research, 2005.

[7] J. J. Godfrey, E. C. Holliman, and J. McDaniel, "SWITCHBOARD: Telephone speech corpus for research and development," in Proc. ICASSP, 1992.

[8] J. W. Lyons, "DARPA TIMIT acoustic-phonetic continuous speech corpus," National Institute of Standards and Technology, 1993.

[9] X. Li, S. Takamichi, T. Saeki, W. Chen, S. Shiota, and S. Watanabe, "YODAS: YouTube-oriented dataset for audio and speech," in Proc. IEEE ASRU, 2023.

[10] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, "Montreal Forced Aligner: Trainable text-speech alignment using Kaldi," in Proc. Interspeech, 2017.

[11] X. Shi, Y. Chen, S. Zhang, and Z. Yan, "Achieving timestamp prediction while recognizing with non-autoregressive end-to-end ASR model," in National Conference on Man-Machine Speech Communication, 2022.

[12] X. Shi et al., "Qwen3-ASR technical report," arXiv preprint arXiv:2601.21337, 2026. arXiv

[13] M. Zusag, L. Wagner, and B. Thallinger, "CrisperWhisper: Accurate timestamps on verbatim speech transcriptions," in Proc. Interspeech, 2024.

[14] M. Sekoyan, N. R. Koluguri, N. Tadevosyan, P. Zelasko, T. Bartley, N. Karpov, J. Balam, and B. Ginsburg, "Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and high-performance models for multilingual ASR and AST," 2025. arXiv

[15] M. Bain, J. Huh, T. Han, and A. Zisserman, "WhisperX: Time-accurate speech transcription of long-form audio," 2023. arXiv

[16] S. Novitasari, T. Fukuda, G. Kurata, and G. Saon, "Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction," 2026. arXiv

[17] VibeVoice-ASR (Transformers-compatible version). Available online: https://huggingface.co/microsoft/VibeVoice-ASR-HF.

Downloads last month
57
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for konszvi/granite-speech-4.1-2b-plus2

Finetuned
(9)
this model

Papers for konszvi/granite-speech-4.1-2b-plus2