--- license: apache-2.0 language: - ar - de - el - en - es - fr - it - ja - ko - nl - pl - pt - vi - zh pipeline_tag: automatic-speech-recognition tags: - audio - hf-asr-leaderboard - speech-recognition - transcription library_name: transformers --- # Cohere Transcribe Cohere Transcribe is an open source release of a 2B parameter dedicated audio-in, text-out automatic speech recognition (ASR) model. The model supports 14 languages. Developed by: [Cohere](https://cohere.com) and [Cohere Labs](https://cohere.com/research). Point of Contact: [Cohere Labs](https://cohere.com/research).
| Name | cohere-transcribe-03-2026 |
|---|---|
| Architecture | conformer-based encoder-decoder |
| Input | audio waveform β log-Mel spectrogram. Audio is automatically resampled to 16kHz if necessary during preprocessing. Similarly, multi-channel (stereo) inputs are averaged to produce a single channel signal. |
| Output | transcribed text |
| Model size | 2B |
| Model | a large Conformer encoder extracts acoustic representations, followed by a lightweight Transformer decoder for token generation |
| Training objective | supervised cross-entropy on output tokens; trained from scratch |
| Languages |
Trained on 14 languages:
|
| License | Apache 2.0 |
| Model | Average WER | AMI | Earnings 22 | Gigaspeech | LS clean | LS other | SPGISpeech | Tedlium | Voxpopuli |
|---|---|---|---|---|---|---|---|---|---|
| Cohere Transcribe | 5.42 | 8.15 | 10.84 | 9.33 | 1.25 | 2.37 | 3.08 | 2.49 | 5.87 |
| Zoom Scribe v1 | 5.47 | 10.03 | 9.53 | 9.61 | 1.63 | 2.81 | 1.59 | 3.22 | 5.37 |
| IBM Granite 4.0 1B Speech | 5.52 | 8.44 | 8.48 | 10.14 | 1.42 | 2.85 | 3.89 | 3.10 | 5.84 |
| NVIDIA Canary Qwen 2.5B | 5.63 | 10.19 | 10.45 | 9.43 | 1.61 | 3.10 | 1.90 | 2.71 | 5.66 |
| Qwen3-ASR-1.7B | 5.76 | 10.56 | 10.25 | 8.74 | 1.63 | 3.40 | 2.84 | 2.28 | 6.35 |
| ElevenLabs Scribe v2 | 5.83 | 11.86 | 9.43 | 9.11 | 1.54 | 2.83 | 2.68 | 2.37 | 6.80 |
| Kyutai STT 2.6B | 6.40 | 12.17 | 10.99 | 9.81 | 1.70 | 4.32 | 2.03 | 3.35 | 6.79 |
| OpenAI Whisper Large v3 | 7.44 | 15.95 | 11.29 | 10.02 | 2.01 | 3.91 | 2.94 | 3.86 | 9.54 |
| Voxtral Mini 4B Realtime 2602 | 7.68 | 17.07 | 11.84 | 10.38 | 2.08 | 5.52 | 2.42 | 3.79 | 8.34 |
_Figure: Human preference evaluation of model transcripts. In a head-to-head comparison,
annotators were asked to express preferences for generations which primarily preserved meaning -
but also avoided hallucination, correctly identified named entities,
and provided verbatim transcripts with appropriate formatting.
A score of 50% or higher indicates that Cohere Transcribe was preferred on average in the comparison._
_Figure: per-language error rate averaged over FLEURS, Common Voice 17.0, MLS and Wenet tests sets (where relevant for a given language). CER for zh, ja, ko β WER otherwise_