---
language:
- bn
license: apache-2.0
base_model: Qwen/Qwen3-ASR-1.7B
datasets:
- SUST-CSE-Speech/SUBAK.KO
tags:
- asr
- audio
- speech
- qwen
- sst
model_type: asr
pipeline_tag: automatic-speech-recognition
model-index:
- name: Qwen3-ASR-1.7B-Bengali
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: SUBAK.KO
      type: SUST-CSE-Speech/SUBAK.KO
      split: validation
    metrics:
    - type: wer
      value: 23.91
      name: Word Error Rate
    - type: cer
      value: 9.57
      name: Character Error Rate
---

# Qwen3-ASR-1.7B-Bengali (Specialized)

This repository contains a fine-tuned, specialized version of the [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) model, fundamentally optimized for **Bengali Automatic Speech Recognition (ASR)**. 

By conducting a full-parameter Supervised Fine-Tuning (SFT) over 241 hours of Bangladeshi Bengali audio, this model overcomes the "language bleeding" and script-hallucination limitations of the base foundation model, providing highly accurate native Bengali transcriptions.

## Comparative Evaluation Results

To ensure a rigorous and unbiased benchmark, the model was evaluated using **1,000 randomly sampled deterministic audio files** from the [SUBAK.KO](https://huggingface.co/datasets/SUST-CSE-Speech/SUBAK.KO) validation set. Metrics were calculated using the industry-standard `jiwer` library. 

*Note: For fairness, the baseline models (Qwen Base and Whisper) had their outputs normalized via `bnunicodenormalizer` to prevent visual-unicode mismatch penalties. The fine-tuned model's output was evaluated raw, proving its native alignment to the language.*

| Model | WER (%) | CER (%) | Script Err% (Devanagari) |
|---|---|---|---|
| **Qwen3-ASR-1.7B-Bengali (FT)** | **23.91%** | **9.57%** | **0.00%** |
| Qwen3-ASR-1.7B (Base) | 71.17% | 40.26% | 80.20% |
| OpenAI Whisper Large-v3 | 72.73% | 28.73% | 29.40% |

### Technical Interpretations & Breakthroughs
1. **Elimination of Language Bleeding:** The base Qwen3 model lacked deep alignment for the Bengali language. When prompted to transcribe Bengali, it suffered an 80.2% script error rate, outputting Devanagari (Hindi) characters. This fine-tune completely eradicated this hallucination (0.00% error), mapping the acoustic features directly to the Bengali script.
2. **Outperforming Generalist Models:** Whisper Large-v3 struggles heavily with the SUBAK.KO dataset due to regional Bangladeshi dialects, broadcast noise, and English loanword formatting. By specializing the weights on local data, this 1.7B model achieves a **~3x reduction in Word Error Rate (WER)** compared to the massive Whisper Large-v3 model.

## Training Configuration
- **Methodology**: Full-parameter Supervised Fine-Tuning (SFT) using a custom PyTorch/Transformers pipeline. No Parameter-Efficient Fine-Tuning (LoRA) was used, ensuring deep architectural alignment to the new linguistic representations.
- **Dataset**: SUBAK.KO (সুবাক্য) - 241 hours of annotated Bangladeshi Bengali corpus (229h Read Speech; 12h Broadcast Speech).
- **Optimization**: Trained on NVIDIA A100 GPUs using `bfloat16` precision and native HF gradient accumulation, masking prompt tokens (`-100`) to focus loss calculation purely on the Bengali textual output.


## Usage

### Method 1: Hugging Face Transformers
Since Qwen3-ASR is a multimodal LLM, you must provide a text prompt containing the `<|audio_pad|>` token so the model knows where to "listen."

```python
import torch
import librosa
from transformers import AutoModel, AutoProcessor, AutoConfig

model_id = "amugoodbad229/Qwen3-ASR-Bengali-FT" 

# 1. Initialize Processor and Model
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id, 
    dtype=torch.bfloat16, 
    device_map="auto", 
    trust_remote_code=True
)

# 2. Load audio (16kHz mono)
audio, _ = librosa.load("path/to/audio.wav", sr=16000)

# 3. Prepare Input with Prompt
# The <|audio_pad|> token is mandatory for the processor
prompt = "<|im_start|>user\n<|audio_pad|>Please transcribe.<|im_end|>\n<|im_start|>assistant\n"
inputs = processor(text=prompt, audio=audio, sampling_rate=16000, return_tensors="pt").to(model.device)

# Ensure floating point inputs match model precision
inputs = {k: v.to(torch.bfloat16) if v.is_floating_point() else v for k, v in inputs.items()}

# 4. Generate
generated_ids = model.generate(**inputs, max_new_tokens=256)

# 5. Decode
transcription = processor.batch_decode(
    generated_ids[:, inputs["input_ids"].shape[1]:], 
    skip_special_tokens=True
)[0]

print(f"Result: {transcription}")
```

### Method 2: Qwen-ASR Official Wrapper (Recommended)
This is the most efficient way to transcribe long audio files.

```python
from qwen_asr import Qwen3ASRModel

# Load the model using the official wrapper
# Ensure your model repo has the chat_template defined in tokenizer_config.json
model = Qwen3ASRModel.from_pretrained("amugoodbad229/Qwen3-ASR-Bengali-FT")

# Transcribe audio (Handles long-form audio chunking natively)
results = model.transcribe(audio=["path/to/your/audio.wav"], language=[None])
print(results[0].text)
```

***