--- language: - bn license: apache-2.0 base_model: Qwen/Qwen3-ASR-1.7B datasets: - SUST-CSE-Speech/SUBAK.KO tags: - asr - audio - speech - qwen - sst model_type: asr pipeline_tag: automatic-speech-recognition model-index: - name: Qwen3-ASR-1.7B-Bengali results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: SUBAK.KO type: SUST-CSE-Speech/SUBAK.KO split: validation metrics: - type: wer value: 23.91 name: Word Error Rate - type: cer value: 9.57 name: Character Error Rate --- # Qwen3-ASR-1.7B-Bengali (Specialized) This repository contains a fine-tuned, specialized version of the [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) model, fundamentally optimized for **Bengali Automatic Speech Recognition (ASR)**. By conducting a full-parameter Supervised Fine-Tuning (SFT) over 241 hours of Bangladeshi Bengali audio, this model overcomes the "language bleeding" and script-hallucination limitations of the base foundation model, providing highly accurate native Bengali transcriptions. ## Comparative Evaluation Results To ensure a rigorous and unbiased benchmark, the model was evaluated using **1,000 randomly sampled deterministic audio files** from the [SUBAK.KO](https://huggingface.co/datasets/SUST-CSE-Speech/SUBAK.KO) validation set. Metrics were calculated using the industry-standard `jiwer` library. *Note: For fairness, the baseline models (Qwen Base and Whisper) had their outputs normalized via `bnunicodenormalizer` to prevent visual-unicode mismatch penalties. The fine-tuned model's output was evaluated raw, proving its native alignment to the language.* | Model | WER (%) | CER (%) | Script Err% (Devanagari) | |---|---|---|---| | **Qwen3-ASR-1.7B-Bengali (FT)** | **23.91%** | **9.57%** | **0.00%** | | Qwen3-ASR-1.7B (Base) | 71.17% | 40.26% | 80.20% | | OpenAI Whisper Large-v3 | 72.73% | 28.73% | 29.40% | ### Technical Interpretations & Breakthroughs 1. **Elimination of Language Bleeding:** The base Qwen3 model lacked deep alignment for the Bengali language. When prompted to transcribe Bengali, it suffered an 80.2% script error rate, outputting Devanagari (Hindi) characters. This fine-tune completely eradicated this hallucination (0.00% error), mapping the acoustic features directly to the Bengali script. 2. **Outperforming Generalist Models:** Whisper Large-v3 struggles heavily with the SUBAK.KO dataset due to regional Bangladeshi dialects, broadcast noise, and English loanword formatting. By specializing the weights on local data, this 1.7B model achieves a **~3x reduction in Word Error Rate (WER)** compared to the massive Whisper Large-v3 model. ## Training Configuration - **Methodology**: Full-parameter Supervised Fine-Tuning (SFT) using a custom PyTorch/Transformers pipeline. No Parameter-Efficient Fine-Tuning (LoRA) was used, ensuring deep architectural alignment to the new linguistic representations. - **Dataset**: SUBAK.KO (সুবাক্য) - 241 hours of annotated Bangladeshi Bengali corpus (229h Read Speech; 12h Broadcast Speech). - **Optimization**: Trained on NVIDIA A100 GPUs using `bfloat16` precision and native HF gradient accumulation, masking prompt tokens (`-100`) to focus loss calculation purely on the Bengali textual output. ## Usage ### Method 1: Hugging Face Transformers Since Qwen3-ASR is a multimodal LLM, you must provide a text prompt containing the `<|audio_pad|>` token so the model knows where to "listen." ```python import torch import librosa from transformers import AutoModel, AutoProcessor, AutoConfig model_id = "amugoodbad229/Qwen3-ASR-Bengali-FT" # 1. Initialize Processor and Model processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) model = AutoModel.from_pretrained( model_id, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) # 2. Load audio (16kHz mono) audio, _ = librosa.load("path/to/audio.wav", sr=16000) # 3. Prepare Input with Prompt # The <|audio_pad|> token is mandatory for the processor prompt = "<|im_start|>user\n<|audio_pad|>Please transcribe.<|im_end|>\n<|im_start|>assistant\n" inputs = processor(text=prompt, audio=audio, sampling_rate=16000, return_tensors="pt").to(model.device) # Ensure floating point inputs match model precision inputs = {k: v.to(torch.bfloat16) if v.is_floating_point() else v for k, v in inputs.items()} # 4. Generate generated_ids = model.generate(**inputs, max_new_tokens=256) # 5. Decode transcription = processor.batch_decode( generated_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True )[0] print(f"Result: {transcription}") ``` ### Method 2: Qwen-ASR Official Wrapper (Recommended) This is the most efficient way to transcribe long audio files. ```python from qwen_asr import Qwen3ASRModel # Load the model using the official wrapper # Ensure your model repo has the chat_template defined in tokenizer_config.json model = Qwen3ASRModel.from_pretrained("amugoodbad229/Qwen3-ASR-Bengali-FT") # Transcribe audio (Handles long-form audio chunking natively) results = model.transcribe(audio=["path/to/your/audio.wav"], language=[None]) print(results[0].text) ``` ***