Whisper Small - Merged Hindi & English (Task Arithmetic)

This model is a bilingual version of OpenAI's Whisper Small, specifically optimized for Hindi transcription while recovering original English capabilities using Task Arithmetic merging.

Model Details

Model Description

Standard fine-tuning on a specific language often causes "Catastrophic Forgetting," where the model loses its accuracy in the original languages (like English). This model solves that by merging a Hindi-fine-tuned model with the original base model.

Developed by: specialv
Model type: Speech-to-Text (ASR)
Language(s): Hindi (hi), English (en)
Finetuned from model: openai/whisper-small
Finetuning Dataset: google/fleurs (Hindi subset)
Merging Method: Task Arithmetic ($W_{new} = W_{base} + \lambda(W_{fine} - W_{base})$)

Model Sources

Base Model: openai/whisper-small
Hindi Task Vector source: specialv/whisper-small-hi-fleurs

Uses

Direct Use

Bilingual Transcription: Accurately transcribes both pure Hindi and pure English audio.
Hinglish Support: Improved performance on code-switched speech (mixed Hindi and English) compared to the base or purely fine-tuned versions.
Transcription of Long-form Audio: Supports chunking for files longer than 30 seconds.

How to Get Started with the Model

Use the code below to transcribe audio in either Hindi or English:

from transformers import pipeline
import torch

pipe = pipeline(
    "automatic-speech-recognition",
    model="specialv/whisper-small-merged-hi-en",
    device="cuda" if torch.cuda.is_available() else "cpu",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
)

# For Hindi Transcription
result_hi = pipe("audio_hi.wav", generate_kwargs={"language": "hindi"})
print(result_hi["text"])

# For English Transcription
result_en = pipe("audio_en.wav", generate_kwargs={"language": "english"})
print(result_en["text"])

Training Details

Training Data

The task vector was derived from fine-tuning on the Google FLEURS Hindi (hi_in) dataset.

Merging Procedure

The model was created using Task Arithmetic. This involves calculating the "Hindi task vector" (the difference between the fine-tuned weights and the base weights) and adding it back to the original English-capable base model. Scaling Factor (λ=W openai/whisper−small+0.6×(W specialv/whisper−small−hi−fleurs−W openai/whisper−small)

Evaluation

Quantitative Comparison

Model	Hindi WER (Word Error Rate) ↓	Relative Improvement
OpenAI Whisper Small (Base)	80.15%	Baseline
Bilingual Merged Model (specialv)	24.30%	+69.68%

Qualitative Comparison (Sample Transcription)

Source	Text
Ground Truth	होस्टल खास तौर पर युवा लोगों के लिए होते हैं इनमें ज़्यादातर बीस साल की उम्र के लोग रुकते हैं हालांकि आपको अक्सर यहां बड़ी उम्र के यात्री भी मिल सकते हैं
Our Merged Model	होस्टल खास तौर पर युवा लोगों के लिए होते हैं इन में ज़ादातर बिस साल की उम्र के लोग रुकते हैं हालांकि आपको अक्सर या बड़ी उम्र के यात्री भी में ल सकते हैं (High Accuracy)
Base Model	अस्टल खाश्टर पर यूगा लोग के लिए होते हैं इन में यादा तर भिस साल की उमर के लोग रुकते हैं हालांकि आपको अक्सर या बडी उमर के याद्टे भी में रिए रहाते हैं (Poor Accuracy)

Language	Accuracy	Note
Hindi	High	Retains ~90% of fine-tuned accuracy
English	High	Significantly improved over the pure Hindi model
Hinglish	Improved	Better at recognizing English loanwords in Hindi

Environmental Impact

Hardware Type: Tesla T4 GPU (Google Colab)
Hours used for fine-tuning: ~1.5 hours
Merging Time: < 5 minutes

How to apply this to your repo:

Go to your model page: https://huggingface.co/specialv/whisper-small-merged-hi-en
Click the "Edit model card" button.
Delete the existing text and paste the block above.
Click "Commit changes".

Downloads last month: 46

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for specialv/whisper-small-merged-hi-en

Base model

openai/whisper-small

Finetuned

(3408)

this model