---
language:
- ja
tags:
- automatic-speech-recognition
- whisper
- japanese
- distillation
- common-voice
- asr
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_17_0
pipeline_tag: automatic-speech-recognition
model-index:
- name: whisper-small-ja-distill
  results:
  - task:
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 17.0 (ja)
      type: mozilla-foundation/common_voice_17_0
      args: ja
    metrics:
    - name: cer
      type: cer
      value: 0.25858
library_name: transformers
metrics:
- cer
---

# faster-whisper-ja-distill
This model was distilled by feeding the output from inferencing the teacher model (Whisper) with Japanese datasets into the student model.
Model details are described below.

**Teacher**: [deepdml/faster-whisper-large-v3-turbo-ct2 (CT2)](https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2)

**Student**: [openai/whisper-small](https://huggingface.co/openai/whisper-small)

※ Distilled whisper-base model of deepdml/faster-whisper-large-v3-turbo-ct2 is comming soon. It's not available now. 

## Training
- Data: Common Voice 17.0 (ja), CC-BY-4.0
- Distillation: hard
- Precision: FP32, gradient checkpointing 
- hardware:
  - 24GB VRAM (RTX A5000)
  - AMD EPYC 7H12
  - 54GB DRAM

The Others draining detail is published in [wandb project](https://wandb.ai/zary-zawatti/huggingface/runs/b14g7vo4?nw=nwuserzary)

## Evaluation

**Conditions**

* **Dataset**: Common Voice 17.0 (ja) validation
* **Sample size**: **N=1000**
* **Audio**: 16 kHz / mono
* **Decoding**: `num_beams=2`, `max_length=225`
* **Language/Task**: `language="ja"`, `task="transcribe"`
* **Chunking**: `chunk_length_s=30`, `stride_length_s=(5,5)`（Transformers ASR pipeline）

**Metrics**
**W\&B**: [Run / Comparison](https://wandb.ai/zary-zawatti/whisper-compare/runs/4ikrxg1i?nw=nwuserzary)

## Usage
```python
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration, pipeline

processor = WhisperProcessor.from_pretrained("openai/whisper-small")  # tokenizer+feature_extractor
model = WhisperForConditionalGeneration.from_pretrained("zary0/faster-whisper-ja-distill")

#model.eval()

forced_ids = processor.get_decoder_prompt_ids(language="ja", task="transcribe")

model.generation_config.forced_decoder_ids = forced_ids
model.config.forced_decoder_ids = forced_ids

asr = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,            
    stride_length_s=(5, 5),        
    return_timestamps=False,     
    device=0 if torch.cuda.is_available() else -1,
    generate_kwargs={
        "max_length": 225,
        "num_beams": 1,
        "forced_decoder_ids": forced_ids,    
    },
)

result = asr("sample.mp3")
print(result["text"])
```

## Convert CT2
It's [here](https://huggingface.co/zary0/faster-whisper-ja-distill-ct2)

## License / Attribution
Model: Apache-2.0

Data: Common Voice 17.0 (ja) © Mozilla, CC-BY-4.0. Provide attribution.