--- language: - ja tags: - automatic-speech-recognition - whisper - japanese - distillation - common-voice - asr license: apache-2.0 datasets: - mozilla-foundation/common_voice_17_0 pipeline_tag: automatic-speech-recognition model-index: - name: whisper-small-ja-distill results: - task: type: automatic-speech-recognition dataset: name: Common Voice 17.0 (ja) type: mozilla-foundation/common_voice_17_0 args: ja metrics: - name: cer type: cer value: 0.25858 library_name: transformers metrics: - cer --- # faster-whisper-ja-distill This model was distilled by feeding the output from inferencing the teacher model (Whisper) with Japanese datasets into the student model. Model details are described below. **Teacher**: [deepdml/faster-whisper-large-v3-turbo-ct2 (CT2)](https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2) **Student**: [openai/whisper-small](https://huggingface.co/openai/whisper-small) ※ Distilled whisper-base model of deepdml/faster-whisper-large-v3-turbo-ct2 is comming soon. It's not available now. ## Training - Data: Common Voice 17.0 (ja), CC-BY-4.0 - Distillation: hard - Precision: FP32, gradient checkpointing - hardware: - 24GB VRAM (RTX A5000) - AMD EPYC 7H12 - 54GB DRAM The Others draining detail is published in [wandb project](https://wandb.ai/zary-zawatti/huggingface/runs/b14g7vo4?nw=nwuserzary) ## Evaluation **Conditions** * **Dataset**: Common Voice 17.0 (ja) validation * **Sample size**: **N=1000** * **Audio**: 16 kHz / mono * **Decoding**: `num_beams=2`, `max_length=225` * **Language/Task**: `language="ja"`, `task="transcribe"` * **Chunking**: `chunk_length_s=30`, `stride_length_s=(5,5)`(Transformers ASR pipeline) **Metrics** **W\&B**: [Run / Comparison](https://wandb.ai/zary-zawatti/whisper-compare/runs/4ikrxg1i?nw=nwuserzary) ## Usage ```python import torch import torchaudio from transformers import WhisperProcessor, WhisperForConditionalGeneration, pipeline processor = WhisperProcessor.from_pretrained("openai/whisper-small") # tokenizer+feature_extractor model = WhisperForConditionalGeneration.from_pretrained("zary0/faster-whisper-ja-distill") #model.eval() forced_ids = processor.get_decoder_prompt_ids(language="ja", task="transcribe") model.generation_config.forced_decoder_ids = forced_ids model.config.forced_decoder_ids = forced_ids asr = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, chunk_length_s=30, stride_length_s=(5, 5), return_timestamps=False, device=0 if torch.cuda.is_available() else -1, generate_kwargs={ "max_length": 225, "num_beams": 1, "forced_decoder_ids": forced_ids, }, ) result = asr("sample.mp3") print(result["text"]) ``` ## Convert CT2 It's [here](https://huggingface.co/zary0/faster-whisper-ja-distill-ct2) ## License / Attribution Model: Apache-2.0 Data: Common Voice 17.0 (ja) © Mozilla, CC-BY-4.0. Provide attribution.