Hungarian SpeechT5 ASR Model
Model Description
This model is an experimental Hungarian automatic speech recognition (ASR) model based on SpeechT5, a unified encoder–decoder architecture for speech and text processing.
The base SpeechT5-ASR model was adapted to Hungarian using a multi-phase fine-tuning strategy, combining selective freezing and unfreezing of model components to ensure training stability, efficient convergence, and improved Word Error Rate (WER).
The model transcribes raw speech audio into Hungarian text.
The model was trained on Common Voice Scripted Speech 24.0 - Hungarian
Base Architecture
The model is based on SpeechT5-ASR, which consists of:
A Transformer-based speech encoder for acoustic feature modeling
A Transformer-based text decoder for autoregressive transcription
Cross-attention between encoder and decoder
Shared embedding space for text representations
The overall architecture (number of layers, attention heads, hidden sizes) was preserved from the original SpeechT5 model.
Architectural Adaptations
To enable effective Hungarian ASR, the following adaptations were applied:
Hungarian-specific tokenizer and vocabulary
Resizing of decoder embeddings and output projection layers
Partial re-initialization of text decoder heads when necessary
Training hyperparameters adjusted for longer utterances and higher morphological variability
No structural changes were made to the attention mechanisms or Transformer blocks.
Freeze / Unfreeze Strategy
A three-phase training strategy with progressively increasing model flexibility was applied. Each phase had a distinct optimization goal and carefully controlled parameter updates.
Phase 1 – Decoder & Language Adaptation
Objective: adapt the text generation components to Hungarian while keeping acoustic representations stable.
Speech encoder: fully frozen
Text decoder: unfrozen
Token embeddings: unfrozen
LM head: unfrozen
In this phase, the acoustic encoder was kept fixed to preserve robust pretrained speech representations. Training focused on adapting the text decoder, embeddings, and output projection to Hungarian linguistic structure and vocabulary.
This step establishes a stable Hungarian language generation capability before any acoustic re-adaptation is attempted.
Phase 2 – Joint ASR Optimization
Objective: optimize end-to-end transcription quality and reduce Word Error Rate (WER).
Speech encoder: partially unfrozen
Text decoder: unfrozen
Cross-attention layers: unfrozen
Token embeddings: unfrozen
LM head: unfrozen
During this phase, upper layers of the speech encoder were unfrozen, while lower layers remained frozen. This allowed the model to jointly optimize acoustic–linguistic alignment without destabilizing low-level speech representations.
At this stage, the model effectively operates in an end-to-end ASR regime, enabling coordinated improvement of both acoustic modeling and text generation.
Phase 3 – Fine-Tuning and Generalization
Objective: improve generalization and reduce residual substitution and deletion errors.
Speech encoder: partially trainable
Lower encoder layers: frozen
Upper encoder layers: trainable
Text decoder: fully trainable
LM head: trainable
This phase uses a conservative fine-tuning setup with a reduced learning rate. By freezing lower encoder layers and fine-tuning higher-level acoustic representations together with the decoder, the model refines error-prone regions while minimizing catastrophic forgetting.
This setup is particularly effective for reducing systematic WER errors that persist after full end-to-end training.
Language Adaptation (Hungarian)
Hungarian ASR presents specific challenges:
Agglutinative morphology
High number of inflected word forms
Frequent compound words
Mitigation strategies:
Training exclusively on Hungarian-language speech
Hungarian-specific text normalization
Subword tokenization tuned for Hungarian morphology
WER-focused evaluation with awareness of morphological penalty effects
Training Procedure
Training was conducted in three phases:
Phase 1: acoustic adaptation with frozen decoder
Phase 2: full end-to-end ASR optimization
Phase 3: low-learning-rate fine-tuning with partial freezing
General settings:
Optimizer: AdamW
Learning rate schedule: linear decay
Gradient clipping enabled
Mixed precision training (fp16)
Checkpoint-based continuation between phases
Final checkpoints were selected based on validation WER rather than training loss.
Evaluation
Primary metric:
Word Error Rate (WER)
Interpretation note:
Due to Hungarian morphology, a single suffix error often counts as a full word error. As a result, relatively high WER values may still correspond to semantically understandable transcriptions.
Intended Use
Suitable for:
Hungarian speech transcription
Research on low-resource and morphologically rich languages
Further fine-tuning and domain adaptation
Offline batch ASR pipelines
Not intended for:
Medical or legal transcription without human verification
Streaming or real-time applications without additional optimization
Limitations
Reduced robustness to strong dialects
Sensitivity to out-of-domain noise
No explicit punctuation or capitalization modeling
No integrated language model rescoring
Ethical Considerations
Trained on legally obtained datasets
No intentional memorization of personal data
Users are responsible for ensuring lawful use
Citation @misc{hungarian-speecht5-asr, title={Hungarian ASR Fine-Tuning of SpeechT5 with Progressive Freezing}, author={Gábor Madarász}, year={2025}, note={Hugging Face model repository} }
Model tree for GaborMadarasz/SpeechT5-asr-hungarian_V1
Base model
microsoft/speecht5_asr