Hungarian SpeechT5 ASR Model

Model Description

This model is an experimental Hungarian automatic speech recognition (ASR) model based on SpeechT5, a unified encoder–decoder architecture for speech and text processing.

The base SpeechT5-ASR model was adapted to Hungarian using a multi-phase fine-tuning strategy, combining selective freezing and unfreezing of model components to ensure training stability, efficient convergence, and improved Word Error Rate (WER).

The model transcribes raw speech audio into Hungarian text.

The model was trained on Common Voice Scripted Speech 24.0 - Hungarian

Base Architecture

The model is based on SpeechT5-ASR, which consists of:

A Transformer-based speech encoder for acoustic feature modeling

A Transformer-based text decoder for autoregressive transcription

Cross-attention between encoder and decoder

Shared embedding space for text representations

The overall architecture (number of layers, attention heads, hidden sizes) was preserved from the original SpeechT5 model.

Architectural Adaptations

To enable effective Hungarian ASR, the following adaptations were applied:

Hungarian-specific tokenizer and vocabulary

Resizing of decoder embeddings and output projection layers

Partial re-initialization of text decoder heads when necessary

Training hyperparameters adjusted for longer utterances and higher morphological variability

No structural changes were made to the attention mechanisms or Transformer blocks.

Freeze / Unfreeze Strategy

A three-phase training strategy with progressively increasing model flexibility was applied. Each phase had a distinct optimization goal and carefully controlled parameter updates.

Phase 1 – Decoder & Language Adaptation

Objective: adapt the text generation components to Hungarian while keeping acoustic representations stable.

Speech encoder: fully frozen

Text decoder: unfrozen

Token embeddings: unfrozen

LM head: unfrozen

In this phase, the acoustic encoder was kept fixed to preserve robust pretrained speech representations. Training focused on adapting the text decoder, embeddings, and output projection to Hungarian linguistic structure and vocabulary.

This step establishes a stable Hungarian language generation capability before any acoustic re-adaptation is attempted.

Phase 2 – Joint ASR Optimization

Objective: optimize end-to-end transcription quality and reduce Word Error Rate (WER).

Speech encoder: partially unfrozen

Text decoder: unfrozen

Cross-attention layers: unfrozen

Token embeddings: unfrozen

LM head: unfrozen

During this phase, upper layers of the speech encoder were unfrozen, while lower layers remained frozen. This allowed the model to jointly optimize acoustic–linguistic alignment without destabilizing low-level speech representations.

At this stage, the model effectively operates in an end-to-end ASR regime, enabling coordinated improvement of both acoustic modeling and text generation.

Phase 3 – Fine-Tuning and Generalization

Objective: improve generalization and reduce residual substitution and deletion errors.

Speech encoder: partially trainable

Lower encoder layers: frozen

Upper encoder layers: trainable

Text decoder: fully trainable

LM head: trainable

This phase uses a conservative fine-tuning setup with a reduced learning rate. By freezing lower encoder layers and fine-tuning higher-level acoustic representations together with the decoder, the model refines error-prone regions while minimizing catastrophic forgetting.

This setup is particularly effective for reducing systematic WER errors that persist after full end-to-end training.

Language Adaptation (Hungarian)

Hungarian ASR presents specific challenges:

Agglutinative morphology

High number of inflected word forms

Frequent compound words

Mitigation strategies:

Training exclusively on Hungarian-language speech

Hungarian-specific text normalization

Subword tokenization tuned for Hungarian morphology

WER-focused evaluation with awareness of morphological penalty effects

Training Procedure

Training was conducted in three phases:

Phase 1: acoustic adaptation with frozen decoder

Phase 2: full end-to-end ASR optimization

Phase 3: low-learning-rate fine-tuning with partial freezing

General settings:

Optimizer: AdamW

Learning rate schedule: linear decay

Gradient clipping enabled

Mixed precision training (fp16)

Checkpoint-based continuation between phases

Final checkpoints were selected based on validation WER rather than training loss.

Evaluation

Primary metric:

Word Error Rate (WER)

Interpretation note:

Due to Hungarian morphology, a single suffix error often counts as a full word error. As a result, relatively high WER values may still correspond to semantically understandable transcriptions.

Intended Use

Suitable for:

Hungarian speech transcription

Research on low-resource and morphologically rich languages

Further fine-tuning and domain adaptation

Offline batch ASR pipelines

Not intended for:

Medical or legal transcription without human verification

Streaming or real-time applications without additional optimization

Limitations

Reduced robustness to strong dialects

Sensitivity to out-of-domain noise

No explicit punctuation or capitalization modeling

No integrated language model rescoring

Ethical Considerations

Trained on legally obtained datasets

No intentional memorization of personal data

Users are responsible for ensuring lawful use

Citation @misc{hungarian-speecht5-asr, title={Hungarian ASR Fine-Tuning of SpeechT5 with Progressive Freezing}, author={Gábor Madarász}, year={2025}, note={Hugging Face model repository} }

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for GaborMadarasz/SpeechT5-asr-hungarian_V1

Base model

microsoft/speecht5_asr

Finetuned

(2)

this model