NAMAA-Space/NAMAA-Saudi-TTS-V2

NAMAA Saudi TTS Banner

A voice-cloning text-to-speech model for Najdi Arabic (Saudi Arabic), fully fine-tuned from SWivid/Habibi-TTS's on (Saudi) specialized checkpoint.

Given 5โ€“8 seconds of a reference voice saying something in Arabic, this model generates new arbitrary Arabic text in that same voice, with Najdi-dialect prosody and pronunciation.

Model Details

Architecture

F5-TTS Diffusion Transformer (DiT), v1 Base configuration:

Parameter Value
Model family F5-TTS (flow-matching text-to-speech)
Backbone DiT (Diffusion Transformer)
Embedding dim 1024
Depth (layers) 22
Attention heads 16
FFN multiplier 2
Text embedding dim 512
Conv text encoder layers 4
Total parameters ~335M
Tokenizer Character-level (vocab size 2704)
Vocoder charactr/vocos-mel-24khz
Mel features 100 mels, 24kHz, hop 256, win 1024

See config.json for the full architecture spec.

Base model

This is a full fine-tune (not LoRA) warm-started from SWivid/Habibi-TTS/Specialized/SAU/model_200000.safetensors. That base checkpoint was already trained on Arabic speech for 200,000 updates by SWivid; this fine-tune runs an additional ~last updates on Najdi-specific data on top of those weights. All ~335M parameters are trainable โ€” no frozen layers, no adapters.

Why full fine-tuning, not LoRA?

LoRA with low rank (r=8-16) on messy conditional generation tasks ends up averaging noise and signal together in its low-rank update, producing artifacts that weren't in either the base model or the training data. Full fine-tuning has the capacity to properly partition the data distribution โ€” at the cost of not being able to toggle the adaptation on/off at inference.

Why flow matching / DiT for TTS?

F5-TTS uses flow matching (a diffusion-family method) rather than autoregressive generation. Instead of generating audio token-by-token, it denoises a full mel-spectrogram in 32 ODE steps. The DiT transformer conditions on both the text and the masked reference audio โ€” the same architecture used for text-to-image diffusion, adapted for audio. See the original F5-TTS paper.

Training

Data

Combined ~18.4 hours of Najdi and Saudi Arabic speech from five HuggingFace datasets, filtered to clips of 3โ€“10 seconds in duration:

Total training clips: ~13,614 Total audio duration: ~18.4 hours Sample rate: resampled to 24 kHz mono

Hyperparameters

Parameter Value
Mixed precision bf16
Batch type frame-packed
Frames per GPU 38,400
Max samples per batch 64
Gradient accumulation steps 2
Learning rate 5e-5
Epochs 20
Warmup updates 300
Max grad norm 1.0
Save interval every 500 updates
Total updates ~2,180
Hardware NVIDIA A100 80GB

See training_config.json for the exact config used.

Usage

Quick start

from huggingface_hub import hf_hub_download
from f5_tts.model import DiT
from f5_tts.infer.utils_infer import load_model, load_vocoder, preprocess_ref_audio_text
from habibi_tts.infer.utils_infer import infer_process
import torch, soundfile as sf

# Download files
ckpt_path  = hf_hub_download(repo_id="NAMAA-Space/NAMAA-Saudi-TTS-V2", filename="model_last.pt")
vocab_path = hf_hub_download(repo_id="NAMAA-Space/NAMAA-Saudi-TTS-V2", filename="vocab.txt")

# Load model
V1_BASE_CFG = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
model = load_model(DiT, V1_BASE_CFG, ckpt_path, vocab_file=vocab_path, device="cuda")
model = model.to(torch.float32).eval()
vocoder = load_vocoder()

# Voice clone
REF_AUDIO = "path/to/najdi_reference_clip.wav"   # 5-8s of clean Najdi speech
REF_TEXT  = "exact transcript of that clip"
ref_audio, ref_text = preprocess_ref_audio_text(REF_AUDIO, REF_TEXT)

# Generate
wave, sr, _ = infer_process(
    ref_audio, ref_text,
    "ู…ุฑุญุจุงุŒ ูƒูŠู ุญุงู„ูƒ ุงู„ูŠูˆู…ุŸ",    # text to speak
    model, vocoder,
    nfe_step=32, speed=1.0,
)
sf.write("output.wav", wave, sr)

Or use the inference.py script included in this repo.

Reference clip guidelines

Quality of the generated output is dominated by quality of the reference clip.

  • Duration: 5โ€“8 seconds. Less than 3s is too little prosody; more than 10s costs VRAM with no quality gain.
  • Clean audio: no background music, no overlapping speakers, no heavy room reverb. Clipped or noisy references produce clipped or noisy outputs.
  • Single speaker, full phrases: the reference should contain one person speaking complete sentences, not isolated words.
  • Accurate transcript: REF_TEXT must be what's actually said in REF_AUDIO. Even small drift here degrades the output.

Generation parameters

  • nfe_step=32 โ€” diffusion ODE steps. 32 is the quality/speed sweet spot; raise to 64 for marginal quality gain (2x slower), lower to 16 for faster iteration.
  • speed=1.0 โ€” speech speed multiplier. 0.8 for slower, 1.2 for faster.

Intended Use

  • Text-to-speech for Najdi Arabic content creation where you have a clean reference clip of the target voice.
  • Research on Arabic dialect TTS.

Out-of-Scope / Limitations

  • Not a multi-dialect model. Fine-tuned on Najdi; will produce Najdi-flavored output even for other Arabic dialect inputs. Use vanilla Habibi-TTS for other dialects.
  • Voice clone requires reference audio. This is not a "read arbitrary text in a fixed voice" model. Without a reference clip, there is no output.
  • Training data includes podcast audio. Some inherited background characteristics (room tone, occasional distant music) may appear in outputs, especially when reference clip is itself podcast-sourced.
  • Non-commercial. License is CC-BY-NC-SA-4.0 (inherited from the base Habibi-TTS model). You may not use this for commercial purposes without working out separate licensing with SWivid.
  • No safeguards against voice cloning misuse. Do not clone a person's voice without their permission. Do not generate deceptive or impersonating audio.

Evaluation

No formal evaluation metrics are reported. Subjective quality improvement over vanilla Habibi-TTS SAU was observed for Najdi-accented prompts. A/B testing against the base model is recommended for your specific use case.

Files

File Description
model_last.pt Model weights for inference (step last)
model_last.pt Full training state (weights + optimizer) for resuming
vocab.txt Character-level tokenizer vocab (2704 tokens)
config.json Architecture hyperparameters
training_config.json Training hyperparameters used
inference.py Standalone inference script

Citation

If you use this model, please cite both this fine-tune and the base Habibi-TTS work:

@misc{habibi-tts-najdi-ft,
  author = {namaa community},
  title = {Habibi-TTS Najdi Fine-Tuned},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/NAMAA-Space/NAMAA-Saudi-TTS-V2}},
}

@misc{habibi-tts,
  author = {SWivid},
  title = {Habibi-TTS},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/SWivid/Habibi-TTS}},
}

@article{f5tts,
  title = {F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
  author = {Chen, Yushen and others},
  journal = {arXiv:2410.06885},
  year = {2024},
}

Acknowledgments

  • SWivid for the Habibi-TTS SAU pretrained base
  • F5-TTS authors for the underlying architecture and training framework
  • charactr/vocos-mel-24khz for the neural vocoder
Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for NAMAA-Space/NAMAA-Saudi-TTS-V2

Base model

SWivid/F5-TTS
Finetuned
(4)
this model

Space using NAMAA-Space/NAMAA-Saudi-TTS-V2 1

Paper for NAMAA-Space/NAMAA-Saudi-TTS-V2