Text-to-Speech
Safetensors
F5-TTS
Shona
none
tts
shona
voice-cloning
audio

Shona F5-TTS Voice

Shona F5-TTS Voice is a Shona (sna) text-to-speech model built on top of SWivid/F5-TTS. This repository replaces an earlier checkpoint with the stronger 150-sample full fine-tune that performed best by listening evaluation against other tested Shona TTS systems.

Model Details

Overview

Two-stage adapted from the F5-TTS base model for Shona speech synthesis. The winning release first adapts the base model on a broader Shona male speech corpus, then applies a LoRA identity pass on a curated 150-sample single-speaker Shona dataset. This package includes the validated final checkpoint, tokenizer vocabulary, training metadata, and held-out evaluation samples for research and downstream voice application testing.

Training Pipeline

  1. Phase 1: full adaptation of the F5-TTS base model on the broader Shona dataset.
  2. Phase 2: LoRA identity adaptation on the 150-sample single-speaker dataset to improve speaker similarity, stability, and robustness.

Files

  • model.pt: full compatibility checkpoint exported from the validated run
  • model.safetensors: inference-oriented weight export
  • vocab.txt: tokenizer vocabulary used for training and inference
  • research/train_config.yaml: generated training configuration for this run
  • research/summary.json: run summary and artifact paths
  • research/prep_summary.json: prepared dataset summary
  • samples/: held-out evaluation generations when included (final_eval_ref01_nfe32)

Intended Use

This model is intended for:

  • Shona TTS research
  • voice agent prototyping
  • single-speaker adaptation experiments
  • comparative benchmarking against Spark-TTS and other Shona TTS systems

It is not positioned as a production-hardened commercial speech API.

Compatibility

This repository does not follow the standard transformers text-to-speech layout. It is intended for the F5-TTS / sna-f5-tts inference stack used in this project.

Inference Notes

  • This checkpoint works best with a short, clean reference clip and accurate reference text.
  • Long-form synthesis is still best handled by chunking.
  • Faster inference is possible by lowering NFE steps, with some quality tradeoff.

Samples

The table below points at the uploaded evaluation WAVs. Inline audio players are included with direct links as a fallback.

File Text Audio
sample_01.wav Mangwanani shamwari yangu, ndafara kukuwona nhasi. Ndaida kukubvunza kuti zvinhu zvirisei kubasa uye mhuri yakasimba here, tinogona kusangana manheru here? tigoronga svondo nemafaro patinenge tapedza kunamata.
WAV
sample_02.wav Nezuro ndakaenda kumusika mangwanani, ndikawana miriwo, madomasi, nehanyanisi, asi mitengo yacho yakanga yakwira zvishoma. Ndakazobika sadza nenyama kumba, vana vakati chikafu chainaka kwazvo.
WAV
sample_03.wav Kana uchida kuti chirongwa ichi chifambe zvakanaka, tinofanira kutanga taronga kupimana kwebasa, topatsanura nguva yekudzidza, tobva taziva zvinotarisirwa pakupera kwemwedzi. Kana tikashanda pamwe chete, tinokwanisa kusvika pazvinangwa zvedu.
WAV
sample_04.wav Mumugwagwa mune motokari dzakawanda nhasi, ndozvaiita kuti ndinonoke kusvika, ndaedza kutsvaga imwe nzira iriclear ndokuzosvika. Dzimwe nguva kufamba muguta kunotoda patience nekuti pa peak hour munenge makazara.
WAV
sample_05.wav Mamukasei, mhuri yakadini, makazofamba mushe here takazorasana paye. Ini ndakasvika zvakanaka chose. Mugokwazisa baba vaLinda ne rimwe team rese.
WAV

Training Provenance

Limitations

  • This is a research checkpoint and may still vary with prompt/reference mismatch.
  • Code-switching performance can depend heavily on how much multilingual material was present in the fine-tuning data.
  • Live conversational use may still need chunked delivery or optimized runtime serving for best latency.

Citation

If you use this model, please also credit the upstream F5-TTS project:

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for manassehzw/sna-f5-tts

Base model

SWivid/F5-TTS
Finetuned
(131)
this model

Datasets used to train manassehzw/sna-f5-tts