Instructions to use manassehzw/sna-f5-tts with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- F5-TTS
How to use manassehzw/sna-f5-tts with F5-TTS:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Shona F5-TTS Voice
Shona F5-TTS Voice is a Shona (sna) text-to-speech model built on top of
SWivid/F5-TTS. This repository replaces an earlier checkpoint with the stronger 150-sample full fine-tune that performed best by listening evaluation against other tested Shona TTS systems.
Model Details
- Author: Manasseh Changachirere (Harare Institute of Technology)
- Base model:
SWivid/F5-TTS - Phase 1 training dataset:
Shekharmeena/Shona-Male-Audio-Dataset - Phase 2 training dataset:
manassehzw/sna-manasseh-150-raw - Language: Shona
- Model family: F5-TTS
- Prepared training rows: 150
- Prepared training duration: 0.2483 hours
- Configured epochs: 50
- Learning rate: 1.50e-05
- Run started: 2026-05-07T18:55:28.784481+00:00
- Run finished: 2026-05-07T19:09:15.967322+00:00
- Phase 1 dataset:
Shekharmeena/Shona-Male-Audio-Dataset - Phase 2 dataset:
manassehzw/sna-manasseh-150-raw
Overview
Two-stage adapted from the F5-TTS base model for Shona speech synthesis. The winning release first adapts the base model on a broader Shona male speech corpus, then applies a LoRA identity pass on a curated 150-sample single-speaker Shona dataset. This package includes the validated final checkpoint, tokenizer vocabulary, training metadata, and held-out evaluation samples for research and downstream voice application testing.
Training Pipeline
- Phase 1: full adaptation of the F5-TTS base model on the broader Shona dataset.
- Phase 2: LoRA identity adaptation on the 150-sample single-speaker dataset to improve speaker similarity, stability, and robustness.
Files
model.pt: full compatibility checkpoint exported from the validated runmodel.safetensors: inference-oriented weight exportvocab.txt: tokenizer vocabulary used for training and inferenceresearch/train_config.yaml: generated training configuration for this runresearch/summary.json: run summary and artifact pathsresearch/prep_summary.json: prepared dataset summarysamples/: held-out evaluation generations when included (final_eval_ref01_nfe32)
Intended Use
This model is intended for:
- Shona TTS research
- voice agent prototyping
- single-speaker adaptation experiments
- comparative benchmarking against Spark-TTS and other Shona TTS systems
It is not positioned as a production-hardened commercial speech API.
Compatibility
This repository does not follow the standard transformers text-to-speech layout. It is intended for the F5-TTS / sna-f5-tts inference stack used in this project.
Inference Notes
- This checkpoint works best with a short, clean reference clip and accurate reference text.
- Long-form synthesis is still best handled by chunking.
- Faster inference is possible by lowering NFE steps, with some quality tradeoff.
Samples
The table below points at the uploaded evaluation WAVs. Inline audio players are included with direct links as a fallback.
| File | Text | Audio |
|---|---|---|
sample_01.wav |
Mangwanani shamwari yangu, ndafara kukuwona nhasi. Ndaida kukubvunza kuti zvinhu zvirisei kubasa uye mhuri yakasimba here, tinogona kusangana manheru here? tigoronga svondo nemafaro patinenge tapedza kunamata. | WAV |
sample_02.wav |
Nezuro ndakaenda kumusika mangwanani, ndikawana miriwo, madomasi, nehanyanisi, asi mitengo yacho yakanga yakwira zvishoma. Ndakazobika sadza nenyama kumba, vana vakati chikafu chainaka kwazvo. | WAV |
sample_03.wav |
Kana uchida kuti chirongwa ichi chifambe zvakanaka, tinofanira kutanga taronga kupimana kwebasa, topatsanura nguva yekudzidza, tobva taziva zvinotarisirwa pakupera kwemwedzi. Kana tikashanda pamwe chete, tinokwanisa kusvika pazvinangwa zvedu. | WAV |
sample_04.wav |
Mumugwagwa mune motokari dzakawanda nhasi, ndozvaiita kuti ndinonoke kusvika, ndaedza kutsvaga imwe nzira iriclear ndokuzosvika. Dzimwe nguva kufamba muguta kunotoda patience nekuti pa peak hour munenge makazara. | WAV |
sample_05.wav |
Mamukasei, mhuri yakadini, makazofamba mushe here takazorasana paye. Ini ndakasvika zvakanaka chose. Mugokwazisa baba vaLinda ne rimwe team rese. | WAV |
Training Provenance
- Base model:
SWivid/F5-TTS - Phase 1 dataset:
Shekharmeena/Shona-Male-Audio-Dataset - Phase 2 dataset:
manassehzw/sna-manasseh-150-raw - Checkpoint path used for publication:
/root/project/runs/shona_f5_tts_identity/shona_f5_tts_identity-m150-n150-lora-lr2em5-ep50-May-07-2026_06+55PM-2fecae2/final_model/model_last.pt - Research metadata: available under
research/
Limitations
- This is a research checkpoint and may still vary with prompt/reference mismatch.
- Code-switching performance can depend heavily on how much multilingual material was present in the fine-tuning data.
- Live conversational use may still need chunked delivery or optimized runtime serving for best latency.
Citation
If you use this model, please also credit the upstream F5-TTS project:
Model tree for manassehzw/sna-f5-tts
Base model
SWivid/F5-TTS