Veda TTS โ LibriTTS Base v1
A neural Text-to-Speech model using the CGNv2 (Chain-of-thought Generative Network v2) architecture, trained on LibriTTS-R clean-360 (~360 hours of clean audiobook speech).
Architecture
- Model: CGNv2 โ autoregressive language-model-style TTS with prosodic chain-of-thought planning
- Parameters: 206.6M
- Vocoder: SNAC @ 24kHz (3-level hierarchical codec, 4096 codebook, 84 tokens/sec)
- G2P: Flite (phoneme-based input, CMU ARPAbet)
- Framework: PyTorch (custom training via HuggingFace Trainer)
Training
| Setting | Value |
|---|---|
| Dataset | LibriTTS-R clean-360 |
| Speakers | 902 |
| Batch size (effective) | 128 |
| Optimizer | AdamW (lr=1e-4, weight_decay=0.1) |
| Warmup steps | 5000 |
| Precision | BF16 |
| Best checkpoint | step 15000 (eval_loss = 2.8932) |
Evaluation Metrics
| Step | WER (Whisper base) | DNSMOS |
|---|---|---|
| 15000 | 0.2697 | 3.2631 |
| 17500 (best WER) | 0.1124 | 3.2436 |
| 35000 | 0.2022 | 3.3033 |
Note: Step 17500 had the best WER but was not saved to disk. This repo contains the step 15000 checkpoint (nearest saved), which is the
best_model_checkpointaccording to trainer state.
Usage
import torch
import soundfile as sf
import yaml
from pathlib import Path
from huggingface_hub import snapshot_download
# Download model
model_dir = Path(snapshot_download("vijayavedartham/veda-tts-libritts"))
# Install veda-tts: pip install git+https://github.com/srallaba/veda-tts.git
from vedatts.models.cgn_v2.config import CGNv2Config
from vedatts.models.cgn_v2.model import CGNv2
from vedatts.models.cgn_v2.tokenizer import CGNv2Tokenizer
from vedatts.models.cgn_v2.generate import synthesize, tokens_to_audio
from vedatts.codec.snac import SNACCodec
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load model
with open(model_dir / "config.yaml") as f:
config = CGNv2Config(**yaml.safe_load(f))
tokenizer = CGNv2Tokenizer.load(model_dir / "tokenizer.json")
model = CGNv2(config).to(device)
state_dict = torch.load(model_dir / "model.pt", map_location=device, weights_only=True)
model.load_state_dict(state_dict)
model.eval()
# Load SNAC codec
codec = SNACCodec(device=device)
# Synthesize
text = "The quick brown fox jumps over the lazy dog."
generated, token_ids = synthesize(model, tokenizer, text, device=device)
waveform = tokens_to_audio(token_ids, tokenizer, codec, codec_type="snac")
# Save
sf.write("output.wav", waveform.squeeze().cpu().numpy(), samplerate=24000)
Requirements
pip install veda-tts snac
# Also requires flite for G2P: apt install flite (or brew install flite)
Demo
Try it live on the HuggingFace Space.
License
Apache-2.0
- Downloads last month
- -