Veda TTS — LibriTTS Base v1

A neural Text-to-Speech model using the CGNv2 (Chain-of-thought Generative Network v2) architecture, trained on LibriTTS-R clean-360 (~360 hours of clean audiobook speech).

Architecture

Model: CGNv2 — autoregressive language-model-style TTS with prosodic chain-of-thought planning
Parameters: 206.6M
Vocoder: SNAC @ 24kHz (3-level hierarchical codec, 4096 codebook, 84 tokens/sec)
G2P: Flite (phoneme-based input, CMU ARPAbet)
Framework: PyTorch (custom training via HuggingFace Trainer)

Training

Setting	Value
Dataset	LibriTTS-R clean-360
Speakers	902
Batch size (effective)	128
Optimizer	AdamW (lr=1e-4, weight_decay=0.1)
Warmup steps	5000
Precision	BF16
Best checkpoint	step 15000 (eval_loss = 2.8932)

Evaluation Metrics

Step	WER (Whisper base)	DNSMOS
15000	0.2697	3.2631
17500 (best WER)	0.1124	3.2436
35000	0.2022	3.3033

Note: Step 17500 had the best WER but was not saved to disk. This repo contains the step 15000 checkpoint (nearest saved), which is the best_model_checkpoint according to trainer state.

Usage

import torch
import soundfile as sf
import yaml
from pathlib import Path
from huggingface_hub import snapshot_download

# Download model
model_dir = Path(snapshot_download("vijayavedartham/veda-tts-libritts"))

# Install veda-tts: pip install git+https://github.com/srallaba/veda-tts.git

from vedatts.models.cgn_v2.config import CGNv2Config
from vedatts.models.cgn_v2.model import CGNv2
from vedatts.models.cgn_v2.tokenizer import CGNv2Tokenizer
from vedatts.models.cgn_v2.generate import synthesize, tokens_to_audio
from vedatts.codec.snac import SNACCodec

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model
with open(model_dir / "config.yaml") as f:
    config = CGNv2Config(**yaml.safe_load(f))

tokenizer = CGNv2Tokenizer.load(model_dir / "tokenizer.json")
model = CGNv2(config).to(device)
state_dict = torch.load(model_dir / "model.pt", map_location=device, weights_only=True)
model.load_state_dict(state_dict)
model.eval()

# Load SNAC codec
codec = SNACCodec(device=device)

# Synthesize
text = "The quick brown fox jumps over the lazy dog."
generated, token_ids = synthesize(model, tokenizer, text, device=device)
waveform = tokens_to_audio(token_ids, tokenizer, codec, codec_type="snac")

# Save
sf.write("output.wav", waveform.squeeze().cpu().numpy(), samplerate=24000)

Requirements

pip install veda-tts snac
# Also requires flite for G2P: apt install flite (or brew install flite)

Demo

Try it live on the HuggingFace Space.

License

Apache-2.0

Downloads last month: -