Veda TTS โ€” LibriTTS Base v1

A neural Text-to-Speech model using the CGNv2 (Chain-of-thought Generative Network v2) architecture, trained on LibriTTS-R clean-360 (~360 hours of clean audiobook speech).

Architecture

  • Model: CGNv2 โ€” autoregressive language-model-style TTS with prosodic chain-of-thought planning
  • Parameters: 206.6M
  • Vocoder: SNAC @ 24kHz (3-level hierarchical codec, 4096 codebook, 84 tokens/sec)
  • G2P: Flite (phoneme-based input, CMU ARPAbet)
  • Framework: PyTorch (custom training via HuggingFace Trainer)

Training

Setting Value
Dataset LibriTTS-R clean-360
Speakers 902
Batch size (effective) 128
Optimizer AdamW (lr=1e-4, weight_decay=0.1)
Warmup steps 5000
Precision BF16
Best checkpoint step 15000 (eval_loss = 2.8932)

Evaluation Metrics

Step WER (Whisper base) DNSMOS
15000 0.2697 3.2631
17500 (best WER) 0.1124 3.2436
35000 0.2022 3.3033

Note: Step 17500 had the best WER but was not saved to disk. This repo contains the step 15000 checkpoint (nearest saved), which is the best_model_checkpoint according to trainer state.

Usage

import torch
import soundfile as sf
import yaml
from pathlib import Path
from huggingface_hub import snapshot_download

# Download model
model_dir = Path(snapshot_download("vijayavedartham/veda-tts-libritts"))

# Install veda-tts: pip install git+https://github.com/srallaba/veda-tts.git

from vedatts.models.cgn_v2.config import CGNv2Config
from vedatts.models.cgn_v2.model import CGNv2
from vedatts.models.cgn_v2.tokenizer import CGNv2Tokenizer
from vedatts.models.cgn_v2.generate import synthesize, tokens_to_audio
from vedatts.codec.snac import SNACCodec

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model
with open(model_dir / "config.yaml") as f:
    config = CGNv2Config(**yaml.safe_load(f))

tokenizer = CGNv2Tokenizer.load(model_dir / "tokenizer.json")
model = CGNv2(config).to(device)
state_dict = torch.load(model_dir / "model.pt", map_location=device, weights_only=True)
model.load_state_dict(state_dict)
model.eval()

# Load SNAC codec
codec = SNACCodec(device=device)

# Synthesize
text = "The quick brown fox jumps over the lazy dog."
generated, token_ids = synthesize(model, tokenizer, text, device=device)
waveform = tokens_to_audio(token_ids, tokenizer, codec, codec_type="snac")

# Save
sf.write("output.wav", waveform.squeeze().cpu().numpy(), samplerate=24000)

Requirements

pip install veda-tts snac
# Also requires flite for G2P: apt install flite (or brew install flite)

Demo

Try it live on the HuggingFace Space.

License

Apache-2.0

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support