Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Paper • 2106.06103 • Published • 4
A single-speaker English text-to-speech model trained on LJSpeech, featuring a novel MCEP-based decoder that replaces the standard HiFi-GAN upsampling decoder with a compact spectral decoder based on Mel-Cepstral coefficients (MCEPs) and inverse STFT.
MCEP-VITS modifies the standard VITS architecture by replacing the decoder with a spectral approach:
mc2sp basis matrix0.3 * tanh) adds fine spectral detail on top of the smooth MCEP envelopeThis design provides strong spectral inductive bias (MCEPs parameterize smooth spectral envelopes), requiring fewer parameters than standard decoders while maintaining good quality.
| Metric | Value |
|---|---|
| WER (Whisper-medium) | 12.2% |
| UTMOS (predicted MOS) | 3.33 |
| RTF (RTX 4090) | 0.007 |
| Component | Parameters |
|---|---|
| Text Encoder (6-layer transformer) | ~4.0M |
| Duration Predictor | ~1.0M |
| Posterior Encoder (WaveNet) | ~2.5M |
| Normalizing Flow (4 coupling layers) | ~4.0M |
| MCEP Decoder (v5) | ~3.9M |
| Total Generator | ~9.3M |
For comparison, the standard VITS HiFi-GAN decoder alone is ~33M parameters.
pip install torch pysptk scipy numpy soundfile phonemizer unidecode
# espeak-ng is required for phonemization
# Ubuntu: sudo apt-get install espeak-ng
import sys, os, torch, soundfile as sf
# Clone or download this repo
model_dir = "path/to/mcep-vits-ljspeech"
sys.path.insert(0, model_dir)
from models import SynthesizerTrn
from text import text_to_sequence
from text.symbols import symbols
from commons import intersperse
from utils import get_hparams_from_file, load_checkpoint
# Load model
hps = get_hparams_from_file(os.path.join(model_dir, "config.json"))
net_g = SynthesizerTrn(
len(symbols),
hps.data.filter_length // 2 + 1,
hps.train.segment_size // hps.data.hop_length,
**hps.model.__dict__,
).eval()
load_checkpoint(os.path.join(model_dir, "G_125000.pth"), net_g)
net_g.dec.remove_weight_norm()
# Synthesize
text = "Scientists at the CERN laboratory made a remarkable discovery last week."
text_norm = intersperse(text_to_sequence(text, hps.data.text_cleaners), 0)
x = torch.LongTensor([text_norm])
x_lengths = torch.LongTensor([len(text_norm)])
with torch.no_grad():
audio, _, _, _, _ = net_g.infer(x, x_lengths, noise_scale=0.667, length_scale=1.0)
audio = audio.squeeze().numpy()
sf.write("output.wav", audio, 22050)
cd mcep-vits-ljspeech
python inference.py --text "Hello, this is a test of the MCEP VITS model." --output hello.wav
If you use this model, please cite the original VITS paper:
@inproceedings{kim2021vits,
title={Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech},
author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
booktitle={ICML},
year={2021}
}
And the MB-iSTFT-VITS paper:
@inproceedings{kawamura2023mbistvits,
title={Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform},
author={Kawamura, Masaya and Shirahata, Yuma and Yamamoto, Ryuichi and Tachibana, Kentaro},
booktitle={ICASSP},
year={2023}
}
MIT License