MCEP-VITS: VITS with MCEP-based iSTFT Decoder

A single-speaker English text-to-speech model trained on LJSpeech, featuring a novel MCEP-based decoder that replaces the standard HiFi-GAN upsampling decoder with a compact spectral decoder based on Mel-Cepstral coefficients (MCEPs) and inverse STFT.

Model Description

MCEP-VITS modifies the standard VITS architecture by replacing the decoder with a spectral approach:

Deep ResBlock Backbone -- 5 ResBlock1 layers extract features from the VITS latent space
MCEP Prediction Head -- Predicts 40-dim mel-cepstral coefficients, converted to log-magnitude spectrum via a fixed (non-learnable) mc2sp basis matrix
Magnitude Refinement -- Bounded residual (0.3 * tanh) adds fine spectral detail on top of the smooth MCEP envelope
Minimum Phase + Learned Phase -- Analytically computed minimum phase from log-magnitude, plus a learned phase residual
Full-resolution iSTFT -- Direct waveform synthesis at 22050 Hz (no subband decomposition, no conv upsampling)

This design provides strong spectral inductive bias (MCEPs parameterize smooth spectral envelopes), requiring fewer parameters than standard decoders while maintaining good quality.

Evaluation Results (Step 125K, 100 val samples)

Metric	Value
WER (Whisper-medium)	12.2%
UTMOS (predicted MOS)	3.33
RTF (RTX 4090)	0.007

WER median: 8.2%, 25% of samples have zero WER
UTMOS range: 2.58 -- 4.12
Most WER errors come from number-to-word normalization mismatches (e.g., "1837" vs "eighteen thirty-seven")

Architecture Details

Component	Parameters
Text Encoder (6-layer transformer)	~4.0M
Duration Predictor	~1.0M
Posterior Encoder (WaveNet)	~2.5M
Normalizing Flow (4 coupling layers)	~4.0M
MCEP Decoder (v5)	~3.9M
Total Generator	~9.3M

For comparison, the standard VITS HiFi-GAN decoder alone is ~33M parameters.

Usage

Prerequisites

pip install torch pysptk scipy numpy soundfile phonemizer unidecode
# espeak-ng is required for phonemization
# Ubuntu: sudo apt-get install espeak-ng

Quick Start

import sys, os, torch, soundfile as sf

# Clone or download this repo
model_dir = "path/to/mcep-vits-ljspeech"
sys.path.insert(0, model_dir)

from models import SynthesizerTrn
from text import text_to_sequence
from text.symbols import symbols
from commons import intersperse
from utils import get_hparams_from_file, load_checkpoint

# Load model
hps = get_hparams_from_file(os.path.join(model_dir, "config.json"))
net_g = SynthesizerTrn(
    len(symbols),
    hps.data.filter_length // 2 + 1,
    hps.train.segment_size // hps.data.hop_length,
    **hps.model.__dict__,
).eval()
load_checkpoint(os.path.join(model_dir, "G_125000.pth"), net_g)
net_g.dec.remove_weight_norm()

# Synthesize
text = "Scientists at the CERN laboratory made a remarkable discovery last week."
text_norm = intersperse(text_to_sequence(text, hps.data.text_cleaners), 0)
x = torch.LongTensor([text_norm])
x_lengths = torch.LongTensor([len(text_norm)])

with torch.no_grad():
    audio, _, _, _, _ = net_g.infer(x, x_lengths, noise_scale=0.667, length_scale=1.0)
    audio = audio.squeeze().numpy()

sf.write("output.wav", audio, 22050)

CLI

cd mcep-vits-ljspeech
python inference.py --text "Hello, this is a test of the MCEP VITS model." --output hello.wav

Checkpoints

G_125000.pth -- Primary checkpoint (evaluated, recommended)
G_175000.pth -- Latest available checkpoint

Training Details

Dataset: LJSpeech (12,969 train / 131 eval utterances)
Hardware: NVIDIA RTX 4090, ~180K steps
Batch size: 48
Learning rate: 2e-4 with 0.999875 decay
Precision: Mixed (fp16)
MCEP config: order=39, alpha=0.455 (warping for 22050 Hz)
Decoder: 192 channels, 5 ResBlock1 layers
Loss: Multi-resolution STFT loss (c_mel=45) + KL divergence + GAN loss
Framework: PyTorch, based on MB-iSTFT-VITS

Limitations

Single speaker (LJSpeech female voice) only
English text input only
Requires espeak-ng for phonemization
Phase quality is the main bottleneck -- some utterances may have subtle artifacts
Number normalization is not built in (e.g., "1837" may not be read as "eighteen thirty-seven")

Citation

If you use this model, please cite the original VITS paper:

@inproceedings{kim2021vits,
  title={Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech},
  author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
  booktitle={ICML},
  year={2021}
}

And the MB-iSTFT-VITS paper:

@inproceedings{kawamura2023mbistvits,
  title={Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform},
  author={Kawamura, Masaya and Shirahata, Yuma and Yamamoto, Ryuichi and Tachibana, Kentaro},
  booktitle={ICASSP},
  year={2023}
}

License

MIT License

Downloads last month: 3

Dataset used to train vijayavedartham/mcep-vits-ljspeech

Space using vijayavedartham/mcep-vits-ljspeech 1

Paper for vijayavedartham/mcep-vits-ljspeech

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Paper • 2106.06103 • Published Jun 11, 2021 • 4