MCEP-VITS: VITS with MCEP-based iSTFT Decoder

A single-speaker English text-to-speech model trained on LJSpeech, featuring a novel MCEP-based decoder that replaces the standard HiFi-GAN upsampling decoder with a compact spectral decoder based on Mel-Cepstral coefficients (MCEPs) and inverse STFT.

Model Description

MCEP-VITS modifies the standard VITS architecture by replacing the decoder with a spectral approach:

  1. Deep ResBlock Backbone -- 5 ResBlock1 layers extract features from the VITS latent space
  2. MCEP Prediction Head -- Predicts 40-dim mel-cepstral coefficients, converted to log-magnitude spectrum via a fixed (non-learnable) mc2sp basis matrix
  3. Magnitude Refinement -- Bounded residual (0.3 * tanh) adds fine spectral detail on top of the smooth MCEP envelope
  4. Minimum Phase + Learned Phase -- Analytically computed minimum phase from log-magnitude, plus a learned phase residual
  5. Full-resolution iSTFT -- Direct waveform synthesis at 22050 Hz (no subband decomposition, no conv upsampling)

This design provides strong spectral inductive bias (MCEPs parameterize smooth spectral envelopes), requiring fewer parameters than standard decoders while maintaining good quality.

Evaluation Results (Step 125K, 100 val samples)

Metric Value
WER (Whisper-medium) 12.2%
UTMOS (predicted MOS) 3.33
RTF (RTX 4090) 0.007
  • WER median: 8.2%, 25% of samples have zero WER
  • UTMOS range: 2.58 -- 4.12
  • Most WER errors come from number-to-word normalization mismatches (e.g., "1837" vs "eighteen thirty-seven")

Architecture Details

Component Parameters
Text Encoder (6-layer transformer) ~4.0M
Duration Predictor ~1.0M
Posterior Encoder (WaveNet) ~2.5M
Normalizing Flow (4 coupling layers) ~4.0M
MCEP Decoder (v5) ~3.9M
Total Generator ~9.3M

For comparison, the standard VITS HiFi-GAN decoder alone is ~33M parameters.

Usage

Prerequisites

pip install torch pysptk scipy numpy soundfile phonemizer unidecode
# espeak-ng is required for phonemization
# Ubuntu: sudo apt-get install espeak-ng

Quick Start

import sys, os, torch, soundfile as sf

# Clone or download this repo
model_dir = "path/to/mcep-vits-ljspeech"
sys.path.insert(0, model_dir)

from models import SynthesizerTrn
from text import text_to_sequence
from text.symbols import symbols
from commons import intersperse
from utils import get_hparams_from_file, load_checkpoint

# Load model
hps = get_hparams_from_file(os.path.join(model_dir, "config.json"))
net_g = SynthesizerTrn(
    len(symbols),
    hps.data.filter_length // 2 + 1,
    hps.train.segment_size // hps.data.hop_length,
    **hps.model.__dict__,
).eval()
load_checkpoint(os.path.join(model_dir, "G_125000.pth"), net_g)
net_g.dec.remove_weight_norm()

# Synthesize
text = "Scientists at the CERN laboratory made a remarkable discovery last week."
text_norm = intersperse(text_to_sequence(text, hps.data.text_cleaners), 0)
x = torch.LongTensor([text_norm])
x_lengths = torch.LongTensor([len(text_norm)])

with torch.no_grad():
    audio, _, _, _, _ = net_g.infer(x, x_lengths, noise_scale=0.667, length_scale=1.0)
    audio = audio.squeeze().numpy()

sf.write("output.wav", audio, 22050)

CLI

cd mcep-vits-ljspeech
python inference.py --text "Hello, this is a test of the MCEP VITS model." --output hello.wav

Checkpoints

  • G_125000.pth -- Primary checkpoint (evaluated, recommended)
  • G_175000.pth -- Latest available checkpoint

Training Details

  • Dataset: LJSpeech (12,969 train / 131 eval utterances)
  • Hardware: NVIDIA RTX 4090, ~180K steps
  • Batch size: 48
  • Learning rate: 2e-4 with 0.999875 decay
  • Precision: Mixed (fp16)
  • MCEP config: order=39, alpha=0.455 (warping for 22050 Hz)
  • Decoder: 192 channels, 5 ResBlock1 layers
  • Loss: Multi-resolution STFT loss (c_mel=45) + KL divergence + GAN loss
  • Framework: PyTorch, based on MB-iSTFT-VITS

Limitations

  • Single speaker (LJSpeech female voice) only
  • English text input only
  • Requires espeak-ng for phonemization
  • Phase quality is the main bottleneck -- some utterances may have subtle artifacts
  • Number normalization is not built in (e.g., "1837" may not be read as "eighteen thirty-seven")

Citation

If you use this model, please cite the original VITS paper:

@inproceedings{kim2021vits,
  title={Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech},
  author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
  booktitle={ICML},
  year={2021}
}

And the MB-iSTFT-VITS paper:

@inproceedings{kawamura2023mbistvits,
  title={Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform},
  author={Kawamura, Masaya and Shirahata, Yuma and Yamamoto, Ryuichi and Tachibana, Kentaro},
  booktitle={ICASSP},
  year={2023}
}

License

MIT License

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train vijayavedartham/mcep-vits-ljspeech

Space using vijayavedartham/mcep-vits-ljspeech 1

Paper for vijayavedartham/mcep-vits-ljspeech