LaionBox v0.2-wip β Fine-tuned DramaBox with Multi-Auxiliary Losses
LaionBox is a fine-tuned version of DramaBox (22B parameter DiT-based TTS) that produces more natural, emotionally expressive speech with improved voice cloning fidelity.
Key Improvements over Vanilla DramaBox
- Higher naturalness: +11% CLAP naturalness score
- Better voice cloning: +2% speaker similarity (WavLM-SV cosine)
- Reduced artifacts: Quality probability 0.99+ (vs ~0.85 baseline)
- Less background noise: Trained on enhanced, high-quality reference audio
- More expressive: Trained on diverse voice acting scenarios, emotions, and archetypes
Model Details
| Property | Value |
|---|---|
| Base Model | ResembleAI/Dramabox (LTX-2.3-22B Audio-Only) |
| Fine-tuning Method | LoRA (rank=128, alpha=128) |
| Trainable Parameters | |
| Training Data | DramaBox voice acting (50%) + Podcasts (50%) |
| Auxiliary Losses | 6 (naturalness, quality, centroid, speaker sim, comb filter, artifact v2) |
| Best Checkpoint | Step 80 (best flow matching loss = 0.418) |
| Training Hardware | 8Γ A100 80GB |
| Precision | bfloat16 |
Auxiliary Losses
This checkpoint was trained with 6 differentiable auxiliary losses that guide the LoRA adapter toward producing more natural audio:
- CLAP Naturalness (ratio=3.0): Maximizes cosine similarity between generated audio's CLAP embedding and positive text prompts ("realistic, genuine, spontaneous, authentic, natural") while minimizing similarity to negative prompts ("distorted, unnatural, robotic")
- Quality MLP (included with naturalness): Binary classifier (768β256β64β1) trained on VoiceCLAP embeddings to distinguish real from synthetic audio
- Centroid Real/Fake (ratio=3.0): Pushes audio embeddings toward the centroid of real speech embeddings and away from synthetic speech centroid
- Speaker Similarity (ratio=3.0): WavLM-SV cosine similarity between reference and generated speaker embeddings
- Comb Filter Detector (ratio=3.0): CNN operating on latent space to detect comb-filter artifacts
- Artifact Detector V2 (ratio=3.0): Larger residual CNN for general artifact detection in latent space
Critical Innovation: Differentiable Reward Chain
The key breakthrough enabling these auxiliary losses is a fully differentiable reward chain:
LoRA params β velocity prediction β xβ recovery β VAE decoder β waveform β CLAP embedding β loss
This allows gradients to flow from perceptual quality metrics back to the LoRA parameters, directly steering the model toward higher-quality audio. Previous attempts with non-differentiable rewards (modulating loss magnitude only) produced no improvement.
Files
.
βββ lora_weights.safetensors # Main LoRA checkpoint (865 MB)
βββ training_config.yaml # Training configuration
βββ training_args.yaml # Full training arguments
βββ discriminators/
β βββ quality_classifier.pt # Quality/MOS prediction MLP
β βββ real_fake_classifier.pt # Real vs AI detection MLP
β βββ best_artifact_detector_v2.pt # CNN artifact detector (latent space)
β βββ best_comb_detector.pt # CNN comb filter detector (latent space)
β βββ best_clap_medium.pt # CLAP-based artifact MLP
βββ scripts/
β βββ dramabox_finetune_train_multi_aux.py # Main training script (6-aux)
β βββ dramabox_finetune_train.py # Base training script
β βββ dramabox_finetune_prepare.py # Data preparation
β βββ train_artifact_detector_v2.py # Artifact detector training
β βββ train_artifact_detector_clap.py # CLAP artifact detector training
β βββ train_comb_filter_detector.py # Comb filter detector training
β βββ train_binary_classifiers.py # Quality/real-fake classifier training
βββ configs/
βββ *.yaml # All training configurations
Discriminator Models
Quality Classifier (discriminators/quality_classifier.pt)
- Architecture: MLP (768β128β32β1, sigmoid)
- Input: VoiceCLAP-small embeddings (768-dim)
- Training: Binary classification (real=1, synthetic=0)
- Accuracy: 100% on validation set
- Purpose: Provides quality probability signal during training
Real/Fake Classifier (discriminators/real_fake_classifier.pt)
- Architecture: Same as quality classifier
- Training: Distinguishes real recordings from TTS outputs
- Purpose: Centroid-based distribution matching
Artifact Detector V2 (discriminators/best_artifact_detector_v2.pt)
- Architecture: Residual CNN (8β64β128β256β384β512, ~10.6M params)
- Input: Latent tensors [B, 8, T, 16]
- Training: Binary classification of clean vs artifacted latents
- Purpose: Detects vocoder artifacts directly in latent space (no decoding needed)
Comb Filter Detector (discriminators/best_comb_detector.pt)
- Architecture: CNN (8β32β64β128β128, ~1M params)
- Input: Latent tensors [B, 8, T, 16]
- Training: Detects comb-filter interference patterns
- Purpose: Penalizes latent configurations that produce comb artifacts
CLAP Artifact MLP (discriminators/best_clap_medium.pt)
- Architecture: MLP (768β512β256β64β1, LayerNorm, GELU)
- Input: VoiceCLAP embeddings (768-dim)
- Training: Distinguishes clean from artifacted audio in embedding space
- Purpose: Perceptual artifact detection via CLAP representations
Usage
This LoRA checkpoint is designed to be loaded on top of the base DramaBox model:
from safetensors.torch import load_file
# Load base DramaBox model
model = load_dramabox_model("ResembleAI/Dramabox")
# Apply LoRA weights
lora_weights = load_file("lora_weights.safetensors")
apply_lora(model, lora_weights, rank=128, alpha=128)
Known Issues
- Subtle metallic/ringing artifacts: BigVGAN v2 vocoder produces harmonic artifacts when processing LoRA-modified mel spectrograms that are slightly out-of-distribution
- Not from stereo interference: L-R channel correlation is >0.999; artifacts are per-channel, intrinsic to the vocoder
- Root cause: Snake/SnakeBeta activations amplify small spectral deviations into audible harmonics
Training Data Sources
- DramaBox Voice Acting (~3,247 samples): Procedurally generated prompts from EmoNet/voice acting taxonomies, expanded by Gemma-4 LLM, covering diverse emotional scenarios and character archetypes
- Podcast Data (~11,966 samples): High-quality conversational speech from TTS-AGI/podcast-tokenized, decoded via DAC-VAE and annotated with Whisper
License
Apache 2.0
Citation
@misc{laionbox2026,
title={LaionBox: Fine-tuning DramaBox TTS with Multi-Auxiliary Differentiable Losses},
author={LAION},
year={2026},
url={https://huggingface.co/laion/laionbox-v0.2-wip}
}