LaionBox v0.2-wip — Fine-tuned DramaBox with Multi-Auxiliary Losses

LaionBox is a fine-tuned version of DramaBox (22B parameter DiT-based TTS) that produces more natural, emotionally expressive speech with improved voice cloning fidelity.

Key Improvements over Vanilla DramaBox

Higher naturalness: +11% CLAP naturalness score
Better voice cloning: +2% speaker similarity (WavLM-SV cosine)
Reduced artifacts: Quality probability 0.99+ (vs ~0.85 baseline)
Less background noise: Trained on enhanced, high-quality reference audio
More expressive: Trained on diverse voice acting scenarios, emotions, and archetypes

Model Details

Property	Value
Base Model	ResembleAI/Dramabox (LTX-2.3-22B Audio-Only)
Fine-tuning Method	LoRA (rank=128, alpha=128)
Trainable Parameters	~~15M / 3.2B total (~~0.5%)
Training Data	DramaBox voice acting (50%) + Podcasts (50%)
Auxiliary Losses	6 (naturalness, quality, centroid, speaker sim, comb filter, artifact v2)
Best Checkpoint	Step 80 (best flow matching loss = 0.418)
Training Hardware	8× A100 80GB
Precision	bfloat16

Auxiliary Losses

This checkpoint was trained with 6 differentiable auxiliary losses that guide the LoRA adapter toward producing more natural audio:

CLAP Naturalness (ratio=3.0): Maximizes cosine similarity between generated audio's CLAP embedding and positive text prompts ("realistic, genuine, spontaneous, authentic, natural") while minimizing similarity to negative prompts ("distorted, unnatural, robotic")
Quality MLP (included with naturalness): Binary classifier (768→256→64→1) trained on VoiceCLAP embeddings to distinguish real from synthetic audio
Centroid Real/Fake (ratio=3.0): Pushes audio embeddings toward the centroid of real speech embeddings and away from synthetic speech centroid
Speaker Similarity (ratio=3.0): WavLM-SV cosine similarity between reference and generated speaker embeddings
Comb Filter Detector (ratio=3.0): CNN operating on latent space to detect comb-filter artifacts
Artifact Detector V2 (ratio=3.0): Larger residual CNN for general artifact detection in latent space

Critical Innovation: Differentiable Reward Chain

The key breakthrough enabling these auxiliary losses is a fully differentiable reward chain:

LoRA params → velocity prediction → x₀ recovery → VAE decoder → waveform → CLAP embedding → loss

This allows gradients to flow from perceptual quality metrics back to the LoRA parameters, directly steering the model toward higher-quality audio. Previous attempts with non-differentiable rewards (modulating loss magnitude only) produced no improvement.

Files

.
├── lora_weights.safetensors          # Main LoRA checkpoint (865 MB)
├── training_config.yaml              # Training configuration
├── training_args.yaml                # Full training arguments
├── discriminators/
│   ├── quality_classifier.pt         # Quality/MOS prediction MLP
│   ├── real_fake_classifier.pt       # Real vs AI detection MLP
│   ├── best_artifact_detector_v2.pt  # CNN artifact detector (latent space)
│   ├── best_comb_detector.pt         # CNN comb filter detector (latent space)
│   └── best_clap_medium.pt          # CLAP-based artifact MLP
├── scripts/
│   ├── dramabox_finetune_train_multi_aux.py  # Main training script (6-aux)
│   ├── dramabox_finetune_train.py            # Base training script
│   ├── dramabox_finetune_prepare.py          # Data preparation
│   ├── train_artifact_detector_v2.py         # Artifact detector training
│   ├── train_artifact_detector_clap.py       # CLAP artifact detector training
│   ├── train_comb_filter_detector.py         # Comb filter detector training
│   └── train_binary_classifiers.py           # Quality/real-fake classifier training
└── configs/
    └── *.yaml                                # All training configurations

Discriminator Models

Quality Classifier (`discriminators/quality_classifier.pt`)

Architecture: MLP (768→128→32→1, sigmoid)
Input: VoiceCLAP-small embeddings (768-dim)
Training: Binary classification (real=1, synthetic=0)
Accuracy: 100% on validation set
Purpose: Provides quality probability signal during training

Real/Fake Classifier (`discriminators/real_fake_classifier.pt`)

Architecture: Same as quality classifier
Training: Distinguishes real recordings from TTS outputs
Purpose: Centroid-based distribution matching

Artifact Detector V2 (`discriminators/best_artifact_detector_v2.pt`)

Architecture: Residual CNN (8→64→128→256→384→512, ~10.6M params)
Input: Latent tensors [B, 8, T, 16]
Training: Binary classification of clean vs artifacted latents
Purpose: Detects vocoder artifacts directly in latent space (no decoding needed)

Comb Filter Detector (`discriminators/best_comb_detector.pt`)

Architecture: CNN (8→32→64→128→128, ~1M params)
Input: Latent tensors [B, 8, T, 16]
Training: Detects comb-filter interference patterns
Purpose: Penalizes latent configurations that produce comb artifacts

CLAP Artifact MLP (`discriminators/best_clap_medium.pt`)

Architecture: MLP (768→512→256→64→1, LayerNorm, GELU)
Input: VoiceCLAP embeddings (768-dim)
Training: Distinguishes clean from artifacted audio in embedding space
Purpose: Perceptual artifact detection via CLAP representations

Usage

This LoRA checkpoint is designed to be loaded on top of the base DramaBox model:

from safetensors.torch import load_file

# Load base DramaBox model
model = load_dramabox_model("ResembleAI/Dramabox")

# Apply LoRA weights
lora_weights = load_file("lora_weights.safetensors")
apply_lora(model, lora_weights, rank=128, alpha=128)

Known Issues

Subtle metallic/ringing artifacts: BigVGAN v2 vocoder produces harmonic artifacts when processing LoRA-modified mel spectrograms that are slightly out-of-distribution
Not from stereo interference: L-R channel correlation is >0.999; artifacts are per-channel, intrinsic to the vocoder
Root cause: Snake/SnakeBeta activations amplify small spectral deviations into audible harmonics

Training Data Sources

DramaBox Voice Acting (~3,247 samples): Procedurally generated prompts from EmoNet/voice acting taxonomies, expanded by Gemma-4 LLM, covering diverse emotional scenarios and character archetypes
Podcast Data (~11,966 samples): High-quality conversational speech from TTS-AGI/podcast-tokenized, decoded via DAC-VAE and annotated with Whisper

License

Apache 2.0

Citation

@misc{laionbox2026,
  title={LaionBox: Fine-tuning DramaBox TTS with Multi-Auxiliary Differentiable Losses},
  author={LAION},
  year={2026},
  url={https://huggingface.co/laion/laionbox-v0.2-wip}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for laion/laionbox-v0.2-wip

Base model

Lightricks/LTX-2.3

Finetuned

ResembleAI/Dramabox

Adapter

(1)

this model