LaionBox v0.2-wip β€” Fine-tuned DramaBox with Multi-Auxiliary Losses

LaionBox is a fine-tuned version of DramaBox (22B parameter DiT-based TTS) that produces more natural, emotionally expressive speech with improved voice cloning fidelity.

Key Improvements over Vanilla DramaBox

  • Higher naturalness: +11% CLAP naturalness score
  • Better voice cloning: +2% speaker similarity (WavLM-SV cosine)
  • Reduced artifacts: Quality probability 0.99+ (vs ~0.85 baseline)
  • Less background noise: Trained on enhanced, high-quality reference audio
  • More expressive: Trained on diverse voice acting scenarios, emotions, and archetypes

Model Details

Property Value
Base Model ResembleAI/Dramabox (LTX-2.3-22B Audio-Only)
Fine-tuning Method LoRA (rank=128, alpha=128)
Trainable Parameters 15M / 3.2B total (0.5%)
Training Data DramaBox voice acting (50%) + Podcasts (50%)
Auxiliary Losses 6 (naturalness, quality, centroid, speaker sim, comb filter, artifact v2)
Best Checkpoint Step 80 (best flow matching loss = 0.418)
Training Hardware 8Γ— A100 80GB
Precision bfloat16

Auxiliary Losses

This checkpoint was trained with 6 differentiable auxiliary losses that guide the LoRA adapter toward producing more natural audio:

  1. CLAP Naturalness (ratio=3.0): Maximizes cosine similarity between generated audio's CLAP embedding and positive text prompts ("realistic, genuine, spontaneous, authentic, natural") while minimizing similarity to negative prompts ("distorted, unnatural, robotic")
  2. Quality MLP (included with naturalness): Binary classifier (768β†’256β†’64β†’1) trained on VoiceCLAP embeddings to distinguish real from synthetic audio
  3. Centroid Real/Fake (ratio=3.0): Pushes audio embeddings toward the centroid of real speech embeddings and away from synthetic speech centroid
  4. Speaker Similarity (ratio=3.0): WavLM-SV cosine similarity between reference and generated speaker embeddings
  5. Comb Filter Detector (ratio=3.0): CNN operating on latent space to detect comb-filter artifacts
  6. Artifact Detector V2 (ratio=3.0): Larger residual CNN for general artifact detection in latent space

Critical Innovation: Differentiable Reward Chain

The key breakthrough enabling these auxiliary losses is a fully differentiable reward chain:

LoRA params β†’ velocity prediction β†’ xβ‚€ recovery β†’ VAE decoder β†’ waveform β†’ CLAP embedding β†’ loss

This allows gradients to flow from perceptual quality metrics back to the LoRA parameters, directly steering the model toward higher-quality audio. Previous attempts with non-differentiable rewards (modulating loss magnitude only) produced no improvement.

Files

.
β”œβ”€β”€ lora_weights.safetensors          # Main LoRA checkpoint (865 MB)
β”œβ”€β”€ training_config.yaml              # Training configuration
β”œβ”€β”€ training_args.yaml                # Full training arguments
β”œβ”€β”€ discriminators/
β”‚   β”œβ”€β”€ quality_classifier.pt         # Quality/MOS prediction MLP
β”‚   β”œβ”€β”€ real_fake_classifier.pt       # Real vs AI detection MLP
β”‚   β”œβ”€β”€ best_artifact_detector_v2.pt  # CNN artifact detector (latent space)
β”‚   β”œβ”€β”€ best_comb_detector.pt         # CNN comb filter detector (latent space)
β”‚   └── best_clap_medium.pt          # CLAP-based artifact MLP
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ dramabox_finetune_train_multi_aux.py  # Main training script (6-aux)
β”‚   β”œβ”€β”€ dramabox_finetune_train.py            # Base training script
β”‚   β”œβ”€β”€ dramabox_finetune_prepare.py          # Data preparation
β”‚   β”œβ”€β”€ train_artifact_detector_v2.py         # Artifact detector training
β”‚   β”œβ”€β”€ train_artifact_detector_clap.py       # CLAP artifact detector training
β”‚   β”œβ”€β”€ train_comb_filter_detector.py         # Comb filter detector training
β”‚   └── train_binary_classifiers.py           # Quality/real-fake classifier training
└── configs/
    └── *.yaml                                # All training configurations

Discriminator Models

Quality Classifier (discriminators/quality_classifier.pt)

  • Architecture: MLP (768β†’128β†’32β†’1, sigmoid)
  • Input: VoiceCLAP-small embeddings (768-dim)
  • Training: Binary classification (real=1, synthetic=0)
  • Accuracy: 100% on validation set
  • Purpose: Provides quality probability signal during training

Real/Fake Classifier (discriminators/real_fake_classifier.pt)

  • Architecture: Same as quality classifier
  • Training: Distinguishes real recordings from TTS outputs
  • Purpose: Centroid-based distribution matching

Artifact Detector V2 (discriminators/best_artifact_detector_v2.pt)

  • Architecture: Residual CNN (8β†’64β†’128β†’256β†’384β†’512, ~10.6M params)
  • Input: Latent tensors [B, 8, T, 16]
  • Training: Binary classification of clean vs artifacted latents
  • Purpose: Detects vocoder artifacts directly in latent space (no decoding needed)

Comb Filter Detector (discriminators/best_comb_detector.pt)

  • Architecture: CNN (8β†’32β†’64β†’128β†’128, ~1M params)
  • Input: Latent tensors [B, 8, T, 16]
  • Training: Detects comb-filter interference patterns
  • Purpose: Penalizes latent configurations that produce comb artifacts

CLAP Artifact MLP (discriminators/best_clap_medium.pt)

  • Architecture: MLP (768β†’512β†’256β†’64β†’1, LayerNorm, GELU)
  • Input: VoiceCLAP embeddings (768-dim)
  • Training: Distinguishes clean from artifacted audio in embedding space
  • Purpose: Perceptual artifact detection via CLAP representations

Usage

This LoRA checkpoint is designed to be loaded on top of the base DramaBox model:

from safetensors.torch import load_file

# Load base DramaBox model
model = load_dramabox_model("ResembleAI/Dramabox")

# Apply LoRA weights
lora_weights = load_file("lora_weights.safetensors")
apply_lora(model, lora_weights, rank=128, alpha=128)

Known Issues

  • Subtle metallic/ringing artifacts: BigVGAN v2 vocoder produces harmonic artifacts when processing LoRA-modified mel spectrograms that are slightly out-of-distribution
  • Not from stereo interference: L-R channel correlation is >0.999; artifacts are per-channel, intrinsic to the vocoder
  • Root cause: Snake/SnakeBeta activations amplify small spectral deviations into audible harmonics

Training Data Sources

  1. DramaBox Voice Acting (~3,247 samples): Procedurally generated prompts from EmoNet/voice acting taxonomies, expanded by Gemma-4 LLM, covering diverse emotional scenarios and character archetypes
  2. Podcast Data (~11,966 samples): High-quality conversational speech from TTS-AGI/podcast-tokenized, decoded via DAC-VAE and annotated with Whisper

License

Apache 2.0

Citation

@misc{laionbox2026,
  title={LaionBox: Fine-tuning DramaBox TTS with Multi-Auxiliary Differentiable Losses},
  author={LAION},
  year={2026},
  url={https://huggingface.co/laion/laionbox-v0.2-wip}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for laion/laionbox-v0.2-wip

Adapter
(1)
this model