scragnog
/

Ace-Step-1.5-ScragVAE

+---
+license: mit
+language:
+  - en
+tags:
+  - audio
+  - music
+  - vae
+  - autoencoder
+  - ace-step
+  - acestep
+  - decoder
+  - oobleck
+  - music-generation
+library_name: diffusers
+pipeline_tag: audio-to-audio
+base_model: ACE-Step/ace-step-v1.5-1d-vae-stable-audio-format
+---
+# ScragVAE — Improved VAE Decoder for ACE-Step 1.5
+A fine-tuned **AutoencoderOobleck** decoder with an intent to improve audio fidelity for the [ACE-Step 1.5](https://github.com/ace-step/ACE-Step-1.5) music generation pipeline. Drop-in compatible with all existing ACE-Step DiT checkpoints.
+## What is this?
+ACE-Step 1.5 uses a VAE (Variational Autoencoder) to convert between audio waveforms and the latent space that the DiT diffusion model operates in. The original VAE decoder attenuates high-frequency content, resulting in audio with reduced clarity and detail above 6kHz.
+ScragVAE retrains the decoder half of the VAE to better reconstruct upper harmonics, transient detail, and spectral "air" — while keeping the encoder frozen so all existing DiT models remain fully compatible.
+## Benchmarks
+Objective spectral analysis comparing ScragVAE vs the original ACE-Step 1.5 VAE decoder on identical latents (same seed, same DiT output):
+| Metric | ScragVAE | Original VAE | Improvement |
+|--------|----------|-------------|-------------|
+| Dynamic range | 85.8 dB | 56.5 dB | **+29.3 dB** |
+| HF energy ratio (>8kHz) | 1.17% | 0.85% | **+38%** |
+| HF energy ratio (>12kHz) | 0.21% | 0.12% | **+83%** |
+| Band: brilliance (6–12kHz) | 43.0 dB | 42.4 dB | **+0.6 dB** |
+| Band: air (12–24kHz) | 30.5 dB | 28.2 dB | **+2.3 dB** |
+| Spectral rolloff (95%) | 3326 Hz | 2901 Hz | **+425 Hz** |
+| Spectral centroid | 3662 Hz | 3447 Hz | +214 Hz (brighter) |
+> **Summary:** ScragVAE preserves significantly more high-frequency content (especially 10–20kHz) and has dramatically better dynamic range, resulting in clearer vocals, crisper transients, and more natural-sounding audio.
+## Files
+| File | Format | Size | Use with |
+|------|--------|------|----------|
+| `diffusion_pytorch_model.safetensors` | F32 safetensors | 644 MB | Python / Diffusers / HOT-Step 9000 |
+| `scragvae-BF16.gguf` | BF16 GGUF | 322 MB | [acestep.cpp](https://github.com/ace-step/acestep.cpp) / HOT-Step CPP |
+| `config.json` | JSON | <1 KB | Architecture config (required for both) |
+## Usage
+### Python / Diffusers
+ScragVAE is a drop-in replacement for the ACE-Step VAE. Replace the VAE checkpoint path in your pipeline:
+```python
+from diffusers import AutoencoderOobleck
+# Load ScragVAE instead of the default VAE
+vae = AutoencoderOobleck.from_pretrained("scragnog/Ace-Step-1.5-ScragVAE")
+# Use with your existing ACE-Step pipeline
+# (replace the vae in your pipeline config or checkpoint directory)
+```
+Or manually swap the decoder weights in an existing setup:
+```python
+import torch
+from safetensors.torch import load_file
+# Load ScragVAE weights
+scrag_weights = load_file("diffusion_pytorch_model.safetensors")
+# Only decoder.* keys differ — encoder.* are identical to the original
+decoder_keys = {k: v for k, v in scrag_weights.items() if k.startswith("decoder.")}
+your_vae.load_state_dict(decoder_keys, strict=False)
+```
+### acestep.cpp / HOT-Step CPP
+Place `scragvae-BF16.gguf` in your models directory alongside the other GGUF files:
+```
+models/
+├── acestep-v15-turbo-BF16.gguf    # DiT
+├── acestep-5Hz-lm-BF16.gguf       # LM
+├── Qwen3-Embedding-BF16.gguf      # Text encoder
+├── vae-BF16.gguf                  # Original VAE
+└── scragvae-BF16.gguf             # ← ScragVAE (add this)
+```
+The engine auto-discovers all VAE GGUFs at startup. In HOT-Step CPP, select **ScragVAE** from the **VAE Decoder** dropdown in the Models & Adapters panel.
+For acestep.cpp's built-in web UI or API, pass `"vae_model": "scragvae-BF16.gguf"` in your synth request JSON.
+### Converting from safetensors to GGUF yourself
+If you need to reconvert (e.g. after further fine-tuning):
+```python
+python engine/convert.py  # scans checkpoints/ and outputs to models/
+```
+Or use the converter directly:
+```python
+from convert import convert_model
+convert_model("scragvae", "/path/to/scragvae/", "scragvae-BF16.gguf", "vae")
+```
+## Architecture
+ScragVAE uses the same **AutoencoderOobleck** architecture as the original ACE-Step VAE — no structural changes. Only the decoder weights differ.
+| Parameter | Value |
+|-----------|-------|
+| Architecture | AutoencoderOobleck |
+| Audio channels | 2 (stereo) |
+| Sample rate | 48,000 Hz |
+| Latent dim | 64 |
+| Decoder channels | 128 |
+| Channel multiples | [1, 2, 4, 8, 16] |
+| Downsampling ratios | [2, 4, 4, 6, 10] |
+| Total ratio | 1920× |
+| Activation | Snake |
+| Weight normalization | Yes (fused at load in GGUF) |
+| Parameters | 168.7M (encoder + decoder) |
+### Compatibility
+- ✅ All ACE-Step 1.5 DiT checkpoints (turbo, SFT, XL)
+- ✅ All LoRA/adapter models
+- ✅ Both Python (PyTorch/Diffusers) and C++ (ggml/acestep.cpp) runtimes
+- ✅ Encoder weights are identical — no retraining of upstream models needed
+## Training
+### Strategy
+**Freeze encoder → train decoder only.** The DiT operates in latent space; by only improving the decoder, all existing DiT checkpoints remain compatible without retraining.
+### Two-phase training
+| Parameter | Phase 1 (Warm-up) | Phase 2 (Quality) |
+|-----------|-------------------|-------------------|
+| Steps | ~3,000 | ~98,000 |
+| Learning rate | 3e-5 | 3e-5 |
+| Adversarial weight | 0.5 | **1.5** |
+| Feature matching | 5.0 | **3.0** |
+| Perceptual weighting | On | **Off** |
+| L1 time domain | 0.0 | **0.05** |
+| Discriminator FFT sizes | 6 | **6 (+4096)** |
+| Spectral loss FFT sizes | — | **9 (32–8192)** |
+| Multi-res mel loss | — | **4 scales** |
+| Precision | bf16-mixed | bf16-mixed |
+| Effective batch | 16 (8×2 accum) | 16 (8×2 accum) |
+| Gradient clip | 1.0 | 1.0 |
+### Key changes vs original training
+- **Disabled perceptual weighting** in the spectral loss — the original's perceptual curve de-emphasizes high frequencies, actively suppressing HF reconstruction
+- **Increased adversarial weight** (0.5 → 1.5) — forces the decoder to produce more realistic spectral detail
+- **Reduced feature matching** (5.0 → 3.0) — less over-smoothing from discriminator feature constraints
+- **Added L1 time-domain loss** (0.05) — preserves transient attacks and waveform fidelity
+- **Added 4096-point FFT** to discriminator — gives the discriminator explicitly better resolution for harmonic content in the 2–8kHz range
+- **Added multi-resolution mel-spectrogram loss** at 4 scales — captures perceptually relevant frequency content
+### Hardware
+- **GPU:** NVIDIA RTX 5090 (32GB)
+- **Training time:** ~8 hours total (Phase 1 + Phase 2)
+- **Framework:** PyTorch + stable-audio-tools
+## License
+MIT License — same as ACE-Step 1.5.
+## Citation
+If you use ScragVAE in your work:
+```bibtex
+@misc{scragvae2026,
+  title={ScragVAE: Improved VAE Decoder for ACE-Step 1.5},
+  author={Scragnog},
+  year={2026},
+  url={https://huggingface.co/scragnog/Ace-Step-1.5-ScragVAE}
+}
+```
+## Acknowledgements
+- [ACE-Step 1.5](https://github.com/ace-step/ACE-Step-1.5) — the base model and VAE architecture
+- [stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools) — training framework
+- [acestep.cpp](https://github.com/ace-step/acestep.cpp) — C++ inference engine with GGUF support