scragnog commited on
Commit
0547ba3
Β·
verified Β·
1 Parent(s): 8eed100

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +200 -3
README.md CHANGED
@@ -1,3 +1,200 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - audio
7
+ - music
8
+ - vae
9
+ - autoencoder
10
+ - ace-step
11
+ - acestep
12
+ - decoder
13
+ - oobleck
14
+ - music-generation
15
+ library_name: diffusers
16
+ pipeline_tag: audio-to-audio
17
+ base_model: ACE-Step/ace-step-v1.5-1d-vae-stable-audio-format
18
+ ---
19
+
20
+ # ScragVAE β€” Improved VAE Decoder for ACE-Step 1.5
21
+
22
+ A fine-tuned **AutoencoderOobleck** decoder with an intent to improve audio fidelity for the [ACE-Step 1.5](https://github.com/ace-step/ACE-Step-1.5) music generation pipeline. Drop-in compatible with all existing ACE-Step DiT checkpoints.
23
+
24
+ ## What is this?
25
+
26
+ ACE-Step 1.5 uses a VAE (Variational Autoencoder) to convert between audio waveforms and the latent space that the DiT diffusion model operates in. The original VAE decoder attenuates high-frequency content, resulting in audio with reduced clarity and detail above 6kHz.
27
+
28
+ ScragVAE retrains the decoder half of the VAE to better reconstruct upper harmonics, transient detail, and spectral "air" β€” while keeping the encoder frozen so all existing DiT models remain fully compatible.
29
+
30
+ ## Benchmarks
31
+
32
+ Objective spectral analysis comparing ScragVAE vs the original ACE-Step 1.5 VAE decoder on identical latents (same seed, same DiT output):
33
+
34
+ | Metric | ScragVAE | Original VAE | Improvement |
35
+ |--------|----------|-------------|-------------|
36
+ | Dynamic range | 85.8 dB | 56.5 dB | **+29.3 dB** |
37
+ | HF energy ratio (>8kHz) | 1.17% | 0.85% | **+38%** |
38
+ | HF energy ratio (>12kHz) | 0.21% | 0.12% | **+83%** |
39
+ | Band: brilliance (6–12kHz) | 43.0 dB | 42.4 dB | **+0.6 dB** |
40
+ | Band: air (12–24kHz) | 30.5 dB | 28.2 dB | **+2.3 dB** |
41
+ | Spectral rolloff (95%) | 3326 Hz | 2901 Hz | **+425 Hz** |
42
+ | Spectral centroid | 3662 Hz | 3447 Hz | +214 Hz (brighter) |
43
+
44
+ > **Summary:** ScragVAE preserves significantly more high-frequency content (especially 10–20kHz) and has dramatically better dynamic range, resulting in clearer vocals, crisper transients, and more natural-sounding audio.
45
+
46
+ ## Files
47
+
48
+ | File | Format | Size | Use with |
49
+ |------|--------|------|----------|
50
+ | `diffusion_pytorch_model.safetensors` | F32 safetensors | 644 MB | Python / Diffusers / HOT-Step 9000 |
51
+ | `scragvae-BF16.gguf` | BF16 GGUF | 322 MB | [acestep.cpp](https://github.com/ace-step/acestep.cpp) / HOT-Step CPP |
52
+ | `config.json` | JSON | <1 KB | Architecture config (required for both) |
53
+
54
+ ## Usage
55
+
56
+ ### Python / Diffusers
57
+
58
+ ScragVAE is a drop-in replacement for the ACE-Step VAE. Replace the VAE checkpoint path in your pipeline:
59
+
60
+ ```python
61
+ from diffusers import AutoencoderOobleck
62
+
63
+ # Load ScragVAE instead of the default VAE
64
+ vae = AutoencoderOobleck.from_pretrained("scragnog/Ace-Step-1.5-ScragVAE")
65
+
66
+ # Use with your existing ACE-Step pipeline
67
+ # (replace the vae in your pipeline config or checkpoint directory)
68
+ ```
69
+
70
+ Or manually swap the decoder weights in an existing setup:
71
+
72
+ ```python
73
+ import torch
74
+ from safetensors.torch import load_file
75
+
76
+ # Load ScragVAE weights
77
+ scrag_weights = load_file("diffusion_pytorch_model.safetensors")
78
+
79
+ # Only decoder.* keys differ β€” encoder.* are identical to the original
80
+ decoder_keys = {k: v for k, v in scrag_weights.items() if k.startswith("decoder.")}
81
+ your_vae.load_state_dict(decoder_keys, strict=False)
82
+ ```
83
+
84
+ ### acestep.cpp / HOT-Step CPP
85
+
86
+ Place `scragvae-BF16.gguf` in your models directory alongside the other GGUF files:
87
+
88
+ ```
89
+ models/
90
+ β”œβ”€β”€ acestep-v15-turbo-BF16.gguf # DiT
91
+ β”œβ”€β”€ acestep-5Hz-lm-BF16.gguf # LM
92
+ β”œβ”€β”€ Qwen3-Embedding-BF16.gguf # Text encoder
93
+ β”œβ”€β”€ vae-BF16.gguf # Original VAE
94
+ └── scragvae-BF16.gguf # ← ScragVAE (add this)
95
+ ```
96
+
97
+ The engine auto-discovers all VAE GGUFs at startup. In HOT-Step CPP, select **ScragVAE** from the **VAE Decoder** dropdown in the Models & Adapters panel.
98
+
99
+ For acestep.cpp's built-in web UI or API, pass `"vae_model": "scragvae-BF16.gguf"` in your synth request JSON.
100
+
101
+ ### Converting from safetensors to GGUF yourself
102
+
103
+ If you need to reconvert (e.g. after further fine-tuning):
104
+
105
+ ```python
106
+ python engine/convert.py # scans checkpoints/ and outputs to models/
107
+ ```
108
+
109
+ Or use the converter directly:
110
+
111
+ ```python
112
+ from convert import convert_model
113
+ convert_model("scragvae", "/path/to/scragvae/", "scragvae-BF16.gguf", "vae")
114
+ ```
115
+
116
+ ## Architecture
117
+
118
+ ScragVAE uses the same **AutoencoderOobleck** architecture as the original ACE-Step VAE β€” no structural changes. Only the decoder weights differ.
119
+
120
+ | Parameter | Value |
121
+ |-----------|-------|
122
+ | Architecture | AutoencoderOobleck |
123
+ | Audio channels | 2 (stereo) |
124
+ | Sample rate | 48,000 Hz |
125
+ | Latent dim | 64 |
126
+ | Decoder channels | 128 |
127
+ | Channel multiples | [1, 2, 4, 8, 16] |
128
+ | Downsampling ratios | [2, 4, 4, 6, 10] |
129
+ | Total ratio | 1920Γ— |
130
+ | Activation | Snake |
131
+ | Weight normalization | Yes (fused at load in GGUF) |
132
+ | Parameters | 168.7M (encoder + decoder) |
133
+
134
+ ### Compatibility
135
+
136
+ - βœ… All ACE-Step 1.5 DiT checkpoints (turbo, SFT, XL)
137
+ - βœ… All LoRA/adapter models
138
+ - βœ… Both Python (PyTorch/Diffusers) and C++ (ggml/acestep.cpp) runtimes
139
+ - βœ… Encoder weights are identical β€” no retraining of upstream models needed
140
+
141
+ ## Training
142
+
143
+ ### Strategy
144
+
145
+ **Freeze encoder β†’ train decoder only.** The DiT operates in latent space; by only improving the decoder, all existing DiT checkpoints remain compatible without retraining.
146
+
147
+ ### Two-phase training
148
+
149
+ | Parameter | Phase 1 (Warm-up) | Phase 2 (Quality) |
150
+ |-----------|-------------------|-------------------|
151
+ | Steps | ~3,000 | ~98,000 |
152
+ | Learning rate | 3e-5 | 3e-5 |
153
+ | Adversarial weight | 0.5 | **1.5** |
154
+ | Feature matching | 5.0 | **3.0** |
155
+ | Perceptual weighting | On | **Off** |
156
+ | L1 time domain | 0.0 | **0.05** |
157
+ | Discriminator FFT sizes | 6 | **6 (+4096)** |
158
+ | Spectral loss FFT sizes | β€” | **9 (32–8192)** |
159
+ | Multi-res mel loss | β€” | **4 scales** |
160
+ | Precision | bf16-mixed | bf16-mixed |
161
+ | Effective batch | 16 (8Γ—2 accum) | 16 (8Γ—2 accum) |
162
+ | Gradient clip | 1.0 | 1.0 |
163
+
164
+ ### Key changes vs original training
165
+
166
+ - **Disabled perceptual weighting** in the spectral loss β€” the original's perceptual curve de-emphasizes high frequencies, actively suppressing HF reconstruction
167
+ - **Increased adversarial weight** (0.5 β†’ 1.5) β€” forces the decoder to produce more realistic spectral detail
168
+ - **Reduced feature matching** (5.0 β†’ 3.0) β€” less over-smoothing from discriminator feature constraints
169
+ - **Added L1 time-domain loss** (0.05) β€” preserves transient attacks and waveform fidelity
170
+ - **Added 4096-point FFT** to discriminator β€” gives the discriminator explicitly better resolution for harmonic content in the 2–8kHz range
171
+ - **Added multi-resolution mel-spectrogram loss** at 4 scales β€” captures perceptually relevant frequency content
172
+
173
+ ### Hardware
174
+
175
+ - **GPU:** NVIDIA RTX 5090 (32GB)
176
+ - **Training time:** ~8 hours total (Phase 1 + Phase 2)
177
+ - **Framework:** PyTorch + stable-audio-tools
178
+
179
+ ## License
180
+
181
+ MIT License β€” same as ACE-Step 1.5.
182
+
183
+ ## Citation
184
+
185
+ If you use ScragVAE in your work:
186
+
187
+ ```bibtex
188
+ @misc{scragvae2026,
189
+ title={ScragVAE: Improved VAE Decoder for ACE-Step 1.5},
190
+ author={Scragnog},
191
+ year={2026},
192
+ url={https://huggingface.co/scragnog/Ace-Step-1.5-ScragVAE}
193
+ }
194
+ ```
195
+
196
+ ## Acknowledgements
197
+
198
+ - [ACE-Step 1.5](https://github.com/ace-step/ACE-Step-1.5) β€” the base model and VAE architecture
199
+ - [stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools) β€” training framework
200
+ - [acestep.cpp](https://github.com/ace-step/acestep.cpp) β€” C++ inference engine with GGUF support