Sync model card with upstream GitHub inference README
Browse files
README.md
CHANGED
|
@@ -20,7 +20,7 @@ acoustic echo cancellation (AEC), noise suppression, and dereverberation of
|
|
| 20 |
16 kHz speech, designed to run on commodity CPUs in real time.
|
| 21 |
|
| 22 |
- 1.3 M parameters (~5 MB F32)
|
| 23 |
-
- ~1.
|
| 24 |
- Causal, streaming: 256-sample hop, 16 ms algorithmic latency
|
| 25 |
- F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
|
| 26 |
PyTorch reference included for verification and research
|
|
@@ -31,8 +31,9 @@ This page is the Hugging Face model card β it hosts the published weights.
|
|
| 31 |
Source code, build system, tests, and training pipeline live in the GitHub
|
| 32 |
repository: <https://github.com/localai-org/LocalVQE>.
|
| 33 |
|
| 34 |
-
The current release is **v1.
|
| 35 |
-
|
|
|
|
| 36 |
|
| 37 |
The technical report describing the architecture, streaming-state contract,
|
| 38 |
and streaming-causal normalisation operator is included in this repo as
|
|
@@ -103,8 +104,11 @@ LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters
|
|
| 103 |
|
| 104 |
| File | Size | Description |
|
| 105 |
|---|---|---|
|
| 106 |
-
| `localvqe-v1.
|
| 107 |
-
| `localvqe-v1.
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
Only F32 GGUF is published today. A `quantize` tool is included in the
|
| 110 |
C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
|
|
@@ -118,11 +122,18 @@ Full 800-clip eval on the
|
|
| 118 |
|
| 119 |
| Scenario | n | AECMOS echo β | AECMOS deg β | blind ERLE β | DNSMOS OVRL β |
|
| 120 |
|-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
|
| 121 |
-
| doubletalk | 115 | 4.
|
| 122 |
-
| doubletalk-with-movement | 185 | 4.
|
| 123 |
-
| farend-singletalk | 107 |
|
| 124 |
-
| farend-singletalk-with-movement | 193 |
|
| 125 |
-
| nearend-singletalk | 200 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
|
| 127 |
- **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
|
| 128 |
quality predictor. "Echo" rates how well echo was removed; "degradation"
|
|
@@ -178,6 +189,23 @@ glslc`/`shaderc`).
|
|
| 178 |
Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
|
| 179 |
full `ggml_backend_graph_compute`.
|
| 180 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
| Backend | Threads | p50 | p99 | max |
|
| 182 |
|-----------------------------|--------:|--------:|--------:|--------:|
|
| 183 |
| CPU | 1 | 3.40 ms | 3.57 ms | 5.06 ms |
|
|
@@ -194,14 +222,14 @@ range.
|
|
| 194 |
|
| 195 |
## Running Inference
|
| 196 |
|
| 197 |
-
Download `localvqe-v1.
|
| 198 |
either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
|
| 199 |
`huggingface_hub`. Then:
|
| 200 |
|
| 201 |
### CLI
|
| 202 |
|
| 203 |
```bash
|
| 204 |
-
./ggml/build/bin/localvqe localvqe-v1.
|
| 205 |
--in-wav mic.wav ref.wav \
|
| 206 |
--out-wav enhanced.wav
|
| 207 |
```
|
|
@@ -211,7 +239,7 @@ Expects 16 kHz mono PCM for both mic and far-end reference.
|
|
| 211 |
### Benchmark
|
| 212 |
|
| 213 |
```bash
|
| 214 |
-
./ggml/build/bin/bench localvqe-v1.
|
| 215 |
--in-wav mic.wav ref.wav --iters 10 --profile
|
| 216 |
```
|
| 217 |
|
|
@@ -233,7 +261,7 @@ tool in the C++ build can produce GGUF variants from the F32 reference
|
|
| 233 |
for experimentation:
|
| 234 |
|
| 235 |
```bash
|
| 236 |
-
./ggml/build/bin/quantize localvqe-v1.
|
| 237 |
```
|
| 238 |
|
| 239 |
Expect end-to-end quality loss until proper per-tensor selection and
|
|
@@ -241,7 +269,7 @@ calibration have been worked through.
|
|
| 241 |
|
| 242 |
## PyTorch Reference
|
| 243 |
|
| 244 |
-
`localvqe-v1.
|
| 245 |
It is provided for verification, ablation, and downstream research β not
|
| 246 |
for end-user inference, which should go through the GGML build above. The
|
| 247 |
model definition lives under `pytorch/` in the
|
|
|
|
| 20 |
16 kHz speech, designed to run on commodity CPUs in real time.
|
| 21 |
|
| 22 |
- 1.3 M parameters (~5 MB F32)
|
| 23 |
+
- ~1.56 ms per 16 ms frame on Zen4 (4 threads) β **β10Γ realtime**
|
| 24 |
- Causal, streaming: 256-sample hop, 16 ms algorithmic latency
|
| 25 |
- F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
|
| 26 |
PyTorch reference included for verification and research
|
|
|
|
| 31 |
Source code, build system, tests, and training pipeline live in the GitHub
|
| 32 |
repository: <https://github.com/localai-org/LocalVQE>.
|
| 33 |
|
| 34 |
+
The current release is **v1.2**. It doubles the supported delay
|
| 35 |
+
window from 500 ms to 1 second at a ~20 % per-hop CPU cost. It also
|
| 36 |
+
avoids oversuppression of voices that are near to the noise floor.
|
| 37 |
|
| 38 |
The technical report describing the architecture, streaming-state contract,
|
| 39 |
and streaming-causal normalisation operator is included in this repo as
|
|
|
|
| 104 |
|
| 105 |
| File | Size | Description |
|
| 106 |
|---|---|---|
|
| 107 |
+
| `localvqe-v1.2-1.3M.pt` | 11 MB | PyTorch checkpoint β DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
|
| 108 |
+
| `localvqe-v1.2-1.3M-f32.gguf` | 5 MB | GGML F32 export β what the C++ inference engine loads. |
|
| 109 |
+
| `localvqe-v1.1-1.3M.pt` | 11 MB | Previous release. |
|
| 110 |
+
| `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | Previous release (F32 GGUF). |
|
| 111 |
+
| `localvqe-v1-1.3M-f32.gguf` | 5 MB | Original release. |
|
| 112 |
|
| 113 |
Only F32 GGUF is published today. A `quantize` tool is included in the
|
| 114 |
C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
|
|
|
|
| 122 |
|
| 123 |
| Scenario | n | AECMOS echo β | AECMOS deg β | blind ERLE β | DNSMOS OVRL β |
|
| 124 |
|-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
|
| 125 |
+
| doubletalk | 115 | 4.72 | 2.37 | 8.4 dB | 2.83 |
|
| 126 |
+
| doubletalk-with-movement | 185 | 4.65 | 2.30 | 8.1 dB | 2.79 |
|
| 127 |
+
| farend-singletalk | 107 | 3.78 | 4.91 | 45.7 dB | 1.80 |
|
| 128 |
+
| farend-singletalk-with-movement | 193 | 4.12 | 4.96 | 40.6 dB | 1.75 |
|
| 129 |
+
| nearend-singletalk | 200 | 5.00 | 4.16 | 2.1 dB | 3.17 |
|
| 130 |
+
|
| 131 |
+
v1.2 vs v1.1 deltas: AECMOS echo MOS +0.80 / +0.72 on FE-ST and
|
| 132 |
+
FE-ST-with-movement (the primary release goal β these scenarios are
|
| 133 |
+
where echo leaks through), near-end deg MOS +0.11, double-talk
|
| 134 |
+
roughly unchanged. FE-ST-with-movement raw ERLE drops 4.4 dB; v1.2
|
| 135 |
+
is less aggressive when the echo path is moving, trading raw
|
| 136 |
+
cancellation for fewer near-end gating artefacts.
|
| 137 |
|
| 138 |
- **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
|
| 139 |
quality predictor. "Echo" rates how well echo was removed; "degradation"
|
|
|
|
| 189 |
Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
|
| 190 |
full `ggml_backend_graph_compute`.
|
| 191 |
|
| 192 |
+
**v1.2** (current, 1024 ms echo-search window):
|
| 193 |
+
|
| 194 |
+
| Backend | Threads | p50 | p99 | max |
|
| 195 |
+
|-----------------------------|--------:|--------:|--------:|--------:|
|
| 196 |
+
| CPU | 1 | 4.15 ms | 4.53 ms | 6.23 ms |
|
| 197 |
+
| CPU | 4 | 1.56 ms | 1.73 ms | 4.57 ms |
|
| 198 |
+
| CPU | 8 | 1.89 ms | 2.15 ms | 6.91 ms |
|
| 199 |
+
| CPU | 16 | 2.12 ms | 2.17 ms | 6.43 ms |
|
| 200 |
+
| Vulkan β AMD iGPU (RADV) | β | 4.88 ms | 5.06 ms | 6.24 ms |
|
| 201 |
+
| Vulkan β NVIDIA RTX 5070 Ti | β | 1.79 ms | 3.42 ms | 5.42 ms |
|
| 202 |
+
|
| 203 |
+
Beyond β4 threads the model is small enough that thread-launch and
|
| 204 |
+
synchronisation overhead dominate; **four threads is the sweet spot
|
| 205 |
+
on Zen4**.
|
| 206 |
+
|
| 207 |
+
**v1.1** (previous, 512 ms echo-search window) for comparison:
|
| 208 |
+
|
| 209 |
| Backend | Threads | p50 | p99 | max |
|
| 210 |
|-----------------------------|--------:|--------:|--------:|--------:|
|
| 211 |
| CPU | 1 | 3.40 ms | 3.57 ms | 5.06 ms |
|
|
|
|
| 222 |
|
| 223 |
## Running Inference
|
| 224 |
|
| 225 |
+
Download `localvqe-v1.2-1.3M-f32.gguf` from this repository (the file list above)
|
| 226 |
either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
|
| 227 |
`huggingface_hub`. Then:
|
| 228 |
|
| 229 |
### CLI
|
| 230 |
|
| 231 |
```bash
|
| 232 |
+
./ggml/build/bin/localvqe localvqe-v1.2-1.3M-f32.gguf \
|
| 233 |
--in-wav mic.wav ref.wav \
|
| 234 |
--out-wav enhanced.wav
|
| 235 |
```
|
|
|
|
| 239 |
### Benchmark
|
| 240 |
|
| 241 |
```bash
|
| 242 |
+
./ggml/build/bin/bench localvqe-v1.2-1.3M-f32.gguf \
|
| 243 |
--in-wav mic.wav ref.wav --iters 10 --profile
|
| 244 |
```
|
| 245 |
|
|
|
|
| 261 |
for experimentation:
|
| 262 |
|
| 263 |
```bash
|
| 264 |
+
./ggml/build/bin/quantize localvqe-v1.2-1.3M-f32.gguf localvqe-v1.2-1.3M-q8_0.gguf Q8_0
|
| 265 |
```
|
| 266 |
|
| 267 |
Expect end-to-end quality loss until proper per-tensor selection and
|
|
|
|
| 269 |
|
| 270 |
## PyTorch Reference
|
| 271 |
|
| 272 |
+
`localvqe-v1.2-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
|
| 273 |
It is provided for verification, ablation, and downstream research β not
|
| 274 |
for end-user inference, which should go through the GGML build above. The
|
| 275 |
model definition lives under `pytorch/` in the
|