LocalAI-io
/

LocalVQE

@@ -20,7 +20,7 @@ acoustic echo cancellation (AEC), noise suppression, and dereverberation of
 16 kHz speech, designed to run on commodity CPUs in real time.
 - 1.3 M parameters (~5 MB F32)
-- ~1.66 ms per 16 ms frame on Zen4 (24 threads) — **≈9.6× realtime**
 - Causal, streaming: 256-sample hop, 16 ms algorithmic latency
 - F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
   PyTorch reference included for verification and research
@@ -31,8 +31,9 @@ This page is the Hugging Face model card — it hosts the published weights.
 Source code, build system, tests, and training pipeline live in the GitHub
 repository: <https://github.com/localai-org/LocalVQE>.
-The current release is **v1.1**, which fixes intermittent crackling the
-previous release produced under heavy background noise.
 The technical report describing the architecture, streaming-state contract,
 and streaming-causal normalisation operator is included in this repo as
@@ -103,8 +104,11 @@ LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters
 | File | Size | Description |
 |---|---|---|
-| `localvqe-v1.1-1.3M.pt` | 11 MB | PyTorch checkpoint — DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
-| `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | GGML F32 export — what the C++ inference engine loads. |
 Only F32 GGUF is published today. A `quantize` tool is included in the
 C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
@@ -118,11 +122,18 @@ Full 800-clip eval on the
 | Scenario                          |   n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ |
 |-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
-| doubletalk                        | 115 |          4.70 |         2.35 |       8.4 dB |          2.85 |
-| doubletalk-with-movement          | 185 |          4.63 |         2.35 |       8.3 dB |          2.80 |
-| farend-singletalk                 | 107 |          2.98 |         4.91 |      44.7 dB |          1.93 |
-| farend-singletalk-with-movement   | 193 |          3.40 |         4.95 |      45.0 dB |          1.91 |
-| nearend-singletalk                | 200 |          4.99 |         4.05 |       2.5 dB |          3.13 |
 - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
   quality predictor. "Echo" rates how well echo was removed; "degradation"
@@ -178,6 +189,23 @@ glslc`/`shaderc`).
 Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
 full `ggml_backend_graph_compute`.
 | Backend                     | Threads | p50     | p99     | max     |
 |-----------------------------|--------:|--------:|--------:|--------:|
 | CPU                         |       1 | 3.40 ms | 3.57 ms | 5.06 ms |
@@ -194,14 +222,14 @@ range.
 ## Running Inference
-Download `localvqe-v1.1-1.3M-f32.gguf` from this repository (the file list above)
 either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
 `huggingface_hub`. Then:
 ### CLI
 ```bash
-./ggml/build/bin/localvqe localvqe-v1.1-1.3M-f32.gguf \
     --in-wav mic.wav ref.wav \
     --out-wav enhanced.wav
 ```
@@ -211,7 +239,7 @@ Expects 16 kHz mono PCM for both mic and far-end reference.
 ### Benchmark
 ```bash
-./ggml/build/bin/bench localvqe-v1.1-1.3M-f32.gguf \
     --in-wav mic.wav ref.wav --iters 10 --profile
 ```
@@ -233,7 +261,7 @@ tool in the C++ build can produce GGUF variants from the F32 reference
 for experimentation:
 ```bash
-./ggml/build/bin/quantize localvqe-v1.1-1.3M-f32.gguf localvqe-v1.1-1.3M-q8.gguf Q8_0
 ```
 Expect end-to-end quality loss until proper per-tensor selection and
@@ -241,7 +269,7 @@ calibration have been worked through.
 ## PyTorch Reference
-`localvqe-v1.1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
 It is provided for verification, ablation, and downstream research — not
 for end-user inference, which should go through the GGML build above. The
 model definition lives under `pytorch/` in the

 16 kHz speech, designed to run on commodity CPUs in real time.
 - 1.3 M parameters (~5 MB F32)
+- ~1.56 ms per 16 ms frame on Zen4 (4 threads) — **≈10× realtime**
 - Causal, streaming: 256-sample hop, 16 ms algorithmic latency
 - F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
   PyTorch reference included for verification and research
 Source code, build system, tests, and training pipeline live in the GitHub
 repository: <https://github.com/localai-org/LocalVQE>.
+The current release is **v1.2**. It doubles the supported delay
+window from 500 ms to 1 second at a ~20 % per-hop CPU cost. It also
+avoids oversuppression of voices that are near to the noise floor.
 The technical report describing the architecture, streaming-state contract,
 and streaming-causal normalisation operator is included in this repo as
 | File | Size | Description |
 |---|---|---|
+| `localvqe-v1.2-1.3M.pt` | 11 MB | PyTorch checkpoint — DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
+| `localvqe-v1.2-1.3M-f32.gguf` | 5 MB | GGML F32 export — what the C++ inference engine loads. |
+| `localvqe-v1.1-1.3M.pt` | 11 MB | Previous release. |
+| `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | Previous release (F32 GGUF). |
+| `localvqe-v1-1.3M-f32.gguf` | 5 MB | Original release. |
 Only F32 GGUF is published today. A `quantize` tool is included in the
 C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
 | Scenario                          |   n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ |
 |-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
+| doubletalk                        | 115 |          4.72 |         2.37 |       8.4 dB |          2.83 |
+| doubletalk-with-movement          | 185 |          4.65 |         2.30 |       8.1 dB |          2.79 |
+| farend-singletalk                 | 107 |          3.78 |         4.91 |      45.7 dB |          1.80 |
+| farend-singletalk-with-movement   | 193 |          4.12 |         4.96 |      40.6 dB |          1.75 |
+| nearend-singletalk                | 200 |          5.00 |         4.16 |       2.1 dB |          3.17 |
+v1.2 vs v1.1 deltas: AECMOS echo MOS +0.80 / +0.72 on FE-ST and
+FE-ST-with-movement (the primary release goal — these scenarios are
+where echo leaks through), near-end deg MOS +0.11, double-talk
+roughly unchanged. FE-ST-with-movement raw ERLE drops 4.4 dB; v1.2
+is less aggressive when the echo path is moving, trading raw
+cancellation for fewer near-end gating artefacts.
 - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
   quality predictor. "Echo" rates how well echo was removed; "degradation"
 Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
 full `ggml_backend_graph_compute`.
+**v1.2** (current, 1024 ms echo-search window):
+| Backend                     | Threads | p50     | p99     | max     |
+|-----------------------------|--------:|--------:|--------:|--------:|
+| CPU                         |       1 | 4.15 ms | 4.53 ms | 6.23 ms |
+| CPU                         |       4 | 1.56 ms | 1.73 ms | 4.57 ms |
+| CPU                         |       8 | 1.89 ms | 2.15 ms | 6.91 ms |
+| CPU                         |      16 | 2.12 ms | 2.17 ms | 6.43 ms |
+| Vulkan — AMD iGPU (RADV)    |       — | 4.88 ms | 5.06 ms | 6.24 ms |
+| Vulkan — NVIDIA RTX 5070 Ti |       — | 1.79 ms | 3.42 ms | 5.42 ms |
+Beyond ≈4 threads the model is small enough that thread-launch and
+synchronisation overhead dominate; **four threads is the sweet spot
+on Zen4**.
+**v1.1** (previous, 512 ms echo-search window) for comparison:
 | Backend                     | Threads | p50     | p99     | max     |
 |-----------------------------|--------:|--------:|--------:|--------:|
 | CPU                         |       1 | 3.40 ms | 3.57 ms | 5.06 ms |
 ## Running Inference
+Download `localvqe-v1.2-1.3M-f32.gguf` from this repository (the file list above)
 either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
 `huggingface_hub`. Then:
 ### CLI
 ```bash
+./ggml/build/bin/localvqe localvqe-v1.2-1.3M-f32.gguf \
     --in-wav mic.wav ref.wav \
     --out-wav enhanced.wav
 ```
 ### Benchmark
 ```bash
+./ggml/build/bin/bench localvqe-v1.2-1.3M-f32.gguf \
     --in-wav mic.wav ref.wav --iters 10 --profile
 ```
 for experimentation:
 ```bash
+./ggml/build/bin/quantize localvqe-v1.2-1.3M-f32.gguf localvqe-v1.2-1.3M-q8_0.gguf Q8_0
 ```
 Expect end-to-end quality loss until proper per-tensor selection and
 ## PyTorch Reference
+`localvqe-v1.2-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
 It is provided for verification, ablation, and downstream research — not
 for end-user inference, which should go through the GGML build above. The
 model definition lives under `pytorch/` in the