richiejp commited on
Commit
66c7198
Β·
verified Β·
1 Parent(s): f0dadbf

Sync model card with upstream GitHub inference README

Browse files
Files changed (1) hide show
  1. README.md +43 -15
README.md CHANGED
@@ -20,7 +20,7 @@ acoustic echo cancellation (AEC), noise suppression, and dereverberation of
20
  16 kHz speech, designed to run on commodity CPUs in real time.
21
 
22
  - 1.3 M parameters (~5 MB F32)
23
- - ~1.66 ms per 16 ms frame on Zen4 (24 threads) β€” **β‰ˆ9.6Γ— realtime**
24
  - Causal, streaming: 256-sample hop, 16 ms algorithmic latency
25
  - F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
26
  PyTorch reference included for verification and research
@@ -31,8 +31,9 @@ This page is the Hugging Face model card β€” it hosts the published weights.
31
  Source code, build system, tests, and training pipeline live in the GitHub
32
  repository: <https://github.com/localai-org/LocalVQE>.
33
 
34
- The current release is **v1.1**, which fixes intermittent crackling the
35
- previous release produced under heavy background noise.
 
36
 
37
  The technical report describing the architecture, streaming-state contract,
38
  and streaming-causal normalisation operator is included in this repo as
@@ -103,8 +104,11 @@ LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters
103
 
104
  | File | Size | Description |
105
  |---|---|---|
106
- | `localvqe-v1.1-1.3M.pt` | 11 MB | PyTorch checkpoint β€” DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
107
- | `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | GGML F32 export β€” what the C++ inference engine loads. |
 
 
 
108
 
109
  Only F32 GGUF is published today. A `quantize` tool is included in the
110
  C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
@@ -118,11 +122,18 @@ Full 800-clip eval on the
118
 
119
  | Scenario | n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ |
120
  |-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
121
- | doubletalk | 115 | 4.70 | 2.35 | 8.4 dB | 2.85 |
122
- | doubletalk-with-movement | 185 | 4.63 | 2.35 | 8.3 dB | 2.80 |
123
- | farend-singletalk | 107 | 2.98 | 4.91 | 44.7 dB | 1.93 |
124
- | farend-singletalk-with-movement | 193 | 3.40 | 4.95 | 45.0 dB | 1.91 |
125
- | nearend-singletalk | 200 | 4.99 | 4.05 | 2.5 dB | 3.13 |
 
 
 
 
 
 
 
126
 
127
  - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
128
  quality predictor. "Echo" rates how well echo was removed; "degradation"
@@ -178,6 +189,23 @@ glslc`/`shaderc`).
178
  Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
179
  full `ggml_backend_graph_compute`.
180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
181
  | Backend | Threads | p50 | p99 | max |
182
  |-----------------------------|--------:|--------:|--------:|--------:|
183
  | CPU | 1 | 3.40 ms | 3.57 ms | 5.06 ms |
@@ -194,14 +222,14 @@ range.
194
 
195
  ## Running Inference
196
 
197
- Download `localvqe-v1.1-1.3M-f32.gguf` from this repository (the file list above)
198
  either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
199
  `huggingface_hub`. Then:
200
 
201
  ### CLI
202
 
203
  ```bash
204
- ./ggml/build/bin/localvqe localvqe-v1.1-1.3M-f32.gguf \
205
  --in-wav mic.wav ref.wav \
206
  --out-wav enhanced.wav
207
  ```
@@ -211,7 +239,7 @@ Expects 16 kHz mono PCM for both mic and far-end reference.
211
  ### Benchmark
212
 
213
  ```bash
214
- ./ggml/build/bin/bench localvqe-v1.1-1.3M-f32.gguf \
215
  --in-wav mic.wav ref.wav --iters 10 --profile
216
  ```
217
 
@@ -233,7 +261,7 @@ tool in the C++ build can produce GGUF variants from the F32 reference
233
  for experimentation:
234
 
235
  ```bash
236
- ./ggml/build/bin/quantize localvqe-v1.1-1.3M-f32.gguf localvqe-v1.1-1.3M-q8.gguf Q8_0
237
  ```
238
 
239
  Expect end-to-end quality loss until proper per-tensor selection and
@@ -241,7 +269,7 @@ calibration have been worked through.
241
 
242
  ## PyTorch Reference
243
 
244
- `localvqe-v1.1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
245
  It is provided for verification, ablation, and downstream research β€” not
246
  for end-user inference, which should go through the GGML build above. The
247
  model definition lives under `pytorch/` in the
 
20
  16 kHz speech, designed to run on commodity CPUs in real time.
21
 
22
  - 1.3 M parameters (~5 MB F32)
23
+ - ~1.56 ms per 16 ms frame on Zen4 (4 threads) β€” **β‰ˆ10Γ— realtime**
24
  - Causal, streaming: 256-sample hop, 16 ms algorithmic latency
25
  - F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
26
  PyTorch reference included for verification and research
 
31
  Source code, build system, tests, and training pipeline live in the GitHub
32
  repository: <https://github.com/localai-org/LocalVQE>.
33
 
34
+ The current release is **v1.2**. It doubles the supported delay
35
+ window from 500 ms to 1 second at a ~20 % per-hop CPU cost. It also
36
+ avoids oversuppression of voices that are near to the noise floor.
37
 
38
  The technical report describing the architecture, streaming-state contract,
39
  and streaming-causal normalisation operator is included in this repo as
 
104
 
105
  | File | Size | Description |
106
  |---|---|---|
107
+ | `localvqe-v1.2-1.3M.pt` | 11 MB | PyTorch checkpoint β€” DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
108
+ | `localvqe-v1.2-1.3M-f32.gguf` | 5 MB | GGML F32 export β€” what the C++ inference engine loads. |
109
+ | `localvqe-v1.1-1.3M.pt` | 11 MB | Previous release. |
110
+ | `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | Previous release (F32 GGUF). |
111
+ | `localvqe-v1-1.3M-f32.gguf` | 5 MB | Original release. |
112
 
113
  Only F32 GGUF is published today. A `quantize` tool is included in the
114
  C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
 
122
 
123
  | Scenario | n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ |
124
  |-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
125
+ | doubletalk | 115 | 4.72 | 2.37 | 8.4 dB | 2.83 |
126
+ | doubletalk-with-movement | 185 | 4.65 | 2.30 | 8.1 dB | 2.79 |
127
+ | farend-singletalk | 107 | 3.78 | 4.91 | 45.7 dB | 1.80 |
128
+ | farend-singletalk-with-movement | 193 | 4.12 | 4.96 | 40.6 dB | 1.75 |
129
+ | nearend-singletalk | 200 | 5.00 | 4.16 | 2.1 dB | 3.17 |
130
+
131
+ v1.2 vs v1.1 deltas: AECMOS echo MOS +0.80 / +0.72 on FE-ST and
132
+ FE-ST-with-movement (the primary release goal β€” these scenarios are
133
+ where echo leaks through), near-end deg MOS +0.11, double-talk
134
+ roughly unchanged. FE-ST-with-movement raw ERLE drops 4.4 dB; v1.2
135
+ is less aggressive when the echo path is moving, trading raw
136
+ cancellation for fewer near-end gating artefacts.
137
 
138
  - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
139
  quality predictor. "Echo" rates how well echo was removed; "degradation"
 
189
  Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
190
  full `ggml_backend_graph_compute`.
191
 
192
+ **v1.2** (current, 1024 ms echo-search window):
193
+
194
+ | Backend | Threads | p50 | p99 | max |
195
+ |-----------------------------|--------:|--------:|--------:|--------:|
196
+ | CPU | 1 | 4.15 ms | 4.53 ms | 6.23 ms |
197
+ | CPU | 4 | 1.56 ms | 1.73 ms | 4.57 ms |
198
+ | CPU | 8 | 1.89 ms | 2.15 ms | 6.91 ms |
199
+ | CPU | 16 | 2.12 ms | 2.17 ms | 6.43 ms |
200
+ | Vulkan β€” AMD iGPU (RADV) | β€” | 4.88 ms | 5.06 ms | 6.24 ms |
201
+ | Vulkan β€” NVIDIA RTX 5070 Ti | β€” | 1.79 ms | 3.42 ms | 5.42 ms |
202
+
203
+ Beyond β‰ˆ4 threads the model is small enough that thread-launch and
204
+ synchronisation overhead dominate; **four threads is the sweet spot
205
+ on Zen4**.
206
+
207
+ **v1.1** (previous, 512 ms echo-search window) for comparison:
208
+
209
  | Backend | Threads | p50 | p99 | max |
210
  |-----------------------------|--------:|--------:|--------:|--------:|
211
  | CPU | 1 | 3.40 ms | 3.57 ms | 5.06 ms |
 
222
 
223
  ## Running Inference
224
 
225
+ Download `localvqe-v1.2-1.3M-f32.gguf` from this repository (the file list above)
226
  either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
227
  `huggingface_hub`. Then:
228
 
229
  ### CLI
230
 
231
  ```bash
232
+ ./ggml/build/bin/localvqe localvqe-v1.2-1.3M-f32.gguf \
233
  --in-wav mic.wav ref.wav \
234
  --out-wav enhanced.wav
235
  ```
 
239
  ### Benchmark
240
 
241
  ```bash
242
+ ./ggml/build/bin/bench localvqe-v1.2-1.3M-f32.gguf \
243
  --in-wav mic.wav ref.wav --iters 10 --profile
244
  ```
245
 
 
261
  for experimentation:
262
 
263
  ```bash
264
+ ./ggml/build/bin/quantize localvqe-v1.2-1.3M-f32.gguf localvqe-v1.2-1.3M-q8_0.gguf Q8_0
265
  ```
266
 
267
  Expect end-to-end quality loss until proper per-tensor selection and
 
269
 
270
  ## PyTorch Reference
271
 
272
+ `localvqe-v1.2-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
273
  It is provided for verification, ablation, and downstream research β€” not
274
  for end-user inference, which should go through the GGML build above. The
275
  model definition lives under `pytorch/` in the