juanquivilla commited on
Commit
d3e73f6
·
verified ·
1 Parent(s): f4b3407

v4: 5-bit MLX, 233MB, long transcript support

Browse files
Files changed (2) hide show
  1. README.md +3 -123
  2. model.safetensors +1 -1
README.md CHANGED
@@ -1,127 +1,7 @@
1
  ---
2
- license: other
3
- license_name: lfm1.0
4
- license_link: https://www.liquid.ai/license
5
- base_model: juanquivilla/sotto-cleanup-lfm25-350m
6
- datasets:
7
- - juanquivilla/sotto-transcript-cleanup
8
- tags:
9
- - speech-to-text
10
- - transcript-cleanup
11
- - disfluency-correction
12
- - sotto-asr
13
- - lfm2
14
- - liquid-ai
15
- - mlx
16
- - apple-silicon
17
- - quantized
18
- - 5-bit
19
  library_name: mlx
20
  pipeline_tag: text-generation
21
- language:
22
- - en
23
  ---
24
-
25
- # SottoASR Transcript Cleanup — LFM2.5-350M MLX 5-bit ⭐ Recommended
26
-
27
- <p align="center">
28
- <a href="https://sotto.app">sotto.app</a> ·
29
- <a href="https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m">Full precision (bf16)</a> ·
30
- <a href="https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit">MLX 4-bit (smaller)</a> ·
31
- <a href="https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup">Training Dataset</a>
32
- </p>
33
-
34
- ## Overview
35
-
36
- **5-bit MLX-quantized** version of the [SottoASR transcript cleanup model](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m), optimized for inference on **Apple Silicon** (M1/M2/M3/M4). This is the **recommended deployment variant** — it delivers near-full-precision quality at 3x smaller size.
37
-
38
- This model powers on-device transcript cleanup in [**SottoASR**](https://sottoasr.app) — a local, privacy-first speech-to-text application for macOS.
39
-
40
- ## Key Specs
41
-
42
- | Property | Value |
43
- |----------|-------|
44
- | **Size** | **233 MB** (3x smaller than bf16) |
45
- | **ROUGE-L** | **0.926** (only 0.5% below full precision) |
46
- | **Exact Match** | **56.3%** (actually higher than bf16) |
47
- | **Filler-Free** | **99.3%** (vs 83% bf16 — quantization improves decisiveness) |
48
- | **Latency** | **129 ms** average per transcript |
49
- | **Quantization** | 5-bit affine, group_size=64 |
50
- | **Framework** | MLX (Apple Silicon optimized) |
51
- | **Architecture** | LFM2.5-350M hybrid (10 conv + 6 GQA attention layers) |
52
- | **Context** | 32,768 tokens |
53
-
54
- ## Why 5-bit?
55
-
56
- We benchmarked 4-bit, 5-bit, and 6-bit quantizations:
57
-
58
- | Variant | Size | ROUGE-L | Exact Match | Filler-Free | Quality Loss |
59
- |---------|------|---------|-------------|-------------|-------------|
60
- | bf16 | 676MB | 0.931 | 55.6% | 83.0% | — |
61
- | 6-bit | 275MB | 0.924 | 54.1% | 100% | -0.7% |
62
- | **5-bit** | **233MB** | **0.926** | **56.3%** | **99.3%** | **-0.5%** |
63
- | 4-bit | 190MB | 0.897 | 44.4% | 99.3% | -3.4% |
64
-
65
- **5-bit is the sweet spot:** minimal quality loss (-0.5%), 3x compression, and paradoxically improved filler removal (99.3% vs 83%). The quantization sharpens the model's decision boundaries for removing verbal noise.
66
-
67
- ## What It Does
68
-
69
- Cleans raw speech-to-text transcripts by removing disfluencies and fixing formatting:
70
-
71
- ```
72
- uh the server is uh running low on memory → The server is running low on memory.
73
- use redis wait no memcached is better → Use Memcached.
74
- send the email to john period → Send the email to John.
75
- lets go ahead and deploy this to staging → Let's go ahead and deploy this to staging.
76
- me and him was debugging all day → He and I were debugging all day.
77
- ```
78
-
79
- ## Usage
80
-
81
- ```python
82
- from mlx_lm import load, generate
83
- from mlx_lm.sample_utils import make_sampler
84
-
85
- # Load model (downloads ~233MB on first use)
86
- model, tokenizer = load("juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit")
87
- sampler = make_sampler(temp=0.0) # greedy for deterministic output
88
-
89
- # Clean a transcript
90
- raw = "uh the server is uh running low on memory"
91
- prompt = f"### Input:\n{raw}\n\n### Output:\n"
92
- output = generate(model, tokenizer, prompt=prompt, max_tokens=256, sampler=sampler)
93
- print(output.strip())
94
- # → "The server is running low on memory."
95
- ```
96
-
97
- **Requirements:** `pip install mlx-lm` and Apple Silicon Mac (M1 or later).
98
-
99
- ## Quantization Recipe
100
-
101
- Generated from the [bf16 model](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m) using:
102
-
103
- ```bash
104
- pip install mlx-lm
105
- mlx_lm.convert \
106
- --hf-path juanquivilla/sotto-cleanup-lfm25-350m \
107
- --mlx-path sotto-cleanup-mlx-5bit \
108
- -q --q-bits 5 --q-group-size 64 \
109
- --trust-remote-code
110
- ```
111
-
112
- ## Training
113
-
114
- The base model was trained in two stages on 124K synthetic transcript cleanup pairs:
115
-
116
- 1. **Stage 1:** Full fine-tune of [LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) on 124K dataset → ROUGE-L 0.930
117
- 2. **Stage 2:** Concentrated hard-pattern FT on 14K examples → ROUGE-L 0.931
118
-
119
- See the [full training research document](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m) for details.
120
-
121
- ## Part of SottoASR
122
-
123
- [**SottoASR**](https://sottoasr.app) is a local, privacy-first speech-to-text application for macOS. Press a hotkey, speak, and clean text appears at your cursor. All audio processing and transcript cleanup happen entirely on-device — nothing is ever sent to a cloud service. This model is the transcript cleanup component.
124
-
125
- ## License
126
-
127
- Inherits the [LFM 1.0 license](https://www.liquid.ai/license) from the base model.
 
1
  ---
2
+ language: en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  library_name: mlx
4
  pipeline_tag: text-generation
5
+ tags:
6
+ - mlx
7
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e793a0abfd1cba3751cd834ce3c7a0566a211c09ce83a88ff089e21a1caffa1f
3
  size 243830226
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74bb76432b46e9eaa27a0ad95b1c706855cab65fdaca5efb8550fad901578425
3
  size 243830226