juanquivilla commited on
Commit
74443d8
·
verified ·
1 Parent(s): b303997

v36: full-FT GRPO with substantive-deletion-aware reward — filler-free 96.9%, sub-del-15-long 0.64%

Browse files
README.md CHANGED
@@ -17,39 +17,38 @@ datasets:
17
  - juanquivilla/sotto-transcript-cleanup
18
  ---
19
 
20
- # SottoASR Transcript Cleanup — LFM2.5-350M MLX 5-bit (v23 + Paragraphs)
21
 
22
  [sottoasr.app](https://sottoasr.app) · [Full precision (bf16)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m) · [MLX 4-bit (smaller)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) · [Training Dataset](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)
23
 
24
  ## Overview
25
 
26
- **MLX 5-bit affine quantization** of [juanquivilla/sotto-cleanup-lfm25-350m](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m) for on-device deployment on Apple Silicon. **This is the recommended variant** for SottoASR's on-device transcript cleanupminimal quality loss vs full precision, fits in 237 MB, runs at ~85 ms per typical transcript on M-series chips.
27
 
28
- This model powers on-device transcript cleanup in [SottoASR](https://sottoasr.app) — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and **— new in v23restructures long dictations into paragraph-formatted prose**, all locally with zero cloud dependency.
29
 
30
- ## What's new in v23
31
 
32
- v23 adds **paragraph emission** for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length. v23 was retrained on a dataset augmented with **4,012 new `paragraph_formatting` samples**, teaching the model to insert `\n\n` paragraph breaks at natural topic / time-reference / discourse-marker boundaries.
33
 
34
- | Capability | v22 (previous prod) | **v23 R6 (this model)** |
35
  |---|---|---|
36
- | Paragraph emission rate on long inputs | **0.0 %** | **91.5 %** |
37
- | ROUGE-L on paragraph-formatted inputs | 0.9521 | **0.9784** |
38
- | ROUGE-L on standard val set | 0.9539 | 0.9537 |
39
- | **Filler-Free rate on standard val set** | 90.3 % | **91.0 %** ⭐ |
40
 
41
  ## Key Specs
42
 
43
  | Property | Value |
44
  |----------|-------|
45
- | **Size** | **237 MB** |
46
  | **Quantization** | 5-bit affine, group_size=64 |
47
  | **Effective bits/weight** | 5.502 |
48
- | **ROUGE-L (val set)** | ~0.9505 (≈ bf16) |
49
- | **Paragraph rate (long inputs)** | ~89.5 % |
50
  | **Architecture** | Hybrid: 10 conv + 6 GQA attention (354M params) |
51
  | **Latency** | ~85 ms average per transcript (M-series) |
52
 
 
 
53
  ## Quantization Recipe
54
 
55
  ```bash
@@ -93,21 +92,17 @@ For long dictation that may need paragraph formatting, raise `max_tokens` to 102
93
  | okay so the thing is basically we're running out of disk space | We're running out of disk space. |
94
  | uh yes | Yes. |
95
 
96
- ### NEW in v23: Paragraph emission on long dictations
97
-
98
- Multi-topic input is now restructured into paragraphed prose with `\n\n` breaks at natural topic boundaries. See the [bf16 model card](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m) for a full example.
99
-
100
- ## Benchmark Results
101
 
102
- Identical to the [bf16 model](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m) within MLX 5-bit quantization noise.
103
 
104
  ## All Variants
105
 
106
  | Variant | Size | Use Case |
107
  |---------|------|----------|
108
  | [Full precision (bf16)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m) | 676 MB | Training, GPU inference |
109
- | **[MLX 5-bit (this)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit)** | **237 MB** | **Recommended for Apple Silicon** |
110
- | [MLX 4-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) | 195 MB | Smallest, slight quality trade-off |
111
 
112
  ## License
113
 
 
17
  - juanquivilla/sotto-transcript-cleanup
18
  ---
19
 
20
+ # SottoASR Transcript Cleanup — LFM2.5-350M MLX 5-bit (v36 + Preservation)
21
 
22
  [sottoasr.app](https://sottoasr.app) · [Full precision (bf16)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m) · [MLX 4-bit (smaller)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) · [Training Dataset](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)
23
 
24
  ## Overview
25
 
26
+ **MLX 5-bit affine quantization** of [juanquivilla/sotto-cleanup-lfm25-350m](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m). The recommended variant for most Apple Silicon usersbest size/quality trade-off.
27
 
28
+ This model powers on-device transcript cleanup in [SottoASR](https://sottoasr.app) — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, restructures long dictations into paragraph-formatted prose, and **— new in v36preserves substantive content reliably even on long inputs**, all locally with zero cloud dependency.
29
 
30
+ ## What's new in v36
31
 
32
+ v36 fixes the **aggressive-edits failure mode** that earlier checkpoints occasionally exhibited: on long inputs the model would sometimes delete substantive content along with the fillers. v36 is a GRPO **full fine-tune** (all 354M params trainable, no LoRA) with a substantive-deletion-aware reward. Result: high-substantive-deletion incidence on long inputs drops from **3.85% → 0.64%** while filler-free rate climbs from **50.9% → 96.9%**.
33
 
34
+ | Capability | v23 baseline | **v36 (this model)** |
35
  |---|---|---|
36
+ | Filler-Free rate | 50.9 % | **96.9 %** |
37
+ | Substantive-deletion >15% on long inputs | 3.85 % | **0.64 %** |
38
+ | ROUGE-L F1 on long inputs (>100 words) | 0.9242 | **0.9425** |
 
39
 
40
  ## Key Specs
41
 
42
  | Property | Value |
43
  |----------|-------|
44
+ | **Size** | **~237 MB** |
45
  | **Quantization** | 5-bit affine, group_size=64 |
46
  | **Effective bits/weight** | 5.502 |
 
 
47
  | **Architecture** | Hybrid: 10 conv + 6 GQA attention (354M params) |
48
  | **Latency** | ~85 ms average per transcript (M-series) |
49
 
50
+ Quality at this quantization tracks the bf16 model closely. See the [bf16 model card](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m) for full benchmark numbers.
51
+
52
  ## Quantization Recipe
53
 
54
  ```bash
 
92
  | okay so the thing is basically we're running out of disk space | We're running out of disk space. |
93
  | uh yes | Yes. |
94
 
95
+ ### Paragraph emission on long dictations (inherited from v23)
 
 
 
 
96
 
97
+ Multi-topic input is restructured into paragraphed prose with `\n\n` breaks at natural topic boundaries. See the [bf16 model card](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m) for a full example.
98
 
99
  ## All Variants
100
 
101
  | Variant | Size | Use Case |
102
  |---------|------|----------|
103
  | [Full precision (bf16)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m) | 676 MB | Training, GPU inference |
104
+ | **[MLX 5-bit (this)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit)** | ~237 MB | **Recommended for Apple Silicon** |
105
+ | [MLX 4-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) | ~195 MB | Smallest, slight quality trade-off |
106
 
107
  ## License
108
 
config.json CHANGED
@@ -21,6 +21,7 @@
21
  "eos_token_id": [
22
  7
23
  ],
 
24
  "hidden_size": 1024,
25
  "initializer_range": 0.02,
26
  "intermediate_size": 6656,
@@ -64,9 +65,9 @@
64
  "rope_theta": 1000000.0,
65
  "rope_type": "default"
66
  },
67
- "tie_embedding": true,
68
  "tie_word_embeddings": true,
69
- "transformers_version": "5.3.0",
70
  "use_cache": false,
71
  "use_pos_enc": true,
72
  "vocab_size": 65536
 
21
  "eos_token_id": [
22
  7
23
  ],
24
+ "full_attn_idxs": null,
25
  "hidden_size": 1024,
26
  "initializer_range": 0.02,
27
  "intermediate_size": 6656,
 
65
  "rope_theta": 1000000.0,
66
  "rope_type": "default"
67
  },
68
+ "rope_theta": 1000000.0,
69
  "tie_word_embeddings": true,
70
+ "transformers_version": "5.6.2",
71
  "use_cache": false,
72
  "use_pos_enc": true,
73
  "vocab_size": 65536
generation_config.json CHANGED
@@ -5,5 +5,5 @@
5
  7
6
  ],
7
  "pad_token_id": 0,
8
- "transformers_version": "5.3.0"
9
  }
 
5
  7
6
  ],
7
  "pad_token_id": 0,
8
+ "transformers_version": "5.6.2"
9
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:bb28992c3112af29230a97abc161d8cc4fc2e4b3b7736695696d5f9842b4ea01
3
- size 243830226
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8fdff7017e6929cb5d3bb90e3136da4be6bfa0897109286532486dc92d57ab33
3
+ size 243830312
tokenizer_config.json CHANGED
@@ -6,6 +6,7 @@
6
  "extra_special_tokens": [],
7
  "is_local": true,
8
  "legacy": false,
 
9
  "model_input_names": [
10
  "input_ids",
11
  "attention_mask"
 
6
  "extra_special_tokens": [],
7
  "is_local": true,
8
  "legacy": false,
9
+ "local_files_only": false,
10
  "model_input_names": [
11
  "input_ids",
12
  "attention_mask"