juanquivilla
/

sotto-cleanup-lfm25-350m-mlx-5bit

@@ -1,127 +1,7 @@
 ---
-license: other
-license_name: lfm1.0
-license_link: https://www.liquid.ai/license
-base_model: juanquivilla/sotto-cleanup-lfm25-350m
-datasets:
-  - juanquivilla/sotto-transcript-cleanup
-tags:
-  - speech-to-text
-  - transcript-cleanup
-  - disfluency-correction
-  - sotto-asr
-  - lfm2
-  - liquid-ai
-  - mlx
-  - apple-silicon
-  - quantized
-  - 5-bit
 library_name: mlx
 pipeline_tag: text-generation
-language:
-  - en
 ---
-# SottoASR Transcript Cleanup — LFM2.5-350M MLX 5-bit ⭐ Recommended
-<p align="center">
-  <a href="https://sotto.app">sotto.app</a> ·
-  <a href="https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m">Full precision (bf16)</a> ·
-  <a href="https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit">MLX 4-bit (smaller)</a> ·
-  <a href="https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup">Training Dataset</a>
-</p>
-## Overview
-**5-bit MLX-quantized** version of the [SottoASR transcript cleanup model](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m), optimized for inference on **Apple Silicon** (M1/M2/M3/M4). This is the **recommended deployment variant** — it delivers near-full-precision quality at 3x smaller size.
-This model powers on-device transcript cleanup in [**SottoASR**](https://sottoasr.app) — a local, privacy-first speech-to-text application for macOS.
-## Key Specs
-| Property | Value |
-|----------|-------|
-| **Size** | **233 MB** (3x smaller than bf16) |
-| **ROUGE-L** | **0.926** (only 0.5% below full precision) |
-| **Exact Match** | **56.3%** (actually higher than bf16) |
-| **Filler-Free** | **99.3%** (vs 83% bf16 — quantization improves decisiveness) |
-| **Latency** | **129 ms** average per transcript |
-| **Quantization** | 5-bit affine, group_size=64 |
-| **Framework** | MLX (Apple Silicon optimized) |
-| **Architecture** | LFM2.5-350M hybrid (10 conv + 6 GQA attention layers) |
-| **Context** | 32,768 tokens |
-## Why 5-bit?
-We benchmarked 4-bit, 5-bit, and 6-bit quantizations:
-| Variant | Size | ROUGE-L | Exact Match | Filler-Free | Quality Loss |
-|---------|------|---------|-------------|-------------|-------------|
-| bf16 | 676MB | 0.931 | 55.6% | 83.0% | — |
-| 6-bit | 275MB | 0.924 | 54.1% | 100% | -0.7% |
-| **5-bit** | **233MB** | **0.926** | **56.3%** | **99.3%** | **-0.5%** |
-| 4-bit | 190MB | 0.897 | 44.4% | 99.3% | -3.4% |
-**5-bit is the sweet spot:** minimal quality loss (-0.5%), 3x compression, and paradoxically improved filler removal (99.3% vs 83%). The quantization sharpens the model's decision boundaries for removing verbal noise.
-## What It Does
-Cleans raw speech-to-text transcripts by removing disfluencies and fixing formatting:
-```
-uh the server is uh running low on memory  →  The server is running low on memory.
-use redis wait no memcached is better      →  Use Memcached.
-send the email to john period              →  Send the email to John.
-lets go ahead and deploy this to staging   →  Let's go ahead and deploy this to staging.
-me and him was debugging all day           →  He and I were debugging all day.
-```
-## Usage
-```python
-from mlx_lm import load, generate
-from mlx_lm.sample_utils import make_sampler
-# Load model (downloads ~233MB on first use)
-model, tokenizer = load("juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit")
-sampler = make_sampler(temp=0.0)  # greedy for deterministic output
-# Clean a transcript
-raw = "uh the server is uh running low on memory"
-prompt = f"### Input:\n{raw}\n\n### Output:\n"
-output = generate(model, tokenizer, prompt=prompt, max_tokens=256, sampler=sampler)
-print(output.strip())
-# → "The server is running low on memory."
-```
-**Requirements:** `pip install mlx-lm` and Apple Silicon Mac (M1 or later).
-## Quantization Recipe
-Generated from the [bf16 model](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m) using:
-```bash
-pip install mlx-lm
-mlx_lm.convert \
-  --hf-path juanquivilla/sotto-cleanup-lfm25-350m \
-  --mlx-path sotto-cleanup-mlx-5bit \
-  -q --q-bits 5 --q-group-size 64 \
-  --trust-remote-code
-```
-## Training
-The base model was trained in two stages on 124K synthetic transcript cleanup pairs:
-1. **Stage 1:** Full fine-tune of [LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) on 124K dataset → ROUGE-L 0.930
-2. **Stage 2:** Concentrated hard-pattern FT on 14K examples → ROUGE-L 0.931
-See the [full training research document](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m) for details.
-## Part of SottoASR
-[**SottoASR**](https://sottoasr.app) is a local, privacy-first speech-to-text application for macOS. Press a hotkey, speak, and clean text appears at your cursor. All audio processing and transcript cleanup happen entirely on-device — nothing is ever sent to a cloud service. This model is the transcript cleanup component.
-## License
-Inherits the [LFM 1.0 license](https://www.liquid.ai/license) from the base model.

 ---
+language: en
 library_name: mlx
 pipeline_tag: text-generation
+tags:
+- mlx
 ---

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e793a0abfd1cba3751cd834ce3c7a0566a211c09ce83a88ff089e21a1caffa1f
 size 243830226

 version https://git-lfs.github.com/spec/v1
+oid sha256:74bb76432b46e9eaa27a0ad95b1c706855cab65fdaca5efb8550fad901578425
 size 243830226