SottoASR Transcript Cleanup — LFM2.5-350M MLX 5-bit (v45 + Numbers)

sottoasr.app · Full precision (bf16) · MLX 4-bit (smaller) · Training Dataset

Overview

MLX 5-bit affine quantization of juanquivilla/sotto-cleanup-lfm25-350m. The recommended variant for most Apple Silicon users — best size/quality trade-off.

This model powers on-device transcript cleanup in SottoASR — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, restructures long dictations into paragraph-formatted prose, preserves substantive content reliably even on long inputs, and — new in v45 — converts spoken-form numbers to digit form correctly (inverse text normalization), all locally with zero cloud dependency.

What's new in v45

v45 adds inverse text normalization (ITN): when users dictate compound spoken numbers like "talk about server three sixty," v45 reliably produces "Talk about server 360." Earlier versions (v36 and prior) either preserved the spoken form (looks unprofessional) or attempted the conversion incorrectly. v45 covers all common ITN categories — compound numbers, hundreds, four-digit years, times, decimals, percentages, currency, ordinals, dates — while continuing to preserve cardinals in idioms ("I'll be there in five" stays as written).

Capability	v36 (preservation)	v45 (this model)
Number accuracy (171-sample stratified set)	12.9 %	95.9 % ⭐
Filler-Free rate	96.9 %	97.0 %
Substantive-deletion >15% on long inputs†	13.3 %	13.7 % (~tied)
Word retention median	0.884	0.922

† Measured on all 241 long inputs (>100 words) from data_v23_paragraphs/val.jsonl — a stricter metric than v36's published 0.64 % (which was on a 350-sample mix). v45 inherits v36's deletion-aware behavior on the same eval.

Key Specs

Property	Value
Size	~237 MB
Quantization	5-bit affine, group_size=64
Effective bits/weight	5.502
Architecture	Hybrid: 10 conv + 6 GQA attention (354M params)
Latency	~85 ms average per transcript (M-series)

Quality at this quantization tracks the bf16 model closely. See the bf16 model card for full benchmark numbers, training pipeline, and reward shape.

Quantization Recipe

mlx_lm.convert \
  --hf-path juanquivilla/sotto-cleanup-lfm25-350m \
  --mlx-path sotto-cleanup-lfm25-350m-mlx-5bit \
  -q --q-bits 5 --q-group-size 64 \
  --trust-remote-code

Usage

Python (mlx_lm)

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load("juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit")
sampler = make_sampler(temp=0.0)  # greedy

text = "talk about server three sixty"
prompt = f"### Input:\n{text}\n\n### Output:\n"

output = generate(model, tokenizer, prompt=prompt, max_tokens=512, sampler=sampler)
if "###" in output:
    output = output[:output.index("###")].strip()
print(output)
# → "Talk about server 360."

For long dictation that may need paragraph formatting, raise max_tokens to 1024–2048.

What It Does

Input (raw ASR)	Output (cleaned)
so uh basically we need to fix the deployment pipeline	We need to fix the deployment pipeline.
talk about server three sixty	Talk about server 360.
schedule it for three fifteen pm	Schedule it for 3:15 PM.
we hit ninety eight percent uptime last month	We hit 98 % uptime last month.
transfer fifty dollars to billing	Transfer $50 to billing.
i'll be there in five	I'll be there in five.
we run twenty four seven	We run 24/7.

Paragraph emission on long dictations (inherited from v23)

Multi-topic input is restructured into paragraphed prose with \n\n breaks at natural topic boundaries. See the bf16 model card for a full example.

All Variants

Variant	Size	Use Case
Full precision (bf16)	676 MB	Training, GPU inference
MLX 5-bit (this)	~237 MB	Recommended for Apple Silicon
MLX 4-bit	~195 MB	Smallest, slight quality trade-off

License

MIT

Model tree for juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit

Base model

LiquidAI/LFM2.5-350M-Base

Finetuned

juanquivilla/sotto-cleanup-lfm25-350m

Quantized

(2)

this model

juanquivilla
/

sotto-cleanup-lfm25-350m-mlx-5bit