SottoASR Transcript Cleanup — LFM2.5-350M MLX 5-bit (v45 + Numbers)

sottoasr.app · Full precision (bf16) · MLX 4-bit (smaller) · Training Dataset

Overview

MLX 5-bit affine quantization of juanquivilla/sotto-cleanup-lfm25-350m. The recommended variant for most Apple Silicon users — best size/quality trade-off.

This model powers on-device transcript cleanup in SottoASR — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, restructures long dictations into paragraph-formatted prose, preserves substantive content reliably even on long inputs, and — new in v45 — converts spoken-form numbers to digit form correctly (inverse text normalization), all locally with zero cloud dependency.

What's new in v45

v45 adds inverse text normalization (ITN): when users dictate compound spoken numbers like "talk about server three sixty," v45 reliably produces "Talk about server 360." Earlier versions (v36 and prior) either preserved the spoken form (looks unprofessional) or attempted the conversion incorrectly. v45 covers all common ITN categories — compound numbers, hundreds, four-digit years, times, decimals, percentages, currency, ordinals, dates — while continuing to preserve cardinals in idioms ("I'll be there in five" stays as written).

Capability v36 (preservation) v45 (this model)
Number accuracy (171-sample stratified set) 12.9 % 95.9 %
Filler-Free rate 96.9 % 97.0 %
Substantive-deletion >15% on long inputs† 13.3 % 13.7 % (~tied)
Word retention median 0.884 0.922

† Measured on all 241 long inputs (>100 words) from data_v23_paragraphs/val.jsonl — a stricter metric than v36's published 0.64 % (which was on a 350-sample mix). v45 inherits v36's deletion-aware behavior on the same eval.

Key Specs

Property Value
Size ~237 MB
Quantization 5-bit affine, group_size=64
Effective bits/weight 5.502
Architecture Hybrid: 10 conv + 6 GQA attention (354M params)
Latency ~85 ms average per transcript (M-series)

Quality at this quantization tracks the bf16 model closely. See the bf16 model card for full benchmark numbers, training pipeline, and reward shape.

Quantization Recipe

mlx_lm.convert \
  --hf-path juanquivilla/sotto-cleanup-lfm25-350m \
  --mlx-path sotto-cleanup-lfm25-350m-mlx-5bit \
  -q --q-bits 5 --q-group-size 64 \
  --trust-remote-code

Usage

Python (mlx_lm)

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load("juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit")
sampler = make_sampler(temp=0.0)  # greedy

text = "talk about server three sixty"
prompt = f"### Input:\n{text}\n\n### Output:\n"

output = generate(model, tokenizer, prompt=prompt, max_tokens=512, sampler=sampler)
if "###" in output:
    output = output[:output.index("###")].strip()
print(output)
# → "Talk about server 360."

For long dictation that may need paragraph formatting, raise max_tokens to 1024–2048.

What It Does

Input (raw ASR) Output (cleaned)
so uh basically we need to fix the deployment pipeline We need to fix the deployment pipeline.
talk about server three sixty Talk about server 360.
schedule it for three fifteen pm Schedule it for 3:15 PM.
we hit ninety eight percent uptime last month We hit 98 % uptime last month.
transfer fifty dollars to billing Transfer $50 to billing.
i'll be there in five I'll be there in five.
we run twenty four seven We run 24/7.

Paragraph emission on long dictations (inherited from v23)

Multi-topic input is restructured into paragraphed prose with \n\n breaks at natural topic boundaries. See the bf16 model card for a full example.

All Variants

Variant Size Use Case
Full precision (bf16) 676 MB Training, GPU inference
MLX 5-bit (this) ~237 MB Recommended for Apple Silicon
MLX 4-bit ~195 MB Smallest, slight quality trade-off

License

MIT

Links

Downloads last month
634
Safetensors
Model size
66.5M params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit

Quantized
(2)
this model

Dataset used to train juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit