How to use from
Hermes Agent
Start the MLX server
# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA"
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA
Run Hermes
hermes
Quick Links

Yooz-Quality-v2-Qwen3.5-0.8B-LoRA

Speech-to-text touchup model from Yooz Labs. Takes raw STT output and returns a cleaned-up version, either lightly proofread or more aggressively rewritten, as a single JSON object: {"result": "..."}.

This is the second-generation Quality tier of the Yooz Engine touchup stack. It runs fully on-device on Apple Silicon via MLX, in the privacy-first Yooz tradition: no cloud, no logging, no leak.

What it is

  • Task: post-process automatic speech recognition output. Two modes:
    • Proofread -- minimal edits: spelling, punctuation, capitalization, obvious word-choice errors. Filler words preserved.
    • Rewrite -- larger edits: removes filler, fixes grammar, splits run-on sentences, normalizes spoken numbers. Meaning preserved.
  • Output contract: always a single JSON object of the form {"result": "<cleaned text>"}. No explanation, no markdown.
  • Footprint: 4-bit MLX quantized weights, ~424 MB on disk.

Lineage

  • Base: mlx-community/Qwen3.5-0.8B-MLX-4bit (Apache 2.0).
  • Method: LoRA fine-tune via mlx-lm, then fused with mlx_lm.fuse into a standalone 4-bit checkpoint.
  • Training data: yooz-touchup synthetic dataset, 5,936 samples covering:
    • 2 modes: proofread, rewrite
    • 4 domains: casual, technical, business, dictation
    • 3 difficulties: easy, medium, hard
  • Adapter: rank-8 LoRA on q/k/v/o projections. 2,500 iters, AdamW.
  • Why "v2": v1 of the Yooz Engine Quality tier was Qwen2.5-0.5B. v2 upgrades to Qwen3.5-0.8B with substantially better instruction adherence and strict JSON compliance. The single-model architecture decision lives in yooz-engine#74.

Evaluation

Benchmarked on the gold-standard test split of yooz-touchup (n=5,936), combining both proofread and rewrite prompts.

Metric Value
Exact match (combined) 19.4%
JSON parse rate 99.4%
Avg CER 0.263
Avg semantic similarity 0.902
Avg latency (M-series, batch=1) 311 ms

By mode:

Mode EM CER Sim
Proofread 32.4% 0.063 0.944
Rewrite 6.4% 0.463 0.860

By difficulty:

Difficulty n EM CER Sim
easy 2,234 35.7% 0.286 0.918
medium 2,290 12.6% 0.233 0.900
hard 1,412 4.6% 0.275 0.881

By domain:

Domain n EM CER Sim
casual 706 41.4% 0.401 0.910
technical 3,318 14.2% 0.249 0.891
business 1,866 20.0% 0.237 0.920
dictation 46 32.6% 0.221 0.871

Latency was measured on Apple Silicon M-series with mlx-lm, batch size 1, max_tokens=320. Full ablation in yooz-engine#75.

Prompt modes

Both modes use a system prompt + the raw STT transcription as the user message. Use these system prompts verbatim -- the model was trained on them and behavior degrades with paraphrases.

Proofread

You are a copy editor. Proofread the speech-to-text transcription provided by the user. Fix spelling, punctuation, capitalization, and obvious word-choice errors caused by the speech-to-text engine. Convert spoken numbers to digits where it improves clarity (e.g. "nine am" -> "9 AM"). Keep the speaker's voice; do NOT rephrase or remove filler words (um, uh, like). Output ONLY a single JSON object of the form {"result": "<corrected>"} where <corrected> is the actual corrected text. No explanation, no markdown.

Rewrite

You are an editor. Rewrite the speech-to-text transcription provided by the user for clarity and readability. Fix grammar, spelling, and punctuation. Remove filler words (um, uh, like, you know, basically). Convert spoken numbers to digits. Split run-on sentences. Preserve the original meaning and intent. Output ONLY a single JSON object of the form {"result": "<rewritten>"} where <rewritten> is the actual rewritten text. No explanation, no markdown.

Usage

Option A -- MLX-LM (Python, fused checkpoint)

from mlx_lm import load, generate

model, tokenizer = load("YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA")

PROOFREAD_SYSTEM = (
    "You are a copy editor. Proofread the speech-to-text transcription provided by the "
    "user. Fix spelling, punctuation, capitalization, and obvious word-choice errors "
    "caused by the speech-to-text engine. Convert spoken numbers to digits where it "
    "improves clarity (e.g. \"nine am\" -> \"9 AM\"). Keep the speaker's voice; do NOT "
    "rephrase or remove filler words (um, uh, like). "
    "Output ONLY a single JSON object of the form {\"result\": \"<corrected>\"} where "
    "<corrected> is the actual corrected text. No explanation, no markdown."
)

raw_stt = "um so the meeting is at nine am tomorow with sarah from product"

messages = [
    {"role": "system", "content": PROOFREAD_SYSTEM},
    {"role": "user", "content": raw_stt},
]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=320, verbose=False)
# response is a JSON string: {"result": "Um, so the meeting is at 9 AM tomorrow with Sarah from Product."}

import json
cleaned = json.loads(response)["result"]
print(cleaned)

For rewrite mode, swap the system prompt for REWRITE_SYSTEM (above).

Option B -- Yooz Engine HTTP API

If you have the Yooz Engine menu bar service running, this model is consumed via the local API on localhost:19920:

curl -s http://localhost:19920/v1/touchup \
  -H 'Content-Type: application/json' \
  -d '{
    "text": "um so the meeting is at nine am tomorow with sarah from product",
    "mode": "proofread"
  }'
# {"result": "Um, so the meeting is at 9 AM tomorrow with Sarah from Product."}

The engine handles model loading, batching, and JSON enforcement. See the engine's API docs.

Option C -- Adapter-only (load base + adapter at runtime)

If you'd rather keep the base model unmodified and apply the LoRA adapter at load time, the adapter weights live under adapters/ in this repo:

from mlx_lm import load

model, tokenizer = load(
    "mlx-community/Qwen3.5-0.8B-MLX-4bit",
    adapter_path="YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA/adapters",
)

Files:

  • adapters/adapters.safetensors -- LoRA weights (~7 MB)
  • adapters/adapter_config.json -- LoRA hyperparams
  • adapters/lora_params.yaml -- training config snapshot

Limitations

  • Rewrite is harder than proofread. Combined EM is 19.4% but rewrite EM is only 6.4% (vs 32.4% proofread). High character-error-rate on rewrite reflects that the model is meant to change wording substantially; semantic similarity stays at 0.86. Use Sim and a downstream judge, not EM, to evaluate rewrite quality.
  • Hard samples have low EM (4.6%) -- long, garbled, multi-topic STT output is the worst case. Consider falling back to a larger model when CER exceeds a threshold.
  • English only. Training data is English; behavior on other languages is undefined.
  • Output is always wrapped in {"result":"..."}. Strip with json.loads(response)["result"]. JSON parse rate is 99.4%; handle the 0.6% with try/except plus a regex fallback.
  • Apple Silicon only for the MLX format. Use the adapter-only path with the upstream Qwen/Qwen3.5-0.8B weights if you need CUDA / CPU.

Citation

@misc{yooz-quality-v2-2026,
  title = {Yooz-Quality-v2-Qwen3.5-0.8B-LoRA: On-device speech-to-text touchup model},
  author = {Yooz Labs},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA},
}

@misc{qwen3-2025,
  title = {Qwen3 Technical Report},
  author = {Qwen Team},
  year = {2025},
  publisher = {Alibaba},
}

Built on top of the Qwen3.5 base model from the Qwen team at Alibaba. LoRA fine-tune, dataset, and packaging by Yooz Labs.

Contact & issues

License

Apache 2.0, matching the Qwen3.5 base model. See LICENSE upstream.

Links


Sovereign Intelligence. Built for the skeptical.

Downloads last month
54
Safetensors
Model size
0.1B params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA

Adapter
(1)
this model