Instructions to use YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA
Run Hermes
hermes
- MLX LM
How to use YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA", "messages": [ {"role": "user", "content": "Hello"} ] }'
Yooz-Quality-v2-Qwen3.5-0.8B-LoRA
Speech-to-text touchup model from Yooz Labs.
Takes raw STT output and returns a cleaned-up version, either lightly proofread
or more aggressively rewritten, as a single JSON object: {"result": "..."}.
This is the second-generation Quality tier of the Yooz Engine touchup stack. It runs fully on-device on Apple Silicon via MLX, in the privacy-first Yooz tradition: no cloud, no logging, no leak.
What it is
- Task: post-process automatic speech recognition output. Two modes:
- Proofread -- minimal edits: spelling, punctuation, capitalization, obvious word-choice errors. Filler words preserved.
- Rewrite -- larger edits: removes filler, fixes grammar, splits run-on sentences, normalizes spoken numbers. Meaning preserved.
- Output contract: always a single JSON object of the form
{"result": "<cleaned text>"}. No explanation, no markdown. - Footprint: 4-bit MLX quantized weights, ~424 MB on disk.
Lineage
- Base:
mlx-community/Qwen3.5-0.8B-MLX-4bit(Apache 2.0). - Method: LoRA fine-tune via
mlx-lm, then fused withmlx_lm.fuseinto a standalone 4-bit checkpoint. - Training data:
yooz-touchupsynthetic dataset, 5,936 samples covering:- 2 modes:
proofread,rewrite - 4 domains:
casual,technical,business,dictation - 3 difficulties:
easy,medium,hard
- 2 modes:
- Adapter: rank-8 LoRA on q/k/v/o projections. 2,500 iters, AdamW.
- Why "v2": v1 of the Yooz Engine Quality tier was Qwen2.5-0.5B. v2 upgrades to Qwen3.5-0.8B with substantially better instruction adherence and strict JSON compliance. The single-model architecture decision lives in yooz-engine#74.
Evaluation
Benchmarked on the gold-standard test split of yooz-touchup (n=5,936),
combining both proofread and rewrite prompts.
| Metric | Value |
|---|---|
| Exact match (combined) | 19.4% |
| JSON parse rate | 99.4% |
| Avg CER | 0.263 |
| Avg semantic similarity | 0.902 |
| Avg latency (M-series, batch=1) | 311 ms |
By mode:
| Mode | EM | CER | Sim |
|---|---|---|---|
| Proofread | 32.4% | 0.063 | 0.944 |
| Rewrite | 6.4% | 0.463 | 0.860 |
By difficulty:
| Difficulty | n | EM | CER | Sim |
|---|---|---|---|---|
| easy | 2,234 | 35.7% | 0.286 | 0.918 |
| medium | 2,290 | 12.6% | 0.233 | 0.900 |
| hard | 1,412 | 4.6% | 0.275 | 0.881 |
By domain:
| Domain | n | EM | CER | Sim |
|---|---|---|---|---|
| casual | 706 | 41.4% | 0.401 | 0.910 |
| technical | 3,318 | 14.2% | 0.249 | 0.891 |
| business | 1,866 | 20.0% | 0.237 | 0.920 |
| dictation | 46 | 32.6% | 0.221 | 0.871 |
Latency was measured on Apple Silicon M-series with mlx-lm, batch size 1,
max_tokens=320. Full ablation in
yooz-engine#75.
Prompt modes
Both modes use a system prompt + the raw STT transcription as the user message. Use these system prompts verbatim -- the model was trained on them and behavior degrades with paraphrases.
Proofread
You are a copy editor. Proofread the speech-to-text transcription provided by the user. Fix spelling, punctuation, capitalization, and obvious word-choice errors caused by the speech-to-text engine. Convert spoken numbers to digits where it improves clarity (e.g. "nine am" -> "9 AM"). Keep the speaker's voice; do NOT rephrase or remove filler words (um, uh, like). Output ONLY a single JSON object of the form {"result": "<corrected>"} where <corrected> is the actual corrected text. No explanation, no markdown.
Rewrite
You are an editor. Rewrite the speech-to-text transcription provided by the user for clarity and readability. Fix grammar, spelling, and punctuation. Remove filler words (um, uh, like, you know, basically). Convert spoken numbers to digits. Split run-on sentences. Preserve the original meaning and intent. Output ONLY a single JSON object of the form {"result": "<rewritten>"} where <rewritten> is the actual rewritten text. No explanation, no markdown.
Usage
Option A -- MLX-LM (Python, fused checkpoint)
from mlx_lm import load, generate
model, tokenizer = load("YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA")
PROOFREAD_SYSTEM = (
"You are a copy editor. Proofread the speech-to-text transcription provided by the "
"user. Fix spelling, punctuation, capitalization, and obvious word-choice errors "
"caused by the speech-to-text engine. Convert spoken numbers to digits where it "
"improves clarity (e.g. \"nine am\" -> \"9 AM\"). Keep the speaker's voice; do NOT "
"rephrase or remove filler words (um, uh, like). "
"Output ONLY a single JSON object of the form {\"result\": \"<corrected>\"} where "
"<corrected> is the actual corrected text. No explanation, no markdown."
)
raw_stt = "um so the meeting is at nine am tomorow with sarah from product"
messages = [
{"role": "system", "content": PROOFREAD_SYSTEM},
{"role": "user", "content": raw_stt},
]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=320, verbose=False)
# response is a JSON string: {"result": "Um, so the meeting is at 9 AM tomorrow with Sarah from Product."}
import json
cleaned = json.loads(response)["result"]
print(cleaned)
For rewrite mode, swap the system prompt for REWRITE_SYSTEM (above).
Option B -- Yooz Engine HTTP API
If you have the Yooz Engine menu
bar service running, this model is consumed via the local API on
localhost:19920:
curl -s http://localhost:19920/v1/touchup \
-H 'Content-Type: application/json' \
-d '{
"text": "um so the meeting is at nine am tomorow with sarah from product",
"mode": "proofread"
}'
# {"result": "Um, so the meeting is at 9 AM tomorrow with Sarah from Product."}
The engine handles model loading, batching, and JSON enforcement. See the engine's API docs.
Option C -- Adapter-only (load base + adapter at runtime)
If you'd rather keep the base model unmodified and apply the LoRA adapter at
load time, the adapter weights live under adapters/ in this repo:
from mlx_lm import load
model, tokenizer = load(
"mlx-community/Qwen3.5-0.8B-MLX-4bit",
adapter_path="YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA/adapters",
)
Files:
adapters/adapters.safetensors-- LoRA weights (~7 MB)adapters/adapter_config.json-- LoRA hyperparamsadapters/lora_params.yaml-- training config snapshot
Limitations
- Rewrite is harder than proofread. Combined EM is 19.4% but rewrite EM
is only 6.4% (vs 32.4% proofread). High character-error-rate on rewrite
reflects that the model is meant to change wording substantially;
semantic similarity stays at 0.86. Use
Simand a downstream judge, not EM, to evaluate rewrite quality. - Hard samples have low EM (4.6%) -- long, garbled, multi-topic STT output is the worst case. Consider falling back to a larger model when CER exceeds a threshold.
- English only. Training data is English; behavior on other languages is undefined.
- Output is always wrapped in
{"result":"..."}. Strip withjson.loads(response)["result"]. JSON parse rate is 99.4%; handle the 0.6% with try/except plus a regex fallback. - Apple Silicon only for the MLX format. Use the adapter-only path
with the upstream
Qwen/Qwen3.5-0.8Bweights if you need CUDA / CPU.
Citation
@misc{yooz-quality-v2-2026,
title = {Yooz-Quality-v2-Qwen3.5-0.8B-LoRA: On-device speech-to-text touchup model},
author = {Yooz Labs},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA},
}
@misc{qwen3-2025,
title = {Qwen3 Technical Report},
author = {Qwen Team},
year = {2025},
publisher = {Alibaba},
}
Built on top of the Qwen3.5 base model from the Qwen team at Alibaba. LoRA fine-tune, dataset, and packaging by Yooz Labs.
Contact & issues
- Questions / contact: dev@yooz.info
- Bug reports / feature requests: open an issue at github.com/yooz-labs/yooz-engine
License
Apache 2.0, matching the Qwen3.5 base model. See LICENSE upstream.
Links
- Yooz Engine -- the menu bar service that consumes this model.
- Tracking issue #74 -- single-model Quality tier decision.
- Latency ablation #75 -- max_tokens vs latency study.
- Base model --
mlx-community/Qwen3.5-0.8B-MLX-4bit.
Sovereign Intelligence. Built for the skeptical.
- Downloads last month
- 54
4-bit
Model tree for YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA
Base model
Qwen/Qwen3.5-0.8B-Base
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True)