Instructions to use YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA

Run Hermes

hermes

MLX LM

How to use YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Yooz-Quality-v2-Qwen3.5-0.8B-LoRA

Speech-to-text touchup model from Yooz Labs. Takes raw STT output and returns a cleaned-up version, either lightly proofread or more aggressively rewritten, as a single JSON object: {"result": "..."}.

This is the second-generation Quality tier of the Yooz Engine touchup stack. It runs fully on-device on Apple Silicon via MLX, in the privacy-first Yooz tradition: no cloud, no logging, no leak.

What it is

Task: post-process automatic speech recognition output. Two modes:
- Proofread -- minimal edits: spelling, punctuation, capitalization, obvious word-choice errors. Filler words preserved.
- Rewrite -- larger edits: removes filler, fixes grammar, splits run-on sentences, normalizes spoken numbers. Meaning preserved.
Output contract: always a single JSON object of the form {"result": "<cleaned text>"}. No explanation, no markdown.
Footprint: 4-bit MLX quantized weights, ~424 MB on disk.

Lineage

Base: mlx-community/Qwen3.5-0.8B-MLX-4bit (Apache 2.0).
Method: LoRA fine-tune via mlx-lm, then fused with mlx_lm.fuse into a standalone 4-bit checkpoint.
Training data: yooz-touchup synthetic dataset, 5,936 samples covering:
- 2 modes: proofread, rewrite
- 4 domains: casual, technical, business, dictation
- 3 difficulties: easy, medium, hard
Adapter: rank-8 LoRA on q/k/v/o projections. 2,500 iters, AdamW.
Why "v2": v1 of the Yooz Engine Quality tier was Qwen2.5-0.5B. v2 upgrades to Qwen3.5-0.8B with substantially better instruction adherence and strict JSON compliance. The single-model architecture decision lives in yooz-engine#74.

Evaluation

Benchmarked on the gold-standard test split of yooz-touchup (n=5,936), combining both proofread and rewrite prompts.

Metric	Value
Exact match (combined)	19.4%
JSON parse rate	99.4%
Avg CER	0.263
Avg semantic similarity	0.902
Avg latency (M-series, batch=1)	311 ms

By mode:

Mode	EM	CER	Sim
Proofread	32.4%	0.063	0.944
Rewrite	6.4%	0.463	0.860

By difficulty:

Difficulty	n	EM	CER	Sim
easy	2,234	35.7%	0.286	0.918
medium	2,290	12.6%	0.233	0.900
hard	1,412	4.6%	0.275	0.881

By domain:

Domain	n	EM	CER	Sim
casual	706	41.4%	0.401	0.910
technical	3,318	14.2%	0.249	0.891
business	1,866	20.0%	0.237	0.920
dictation	46	32.6%	0.221	0.871

Latency was measured on Apple Silicon M-series with mlx-lm, batch size 1, max_tokens=320. Full ablation in yooz-engine#75.

Prompt modes

Both modes use a system prompt + the raw STT transcription as the user message. Use these system prompts verbatim -- the model was trained on them and behavior degrades with paraphrases.

Proofread

You are a copy editor. Proofread the speech-to-text transcription provided by the user. Fix spelling, punctuation, capitalization, and obvious word-choice errors caused by the speech-to-text engine. Convert spoken numbers to digits where it improves clarity (e.g. "nine am" -> "9 AM"). Keep the speaker's voice; do NOT rephrase or remove filler words (um, uh, like). Output ONLY a single JSON object of the form {"result": "<corrected>"} where <corrected> is the actual corrected text. No explanation, no markdown.

Rewrite

You are an editor. Rewrite the speech-to-text transcription provided by the user for clarity and readability. Fix grammar, spelling, and punctuation. Remove filler words (um, uh, like, you know, basically). Convert spoken numbers to digits. Split run-on sentences. Preserve the original meaning and intent. Output ONLY a single JSON object of the form {"result": "<rewritten>"} where <rewritten> is the actual rewritten text. No explanation, no markdown.

Usage

Option A -- MLX-LM (Python, fused checkpoint)

from mlx_lm import load, generate

model, tokenizer = load("YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA")

PROOFREAD_SYSTEM = (
    "You are a copy editor. Proofread the speech-to-text transcription provided by the "
    "user. Fix spelling, punctuation, capitalization, and obvious word-choice errors "
    "caused by the speech-to-text engine. Convert spoken numbers to digits where it "
    "improves clarity (e.g. \"nine am\" -> \"9 AM\"). Keep the speaker's voice; do NOT "
    "rephrase or remove filler words (um, uh, like). "
    "Output ONLY a single JSON object of the form {\"result\": \"<corrected>\"} where "
    "<corrected> is the actual corrected text. No explanation, no markdown."
)

raw_stt = "um so the meeting is at nine am tomorow with sarah from product"

messages = [
    {"role": "system", "content": PROOFREAD_SYSTEM},
    {"role": "user", "content": raw_stt},
]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=320, verbose=False)
# response is a JSON string: {"result": "Um, so the meeting is at 9 AM tomorrow with Sarah from Product."}

import json
cleaned = json.loads(response)["result"]
print(cleaned)

For rewrite mode, swap the system prompt for REWRITE_SYSTEM (above).

Option B -- Yooz Engine HTTP API

If you have the Yooz Engine menu bar service running, this model is consumed via the local API on localhost:19920:

curl -s http://localhost:19920/v1/touchup \
  -H 'Content-Type: application/json' \
  -d '{
    "text": "um so the meeting is at nine am tomorow with sarah from product",
    "mode": "proofread"
  }'
# {"result": "Um, so the meeting is at 9 AM tomorrow with Sarah from Product."}

The engine handles model loading, batching, and JSON enforcement. See the engine's API docs.

Option C -- Adapter-only (load base + adapter at runtime)

If you'd rather keep the base model unmodified and apply the LoRA adapter at load time, the adapter weights live under adapters/ in this repo:

from mlx_lm import load

model, tokenizer = load(
    "mlx-community/Qwen3.5-0.8B-MLX-4bit",
    adapter_path="YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA/adapters",
)

Files:

adapters/adapters.safetensors -- LoRA weights (~7 MB)
adapters/adapter_config.json -- LoRA hyperparams
adapters/lora_params.yaml -- training config snapshot

Limitations

Rewrite is harder than proofread. Combined EM is 19.4% but rewrite EM is only 6.4% (vs 32.4% proofread). High character-error-rate on rewrite reflects that the model is meant to change wording substantially; semantic similarity stays at 0.86. Use Sim and a downstream judge, not EM, to evaluate rewrite quality.
Hard samples have low EM (4.6%) -- long, garbled, multi-topic STT output is the worst case. Consider falling back to a larger model when CER exceeds a threshold.
English only. Training data is English; behavior on other languages is undefined.
Output is always wrapped in {"result":"..."}. Strip with json.loads(response)["result"]. JSON parse rate is 99.4%; handle the 0.6% with try/except plus a regex fallback.
Apple Silicon only for the MLX format. Use the adapter-only path with the upstream Qwen/Qwen3.5-0.8B weights if you need CUDA / CPU.

Citation

@misc{yooz-quality-v2-2026,
  title = {Yooz-Quality-v2-Qwen3.5-0.8B-LoRA: On-device speech-to-text touchup model},
  author = {Yooz Labs},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA},
}

@misc{qwen3-2025,
  title = {Qwen3 Technical Report},
  author = {Qwen Team},
  year = {2025},
  publisher = {Alibaba},
}

Built on top of the Qwen3.5 base model from the Qwen team at Alibaba. LoRA fine-tune, dataset, and packaging by Yooz Labs.

Contact & issues

Questions / contact: dev@yooz.info
Bug reports / feature requests: open an issue at github.com/yooz-labs/yooz-engine

License

Apache 2.0, matching the Qwen3.5 base model. See LICENSE upstream.

Model tree for YoozLabs/Yooz-Quality-v2-Qwen3.5-0.8B-LoRA

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Quantized

mlx-community/Qwen3.5-0.8B-MLX-4bit

Adapter

(1)

this model

YoozLabs
/

Yooz-Quality-v2-Qwen3.5-0.8B-LoRA