Instructions to use annnnnnnd/Qwen3.6-27B-Reflect with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use annnnnnnd/Qwen3.6-27B-Reflect with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="annnnnnnd/Qwen3.6-27B-Reflect",
	filename="Qwen3.6-27b-Reflect-Q6_K.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use annnnnnnd/Qwen3.6-27B-Reflect with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
# Run inference directly in the terminal:
llama-cli -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
# Run inference directly in the terminal:
llama-cli -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
# Run inference directly in the terminal:
./llama-cli -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
# Run inference directly in the terminal:
./build/bin/llama-cli -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K

Use Docker

docker model run hf.co/annnnnnnd/Qwen3.6-27B-Reflect:Q6_K

LM Studio
Jan
Ollama
How to use annnnnnnd/Qwen3.6-27B-Reflect with Ollama:
```
ollama run hf.co/annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
```

Unsloth Studio

How to use annnnnnnd/Qwen3.6-27B-Reflect with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for annnnnnnd/Qwen3.6-27B-Reflect to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for annnnnnnd/Qwen3.6-27B-Reflect to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for annnnnnnd/Qwen3.6-27B-Reflect to start chatting

How to use annnnnnnd/Qwen3.6-27B-Reflect with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "annnnnnnd/Qwen3.6-27B-Reflect:Q6_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use annnnnnnd/Qwen3.6-27B-Reflect with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default annnnnnnd/Qwen3.6-27B-Reflect:Q6_K

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use annnnnnnd/Qwen3.6-27B-Reflect with Docker Model Runner:
```
docker model run hf.co/annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
```

Lemonade

How to use annnnnnnd/Qwen3.6-27B-Reflect with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull annnnnnnd/Qwen3.6-27B-Reflect:Q6_K

Run and chat with the model

lemonade run user.Qwen3.6-27B-Reflect-Q6_K

List all available models

lemonade list

Qwen3.6-27B-Reflect

A fine-tuned Qwen3.6-27B focused on anti-sycophancy and an honest, self-correcting reasoning voice — built on aggressive dataset curation rather than data volume.

What is Reflect?

Reflect is a fine-tuned family built on the principle that less training data, better curated, produces a cleaner voice without degrading capability. Rather than tens of thousands of examples, Reflect uses 1,400 aggressively cleaned examples to reshape how the model reasons and talks.

The name describes what it does: it reflects and reconsiders rather than performing confidence.

What the training actually changed

Across four independent test sets, the consistent effect of the SFT+DPO recipe is best summed up as say less, think more — it compresses the answer channel and expands the reasoning channel, without altering underlying capability:

Less filler (no-think mode): same accuracy as base, ~12% fewer output tokens — more concise worked solutions, not shorter answers. (GPQA-Diamond, full 198: 1,645 vs 1,880 mean tokens, 71.7% vs 72.2%.)
Thinks more (thinking mode): reasons ~2.1x longer than base on hard recovery questions and uses it to recover more previously-failed problems (+6.0 points on a 215-question shared-failure set).
Finishes instead of timing out. On hard GPQA-diamond items where the base model spirals into its context limit, Reflect runs its reasoning to a clean stop (10-question hard probe: base hit the token cap once and never answered; Reflect zero times).
Capability-neutral on hard science. Matches base Qwen3.6-27B on GPQA-Diamond in both regimes (think 89.4 vs 87.8; no-think 71.7 vs 72.2) — the voice fine-tune did not damage the underlying weights.
Tradeoff: it can overthink easy items, and pays a small format-following tax (see IFEval below).
Preserved tool use: native Qwen tool-calling, function calling, and structured output retained.

Note on an earlier version of this card: a prior release claimed "3x token efficiency / shorter reasoning." That number came from a contaminated run in which Reflect was served in **no-think** mode while the base model used full thinking — so it was comparing no-think against think, not the two models fairly. Run correctly with thinking enabled for both, Reflect reasons **longer**. The corrected results are below.

Training Methodology

SFT (Supervised Fine-Tuning)

Dataset: 1,400 curated examples
LoRA: r32 / a32 (1:1 alpha-to-rank)
Learning rate: 1e-4
Epochs: 1
Precision: Q4 (forces reconstruction)

DPO (Direct Preference Optimization)

1,400 preference pairs
LoRA: r16
Learning rate: 1e-6
Beta: 0.1
Epochs: 1
Method: voice distillation — model's own outputs as rejected, curated outputs as chosen.

Benchmarks

1. No-think accuracy (full sets, thinking disabled for both)

A clean no-think A/B to gauge base-weight similarity.

Benchmark	N	Base Qwen3.6	Reflect	Delta
MMLU	1000	87.40%	87.60%	+0.20%
GSM8K	400	96.25%	96.75%	+0.50%
HumanEval	164	93.29%	92.07%	-1.22%
IFEval	192	81.25%	77.08%	-4.17%
ARC Challenge	400	96.75%	96.25%	-0.50%
TruthfulQA	200	89.50%	87.50%	-2.00%
Average		90.74%	89.54%	-1.20%

Reading this honestly:

MMLU/GSM8K deltas (+0.2 / +0.5) are within noise at these sample sizes — not evidence of a knowledge gain.
HumanEval, ARC, TruthfulQA: within noise; no catastrophic forgetting.
IFEval -4.0% (78% -> 74%; 14 regressions vs 6 gains) is a voice-vs-format tradeoff Transcript analysis shows the failures cluster on multi-constraint prompts (avg 1.79 instructions vs 1.53 overall) and on purely mechanical format rules — exact sentence counts, all-lowercase, mandatory keywords/placeholders, capital-word frequency. The model is not refusing instructions; it answers fluently and its distilled voice rounds off the rigid sub-constraint (e.g. drops a required literal keyword, uses an unnumbered list, changes sentence granularity). It is not truncation — responses are full-length, often longer than base. This is a predictable cost of voice distillation and shrinks if format-adherence examples are added to the SFT/DPO mix.

2. Thinking-mode recovery (corrected — thinking enabled for both)

Both models retested on the 215 questions both failed in the no-think run. 3 samples/question, identical settings, both emitting real <think> traces.

Benchmark	N	Base pass@3	Reflect pass@3	Base think (chars)	Reflect think (chars)
MMLU	138	46.4%	55.1%	2,670	6,977
GSM8K	18	61.1%	61.1%	5,165	5,994
ARC Challenge	16	50.0%	43.8%	2,066	6,247
TruthfulQA	28	46.4%	46.4%	1,114	5,161
HumanEval	15	80.0%	93.3%	9,806	6,209
Overall	215	50.2%	56.3%	3,129	6,550

Reading this honestly:

Reflect recovers +6.0 points overall (~13 more questions of 215) while thinking 2.09x longer. The gain is real; the mechanism is more reasoning, not less.
MMLU (+8.7, N=138) is the trustworthy result.
HumanEval: near-tie (Reflect 93% vs base 80%, ~2 questions at N=15). An earlier run showed a +33 gap; that was a trace-capture bug that recorded base's thinking as ~0 chars and under-counted its passes. With capture fixed, base actually thinks more than Reflect here (9,806 vs 6,209 chars). Both sides need a same-harness rerun before any HumanEval claim — currently a wash.
GSM8K / TruthfulQA: ties. (The GSM8K shared-failure pool also contains several mislabeled gold answers, so its "tie" is low-confidence.)
ARC (-6.25, N=16): the only regression, smallest sample. A mild option-position bias under low-sample MCQ; footnote, not a trend.

3. GPQA-Diamond (full set, N=198) — capability preservation + concision

Same harness, both models, both regimes.

Mode	Base Qwen3.6	Reflect	Delta
Thinking on	87.8%*	89.4%	+1.6 (noise)
Thinking off	72.2%	71.7%	-0.5 (noise)

*Base thinking-on is Qwen's published figure (their scaffold); 89.4 is on this harness — cross-scaffold, so treat the think-on delta as indicative only. The no-think row is a clean same-harness A/B.

Findings:

Capability-neutral. Reflect matches base in both regimes. Critically, the no-think match (71.7 vs 72.2) shows the science knowledge is intact in the weights — Reflect is not leaning on extra thinking to paper over fine-tune damage. This is the result you want from a voice fine-tune: voice changed, capability untouched.
Says less, no-think (real, same-harness token counts): matched accuracy at ~12% fewer output tokens (Reflect mean 1,645 vs base 1,880; 325,789 vs 372,208 total over 198 generations). On the 126 questions both answered correctly, Reflect was the shorter response 71% of the time — consistent concision, not an outlier effect. This is "more concise worked solution," not "terse answer" — both models still write full ~1,200-token solutions before the boxed answer.
Thinking buys ~+16 points on this set for both models (72 -> 88), so any GPQA-Diamond figure should be quoted with its thinking budget. Reflect's correct-answer thinking on hard items runs into the tens of thousands of tokens.

Truncation note: base hit the 4,096-token output cap more often than Reflect (42 vs 29 of 198) — itself consistent with base being more verbose. This mildly widens the measured token gap (some base answers were clipped, not freely longer), so the true no-think concision delta is likely a few points under 12%. All 198 still parsed a valid letter, so accuracy is unaffected.

4. GPQA-Diamond stamina probe (hard subset, N=10)

	Correct	Avg thinking (chars)	Avg "wait" count
Base Qwen3.6	6/10	42,766	33
Reflect	6/10	67,615	44

Same accuracy; Reflect thinks ~58% longer. On base's worst spiral (109k-char wrong answer) Reflect ground to ~220k chars and got it right; on the item where base hit its context cap and never answered, Reflect resolved cleanly. The difference is fewer runaway-reasoning failures, not higher accuracy.

Summary

Reflect is a "say less, think more" edit — and capability-neutral underneath:

No-think: same accuracy, ~12% more concise output (compresses the answer channel).
Thinking: reasons ~2x longer, recovers more hard previously-failed questions, exhausts its context far less often (expands the reasoning channel).
Capability: unchanged on GPQA-Diamond in both regimes — the fine-tune reshaped behavior without touching the weights' knowledge.
Costs: higher token spend in thinking mode, occasional overthinking of easy items, and a small IFEval format-compliance tradeoff (~-4 pts).

The Reflect Family

Model	Base	Status
Reflect 27B	Qwen3.6-27B	✅ Released
Reflect 9B	Qwen3.5-9B	Coming soon
Reflect 4B	Qwen3.5-4B	Coming soon

Recommended Settings

Temperature: 0.6–0.7
Context: up to 262K tokens
Quantization: Q6_K
System prompt: (add yours here — left blank intentionally; the model was not trained against a fixed system prompt)

Technical Details

Base model: Qwen/Qwen3.6-27B
Architecture: dense transformer, 27B
Format: GGUF Q6_K (~22GB)
Training hardware: RTX Pro 6000
Training framework: Unsloth

About

Curation over volume: 1,400 carefully chosen examples reshape voice and reasoning behavior without collapsing capability. The honest finding across four test sets: Reflect says less and thinks more — leaner output at matched accuracy, longer reasoning where it pays, and underlying capability left intact.

License

Apache 2.0 / Qwen license (same as base).

Model tree for annnnnnnd/Qwen3.6-27B-Reflect

Base model

Qwen/Qwen3.6-27B

Quantized

(471)

this model

annnnnnnd
/

Qwen3.6-27B-Reflect