Instructions to use annnnnnnd/Qwen3.6-27B-Reflect with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use annnnnnnd/Qwen3.6-27B-Reflect with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="annnnnnnd/Qwen3.6-27B-Reflect", filename="Qwen3.6-27b-Reflect-Q6_K.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use annnnnnnd/Qwen3.6-27B-Reflect with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K # Run inference directly in the terminal: llama-cli -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K # Run inference directly in the terminal: llama-cli -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K # Run inference directly in the terminal: ./llama-cli -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Use Docker
docker model run hf.co/annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
- LM Studio
- Jan
- Ollama
How to use annnnnnnd/Qwen3.6-27B-Reflect with Ollama:
ollama run hf.co/annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
- Unsloth Studio
How to use annnnnnnd/Qwen3.6-27B-Reflect with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for annnnnnnd/Qwen3.6-27B-Reflect to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for annnnnnnd/Qwen3.6-27B-Reflect to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for annnnnnnd/Qwen3.6-27B-Reflect to start chatting
- Pi
How to use annnnnnnd/Qwen3.6-27B-Reflect with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "annnnnnnd/Qwen3.6-27B-Reflect:Q6_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use annnnnnnd/Qwen3.6-27B-Reflect with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use annnnnnnd/Qwen3.6-27B-Reflect with Docker Model Runner:
docker model run hf.co/annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
- Lemonade
How to use annnnnnnd/Qwen3.6-27B-Reflect with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Run and chat with the model
lemonade run user.Qwen3.6-27B-Reflect-Q6_K
List all available models
lemonade list
Qwen3.6-27B-Reflect
A fine-tuned Qwen3.6-27B focused on anti-sycophancy and an honest, self-correcting reasoning voice โ built on aggressive dataset curation rather than data volume.
What is Reflect?
Reflect is a fine-tuned family built on the principle that less training data, better curated, produces a cleaner voice without degrading capability. Rather than tens of thousands of examples, Reflect uses 1,400 aggressively cleaned examples to reshape how the model reasons and talks.
The name describes what it does: it reflects and reconsiders rather than performing confidence.
What the training actually changed
Across four independent test sets, the consistent effect of the SFT+DPO recipe is best summed up as say less, think more โ it compresses the answer channel and expands the reasoning channel, without altering underlying capability:
- Less filler (no-think mode): same accuracy as base, ~12% fewer output tokens โ more concise worked solutions, not shorter answers. (GPQA-Diamond, full 198: 1,645 vs 1,880 mean tokens, 71.7% vs 72.2%.)
- Thinks more (thinking mode): reasons ~2.1x longer than base on hard recovery questions and uses it to recover more previously-failed problems (+6.0 points on a 215-question shared-failure set).
- Finishes instead of timing out. On hard GPQA-diamond items where the base model spirals into its context limit, Reflect runs its reasoning to a clean stop (10-question hard probe: base hit the token cap once and never answered; Reflect zero times).
- Capability-neutral on hard science. Matches base Qwen3.6-27B on GPQA-Diamond in both regimes (think 89.4 vs 87.8; no-think 71.7 vs 72.2) โ the voice fine-tune did not damage the underlying weights.
- Tradeoff: it can overthink easy items, and pays a small format-following tax (see IFEval below).
- Preserved tool use: native Qwen tool-calling, function calling, and structured output retained.
Note on an earlier version of this card: a prior release claimed "3x token efficiency / shorter reasoning." That number came from a contaminated run in which Reflect was served in **no-think** mode while the base model used full thinking โ so it was comparing no-think against think, not the two models fairly. Run correctly with thinking enabled for both, Reflect reasons **longer**. The corrected results are below.
Training Methodology
SFT (Supervised Fine-Tuning)
- Dataset: 1,400 curated examples
- LoRA: r32 / a32 (1:1 alpha-to-rank)
- Learning rate: 1e-4
- Epochs: 1
- Precision: Q4 (forces reconstruction)
DPO (Direct Preference Optimization)
- 1,400 preference pairs
- LoRA: r16
- Learning rate: 1e-6
- Beta: 0.1
- Epochs: 1
- Method: voice distillation โ model's own outputs as rejected, curated outputs as chosen.
Benchmarks
1. No-think accuracy (full sets, thinking disabled for both)
A clean no-think A/B to gauge base-weight similarity.
| Benchmark | N | Base Qwen3.6 | Reflect | Delta |
|---|---|---|---|---|
| MMLU | 1000 | 87.40% | 87.60% | +0.20% |
| GSM8K | 400 | 96.25% | 96.75% | +0.50% |
| HumanEval | 164 | 93.29% | 92.07% | -1.22% |
| IFEval | 192 | 81.25% | 77.08% | -4.17% |
| ARC Challenge | 400 | 96.75% | 96.25% | -0.50% |
| TruthfulQA | 200 | 89.50% | 87.50% | -2.00% |
| Average | 90.74% | 89.54% | -1.20% |
Reading this honestly:
- MMLU/GSM8K deltas (+0.2 / +0.5) are within noise at these sample sizes โ not evidence of a knowledge gain.
- HumanEval, ARC, TruthfulQA: within noise; no catastrophic forgetting.
- IFEval -4.0% (78% -> 74%; 14 regressions vs 6 gains) is a voice-vs-format tradeoff Transcript analysis shows the failures cluster on multi-constraint prompts (avg 1.79 instructions vs 1.53 overall) and on purely mechanical format rules โ exact sentence counts, all-lowercase, mandatory keywords/placeholders, capital-word frequency. The model is not refusing instructions; it answers fluently and its distilled voice rounds off the rigid sub-constraint (e.g. drops a required literal keyword, uses an unnumbered list, changes sentence granularity). It is not truncation โ responses are full-length, often longer than base. This is a predictable cost of voice distillation and shrinks if format-adherence examples are added to the SFT/DPO mix.
2. Thinking-mode recovery (corrected โ thinking enabled for both)
Both models retested on the 215 questions both failed in the no-think run. 3 samples/question, identical settings, both emitting real <think> traces.
| Benchmark | N | Base pass@3 | Reflect pass@3 | Base think (chars) | Reflect think (chars) |
|---|---|---|---|---|---|
| MMLU | 138 | 46.4% | 55.1% | 2,670 | 6,977 |
| GSM8K | 18 | 61.1% | 61.1% | 5,165 | 5,994 |
| ARC Challenge | 16 | 50.0% | 43.8% | 2,066 | 6,247 |
| TruthfulQA | 28 | 46.4% | 46.4% | 1,114 | 5,161 |
| HumanEval | 15 | 80.0% | 93.3% | 9,806 | 6,209 |
| Overall | 215 | 50.2% | 56.3% | 3,129 | 6,550 |
Reading this honestly:
- Reflect recovers +6.0 points overall (~13 more questions of 215) while thinking 2.09x longer. The gain is real; the mechanism is more reasoning, not less.
- MMLU (+8.7, N=138) is the trustworthy result.
- HumanEval: near-tie (Reflect 93% vs base 80%, ~2 questions at N=15). An earlier run showed a +33 gap; that was a trace-capture bug that recorded base's thinking as ~0 chars and under-counted its passes. With capture fixed, base actually thinks more than Reflect here (9,806 vs 6,209 chars). Both sides need a same-harness rerun before any HumanEval claim โ currently a wash.
- GSM8K / TruthfulQA: ties. (The GSM8K shared-failure pool also contains several mislabeled gold answers, so its "tie" is low-confidence.)
- ARC (-6.25, N=16): the only regression, smallest sample. A mild option-position bias under low-sample MCQ; footnote, not a trend.
3. GPQA-Diamond (full set, N=198) โ capability preservation + concision
Same harness, both models, both regimes.
| Mode | Base Qwen3.6 | Reflect | Delta |
|---|---|---|---|
| Thinking on | 87.8%* | 89.4% | +1.6 (noise) |
| Thinking off | 72.2% | 71.7% | -0.5 (noise) |
*Base thinking-on is Qwen's published figure (their scaffold); 89.4 is on this harness โ cross-scaffold, so treat the think-on delta as indicative only. The no-think row is a clean same-harness A/B.
Findings:
- Capability-neutral. Reflect matches base in both regimes. Critically, the no-think match (71.7 vs 72.2) shows the science knowledge is intact in the weights โ Reflect is not leaning on extra thinking to paper over fine-tune damage. This is the result you want from a voice fine-tune: voice changed, capability untouched.
- Says less, no-think (real, same-harness token counts): matched accuracy at ~12% fewer output tokens (Reflect mean 1,645 vs base 1,880; 325,789 vs 372,208 total over 198 generations). On the 126 questions both answered correctly, Reflect was the shorter response 71% of the time โ consistent concision, not an outlier effect. This is "more concise worked solution," not "terse answer" โ both models still write full ~1,200-token solutions before the boxed answer.
- Thinking buys ~+16 points on this set for both models (72 -> 88), so any GPQA-Diamond figure should be quoted with its thinking budget. Reflect's correct-answer thinking on hard items runs into the tens of thousands of tokens.
Truncation note: base hit the 4,096-token output cap more often than Reflect (42 vs 29 of 198) โ itself consistent with base being more verbose. This mildly widens the measured token gap (some base answers were clipped, not freely longer), so the true no-think concision delta is likely a few points under 12%. All 198 still parsed a valid letter, so accuracy is unaffected.
4. GPQA-Diamond stamina probe (hard subset, N=10)
| Correct | Avg thinking (chars) | Avg "wait" count | |
|---|---|---|---|
| Base Qwen3.6 | 6/10 | 42,766 | 33 |
| Reflect | 6/10 | 67,615 | 44 |
Same accuracy; Reflect thinks ~58% longer. On base's worst spiral (109k-char wrong answer) Reflect ground to ~220k chars and got it right; on the item where base hit its context cap and never answered, Reflect resolved cleanly. The difference is fewer runaway-reasoning failures, not higher accuracy.
Summary
Reflect is a "say less, think more" edit โ and capability-neutral underneath:
- No-think: same accuracy, ~12% more concise output (compresses the answer channel).
- Thinking: reasons ~2x longer, recovers more hard previously-failed questions, exhausts its context far less often (expands the reasoning channel).
- Capability: unchanged on GPQA-Diamond in both regimes โ the fine-tune reshaped behavior without touching the weights' knowledge.
- Costs: higher token spend in thinking mode, occasional overthinking of easy items, and a small IFEval format-compliance tradeoff (~-4 pts).
The Reflect Family
| Model | Base | Status |
|---|---|---|
| Reflect 27B | Qwen3.6-27B | โ Released |
| Reflect 9B | Qwen3.5-9B | Coming soon |
| Reflect 4B | Qwen3.5-4B | Coming soon |
Recommended Settings
- Temperature: 0.6โ0.7
- Context: up to 262K tokens
- Quantization: Q6_K
- System prompt: (add yours here โ left blank intentionally; the model was not trained against a fixed system prompt)
Technical Details
- Base model: Qwen/Qwen3.6-27B
- Architecture: dense transformer, 27B
- Format: GGUF Q6_K (~22GB)
- Training hardware: RTX Pro 6000
- Training framework: Unsloth
About
Curation over volume: 1,400 carefully chosen examples reshape voice and reasoning behavior without collapsing capability. The honest finding across four test sets: Reflect says less and thinks more โ leaner output at matched accuracy, longer reasoning where it pays, and underlying capability left intact.
License
Apache 2.0 / Qwen license (same as base).
Links
- Base model: Qwen/Qwen3.6-27B
- Downloads last month
- 90
6-bit
Model tree for annnnnnnd/Qwen3.6-27B-Reflect
Base model
Qwen/Qwen3.6-27B