Qwen3.6-27B-Reflect

A fine-tuned Qwen3.6-27B focused on anti-sycophancy and an honest, self-correcting reasoning voice โ€” built on aggressive dataset curation rather than data volume.

What is Reflect?

Reflect is a fine-tuned family built on the principle that less training data, better curated, produces a cleaner voice without degrading capability. Rather than tens of thousands of examples, Reflect uses 1,400 aggressively cleaned examples to reshape how the model reasons and talks.

The name describes what it does: it reflects and reconsiders rather than performing confidence.

What the training actually changed

Across four independent test sets, the consistent effect of the SFT+DPO recipe is best summed up as say less, think more โ€” it compresses the answer channel and expands the reasoning channel, without altering underlying capability:

  • Less filler (no-think mode): same accuracy as base, ~12% fewer output tokens โ€” more concise worked solutions, not shorter answers. (GPQA-Diamond, full 198: 1,645 vs 1,880 mean tokens, 71.7% vs 72.2%.)
  • Thinks more (thinking mode): reasons ~2.1x longer than base on hard recovery questions and uses it to recover more previously-failed problems (+6.0 points on a 215-question shared-failure set).
  • Finishes instead of timing out. On hard GPQA-diamond items where the base model spirals into its context limit, Reflect runs its reasoning to a clean stop (10-question hard probe: base hit the token cap once and never answered; Reflect zero times).
  • Capability-neutral on hard science. Matches base Qwen3.6-27B on GPQA-Diamond in both regimes (think 89.4 vs 87.8; no-think 71.7 vs 72.2) โ€” the voice fine-tune did not damage the underlying weights.
  • Tradeoff: it can overthink easy items, and pays a small format-following tax (see IFEval below).
  • Preserved tool use: native Qwen tool-calling, function calling, and structured output retained.

Note on an earlier version of this card: a prior release claimed "3x token efficiency / shorter reasoning." That number came from a contaminated run in which Reflect was served in **no-think** mode while the base model used full thinking โ€” so it was comparing no-think against think, not the two models fairly. Run correctly with thinking enabled for both, Reflect reasons **longer**. The corrected results are below.

Training Methodology

SFT (Supervised Fine-Tuning)

  • Dataset: 1,400 curated examples
  • LoRA: r32 / a32 (1:1 alpha-to-rank)
  • Learning rate: 1e-4
  • Epochs: 1
  • Precision: Q4 (forces reconstruction)

DPO (Direct Preference Optimization)

  • 1,400 preference pairs
  • LoRA: r16
  • Learning rate: 1e-6
  • Beta: 0.1
  • Epochs: 1
  • Method: voice distillation โ€” model's own outputs as rejected, curated outputs as chosen.

Benchmarks

1. No-think accuracy (full sets, thinking disabled for both)

A clean no-think A/B to gauge base-weight similarity.

Benchmark N Base Qwen3.6 Reflect Delta
MMLU 1000 87.40% 87.60% +0.20%
GSM8K 400 96.25% 96.75% +0.50%
HumanEval 164 93.29% 92.07% -1.22%
IFEval 192 81.25% 77.08% -4.17%
ARC Challenge 400 96.75% 96.25% -0.50%
TruthfulQA 200 89.50% 87.50% -2.00%
Average 90.74% 89.54% -1.20%

Reading this honestly:

  • MMLU/GSM8K deltas (+0.2 / +0.5) are within noise at these sample sizes โ€” not evidence of a knowledge gain.
  • HumanEval, ARC, TruthfulQA: within noise; no catastrophic forgetting.
  • IFEval -4.0% (78% -> 74%; 14 regressions vs 6 gains) is a voice-vs-format tradeoff Transcript analysis shows the failures cluster on multi-constraint prompts (avg 1.79 instructions vs 1.53 overall) and on purely mechanical format rules โ€” exact sentence counts, all-lowercase, mandatory keywords/placeholders, capital-word frequency. The model is not refusing instructions; it answers fluently and its distilled voice rounds off the rigid sub-constraint (e.g. drops a required literal keyword, uses an unnumbered list, changes sentence granularity). It is not truncation โ€” responses are full-length, often longer than base. This is a predictable cost of voice distillation and shrinks if format-adherence examples are added to the SFT/DPO mix.

2. Thinking-mode recovery (corrected โ€” thinking enabled for both)

Both models retested on the 215 questions both failed in the no-think run. 3 samples/question, identical settings, both emitting real <think> traces.

Benchmark N Base pass@3 Reflect pass@3 Base think (chars) Reflect think (chars)
MMLU 138 46.4% 55.1% 2,670 6,977
GSM8K 18 61.1% 61.1% 5,165 5,994
ARC Challenge 16 50.0% 43.8% 2,066 6,247
TruthfulQA 28 46.4% 46.4% 1,114 5,161
HumanEval 15 80.0% 93.3% 9,806 6,209
Overall 215 50.2% 56.3% 3,129 6,550

Reading this honestly:

  • Reflect recovers +6.0 points overall (~13 more questions of 215) while thinking 2.09x longer. The gain is real; the mechanism is more reasoning, not less.
  • MMLU (+8.7, N=138) is the trustworthy result.
  • HumanEval: near-tie (Reflect 93% vs base 80%, ~2 questions at N=15). An earlier run showed a +33 gap; that was a trace-capture bug that recorded base's thinking as ~0 chars and under-counted its passes. With capture fixed, base actually thinks more than Reflect here (9,806 vs 6,209 chars). Both sides need a same-harness rerun before any HumanEval claim โ€” currently a wash.
  • GSM8K / TruthfulQA: ties. (The GSM8K shared-failure pool also contains several mislabeled gold answers, so its "tie" is low-confidence.)
  • ARC (-6.25, N=16): the only regression, smallest sample. A mild option-position bias under low-sample MCQ; footnote, not a trend.

3. GPQA-Diamond (full set, N=198) โ€” capability preservation + concision

Same harness, both models, both regimes.

Mode Base Qwen3.6 Reflect Delta
Thinking on 87.8%* 89.4% +1.6 (noise)
Thinking off 72.2% 71.7% -0.5 (noise)

*Base thinking-on is Qwen's published figure (their scaffold); 89.4 is on this harness โ€” cross-scaffold, so treat the think-on delta as indicative only. The no-think row is a clean same-harness A/B.

Findings:

  • Capability-neutral. Reflect matches base in both regimes. Critically, the no-think match (71.7 vs 72.2) shows the science knowledge is intact in the weights โ€” Reflect is not leaning on extra thinking to paper over fine-tune damage. This is the result you want from a voice fine-tune: voice changed, capability untouched.
  • Says less, no-think (real, same-harness token counts): matched accuracy at ~12% fewer output tokens (Reflect mean 1,645 vs base 1,880; 325,789 vs 372,208 total over 198 generations). On the 126 questions both answered correctly, Reflect was the shorter response 71% of the time โ€” consistent concision, not an outlier effect. This is "more concise worked solution," not "terse answer" โ€” both models still write full ~1,200-token solutions before the boxed answer.
  • Thinking buys ~+16 points on this set for both models (72 -> 88), so any GPQA-Diamond figure should be quoted with its thinking budget. Reflect's correct-answer thinking on hard items runs into the tens of thousands of tokens.

Truncation note: base hit the 4,096-token output cap more often than Reflect (42 vs 29 of 198) โ€” itself consistent with base being more verbose. This mildly widens the measured token gap (some base answers were clipped, not freely longer), so the true no-think concision delta is likely a few points under 12%. All 198 still parsed a valid letter, so accuracy is unaffected.

4. GPQA-Diamond stamina probe (hard subset, N=10)

Correct Avg thinking (chars) Avg "wait" count
Base Qwen3.6 6/10 42,766 33
Reflect 6/10 67,615 44

Same accuracy; Reflect thinks ~58% longer. On base's worst spiral (109k-char wrong answer) Reflect ground to ~220k chars and got it right; on the item where base hit its context cap and never answered, Reflect resolved cleanly. The difference is fewer runaway-reasoning failures, not higher accuracy.

Summary

Reflect is a "say less, think more" edit โ€” and capability-neutral underneath:

  • No-think: same accuracy, ~12% more concise output (compresses the answer channel).
  • Thinking: reasons ~2x longer, recovers more hard previously-failed questions, exhausts its context far less often (expands the reasoning channel).
  • Capability: unchanged on GPQA-Diamond in both regimes โ€” the fine-tune reshaped behavior without touching the weights' knowledge.
  • Costs: higher token spend in thinking mode, occasional overthinking of easy items, and a small IFEval format-compliance tradeoff (~-4 pts).

The Reflect Family

Model Base Status
Reflect 27B Qwen3.6-27B โœ… Released
Reflect 9B Qwen3.5-9B Coming soon
Reflect 4B Qwen3.5-4B Coming soon

Recommended Settings

  • Temperature: 0.6โ€“0.7
  • Context: up to 262K tokens
  • Quantization: Q6_K
  • System prompt: (add yours here โ€” left blank intentionally; the model was not trained against a fixed system prompt)

Technical Details

  • Base model: Qwen/Qwen3.6-27B
  • Architecture: dense transformer, 27B
  • Format: GGUF Q6_K (~22GB)
  • Training hardware: RTX Pro 6000
  • Training framework: Unsloth

About

Curation over volume: 1,400 carefully chosen examples reshape voice and reasoning behavior without collapsing capability. The honest finding across four test sets: Reflect says less and thinks more โ€” leaner output at matched accuracy, longer reasoning where it pays, and underlying capability left intact.

License

Apache 2.0 / Qwen license (same as base).

Links

Downloads last month
90
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for annnnnnnd/Qwen3.6-27B-Reflect

Base model

Qwen/Qwen3.6-27B
Quantized
(471)
this model