DJLougen commited on
Commit
eb3dc88
·
verified ·
1 Parent(s): 74d0d45

README: note that thinking model needs bundled chat_template.jinja + --jinja --reasoning-format deepseek to hide <think> tags

Browse files
Files changed (1) hide show
  1. README.md +17 -0
README.md CHANGED
@@ -35,6 +35,23 @@ This release ships as **BF16 GGUF re-converted with per-layer `layer_types` bake
35
 
36
  The patch is a single commit on top of `ggml-org/llama.cpp@d00685831`. It is backward-compatible — stock (non-RYS) Qwen3.5 GGUFs still load normally.
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ## Support This Work
39
 
40
  I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
 
35
 
36
  The patch is a single commit on top of `ggml-org/llama.cpp@d00685831`. It is backward-compatible — stock (non-RYS) Qwen3.5 GGUFs still load normally.
37
 
38
+ ## Note: thinking model — use the bundled `chat_template.jinja`
39
+
40
+ This is a Qwen3-Thinking derivative and emits its reasoning inside `<think>...</think>` tags. If you see raw `<think>thinking text</think>` blocks appearing inline in every response from `llama-server` (or any OpenAI-compatible client), you need to apply the Qwen3 thinking chat template that ships in this repo.
41
+
42
+ ```bash
43
+ llama-server \
44
+ -m Ornstein3.6-35B-A3B-RYS-SABER-BF16.gguf \
45
+ --jinja \
46
+ --chat-template-file chat_template.jinja \
47
+ --reasoning-format deepseek \
48
+ -ngl 99 -c 8192
49
+ ```
50
+
51
+ - `--jinja` enables jinja chat-template parsing.
52
+ - `--chat-template-file chat_template.jinja` overrides whatever template is embedded in the GGUF with the correct Qwen3-Thinking one from this repo.
53
+ - `--reasoning-format deepseek` makes llama-server split `<think>...</think>` out into a separate `reasoning_content` field on the response instead of leaving it inline in `content`. Without this flag the tags will still appear in the response body even with the right template.
54
+
55
  ## Support This Work
56
 
57
  I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.