README: note that thinking model needs bundled chat_template.jinja + --jinja --reasoning-format deepseek to hide <think> tags
Browse files
README.md
CHANGED
|
@@ -35,6 +35,23 @@ This release ships as **BF16 GGUF re-converted with per-layer `layer_types` bake
|
|
| 35 |
|
| 36 |
The patch is a single commit on top of `ggml-org/llama.cpp@d00685831`. It is backward-compatible — stock (non-RYS) Qwen3.5 GGUFs still load normally.
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
## Support This Work
|
| 39 |
|
| 40 |
I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
|
|
|
|
| 35 |
|
| 36 |
The patch is a single commit on top of `ggml-org/llama.cpp@d00685831`. It is backward-compatible — stock (non-RYS) Qwen3.5 GGUFs still load normally.
|
| 37 |
|
| 38 |
+
## Note: thinking model — use the bundled `chat_template.jinja`
|
| 39 |
+
|
| 40 |
+
This is a Qwen3-Thinking derivative and emits its reasoning inside `<think>...</think>` tags. If you see raw `<think>thinking text</think>` blocks appearing inline in every response from `llama-server` (or any OpenAI-compatible client), you need to apply the Qwen3 thinking chat template that ships in this repo.
|
| 41 |
+
|
| 42 |
+
```bash
|
| 43 |
+
llama-server \
|
| 44 |
+
-m Ornstein3.6-35B-A3B-RYS-SABER-BF16.gguf \
|
| 45 |
+
--jinja \
|
| 46 |
+
--chat-template-file chat_template.jinja \
|
| 47 |
+
--reasoning-format deepseek \
|
| 48 |
+
-ngl 99 -c 8192
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
- `--jinja` enables jinja chat-template parsing.
|
| 52 |
+
- `--chat-template-file chat_template.jinja` overrides whatever template is embedded in the GGUF with the correct Qwen3-Thinking one from this repo.
|
| 53 |
+
- `--reasoning-format deepseek` makes llama-server split `<think>...</think>` out into a separate `reasoning_content` field on the response instead of leaving it inline in `content`. Without this flag the tags will still appear in the response body even with the right template.
|
| 54 |
+
|
| 55 |
## Support This Work
|
| 56 |
|
| 57 |
I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
|