YTan2000
/

Qwen3.6-27B-TQ3_4S

@@ -24,23 +24,20 @@ language:
 ## TQ3_4S Release
-This repository packages the model as a TurboQuant `TQ3_4S` GGUF for local deployment. It can be used in two modes:
-- **Text-only chat / coding:** use `Qwen3.6-27B-TQ3_4S.gguf` only.
-- **Image-to-text / multimodal:** use `Qwen3.6-27B-TQ3_4S.gguf` together with `mmproj.gguf`.
 ## Runtime Compatibility
 This quant requires a TurboQuant-capable runtime. For llama.cpp, use the `turbo-tan/llama.cpp-tq3` fork rather than stock upstream llama.cpp if you want native `TQ3_4S` support.
 - TurboQuant runtime fork: [turbo-tan/llama.cpp-tq3](https://github.com/turbo-tan/llama.cpp-tq3)
 ## Files
 | File | Quant | Size |
 | --- | --- | ---: |
-| `Qwen3.6-27B-TQ3_4S.gguf` | TQ3_4S text model | ~13.0 GB |
-| `mmproj.gguf` | Qwen3.6-27B vision projector | ~889 MB |
 | `chat_template.jinja` | chat template | text |
 | `thumbnail.png` | model card image | png |
@@ -66,44 +63,11 @@ Prompt processing:
 - Use a TurboQuant-capable llama.cpp build for best performance.
 - For llama.cpp, the intended runtime is the `turbo-tan/llama.cpp-tq3` fork.
-- Text-only usage does not need `mmproj.gguf`.
-- Image-to-text usage requires `mmproj.gguf`; pass it with `--mmproj mmproj.gguf` when using `llama-server` or other compatible llama.cpp tools.
 - For llama.cpp chat usage, keep `--jinja` enabled so the bundled chat template is honored.
 - Upstream guidance recommends keeping at least `128K` context when possible for reasoning-heavy workloads. On smaller local GPUs, reduce context as needed to fit memory.
 - Upstream default sampling guidance differs between thinking and non-thinking mode; follow the official Qwen card if you are trying to reproduce base-model behavior.
-## Text-Only vs Image-To-Text
-### Text-only
-For normal chat, coding, and text generation, load only the main model:
-```bash
-llama-server \
-  -m Qwen3.6-27B-TQ3_4S.gguf \
-  -ngl 99 -c 4096 -np 1 \
-  -ctk q4_0 -ctv tq3_0 -fa on \
-  --jinja --reasoning off --reasoning-budget 0
-```
-### Image-to-text
-For vision/image prompts, also load the projector:
-`mmproj.gguf` was smoke-tested with the `turbo-tan/llama.cpp-tq3` `llama-server` runtime on RTX 5060 Ti. The server loaded the projector as a Qwen-VL multimodal model and `/health` returned `ok`.
-Validated smoke-test settings:
-```bash
-llama-server \
-  -m Qwen3.6-27B-TQ3_4S.gguf \
-  --mmproj mmproj.gguf \
-  -ngl 99 -c 2048 -np 1 \
-  -ctk q4_0 -ctv tq3_0 -fa on \
-  --jinja --reasoning off --reasoning-budget 0
-```
 ## Recommended llama.cpp Settings
 Default prompt-processing settings on 16 GB:
@@ -118,23 +82,11 @@ llama-bench \
   -p 2048 -n 0 -r 3
 ```
-Default text-only chat/server settings:
-```bash
-llama-server \
-  -m Qwen3.6-27B-TQ3_4S.gguf \
-  --host 127.0.0.1 --port 8080 \
-  -ngl 99 -c 4096 -np 1 \
-  -ctk q4_0 -ctv tq3_0 -fa on \
-  --jinja
-```
-Image-to-text server settings:
 ```bash
 llama-server \
   -m Qwen3.6-27B-TQ3_4S.gguf \
-  --mmproj mmproj.gguf \
   --host 127.0.0.1 --port 8080 \
   -ngl 99 -c 4096 -np 1 \
   -ctk q4_0 -ctv tq3_0 -fa on \

 ## TQ3_4S Release
+This repository packages the model as a TurboQuant `TQ3_4S` GGUF for local deployment.
 ## Runtime Compatibility
 This quant requires a TurboQuant-capable runtime. For llama.cpp, use the `turbo-tan/llama.cpp-tq3` fork rather than stock upstream llama.cpp if you want native `TQ3_4S` support.
 - TurboQuant runtime fork: [turbo-tan/llama.cpp-tq3](https://github.com/turbo-tan/llama.cpp-tq3)
+- LM Studio setup: [docs/backend/LMStudio.md](https://github.com/turbo-tan/llama.cpp-tq3/blob/main/docs/backend/LMStudio.md)
 ## Files
 | File | Quant | Size |
 | --- | --- | ---: |
+| `Qwen3.6-27B-TQ3_4S.gguf` | TQ3_4S | ~13.0 GB |
 | `chat_template.jinja` | chat template | text |
 | `thumbnail.png` | model card image | png |
 - Use a TurboQuant-capable llama.cpp build for best performance.
 - For llama.cpp, the intended runtime is the `turbo-tan/llama.cpp-tq3` fork.
+- The upstream family is multimodal-capable, but the public 27B repos used here do not currently expose a separate GGUF `mmproj` artifact.
 - For llama.cpp chat usage, keep `--jinja` enabled so the bundled chat template is honored.
 - Upstream guidance recommends keeping at least `128K` context when possible for reasoning-heavy workloads. On smaller local GPUs, reduce context as needed to fit memory.
 - Upstream default sampling guidance differs between thinking and non-thinking mode; follow the official Qwen card if you are trying to reproduce base-model behavior.
 ## Recommended llama.cpp Settings
 Default prompt-processing settings on 16 GB:
   -p 2048 -n 0 -r 3
 ```
+Default chat/server settings:
 ```bash
 llama-server \
   -m Qwen3.6-27B-TQ3_4S.gguf \
   --host 127.0.0.1 --port 8080 \
   -ngl 99 -c 4096 -np 1 \
   -ctk q4_0 -ctv tq3_0 -fa on \