2x4060 Reporting - 22tok/s on UD-Q4_K_XL
#9
by mrchuy - opened
podman run -d \
--name llama-qwen36-27b-gguf \
--device nvidia.com/gpu=all \
-v /data/models:/root/.cache/huggingface:Z \
-v /data/vllm_cache:/cache:Z \
-p 8001:8080 \
--env NVIDIA_VISIBLE_DEVICES=all \
--ipc=host \
--restart=unless-stopped \
ghcr.io/ggml-org/llama.cpp:server-cuda13 \
--hf-repo unsloth/Qwen3.6-27B-GGUF \
--hf-file Qwen3.6-27B-UD-Q4_K_XL.gguf \
--host 0.0.0.0 \
--port 8080 \
--n-gpu-layers all \
--ctx-size 125000 \
--batch-size 4096 \
--mmap \
--ubatch-size 2048 \
--flash-attn on \
--split-mode tensor \ # row is more stable @ 17tok/s
--ctx-checkpoints -1 \ # Only needed for tensor split-mode
--tensor-split 1,1 \
--threads 16 \
--threads-batch 20 \
--cache-ram 4096 \
--parallel 2 \
--jinja \
--reasoning on \
--reasoning-budget 1000 \
--metrics
podman logs llama-qwen36-27b-gguf --follow
prompt eval time = 4576.02 ms / 2945 tokens ( 1.55 ms per token, 643.57 tokens per second)
eval time = 43278.35 ms / 965 tokens ( 44.85 ms per token, 22.30 tokens per second)
total time = 47854.37 ms / 3910 tokens
prompt eval time = 5234.37 ms / 3430 tokens ( 1.53 ms per token, 655.28 tokens per second)
eval time = 82262.34 ms / 1830 tokens ( 44.95 ms per token, 22.25 tokens per second)
total time = 87496.71 ms / 5260 tokens
Note split-mode tensor is very fresh, active PRs past 2 weeks. "row" is more stable but gpu util stays at 60% instead of 80-90%.
Wed Apr 22 13:44:32 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03 Driver Version: 595.58.03 CUDA Version: 13.2 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:04:00.0 Off | N/A |
| 33% 61C P2 113W / 165W | 15524MiB / 16380MiB | 87% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4060 Ti Off | 00000000:0A:00.0 Off | N/A |
| 41% 70C P2 119W / 165W | 14386MiB / 16380MiB | 84% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 219901 C /app/llama-server 15514MiB |
| 1 N/A N/A 219901 C /app/llama-server 14376MiB |
+-----------------------------------------------------------------------------------------+
Lets see how much I can push these gpu-poor beauties