2x4060 Reporting - 22tok/s on UD-Q4_K_XL

by mrchuy - opened 4 days ago

podman run -d \
  --name llama-qwen36-27b-gguf \
  --device nvidia.com/gpu=all \
  -v /data/models:/root/.cache/huggingface:Z \
  -v /data/vllm_cache:/cache:Z \
  -p 8001:8080 \
  --env NVIDIA_VISIBLE_DEVICES=all \
  --ipc=host \
  --restart=unless-stopped \
  ghcr.io/ggml-org/llama.cpp:server-cuda13 \
  --hf-repo unsloth/Qwen3.6-27B-GGUF \
  --hf-file Qwen3.6-27B-UD-Q4_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers all \
  --ctx-size 125000 \
  --batch-size 4096 \
  --mmap \
  --ubatch-size 2048 \
  --flash-attn on \
  --split-mode tensor \  # row is more stable @ 17tok/s
  --ctx-checkpoints -1 \ # Only needed for tensor split-mode 
  --tensor-split 1,1 \
  --threads 16 \
  --threads-batch 20 \
  --cache-ram 4096 \
  --parallel 2 \
  --jinja \
  --reasoning on \
  --reasoning-budget 1000 \
  --metrics

podman logs llama-qwen36-27b-gguf --follow

prompt eval time =    4576.02 ms /  2945 tokens (    1.55 ms per token,   643.57 tokens per second)
       eval time =   43278.35 ms /   965 tokens (   44.85 ms per token,    22.30 tokens per second)
      total time =   47854.37 ms /  3910 tokens

prompt eval time =    5234.37 ms /  3430 tokens (    1.53 ms per token,   655.28 tokens per second)
       eval time =   82262.34 ms /  1830 tokens (   44.95 ms per token,    22.25 tokens per second)
      total time =   87496.71 ms /  5260 tokens

Note split-mode tensor is very fresh, active PRs past 2 weeks. "row" is more stable but gpu util stays at 60% instead of 80-90%.

Wed Apr 22 13:44:32 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     Off |   00000000:04:00.0 Off |                  N/A |
| 33%   61C    P2            113W /  165W |   15524MiB /  16380MiB |     87%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4060 Ti     Off |   00000000:0A:00.0 Off |                  N/A |
| 41%   70C    P2            119W /  165W |   14386MiB /  16380MiB |     84%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          219901      C   /app/llama-server                     15514MiB |
|    1   N/A  N/A          219901      C   /app/llama-server                     14376MiB |
+-----------------------------------------------------------------------------------------+

Lets see how much I can push these gpu-poor beauties

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment