Report: 30 t/s on RTX 4090D (48GB VRAM) with UD-Q6_K_XL
#7
by SlavikF - opened
System:
- Nvidia RTX 4090D 48GB VRAM
- Intel Xeon W5-3425 with 12 cores
- DDR5-4800 RAM
Speed:
- PP: start with 3000 t/s on small context, goes under 2000 t/s on long context
- TG: 30 t/s
prompt eval time = 19 s / 40387 tokens ( 0.49 ms per token, 2052.53 tokens per second)
eval time = 49 s / 1549 tokens ( 32.10 ms per token, 31.15 tokens per second)
My docker compose:
services:
llama-router:
image: ghcr.io/ggml-org/llama.cpp:server-cuda12-b8882
container_name: router
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
ports:
- "8080:8080"
volumes:
- /home/slavik/.cache/huggingface/hub:/root/.cache/huggingface/hub:ro
- ./models.ini:/app/models.ini:ro
entrypoint: ["./llama-server"]
command: >
--models-max 1
--models-preset ./models.ini
--host 0.0.0.0 --port 8080
my INI file:
version = 1
[unsloth/Qwen3.6-27B-GGUF:Q6_K_XL]
ctx-size=262144
temp=0.6
top-p=0.95
top-k=20
min-p=0.00
using nvtop, I see 42GB of VRAM used.
It feels that model is a bit heavy on thinking.
Use vllm and you should get at least 3 time more speed.
Have you seen this Reddit?:
https://www.reddit.com/r/LocalLLaMA/comments/1rianwb/running_qwen35_27b_dense_with_170k_context_at/
As the article writer, I also have 2 RTX3090 and never had so high quality and fast coding AI like this. (I used to run GGUF Q8 but this AWQ in the article is far better quality)