Report: 30 t/s on RTX 4090D (48GB VRAM) with UD-Q6_K_XL

#7
by SlavikF - opened

System:

  • Nvidia RTX 4090D 48GB VRAM
  • Intel Xeon W5-3425 with 12 cores
  • DDR5-4800 RAM

Speed:

  • PP: start with 3000 t/s on small context, goes under 2000 t/s on long context
  • TG: 30 t/s
prompt eval time =   19 s / 40387 tokens (    0.49 ms per token,  2052.53 tokens per second)
       eval time =   49 s /  1549 tokens (   32.10 ms per token,    31.15 tokens per second)

My docker compose:

services:
  llama-router:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda12-b8882
    container_name: router
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
    ports:
      - "8080:8080"
    volumes:
      - /home/slavik/.cache/huggingface/hub:/root/.cache/huggingface/hub:ro
      - ./models.ini:/app/models.ini:ro
    entrypoint: ["./llama-server"]
    command: >
      --models-max 1
      --models-preset ./models.ini
      --host 0.0.0.0  --port 8080

my INI file:

version = 1

[unsloth/Qwen3.6-27B-GGUF:Q6_K_XL]
ctx-size=262144
temp=0.6
top-p=0.95
top-k=20
min-p=0.00

using nvtop, I see 42GB of VRAM used.

It feels that model is a bit heavy on thinking.

Use vllm and you should get at least 3 time more speed.
Have you seen this Reddit?:
https://www.reddit.com/r/LocalLLaMA/comments/1rianwb/running_qwen35_27b_dense_with_170k_context_at/
As the article writer, I also have 2 RTX3090 and never had so high quality and fast coding AI like this. (I used to run GGUF Q8 but this AWQ in the article is far better quality)

Sign up or log in to comment