Report: 30 t/s on RTX 4090D (48GB VRAM) with UD-Q6_K_XL

by SlavikF - opened 10 days ago

Discussion

SlavikF

10 days ago

System:

Nvidia RTX 4090D 48GB VRAM
Intel Xeon W5-3425 with 12 cores
DDR5-4800 RAM

Speed:

PP: start with 3000 t/s on small context, goes under 2000 t/s on long context
TG: 30 t/s

prompt eval time =   19 s / 40387 tokens (    0.49 ms per token,  2052.53 tokens per second)
       eval time =   49 s /  1549 tokens (   32.10 ms per token,    31.15 tokens per second)

My docker compose:

services:
  llama-router:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda12-b8882
    container_name: router
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
    ports:
      - "8080:8080"
    volumes:
      - /home/slavik/.cache/huggingface/hub:/root/.cache/huggingface/hub:ro
      - ./models.ini:/app/models.ini:ro
    entrypoint: ["./llama-server"]
    command: >
      --models-max 1
      --models-preset ./models.ini
      --host 0.0.0.0  --port 8080

my INI file:

version = 1

[unsloth/Qwen3.6-27B-GGUF:Q6_K_XL]
ctx-size=262144
temp=0.6
top-p=0.95
top-k=20
min-p=0.00

using nvtop, I see 42GB of VRAM used.

It feels that model is a bit heavy on thinking.

robert1968

5 days ago

Use vllm and you should get at least 3 time more speed.
Have you seen this Reddit?:
https://www.reddit.com/r/LocalLLaMA/comments/1rianwb/running_qwen35_27b_dense_with_170k_context_at/
As the article writer, I also have 2 RTX3090 and never had so high quality and fast coding AI like this. (I used to run GGUF Q8 but this AWQ in the article is far better quality)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment