DFlash Qwen3.5-27B Uncensored

27B hybrid linear-attention model | BF16 full-precision | Vision + Text | DFlash speculative decoding

Performance (DGX Spark GB10, NVFP4 version)

Without DFlash With DFlash Speedup
Single-stream 12.2 tok/s 33.2 tok/s 2.7x
4 concurrent 48.1 tok/s 85.5 tok/s 1.8x
Metric Value
Model Size ~52 GB (BF16) / ~20 GB (NVFP4)
TTFT 98-138 ms

Quick Links

Get Started Step-by-step quick start guide on DGX Spark
Docker Image ghcr.io/aeon-7/vllm-dflash:latest
NVFP4 Version AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 — Use this if you have an NVIDIA Blackwell or later GPU (why?)
DFlash Drafter z-lab/Qwen3.5-27B-DFlash
Base Model Qwen/Qwen3.5-27B
DFlash Paper arXiv 2602.06036

Quick Start (DGX Spark)

1. Download the model

huggingface-cli download AEON-7/DFlash-Qwen3.5-27B-Uncensored \
  --local-dir ~/models/DFlash-Qwen3.5-27B-Uncensored

2. Create your environment file

# Auto-generate API key and create .env
cat > .env.dflash << 'EOF'
# Authentication
HF_TOKEN=hf_your_token_here
VLLM_API_KEY=$(openssl rand -hex 32)

# Model path
MODEL_HOST_PATH=~/models/DFlash-Qwen3.5-27B-Uncensored

# DFlash speculative decoding (auto-downloads drafter on first run)
DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash
DFLASH_NUM_SPEC_TOKENS=15

# DGX Spark optimal settings (BF16, 64K context)
MAX_MODEL_LEN=65536
MAX_NUM_SEQS=2
GPU_MEMORY_UTILIZATION=0.90
MAX_NUM_BATCHED_TOKENS=65536
EOF

# Generate a real API key and inject it
sed -i "s|\$(openssl rand -hex 32)|$(openssl rand -hex 32)|" .env.dflash
echo "Your API key: $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)"

3. Save docker-compose.dflash-bf16.yml

services:
  vllm-dflash-bf16:
    image: ghcr.io/aeon-7/vllm-dflash:latest
    container_name: vllm-dflash-bf16
    restart: unless-stopped
    network_mode: host
    ipc: host
    volumes:
      - ${MODEL_HOST_PATH}:/models/DFlash-Qwen3.5-27B-Uncensored
      - dflash-drafter-cache:/models/drafter-cache
    environment:
      - MODEL_PATH=/models/DFlash-Qwen3.5-27B-Uncensored
      - SERVED_MODEL_NAME=DFlash-Qwen3.5-27B-Uncensored
      - DFLASH_DRAFTER=${DFLASH_DRAFTER}
      - DFLASH_NUM_SPEC_TOKENS=${DFLASH_NUM_SPEC_TOKENS}
      - GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION}
      - MAX_MODEL_LEN=${MAX_MODEL_LEN}
      - MAX_NUM_SEQS=${MAX_NUM_SEQS}
      - MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS}
      - NVIDIA_VISIBLE_DEVICES=all
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_API_KEY=${VLLM_API_KEY}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  dflash-drafter-cache:

4. Launch

docker compose --env-file .env.dflash -f docker-compose.dflash-bf16.yml up -d

# Watch startup (~5-8 min for weight loading + compilation)
docker compose -f docker-compose.dflash-bf16.yml logs -f

5. Test

# Text generation
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
  -d '{
    "model": "DFlash-Qwen3.5-27B-Uncensored",
    "messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],
    "max_tokens": 200
  }'

# Vision (image understanding)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
  -d '{
    "model": "DFlash-Qwen3.5-27B-Uncensored",
    "messages": [{"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
      {"type": "text", "text": "What do you see?"}
    ]}],
    "max_tokens": 200
  }'

Environment Variables

Variable Default Description
MODEL_HOST_PATH Host path to model weights
DFLASH_DRAFTER z-lab/Qwen3.5-27B-DFlash HF repo ID for drafter (auto-downloaded). Set off to disable.
DFLASH_NUM_SPEC_TOKENS 15 Tokens per draft step
VLLM_API_KEY API key for LAN authentication
HF_TOKEN HuggingFace token for gated models
GPU_MEMORY_UTILIZATION 0.85 GPU memory fraction (higher for BF16)

Why This Model

Why Dense Over MoE

Qwen3.5 comes in two flavors: the 122B-A10B MoE (256 experts, 10B active per token) and this 27B dense model (all parameters active on every token). The dense model has real advantages:

  • Higher quality per FLOP — Every one of the 27B parameters contributes to every token. MoE models route to a sparse subset, which means some experts are undertrained and routing decisions introduce noise. Dense models don't have this problem.
  • No routing overhead — MoE models spend compute on expert selection, load balancing, and all-to-all communication. Dense models just run the computation.
  • Predictable latency — No variance from different experts being selected per token. Every forward pass costs the same.
  • Simpler deployment — No expert parallelism concerns, no load imbalance, fits on a single GPU with NVFP4.

The tradeoff has always been speed: a 27B dense model moves 27B parameters through memory per token, while the 122B MoE only moves ~10B active parameters. On a memory-bandwidth-limited device like DGX Spark (273 GB/s), that meant the dense model was slow — 12 tok/s baseline.

DFlash changes this equation entirely. See below.

Why DFlash Makes Dense Practical on DGX Spark

The fundamental bottleneck on DGX Spark is memory bandwidth. At 273 GB/s, loading 20 GB of NVFP4 weights per token limits you to ~12 tok/s. Every dense model hits this wall.

DFlash block-diffusion speculative decoding breaks through it:

  1. The 2B drafter proposes multiple tokens simultaneously — one diffusion forward pass generates an entire block of speculative tokens in parallel, not sequentially. This costs roughly the same as generating a single token.
  2. The 27B target verifies all proposed tokens in one forward pass — instead of paying the full memory bandwidth cost per token, you pay it once and produce 3-4 accepted tokens on average.
  3. Net effect: you amortize the bandwidth cost across multiple tokens per forward pass.

The result on DGX Spark:

Without DFlash With DFlash
Single-stream 12.2 tok/s 33.2 tok/s
Effective bandwidth utilization 1 token per pass ~3.5 tokens per pass
Practical feel Sluggish, noticeable delay Responsive, fluid

This makes the 27B dense model faster than the 122B MoE on a single DGX Spark while delivering the quality advantages of a dense architecture. DFlash turns the DGX Spark from "it can run a 27B model" into "it runs a 27B model well."

Hybrid Architecture

Qwen3.5-27B uses a hybrid architecture mixing two attention types across 64 layers:

  • Linear attention (GDN) — Gated Delta Network layers for efficient long-context processing with O(1) per-token state (48 layers)
  • Full attention — Standard multi-head attention every 4th layer for global context capture (16 layers)

This gives near-linear scaling with sequence length while maintaining full-attention quality at key intervals.

Vision + Text

Includes a 27-layer ViT vision encoder (460M params) with a merger that projects visual features into the language model's hidden space. Supports image understanding alongside text generation.

DFlash Block-Diffusion Speculative Decoding

Pair with z-lab/Qwen3.5-27B-DFlash — a 2B block-diffusion drafter that generates all speculative tokens simultaneously in a single diffusion step. The container auto-downloads and configures this.

Abliteration

Created using the orthogonal projection abliteration technique:

  1. Measures refusal directions across harmful/harmless prompt pairs
  2. Analyzes layer-by-layer activation patterns to identify the refusal direction
  3. Abliterates by projecting out the refusal direction from weight matrices

Modifies weights directly (not LoRA/adapter). Standalone BF16 model with no built-in refusal behavior.

Model Details

Property Value
Architecture Qwen3.5 (Hybrid, 27B parameters)
Layers 64 (48 GDN + 16 full-attention)
Hidden Size 5120
Attention Heads 24 (4 KV heads), head_dim=256
Vision Encoder 27-layer ViT, 460M params
Max Context 131,072 tokens
Vocabulary 248,320 tokens
Precision BF16
Model Size ~52 GB

Why NVFP4 on Blackwell

If you have an NVIDIA Blackwell GPU (B200, GB200, GB10/DGX Spark, or later), you should use the NVFP4 version instead. Here's why:

NVFP4 is effectively lossless on Blackwell. The FP4 (E2M1) format is a native tensor core datatype on Blackwell's SM 12.x architecture. Unlike older INT4/GPTQ quantization that introduces significant degradation, NVFP4 with AWQ_FULL calibration preserves model quality while giving you:

  • 3x memory reduction — 20 GB vs 52 GB, freeing memory for longer context and more concurrent requests
  • Hardware-accelerated FP4 GEMM — Blackwell tensor cores execute FP4 matrix multiplies natively via FlashInfer CUTLASS, not through dequantize-then-compute
  • Higher throughput — The smaller weight footprint means less memory bandwidth consumed per token, directly translating to faster inference
  • Same quality — AWQ_FULL uses exhaustive grid search (10 scaling factors per layer) plus clipping optimization. The vision encoder, embeddings, norms, and lm_head remain in full BF16

This is a free performance boost — you get the same model quality at 3x less memory and measurably faster inference. The BF16 version here is primarily for non-Blackwell hardware or research workflows that need full-precision weights.


Alternative Deployment

vLLM (Manual)

vllm serve AEON-7/DFlash-Qwen3.5-27B-Uncensored \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --kv-cache-dtype auto \
  --gpu-memory-utilization 0.85 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 4 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --trust-remote-code

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "AEON-7/DFlash-Qwen3.5-27B-Uncensored"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)

messages = [{"role": "user", "content": "Hello, tell me about yourself."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Credits

Legal Disclaimer

THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. This model has had safety alignment removed. Users are responsible for ensuring ethical and legal use.

Downloads last month
-
Safetensors
Model size
28B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEON-7/DFlash-Qwen3.5-27B-Uncensored

Base model

Qwen/Qwen3.5-27B
Finetuned
(231)
this model
Quantizations
1 model

Paper for AEON-7/DFlash-Qwen3.5-27B-Uncensored