DFlash Qwen3.5-27B Uncensored
27B hybrid linear-attention model | BF16 full-precision | Vision + Text | DFlash speculative decoding
Performance (DGX Spark GB10, NVFP4 version)
| Without DFlash | With DFlash | Speedup | |
|---|---|---|---|
| Single-stream | 12.2 tok/s | 33.2 tok/s | 2.7x |
| 4 concurrent | 48.1 tok/s | 85.5 tok/s | 1.8x |
| Metric | Value |
|---|---|
| Model Size | ~52 GB (BF16) / ~20 GB (NVFP4) |
| TTFT | 98-138 ms |
Quick Links
| Get Started | Step-by-step quick start guide on DGX Spark |
| Docker Image | ghcr.io/aeon-7/vllm-dflash:latest |
| NVFP4 Version | AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 — Use this if you have an NVIDIA Blackwell or later GPU (why?) |
| DFlash Drafter | z-lab/Qwen3.5-27B-DFlash |
| Base Model | Qwen/Qwen3.5-27B |
| DFlash Paper | arXiv 2602.06036 |
Quick Start (DGX Spark)
1. Download the model
huggingface-cli download AEON-7/DFlash-Qwen3.5-27B-Uncensored \
--local-dir ~/models/DFlash-Qwen3.5-27B-Uncensored
2. Create your environment file
# Auto-generate API key and create .env
cat > .env.dflash << 'EOF'
# Authentication
HF_TOKEN=hf_your_token_here
VLLM_API_KEY=$(openssl rand -hex 32)
# Model path
MODEL_HOST_PATH=~/models/DFlash-Qwen3.5-27B-Uncensored
# DFlash speculative decoding (auto-downloads drafter on first run)
DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash
DFLASH_NUM_SPEC_TOKENS=15
# DGX Spark optimal settings (BF16, 64K context)
MAX_MODEL_LEN=65536
MAX_NUM_SEQS=2
GPU_MEMORY_UTILIZATION=0.90
MAX_NUM_BATCHED_TOKENS=65536
EOF
# Generate a real API key and inject it
sed -i "s|\$(openssl rand -hex 32)|$(openssl rand -hex 32)|" .env.dflash
echo "Your API key: $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)"
3. Save docker-compose.dflash-bf16.yml
services:
vllm-dflash-bf16:
image: ghcr.io/aeon-7/vllm-dflash:latest
container_name: vllm-dflash-bf16
restart: unless-stopped
network_mode: host
ipc: host
volumes:
- ${MODEL_HOST_PATH}:/models/DFlash-Qwen3.5-27B-Uncensored
- dflash-drafter-cache:/models/drafter-cache
environment:
- MODEL_PATH=/models/DFlash-Qwen3.5-27B-Uncensored
- SERVED_MODEL_NAME=DFlash-Qwen3.5-27B-Uncensored
- DFLASH_DRAFTER=${DFLASH_DRAFTER}
- DFLASH_NUM_SPEC_TOKENS=${DFLASH_NUM_SPEC_TOKENS}
- GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION}
- MAX_MODEL_LEN=${MAX_MODEL_LEN}
- MAX_NUM_SEQS=${MAX_NUM_SEQS}
- MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS}
- NVIDIA_VISIBLE_DEVICES=all
- TORCH_MATMUL_PRECISION=high
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- HF_TOKEN=${HF_TOKEN}
- VLLM_API_KEY=${VLLM_API_KEY}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
dflash-drafter-cache:
4. Launch
docker compose --env-file .env.dflash -f docker-compose.dflash-bf16.yml up -d
# Watch startup (~5-8 min for weight loading + compilation)
docker compose -f docker-compose.dflash-bf16.yml logs -f
5. Test
# Text generation
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
-d '{
"model": "DFlash-Qwen3.5-27B-Uncensored",
"messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],
"max_tokens": 200
}'
# Vision (image understanding)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
-d '{
"model": "DFlash-Qwen3.5-27B-Uncensored",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
{"type": "text", "text": "What do you see?"}
]}],
"max_tokens": 200
}'
Environment Variables
| Variable | Default | Description |
|---|---|---|
MODEL_HOST_PATH |
— | Host path to model weights |
DFLASH_DRAFTER |
z-lab/Qwen3.5-27B-DFlash |
HF repo ID for drafter (auto-downloaded). Set off to disable. |
DFLASH_NUM_SPEC_TOKENS |
15 |
Tokens per draft step |
VLLM_API_KEY |
— | API key for LAN authentication |
HF_TOKEN |
— | HuggingFace token for gated models |
GPU_MEMORY_UTILIZATION |
0.85 |
GPU memory fraction (higher for BF16) |
Why This Model
Why Dense Over MoE
Qwen3.5 comes in two flavors: the 122B-A10B MoE (256 experts, 10B active per token) and this 27B dense model (all parameters active on every token). The dense model has real advantages:
- Higher quality per FLOP — Every one of the 27B parameters contributes to every token. MoE models route to a sparse subset, which means some experts are undertrained and routing decisions introduce noise. Dense models don't have this problem.
- No routing overhead — MoE models spend compute on expert selection, load balancing, and all-to-all communication. Dense models just run the computation.
- Predictable latency — No variance from different experts being selected per token. Every forward pass costs the same.
- Simpler deployment — No expert parallelism concerns, no load imbalance, fits on a single GPU with NVFP4.
The tradeoff has always been speed: a 27B dense model moves 27B parameters through memory per token, while the 122B MoE only moves ~10B active parameters. On a memory-bandwidth-limited device like DGX Spark (273 GB/s), that meant the dense model was slow — 12 tok/s baseline.
DFlash changes this equation entirely. See below.
Why DFlash Makes Dense Practical on DGX Spark
The fundamental bottleneck on DGX Spark is memory bandwidth. At 273 GB/s, loading 20 GB of NVFP4 weights per token limits you to ~12 tok/s. Every dense model hits this wall.
DFlash block-diffusion speculative decoding breaks through it:
- The 2B drafter proposes multiple tokens simultaneously — one diffusion forward pass generates an entire block of speculative tokens in parallel, not sequentially. This costs roughly the same as generating a single token.
- The 27B target verifies all proposed tokens in one forward pass — instead of paying the full memory bandwidth cost per token, you pay it once and produce 3-4 accepted tokens on average.
- Net effect: you amortize the bandwidth cost across multiple tokens per forward pass.
The result on DGX Spark:
| Without DFlash | With DFlash | |
|---|---|---|
| Single-stream | 12.2 tok/s | 33.2 tok/s |
| Effective bandwidth utilization | 1 token per pass | ~3.5 tokens per pass |
| Practical feel | Sluggish, noticeable delay | Responsive, fluid |
This makes the 27B dense model faster than the 122B MoE on a single DGX Spark while delivering the quality advantages of a dense architecture. DFlash turns the DGX Spark from "it can run a 27B model" into "it runs a 27B model well."
Hybrid Architecture
Qwen3.5-27B uses a hybrid architecture mixing two attention types across 64 layers:
- Linear attention (GDN) — Gated Delta Network layers for efficient long-context processing with O(1) per-token state (48 layers)
- Full attention — Standard multi-head attention every 4th layer for global context capture (16 layers)
This gives near-linear scaling with sequence length while maintaining full-attention quality at key intervals.
Vision + Text
Includes a 27-layer ViT vision encoder (460M params) with a merger that projects visual features into the language model's hidden space. Supports image understanding alongside text generation.
DFlash Block-Diffusion Speculative Decoding
Pair with z-lab/Qwen3.5-27B-DFlash — a 2B block-diffusion drafter that generates all speculative tokens simultaneously in a single diffusion step. The container auto-downloads and configures this.
Abliteration
Created using the orthogonal projection abliteration technique:
- Measures refusal directions across harmful/harmless prompt pairs
- Analyzes layer-by-layer activation patterns to identify the refusal direction
- Abliterates by projecting out the refusal direction from weight matrices
Modifies weights directly (not LoRA/adapter). Standalone BF16 model with no built-in refusal behavior.
Model Details
| Property | Value |
|---|---|
| Architecture | Qwen3.5 (Hybrid, 27B parameters) |
| Layers | 64 (48 GDN + 16 full-attention) |
| Hidden Size | 5120 |
| Attention Heads | 24 (4 KV heads), head_dim=256 |
| Vision Encoder | 27-layer ViT, 460M params |
| Max Context | 131,072 tokens |
| Vocabulary | 248,320 tokens |
| Precision | BF16 |
| Model Size | ~52 GB |
Why NVFP4 on Blackwell
If you have an NVIDIA Blackwell GPU (B200, GB200, GB10/DGX Spark, or later), you should use the NVFP4 version instead. Here's why:
NVFP4 is effectively lossless on Blackwell. The FP4 (E2M1) format is a native tensor core datatype on Blackwell's SM 12.x architecture. Unlike older INT4/GPTQ quantization that introduces significant degradation, NVFP4 with AWQ_FULL calibration preserves model quality while giving you:
- 3x memory reduction — 20 GB vs 52 GB, freeing memory for longer context and more concurrent requests
- Hardware-accelerated FP4 GEMM — Blackwell tensor cores execute FP4 matrix multiplies natively via FlashInfer CUTLASS, not through dequantize-then-compute
- Higher throughput — The smaller weight footprint means less memory bandwidth consumed per token, directly translating to faster inference
- Same quality — AWQ_FULL uses exhaustive grid search (10 scaling factors per layer) plus clipping optimization. The vision encoder, embeddings, norms, and lm_head remain in full BF16
This is a free performance boost — you get the same model quality at 3x less memory and measurably faster inference. The BF16 version here is primarily for non-Blackwell hardware or research workflows that need full-precision weights.
Alternative Deployment
vLLM (Manual)
vllm serve AEON-7/DFlash-Qwen3.5-27B-Uncensored \
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
--attention-backend flash_attn \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.85 \
--max-num-batched-tokens 8192 \
--max-num-seqs 4 \
--enable-chunked-prefill \
--enable-prefix-caching \
--trust-remote-code
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "AEON-7/DFlash-Qwen3.5-27B-Uncensored"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
messages = [{"role": "user", "content": "Hello, tell me about yourself."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Credits
- Base model by Qwen Team
- DFlash speculative decoding by z-lab (paper)
- Abliteration using llm-abliteration
- Release by AEON-7
Legal Disclaimer
THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. This model has had safety alignment removed. Users are responsible for ensuring ethical and legal use.
- Downloads last month
- -