Spaces:

lordx64
/

qwen3-6-distill-chat

Runtime error

lordx64 commited on 18 days ago

Commit

8b84e8e

verified ·

1 Parent(s): 913a3a4

Upload app.py with huggingface_hub

Files changed (1) hide show

app.py CHANGED Viewed

@@ -27,7 +27,7 @@ from transformers import (
 )
 MODEL_ID = "lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled"
-MAX_NEW_TOKENS = 8192   # room for ~thinking + answer; long problems may truncate
 # First call includes lazy model load (~2 min to 4-bit on GPU); subsequent
 # calls just generate. ZeroGPU keeps the loaded model resident across calls.
 GEN_DURATION_SECONDS = 300
@@ -37,7 +37,9 @@ DESCRIPTION = """\
 **A 35B-parameter MoE (with only ~3B active per token) fine-tuned to imitate the chain-of-thought style of Claude Opus 4.7.** The model thinks in explicit `<think>…</think>` blocks before producing the final answer, same as frontier reasoning systems.
-> Running in 4-bit (NF4) on ZeroGPU. First message per session may pause a few seconds while the GPU attaches. Long reasoning can take 30–60s — the model genuinely uses thousands of tokens of thinking on hard problems.
 Model: [lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled)
 """

 )
 MODEL_ID = "lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled"
+MAX_NEW_TOKENS = 2048   # demo latency cap; raise to 8192 for full reasoning
 # First call includes lazy model load (~2 min to 4-bit on GPU); subsequent
 # calls just generate. ZeroGPU keeps the loaded model resident across calls.
 GEN_DURATION_SECONDS = 300
 **A 35B-parameter MoE (with only ~3B active per token) fine-tuned to imitate the chain-of-thought style of Claude Opus 4.7.** The model thinks in explicit `<think>…</think>` blocks before producing the final answer, same as frontier reasoning systems.
+> Running in 4-bit (NF4) on ZeroGPU. **First message of a session may take 2–3 minutes** (lazy model load of ~70GB on GPU). **Subsequent messages take 30–90 seconds** because the model genuinely emits thousands of thinking tokens — that's the reasoning distillation working as intended, not a bug.
+>
+> Responses capped at 2048 tokens for demo latency. For full-length reasoning, run the model locally with vLLM at 64k context.
 Model: [lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled)
 """