Spaces:
Runtime error
Runtime error
Upload app.py with huggingface_hub
Browse files
app.py
CHANGED
|
@@ -27,7 +27,7 @@ from transformers import (
|
|
| 27 |
)
|
| 28 |
|
| 29 |
MODEL_ID = "lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled"
|
| 30 |
-
MAX_NEW_TOKENS =
|
| 31 |
# First call includes lazy model load (~2 min to 4-bit on GPU); subsequent
|
| 32 |
# calls just generate. ZeroGPU keeps the loaded model resident across calls.
|
| 33 |
GEN_DURATION_SECONDS = 300
|
|
@@ -37,7 +37,9 @@ DESCRIPTION = """\
|
|
| 37 |
|
| 38 |
**A 35B-parameter MoE (with only ~3B active per token) fine-tuned to imitate the chain-of-thought style of Claude Opus 4.7.** The model thinks in explicit `<think>…</think>` blocks before producing the final answer, same as frontier reasoning systems.
|
| 39 |
|
| 40 |
-
> Running in 4-bit (NF4) on ZeroGPU. First message
|
|
|
|
|
|
|
| 41 |
|
| 42 |
Model: [lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled)
|
| 43 |
"""
|
|
|
|
| 27 |
)
|
| 28 |
|
| 29 |
MODEL_ID = "lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled"
|
| 30 |
+
MAX_NEW_TOKENS = 2048 # demo latency cap; raise to 8192 for full reasoning
|
| 31 |
# First call includes lazy model load (~2 min to 4-bit on GPU); subsequent
|
| 32 |
# calls just generate. ZeroGPU keeps the loaded model resident across calls.
|
| 33 |
GEN_DURATION_SECONDS = 300
|
|
|
|
| 37 |
|
| 38 |
**A 35B-parameter MoE (with only ~3B active per token) fine-tuned to imitate the chain-of-thought style of Claude Opus 4.7.** The model thinks in explicit `<think>…</think>` blocks before producing the final answer, same as frontier reasoning systems.
|
| 39 |
|
| 40 |
+
> Running in 4-bit (NF4) on ZeroGPU. **First message of a session may take 2–3 minutes** (lazy model load of ~70GB on GPU). **Subsequent messages take 30–90 seconds** because the model genuinely emits thousands of thinking tokens — that's the reasoning distillation working as intended, not a bug.
|
| 41 |
+
>
|
| 42 |
+
> Responses capped at 2048 tokens for demo latency. For full-length reasoning, run the model locally with vLLM at 64k context.
|
| 43 |
|
| 44 |
Model: [lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled)
|
| 45 |
"""
|