lordx64 commited on
Commit
8b84e8e
·
verified ·
1 Parent(s): 913a3a4

Upload app.py with huggingface_hub

Browse files
Files changed (1) hide show
  1. app.py +4 -2
app.py CHANGED
@@ -27,7 +27,7 @@ from transformers import (
27
  )
28
 
29
  MODEL_ID = "lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled"
30
- MAX_NEW_TOKENS = 8192 # room for ~thinking + answer; long problems may truncate
31
  # First call includes lazy model load (~2 min to 4-bit on GPU); subsequent
32
  # calls just generate. ZeroGPU keeps the loaded model resident across calls.
33
  GEN_DURATION_SECONDS = 300
@@ -37,7 +37,9 @@ DESCRIPTION = """\
37
 
38
  **A 35B-parameter MoE (with only ~3B active per token) fine-tuned to imitate the chain-of-thought style of Claude Opus 4.7.** The model thinks in explicit `<think>…</think>` blocks before producing the final answer, same as frontier reasoning systems.
39
 
40
- > Running in 4-bit (NF4) on ZeroGPU. First message per session may pause a few seconds while the GPU attaches. Long reasoning can take 30–60s the model genuinely uses thousands of tokens of thinking on hard problems.
 
 
41
 
42
  Model: [lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled)
43
  """
 
27
  )
28
 
29
  MODEL_ID = "lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled"
30
+ MAX_NEW_TOKENS = 2048 # demo latency cap; raise to 8192 for full reasoning
31
  # First call includes lazy model load (~2 min to 4-bit on GPU); subsequent
32
  # calls just generate. ZeroGPU keeps the loaded model resident across calls.
33
  GEN_DURATION_SECONDS = 300
 
37
 
38
  **A 35B-parameter MoE (with only ~3B active per token) fine-tuned to imitate the chain-of-thought style of Claude Opus 4.7.** The model thinks in explicit `<think>…</think>` blocks before producing the final answer, same as frontier reasoning systems.
39
 
40
+ > Running in 4-bit (NF4) on ZeroGPU. **First message of a session may take 2–3 minutes** (lazy model load of ~70GB on GPU). **Subsequent messages take 30–90 seconds** because the model genuinely emits thousands of thinking tokens that's the reasoning distillation working as intended, not a bug.
41
+ >
42
+ > Responses capped at 2048 tokens for demo latency. For full-length reasoning, run the model locally with vLLM at 64k context.
43
 
44
  Model: [lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled)
45
  """