Explyt commited on
Commit
5d2db2a
·
verified ·
1 Parent(s): 6b4e78d

Add files using upload-large-folder tool

Browse files
Files changed (50) hide show
  1. .gitattributes +2 -0
  2. README.md +225 -0
  3. __pycache__/glm47_moe_tool_parser_fixed.cpython-312.pyc +0 -0
  4. chat_template.jinja +86 -0
  5. config.json +75 -0
  6. expert_id_remap.json +0 -0
  7. expert_keep_map.json +0 -0
  8. expert_prune_report.json +83 -0
  9. generation_config.json +12 -0
  10. glm47_moe_tool_parser_fixed.py +532 -0
  11. model-00001-of-00141.safetensors +3 -0
  12. model-00002-of-00141.safetensors +3 -0
  13. model-00003-of-00141.safetensors +3 -0
  14. model-00004-of-00141.safetensors +3 -0
  15. model-00005-of-00141.safetensors +3 -0
  16. model-00006-of-00141.safetensors +3 -0
  17. model-00007-of-00141.safetensors +3 -0
  18. model-00008-of-00141.safetensors +3 -0
  19. model-00009-of-00141.safetensors +3 -0
  20. model-00010-of-00141.safetensors +3 -0
  21. model-00011-of-00141.safetensors +3 -0
  22. model-00012-of-00141.safetensors +3 -0
  23. model-00013-of-00141.safetensors +3 -0
  24. model-00014-of-00141.safetensors +3 -0
  25. model-00015-of-00141.safetensors +3 -0
  26. model-00016-of-00141.safetensors +3 -0
  27. model-00017-of-00141.safetensors +3 -0
  28. model-00018-of-00141.safetensors +3 -0
  29. model-00019-of-00141.safetensors +3 -0
  30. model-00020-of-00141.safetensors +3 -0
  31. model-00021-of-00141.safetensors +3 -0
  32. model-00022-of-00141.safetensors +3 -0
  33. model-00023-of-00141.safetensors +3 -0
  34. model-00024-of-00141.safetensors +3 -0
  35. model-00025-of-00141.safetensors +3 -0
  36. model-00026-of-00141.safetensors +3 -0
  37. model-00027-of-00141.safetensors +3 -0
  38. model-00028-of-00141.safetensors +3 -0
  39. model-00029-of-00141.safetensors +3 -0
  40. model-00031-of-00141.safetensors +3 -0
  41. model-00125-of-00141.safetensors +3 -0
  42. model-00127-of-00141.safetensors +3 -0
  43. model-00129-of-00141.safetensors +3 -0
  44. model-00130-of-00141.safetensors +3 -0
  45. model-00133-of-00141.safetensors +3 -0
  46. model-00135-of-00141.safetensors +3 -0
  47. model-00136-of-00141.safetensors +3 -0
  48. model-00141-of-00141.safetensors +3 -0
  49. tokenizer.json +3 -0
  50. tokenizer_config.json +33 -0
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
37
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,225 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: mit
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - vLLM
7
+ - AWQ
8
+ base_model:
9
+ - zai-org/GLM-5
10
+ base_model_relation: quantized
11
+
12
+ ---
13
+ # GLM-5-AWQ
14
+ Base model: [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5)
15
+
16
+ This repo quantizes the model using data-free quantization (no calibration dataset required).
17
+
18
+ ### 【Dependencies / Installation】
19
+
20
+ ```python
21
+ # NOTE:
22
+ # vllm==0.16.0rc2 absolutely would NOT work!
23
+ # Must upgrade to >=0.16.1rc1
24
+ vllm>=0.16.1rc1.dev7
25
+ transformers>=5.3.0.dev0
26
+ ```
27
+
28
+ As of **2026-02-26**, make sure your system has cuda12.8 installed.
29
+
30
+ Then, create a fresh Python environment (e.g. python3.12 venv) and run:
31
+ ```bash
32
+ pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
33
+ pip install git+https://github.com/huggingface/transformers.git
34
+ pip install git+https://github.com/deepseek-ai/DeepGEMM.git@v2.1.1.post3 --no-build-isolation
35
+ ```
36
+ [vLLM Official Guide](https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html)
37
+
38
+
39
+ ### 【vLLM Startup Command】
40
+ <i>Note: When launching with TP=8, include `--enable-expert-parallel`;
41
+ otherwise the expert tensors wouldn’t be evenly sharded across GPU devices.</i>
42
+
43
+ ```
44
+ export VLLM_USE_DEEP_GEMM=0
45
+ export VLLM_USE_FLASHINFER_MOE_FP16=1
46
+ export VLLM_USE_FLASHINFER_SAMPLER=0
47
+ export OMP_NUM_THREADS=4
48
+
49
+ vllm serve \
50
+ __YOUR_PATH__/QuantTrio/GLM-5-AWQ \
51
+ --served-model-name MY_MODEL \
52
+ --swap-space 16 \
53
+ --max-num-seqs 32 \
54
+ --max-model-len 32768 \
55
+ --gpu-memory-utilization 0.9 \
56
+ --tensor-parallel-size 8 \
57
+ --enable-expert-parallel \
58
+ --enable-auto-tool-choice \
59
+ --tool-call-parser glm47 \
60
+ --reasoning-parser glm45 \
61
+ --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
62
+ --trust-remote-code \
63
+ --host 0.0.0.0 \
64
+ --port 8000
65
+ ```
66
+
67
+ ### 【Logs】
68
+ ```
69
+ 2026-02-26
70
+ 1. Initial commit
71
+ ```
72
+
73
+ ### 【Model Files】
74
+ | File Size | Last Updated |
75
+ |-----------|--------------|
76
+ | `392 GiB` | `2026-02-26` |
77
+
78
+ ### 【Model Download】
79
+ ```python
80
+ from huggingface_hub import snapshot_download
81
+ snapshot_download('QuantTrio/GLM-5-AWQ', cache_dir="your_local_path")
82
+ ```
83
+
84
+ ### 【Overview】
85
+
86
+ # GLM-5
87
+
88
+ <div align="center">
89
+ <img src=https://raw.githubusercontent.com/zai-org/GLM-5/refs/heads/main/resources/logo.svg width="15%"/>
90
+ </div>
91
+ <p align="center">
92
+ 👋 Join our <a href="https://raw.githubusercontent.com/zai-org/GLM-5/refs/heads/main/resources/wechat.png" target="_blank">WeChat</a> or <a href="https://discord.gg/QR7SARHRxK" target="_blank">Discord</a> community.
93
+ <br>
94
+ 📖 Check out the GLM-5 <a href="https://z.ai/blog/glm-5" target="_blank">technical blog</a>.
95
+ <br>
96
+ 📍 Use GLM-5 API services on <a href="https://docs.z.ai/guides/llm/glm-5">Z.ai API Platform. </a>
97
+ <br>
98
+ 👉 One click to <a href="https://chat.z.ai">GLM-5</a>.
99
+ </p>
100
+
101
+ ## Introduction
102
+
103
+ We are launching GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.
104
+
105
+ Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed [slime](https://github.com/THUDM/slime), a novel **asynchronous RL infrastructure** that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.
106
+
107
+ ## Benchmark
108
+
109
+ | | GLM-5 | GLM-4.7 | DeepSeek-V3.2 | Kimi K2.5 | Claude Opus 4.5 | Gemini 3 Pro | GPT-5.2 (xhigh) |
110
+ | -------------------------------- | ---------------------- | --------- | ------------- |-----------| --------------- | ------------ | --------------- |
111
+ | HLE | 30.5 | 24.8 | 25.1 | 31.5 | 28.4 | 37.2 | 35.4 |
112
+ | HLE (w/ Tools) | 50.4 | 42.8 | 40.8 | 51.8 | 43.4* | 45.8* | 45.5* |
113
+ | AIME 2026 I | 92.7 | 92.9 | 92.7 | 92.5 | 93.3 | 90.6 | - |
114
+ | HMMT Nov. 2025 | 96.9 | 93.5 | 90.2 | 91.1 | 91.7 | 93.0 | 97.1 |
115
+ | IMOAnswerBench | 82.5 | 82.0 | 78.3 | 81.8 | 78.5 | 83.3 | 86.3 |
116
+ | GPQA-Diamond | 86.0 | 85.7 | 82.4 | 87.6 | 87.0 | 91.9 | 92.4 |
117
+ | SWE-bench Verified | 77.8 | 73.8 | 73.1 | 76.8 | 80.9 | 76.2 | 80.0 |
118
+ | SWE-bench Multilingual | 73.3 | 66.7 | 70.2 | 73.0 | 77.5 | 65.0 | 72.0 |
119
+ | Terminal-Bench 2.0 (Terminus 2) | 56.2 / 60.7 † | 41.0 | 39.3 | 50.8 | 59.3 | 54.2 | 54.0 |
120
+ | Terminal-Bench 2.0 (Claude Code) | 56.2 / 61.1 † | 32.8 | 46.4 | - | 57.9 | - | - |
121
+ | CyberGym | 43.2 | 23.5 | 17.3 | 41.3 | 50.6 | 39.9 | - |
122
+ | BrowseComp | 62.0 | 52.0 | 51.4 | 60.6 | 37.0 | 37.8 | - |
123
+ | BrowseComp (w/ Context Manage) | 75.9 | 67.5 | 67.6 | 74.9 | 67.8 | 59.2 | 65.8 |
124
+ | BrowseComp-Zh | 72.7 | 66.6 | 65.0 | 62.3 | 62.4 | 66.8 | 76.1 |
125
+ | τ²-Bench | 89.7 | 87.4 | 85.3 | 80.2 | 91.6 | 90.7 | 85.5 |
126
+ | MCP-Atlas (Public Set) | 67.8 | 52.0 | 62.2 | 63.8 | 65.2 | 66.6 | 68.0 |
127
+ | Tool-Decathlon | 38.0 | 23.8 | 35.2 | 27.8 | 43.5 | 36.4 | 46.3 |
128
+ | Vending Bench 2 | $4,432.12 | $2,376.82 | $1,034.00 | $1,198.46 | $4,967.06 | $5,478.16 | $3,591.33 |
129
+
130
+ > *: refers to their scores of full set.
131
+ >
132
+ > †: A verified version of Terminal-Bench 2.0 that fixes some ambiguous instructions.
133
+ See footnote for more evaluation details.
134
+
135
+ ### Footnote
136
+
137
+ * **Humanity’s Last Exam (HLE) & other reasoning tasks**: We evaluate with a maximum generation length of 131,072 tokens (`temperature=1.0, top_p=0.95, max_new_tokens=131072`). By default, we report the text-only subset; results marked with * are from the full set. We use GPT-5.2 (medium) as the judge model. For HLE-with-tools, we use a maximum context length of 202,752 tokens.
138
+ * **SWE-bench & SWE-bench Multilingual**: We run the SWE-bench suite with OpenHands using a tailored instruction prompt. Settings: `temperature=0.7, top_p=0.95, max_new_tokens=16384`, with a 200K context window.
139
+ * **BrowserComp**: Without context management, we retain details from the most recent 5 turns. With context management, we use the same discard-all strategy as DeepSeek-v3.2 and Kimi K2.5.
140
+ * **Terminal-Bench 2.0 (Terminus 2)**: We evaluate with the Terminus framework using `timeout=2h, temperature=0.7, top_p=1.0, max_new_tokens=8192`, with a 128K context window. Resource limits are capped at 16 CPUs and 32 GB RAM.
141
+ * **Terminal-Bench 2.0 (Claude Code)**: We evaluate in Claude Code 2.1.14 (think mode, default effort) with `temperature=1.0, top_p=0.95, max_new_tokens=65536`. We remove wall-clock time limits due to generation speed, while preserving per-task CPU and memory constraints. Scores are averaged over 5 runs. We fix environment issues introduced by Claude Code and also report results on a verified Terminal-Bench 2.0 dataset that resolves ambiguous instructions (see: [https://huggingface.co/datasets/zai-org/terminal-bench-2-verified](https://huggingface.co/datasets/zai-org/terminal-bench-2-verified)).
142
+ * **CyberGym**: We evaluate in Claude Code 2.1.18 (think mode, no web tools) with (`temperature=1.0, top_p=1.0, max_new_tokens=32000`) and a 250-minute timeout per task. Results are single-run Pass@1 over 1,507 tasks.
143
+ * **MCP-Atlas**: All models are evaluated in think mode on the 500-task public subset with a 10-minute timeout per task. We use Gemini 3 Pro as the judge model.
144
+ * **τ²-bench**: We add a small prompt adjustment in Retail and Telecom to avoid failures caused by premature user termination. For Airline, we apply the domain fixes proposed in the Claude Opus 4.5 system card.
145
+ * **Vending Bench 2**: Runs are conducted independently by [Andon Labs](https://andonlabs.com/evals/vending-bench-2).
146
+
147
+
148
+ ## Serve GLM-5 Locally
149
+
150
+ ### Prepare environment
151
+
152
+ vLLM, SGLang, and xLLM all support local deployment of GLM-5. A simple deployment guide is provided here.
153
+
154
+ + vLLM
155
+
156
+ Using Docker as:
157
+
158
+ ```shell
159
+ docker pull vllm/vllm-openai:nightly
160
+ ```
161
+
162
+ or using pip:
163
+
164
+ ```shell
165
+ pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
166
+ ```
167
+
168
+ then upgrade transformers:
169
+
170
+ ```
171
+ pip install git+https://github.com/huggingface/transformers.git
172
+ ```
173
+
174
+ + SGLang
175
+
176
+ Using Docker as:
177
+ ```bash
178
+ docker pull lmsysorg/sglang:glm5-hopper # For Hopper GPU
179
+ docker pull lmsysorg/sglang:glm5-blackwell # For Blackwell GPU
180
+ ```
181
+
182
+ ### Deploy
183
+
184
+ + vLLM
185
+
186
+ ```shell
187
+ vllm serve zai-org/GLM-5-FP8 \
188
+ --tensor-parallel-size 8 \
189
+ --gpu-memory-utilization 0.85 \
190
+ --speculative-config.method mtp \
191
+ --speculative-config.num_speculative_tokens 1 \
192
+ --tool-call-parser glm47 \
193
+ --reasoning-parser glm45 \
194
+ --enable-auto-tool-choice \
195
+ --served-model-name glm-5-fp8
196
+ ```
197
+
198
+ Check the [recipes](https://github.com/vllm-project/recipes/blob/main/GLM/GLM5.md) for more details.
199
+
200
+ + SGLang
201
+
202
+ ```shell
203
+ python3 -m sglang.launch_server \
204
+ --model-path zai-org/GLM-5-FP8 \
205
+ --tp-size 8 \
206
+ --tool-call-parser glm47 \
207
+ --reasoning-parser glm45 \
208
+ --speculative-algorithm EAGLE \
209
+ --speculative-num-steps 3 \
210
+ --speculative-eagle-topk 1 \
211
+ --speculative-num-draft-tokens 4 \
212
+ --mem-fraction-static 0.85 \
213
+ --served-model-name glm-5-fp8
214
+ ```
215
+
216
+ Check the [sglang cookbook](https://cookbook.sglang.io/autoregressive/GLM/GLM-5) for more details.
217
+
218
+ + xLLM and other Ascend NPU
219
+
220
+ Please check the deployment guide [here](https://github.com/zai-org/GLM-5/blob/main/example/ascend.md).
221
+
222
+
223
+ ## Citation
224
+
225
+ Our technical report is coming soon.
__pycache__/glm47_moe_tool_parser_fixed.cpython-312.pyc ADDED
Binary file (23.5 kB). View file
 
chat_template.jinja ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [gMASK]<sop>
2
+ {%- if tools -%}
3
+ <|system|>
4
+ # Tools
5
+
6
+ You may call one or more functions to assist with the user query.
7
+
8
+ You are provided with function signatures within <tools></tools> XML tags:
9
+ <tools>
10
+ {% for tool in tools %}
11
+ {{ tool | tojson(ensure_ascii=False) }}
12
+ {% endfor %}
13
+ </tools>
14
+
15
+ For each function call, output the function name and arguments within the following XML format:
16
+ <tool_call>{function-name}<arg_key>{arg-key-1}</arg_key><arg_value>{arg-value-1}</arg_value><arg_key>{arg-key-2}</arg_key><arg_value>{arg-value-2}</arg_value>...</tool_call>{%- endif -%}
17
+ {%- macro visible_text(content) -%}
18
+ {%- if content is string -%}
19
+ {{- content }}
20
+ {%- elif content is iterable and content is not mapping -%}
21
+ {%- for item in content -%}
22
+ {%- if item is mapping and item.type == 'text' -%}
23
+ {{- item.text }}
24
+ {%- elif item is string -%}
25
+ {{- item }}
26
+ {%- endif -%}
27
+ {%- endfor -%}
28
+ {%- else -%}
29
+ {{- content }}
30
+ {%- endif -%}
31
+ {%- endmacro -%}
32
+ {%- set ns = namespace(last_user_index=-1) %}
33
+ {%- for m in messages %}
34
+ {%- if m.role == 'user' %}
35
+ {% set ns.last_user_index = loop.index0 -%}
36
+ {%- endif %}
37
+ {%- endfor %}
38
+ {% for m in messages %}
39
+ {%- if m.role == 'user' -%}<|user|>{{ visible_text(m.content) }}
40
+ {%- elif m.role == 'assistant' -%}
41
+ <|assistant|>
42
+ {%- set reasoning_content = '' %}
43
+ {%- set content = visible_text(m.content) %}
44
+ {%- if m.reasoning_content is string %}
45
+ {%- set reasoning_content = m.reasoning_content %}
46
+ {%- else %}
47
+ {%- if '</think>' in content %}
48
+ {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
49
+ {%- set content = content.split('</think>')[-1].lstrip('\n') %}
50
+ {%- endif %}
51
+ {%- endif %}
52
+ {%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content -%}
53
+ {{ '<think>' + reasoning_content.strip() + '</think>'}}
54
+ {%- else -%}
55
+ {{ '</think>' }}
56
+ {%- endif -%}
57
+ {%- if content.strip() -%}
58
+ {{ content.strip() }}
59
+ {%- endif -%}
60
+ {% if m.tool_calls %}
61
+ {% for tc in m.tool_calls %}
62
+ {%- if tc.function %}
63
+ {%- set tc = tc.function %}
64
+ {%- endif %}
65
+ {{- '<tool_call>' + tc.name -}}
66
+ {% set _args = tc.arguments %}{% for k, v in _args.items() %}<arg_key>{{ k }}</arg_key><arg_value>{{ v | tojson(ensure_ascii=False) if v is not string else v }}</arg_value>{% endfor %}</tool_call>{% endfor %}
67
+ {% endif %}
68
+ {%- elif m.role == 'tool' -%}
69
+ {%- if m.content is string -%}
70
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
71
+ {{- '<|observation|>' }}
72
+ {%- endif %}
73
+ {{- '<tool_response>' }}
74
+ {{- m.content }}
75
+ {{- '</tool_response>' }}
76
+ {%- else -%}
77
+ <|observation|>{% for tr in m.content %}
78
+ <tool_response>{{ tr.output if tr.output is defined else tr }}</tool_response>{% endfor -%}
79
+ {% endif -%}
80
+ {%- elif m.role == 'system' -%}
81
+ <|system|>{{ visible_text(m.content) }}
82
+ {%- endif -%}
83
+ {%- endfor -%}
84
+ {%- if add_generation_prompt -%}
85
+ <|assistant|>{{- '</think>' if (enable_thinking is defined and not enable_thinking) else '<think>' -}}
86
+ {%- endif -%}
config.json ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "GlmMoeDsaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "dtype": "bfloat16",
8
+ "eos_token_id": [
9
+ 154820,
10
+ 154827,
11
+ 154829
12
+ ],
13
+ "ep_size": 1,
14
+ "first_k_dense_replace": 3,
15
+ "head_dim": 64,
16
+ "hidden_act": "silu",
17
+ "hidden_size": 6144,
18
+ "index_head_dim": 128,
19
+ "index_n_heads": 32,
20
+ "index_topk": 2048,
21
+ "indexer_rope_interleave": true,
22
+ "initializer_range": 0.02,
23
+ "intermediate_size": 12288,
24
+ "kv_lora_rank": 512,
25
+ "max_position_embeddings": 202752,
26
+ "model_type": "glm_moe_dsa",
27
+ "moe_intermediate_size": 2048,
28
+ "moe_layer_freq": 1,
29
+ "n_group": 1,
30
+ "n_routed_experts": 205,
31
+ "n_shared_experts": 1,
32
+ "name_or_path": "tclf90/GLM-5-AWQ",
33
+ "norm_topk_prob": true,
34
+ "num_attention_heads": 64,
35
+ "num_experts_per_tok": 8,
36
+ "num_hidden_layers": 78,
37
+ "num_key_value_heads": 64,
38
+ "num_nextn_predict_layers": 1,
39
+ "pad_token_id": 154820,
40
+ "pretraining_tp": 1,
41
+ "q_lora_rank": 2048,
42
+ "qk_head_dim": 256,
43
+ "qk_nope_head_dim": 192,
44
+ "qk_rope_head_dim": 64,
45
+ "quantization_config": {
46
+ "bits": 4,
47
+ "group_size": 128,
48
+ "modules_to_not_convert": [
49
+ "self_attn",
50
+ "shared_expert",
51
+ "mlp.gate",
52
+ "model.layers.0.",
53
+ "model.layers.1.",
54
+ "model.layers.2."
55
+ ],
56
+ "quant_method": "awq",
57
+ "version": "gemm",
58
+ "zero_point": true
59
+ },
60
+ "rms_norm_eps": 1e-05,
61
+ "rope_interleave": true,
62
+ "rope_parameters": {
63
+ "rope_theta": 1000000,
64
+ "rope_type": "default"
65
+ },
66
+ "routed_scaling_factor": 2.5,
67
+ "scoring_func": "sigmoid",
68
+ "tie_word_embeddings": false,
69
+ "topk_group": 1,
70
+ "topk_method": "noaux_tc",
71
+ "transformers_version": "5.0.2.dev0",
72
+ "use_cache": true,
73
+ "v_head_dim": 256,
74
+ "vocab_size": 154880
75
+ }
expert_id_remap.json ADDED
The diff for this file is too large to render. See raw diff
 
expert_keep_map.json ADDED
The diff for this file is too large to render. See raw diff
 
expert_prune_report.json ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "mode": "physical",
3
+ "new_num_experts": 205,
4
+ "old_num_experts": 256,
5
+ "per_layer_keep_counts": {
6
+ "10": 205,
7
+ "11": 205,
8
+ "12": 205,
9
+ "13": 205,
10
+ "14": 205,
11
+ "15": 205,
12
+ "16": 205,
13
+ "17": 205,
14
+ "18": 205,
15
+ "19": 205,
16
+ "20": 205,
17
+ "21": 205,
18
+ "22": 205,
19
+ "23": 205,
20
+ "24": 205,
21
+ "25": 205,
22
+ "26": 205,
23
+ "27": 205,
24
+ "28": 205,
25
+ "29": 205,
26
+ "3": 205,
27
+ "30": 205,
28
+ "31": 205,
29
+ "32": 205,
30
+ "33": 205,
31
+ "34": 205,
32
+ "35": 205,
33
+ "36": 205,
34
+ "37": 205,
35
+ "38": 205,
36
+ "39": 205,
37
+ "4": 205,
38
+ "40": 205,
39
+ "41": 205,
40
+ "42": 205,
41
+ "43": 205,
42
+ "44": 205,
43
+ "45": 205,
44
+ "46": 205,
45
+ "47": 205,
46
+ "48": 205,
47
+ "49": 205,
48
+ "5": 205,
49
+ "50": 205,
50
+ "51": 205,
51
+ "52": 205,
52
+ "53": 205,
53
+ "54": 205,
54
+ "55": 205,
55
+ "56": 205,
56
+ "57": 205,
57
+ "58": 205,
58
+ "59": 205,
59
+ "6": 205,
60
+ "60": 205,
61
+ "61": 205,
62
+ "62": 205,
63
+ "63": 205,
64
+ "64": 205,
65
+ "65": 205,
66
+ "66": 205,
67
+ "67": 205,
68
+ "68": 205,
69
+ "69": 205,
70
+ "7": 205,
71
+ "70": 205,
72
+ "71": 205,
73
+ "72": 205,
74
+ "73": 205,
75
+ "74": 205,
76
+ "75": 205,
77
+ "76": 205,
78
+ "77": 205,
79
+ "8": 205,
80
+ "9": 205
81
+ },
82
+ "warnings": []
83
+ }
generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "eos_token_id": [
4
+ 154820,
5
+ 154827,
6
+ 154829
7
+ ],
8
+ "pad_token_id": 154820,
9
+ "temperature": 1.0,
10
+ "top_p": 0.95,
11
+ "transformers_version": "5.0.2.dev0"
12
+ }
glm47_moe_tool_parser_fixed.py ADDED
@@ -0,0 +1,532 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SPDX-License-Identifier: Apache-2.0
2
+ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
3
+ """
4
+ GLM-4 Tool Call Parser with incremental string streaming support.
5
+
6
+ This parser fixes the streaming issue reported in Issue #32829 where long string
7
+ parameters (e.g., file content with 4000+ characters of code) are buffered until
8
+ complete, causing multi-second delays before the user sees any content.
9
+
10
+ The fix streams string values incrementally as they arrive, providing a true
11
+ streaming experience for long content.
12
+ """
13
+
14
+ import ast
15
+ import json
16
+ from collections.abc import Sequence
17
+ from typing import Any
18
+
19
+ import regex as re
20
+
21
+ from vllm.entrypoints.chat_utils import make_tool_call_id
22
+ from vllm.entrypoints.openai.chat_completion.protocol import (
23
+ ChatCompletionRequest,
24
+ ChatCompletionToolsParam,
25
+ )
26
+ from vllm.entrypoints.openai.engine.protocol import (
27
+ DeltaFunctionCall,
28
+ DeltaMessage,
29
+ DeltaToolCall,
30
+ ExtractedToolCallInformation,
31
+ FunctionCall,
32
+ ToolCall,
33
+ )
34
+ from vllm.logger import init_logger
35
+ from vllm.tokenizers import TokenizerLike
36
+ from vllm.tool_parsers.abstract_tool_parser import (
37
+ ToolParser,
38
+ ToolParserManager,
39
+ )
40
+
41
+ logger = init_logger(__name__)
42
+
43
+
44
+ @ToolParserManager.register_module("glm47_fixed")
45
+ class Glm47MoeModelToolParser(ToolParser):
46
+ """Tool parser for GLM-4 models with incremental string streaming.
47
+
48
+ This parser emits tool-call deltas incrementally as arguments arrive.
49
+ For string-type parameters, content is streamed character-by-character
50
+ rather than waiting for the complete </arg_value> tag.
51
+ """
52
+
53
+ def __init__(self, tokenizer: TokenizerLike):
54
+ super().__init__(tokenizer)
55
+ # Stateful streaming fields
56
+ self.current_tool_name_sent: bool = False
57
+ self.prev_tool_call_arr: list[dict[str, Any]] = []
58
+ self.current_tool_id: int = -1
59
+ self.streamed_args_for_tool: list[str] = []
60
+
61
+ self.tool_call_start_token: str = "<tool_call>"
62
+ self.tool_call_end_token: str = "</tool_call>"
63
+ self.arg_key_start: str = "<arg_key>"
64
+ self.arg_key_end: str = "</arg_key>"
65
+ self.arg_val_start: str = "<arg_value>"
66
+ self.arg_val_end: str = "</arg_value>"
67
+
68
+ self.tool_calls_start_token = self.tool_call_start_token
69
+
70
+ self.func_call_regex = re.compile(r"<tool_call>.*?</tool_call>", re.DOTALL)
71
+
72
+ # GLM-4.7 format: <tool_call>func_name[<arg_key>...]*</tool_call>
73
+ # The function name can be followed by a newline, whitespace, or
74
+ # directly by <arg_key> tags (no separator). The arg section is
75
+ # optional so that zero-argument calls are supported.
76
+ self.func_detail_regex = re.compile(
77
+ r"<tool_call>\s*(\S+?)\s*(<arg_key>.*)?</tool_call>", re.DOTALL
78
+ )
79
+ self.func_arg_regex = re.compile(
80
+ r"<arg_key>(.*?)</arg_key>\s*<arg_value>(.*?)</arg_value>",
81
+ re.DOTALL,
82
+ )
83
+
84
+ if not self.model_tokenizer:
85
+ raise ValueError(
86
+ "The model tokenizer must be passed to the ToolParser "
87
+ "constructor during construction."
88
+ )
89
+
90
+ self.tool_call_start_token_id = self.vocab.get(self.tool_call_start_token)
91
+ self.tool_call_end_token_id = self.vocab.get(self.tool_call_end_token)
92
+ self._buffer: str = ""
93
+
94
+ # Streaming state for incremental tool-call streaming
95
+ self._in_tool_call: bool = False
96
+ self._current_tool_name: str | None = None
97
+ self._pending_key: str | None = None
98
+ self._streaming_string_value: bool = False
99
+ self._tool_call_ids: list[str] = []
100
+ self._args_started: list[bool] = []
101
+ self._args_closed: list[bool] = []
102
+ self._seen_keys: list[set[str]] = []
103
+
104
+ @staticmethod
105
+ def _deserialize(value: str) -> Any:
106
+ try:
107
+ return json.loads(value)
108
+ except json.JSONDecodeError:
109
+ pass
110
+
111
+ try:
112
+ return ast.literal_eval(value)
113
+ except (ValueError, SyntaxError):
114
+ pass
115
+
116
+ return value
117
+
118
+ @staticmethod
119
+ def _json_escape_string_content(s: str) -> str:
120
+ """JSON-escape string content for incremental streaming.
121
+
122
+ This escapes the content that goes INSIDE a JSON string (between quotes),
123
+ not including the surrounding quotes themselves.
124
+ """
125
+ if not s:
126
+ return ""
127
+ return json.dumps(s, ensure_ascii=False)[1:-1]
128
+
129
+ @staticmethod
130
+ def _is_string_type(
131
+ tool_name: str,
132
+ arg_name: str,
133
+ tools: list[ChatCompletionToolsParam] | None,
134
+ ) -> bool:
135
+ if tools is None:
136
+ return False
137
+ for tool in tools:
138
+ if tool.function.name != tool_name:
139
+ continue
140
+ if tool.function.parameters is None:
141
+ return False
142
+ arg_type = (
143
+ tool.function.parameters.get("properties", {})
144
+ .get(arg_name, {})
145
+ .get("type", None)
146
+ )
147
+ return arg_type == "string"
148
+ logger.debug("No tool named '%s'.", tool_name)
149
+ return False
150
+
151
+ @staticmethod
152
+ def _tools_enabled(request: ChatCompletionRequest) -> bool:
153
+ """Return whether tool parsing should be applied for this request."""
154
+ try:
155
+ tools = getattr(request, "tools", None)
156
+ tool_choice = getattr(request, "tool_choice", None)
157
+ return bool(tools) and tool_choice != "none"
158
+ except Exception:
159
+ logger.exception("Failed to determine if tools are enabled.")
160
+ return False
161
+
162
+ def adjust_request(self, request: ChatCompletionRequest) -> ChatCompletionRequest:
163
+ """Adjust request parameters for tool call token handling."""
164
+ request = super().adjust_request(request)
165
+ if request.tools and request.tool_choice != "none":
166
+ # Ensure tool call tokens (<tool_call>, </tool_call>) are not skipped
167
+ # during decoding. Even though they are not marked as special tokens,
168
+ # setting skip_special_tokens=False ensures proper handling in
169
+ # transformers 5.x where decoding behavior may have changed.
170
+ request.skip_special_tokens = False
171
+ return request
172
+
173
+ def extract_tool_calls(
174
+ self,
175
+ model_output: str,
176
+ request: ChatCompletionRequest,
177
+ ) -> ExtractedToolCallInformation:
178
+ matched_tool_calls = self.func_call_regex.findall(model_output)
179
+ logger.debug("model_output: %s", model_output)
180
+ try:
181
+ tool_calls: list[ToolCall] = []
182
+ for match in matched_tool_calls:
183
+ tc_detail = self.func_detail_regex.search(match)
184
+ if not tc_detail:
185
+ logger.warning(
186
+ "Failed to parse tool call details from: %s",
187
+ match,
188
+ )
189
+ continue
190
+ tc_name = tc_detail.group(1).strip()
191
+ tc_args = tc_detail.group(2)
192
+ pairs = self.func_arg_regex.findall(tc_args) if tc_args else []
193
+ arg_dct: dict[str, Any] = {}
194
+ for key, value in pairs:
195
+ arg_key = key.strip()
196
+ arg_val = value.strip()
197
+ if not self._is_string_type(tc_name, arg_key, request.tools):
198
+ arg_val = self._deserialize(arg_val)
199
+ logger.debug("arg_key = %s, arg_val = %s", arg_key, arg_val)
200
+ arg_dct[arg_key] = arg_val
201
+ tool_calls.append(
202
+ ToolCall(
203
+ type="function",
204
+ function=FunctionCall(
205
+ name=tc_name,
206
+ arguments=json.dumps(arg_dct, ensure_ascii=False),
207
+ ),
208
+ )
209
+ )
210
+ except Exception:
211
+ logger.exception("Failed to extract tool call spec")
212
+ return ExtractedToolCallInformation(
213
+ tools_called=False, tool_calls=[], content=model_output
214
+ )
215
+ else:
216
+ if len(tool_calls) > 0:
217
+ content: str | None = model_output[
218
+ : model_output.find(self.tool_calls_start_token)
219
+ ]
220
+ # Normalize empty/whitespace-only content to None
221
+ if not content or not content.strip():
222
+ content = None
223
+ return ExtractedToolCallInformation(
224
+ tools_called=True, tool_calls=tool_calls, content=content
225
+ )
226
+ return ExtractedToolCallInformation(
227
+ tools_called=False, tool_calls=[], content=model_output
228
+ )
229
+
230
+ def extract_tool_calls_streaming(
231
+ self,
232
+ previous_text: str,
233
+ current_text: str,
234
+ delta_text: str,
235
+ previous_token_ids: Sequence[int],
236
+ current_token_ids: Sequence[int],
237
+ delta_token_ids: Sequence[int],
238
+ request: ChatCompletionRequest,
239
+ ) -> DeltaMessage | None:
240
+ if not self._tools_enabled(request):
241
+ return DeltaMessage(content=delta_text) if delta_text else None
242
+
243
+ self._buffer += delta_text
244
+
245
+ while True:
246
+ if not self._in_tool_call:
247
+ start_idx = self._buffer.find(self.tool_call_start_token)
248
+ if start_idx == -1:
249
+ # Check for partial start token at end of buffer
250
+ for i in range(1, len(self.tool_call_start_token)):
251
+ if self._buffer.endswith(self.tool_call_start_token[:i]):
252
+ out = self._buffer[:-i]
253
+ self._buffer = self._buffer[-i:]
254
+ return DeltaMessage(content=out) if out else None
255
+ out = self._buffer
256
+ self._buffer = ""
257
+ return DeltaMessage(content=out) if out else None
258
+
259
+ if start_idx > 0:
260
+ out = self._buffer[:start_idx]
261
+ self._buffer = self._buffer[start_idx:]
262
+ return DeltaMessage(content=out) if out else None
263
+
264
+ self._buffer = self._buffer[len(self.tool_call_start_token) :]
265
+ self._begin_tool_call()
266
+ continue
267
+
268
+ # Parse tool name first
269
+ if not self.current_tool_name_sent:
270
+ nl = self._buffer.find("\n")
271
+ ak = self._buffer.find(self.arg_key_start)
272
+ end = self._buffer.find(self.tool_call_end_token)
273
+ candidates = [i for i in [nl, ak, end] if i != -1]
274
+ if not candidates:
275
+ return None
276
+ cut = min(candidates)
277
+ tool_name = self._buffer[:cut].strip()
278
+ if tool_name == "" and cut == end:
279
+ # Handle empty tool call like `<tool_call></tool_call>`.
280
+ # Consume the tokens and reset state to avoid infinite loop.
281
+ self._buffer = self._buffer[end + len(self.tool_call_end_token) :]
282
+ self._finish_tool_call()
283
+ self._revert_last_tool_call_state()
284
+ continue
285
+
286
+ if cut == nl:
287
+ self._buffer = self._buffer[nl + 1 :]
288
+ else:
289
+ self._buffer = self._buffer[cut:]
290
+
291
+ self._current_tool_name = tool_name
292
+ self.current_tool_name_sent = True
293
+ return self._emit_tool_name_delta(tool_name)
294
+
295
+ assert self._current_tool_name is not None
296
+
297
+ # Handle incremental string value streaming
298
+ if self._streaming_string_value:
299
+ val_end = self._buffer.find(self.arg_val_end)
300
+ if val_end != -1:
301
+ raw_content = self._buffer[:val_end]
302
+ self._buffer = self._buffer[val_end + len(self.arg_val_end) :]
303
+ self._streaming_string_value = False
304
+ self._pending_key = None
305
+
306
+ escaped = self._json_escape_string_content(raw_content)
307
+ frag = escaped + '"'
308
+ self.streamed_args_for_tool[self.current_tool_id] += frag
309
+ return self._emit_tool_args_delta(frag)
310
+ else:
311
+ # Check for partial </arg_value> at end
312
+ safe_len = len(self._buffer)
313
+ for i in range(1, len(self.arg_val_end)):
314
+ if self._buffer.endswith(self.arg_val_end[:i]):
315
+ safe_len = len(self._buffer) - i
316
+ break
317
+
318
+ if safe_len > 0:
319
+ to_emit = self._buffer[:safe_len]
320
+ self._buffer = self._buffer[safe_len:]
321
+ escaped = self._json_escape_string_content(to_emit)
322
+ if escaped:
323
+ self.streamed_args_for_tool[self.current_tool_id] += escaped
324
+ return self._emit_tool_args_delta(escaped)
325
+ return None
326
+
327
+ # If we have a pending key, parse its value
328
+ if self._pending_key is not None:
329
+ val_pos = self._buffer.find(self.arg_val_start)
330
+ if val_pos == -1:
331
+ return None
332
+ if val_pos > 0:
333
+ self._buffer = self._buffer[val_pos:]
334
+
335
+ key = (self._pending_key or "").strip()
336
+
337
+ is_string = self._is_string_type(
338
+ self._current_tool_name, key, request.tools
339
+ )
340
+
341
+ if is_string:
342
+ # String type: stream incrementally
343
+ self._buffer = self._buffer[len(self.arg_val_start) :]
344
+
345
+ if key in self._seen_keys[self.current_tool_id]:
346
+ self._pending_key = None
347
+ continue
348
+
349
+ self._seen_keys[self.current_tool_id].add(key)
350
+ key_json = json.dumps(key, ensure_ascii=False)
351
+
352
+ if not self._args_started[self.current_tool_id]:
353
+ frag = "{" + key_json + ': "'
354
+ self._args_started[self.current_tool_id] = True
355
+ else:
356
+ frag = ", " + key_json + ': "'
357
+
358
+ self.streamed_args_for_tool[self.current_tool_id] += frag
359
+ self._streaming_string_value = True
360
+ return self._emit_tool_args_delta(frag)
361
+ else:
362
+ # Non-string type: wait for complete value
363
+ val_end = self._buffer.find(self.arg_val_end)
364
+ if val_end == -1:
365
+ return None
366
+
367
+ raw_val = self._buffer[len(self.arg_val_start) : val_end].strip()
368
+ self._buffer = self._buffer[val_end + len(self.arg_val_end) :]
369
+ self._pending_key = None
370
+
371
+ frag_or_none = self._append_arg_fragment(key=key, raw_val=raw_val)
372
+ if frag_or_none:
373
+ return self._emit_tool_args_delta(frag_or_none)
374
+ continue
375
+
376
+ # Parse next arg or close
377
+ end_pos = self._buffer.find(self.tool_call_end_token)
378
+ key_pos = self._buffer.find(self.arg_key_start)
379
+ if end_pos != -1 and (key_pos == -1 or end_pos < key_pos):
380
+ self._buffer = self._buffer[end_pos + len(self.tool_call_end_token) :]
381
+ frag_or_none = self._close_args_if_needed()
382
+ # Finalize prev_tool_call_arr with complete parsed arguments
383
+ if self._current_tool_name:
384
+ try:
385
+ full_args_str = self.streamed_args_for_tool[
386
+ self.current_tool_id
387
+ ]
388
+ json.loads(full_args_str)
389
+ self.prev_tool_call_arr[self.current_tool_id] = {
390
+ "name": self._current_tool_name,
391
+ "arguments": full_args_str,
392
+ }
393
+ except (json.JSONDecodeError, IndexError) as e:
394
+ logger.warning(
395
+ "Failed to finalize tool call state for tool %d: %s",
396
+ self.current_tool_id,
397
+ e,
398
+ )
399
+ self._finish_tool_call()
400
+ return (
401
+ self._emit_tool_args_delta(frag_or_none) if frag_or_none else None
402
+ )
403
+
404
+ if key_pos == -1:
405
+ return None
406
+ if key_pos > 0:
407
+ self._buffer = self._buffer[key_pos:]
408
+ key_end = self._buffer.find(self.arg_key_end)
409
+ if key_end == -1:
410
+ return None
411
+ key = self._buffer[len(self.arg_key_start) : key_end]
412
+ self._buffer = self._buffer[key_end + len(self.arg_key_end) :]
413
+ self._pending_key = key
414
+ continue
415
+
416
+ def _ensure_tool_state(self) -> None:
417
+ while len(self._tool_call_ids) <= self.current_tool_id:
418
+ self._tool_call_ids.append(
419
+ make_tool_call_id(id_type="random", func_name=None, idx=None)
420
+ )
421
+ while len(self.streamed_args_for_tool) <= self.current_tool_id:
422
+ self.streamed_args_for_tool.append("")
423
+ while len(self.prev_tool_call_arr) <= self.current_tool_id:
424
+ self.prev_tool_call_arr.append({})
425
+ while len(self._args_started) <= self.current_tool_id:
426
+ self._args_started.append(False)
427
+ while len(self._args_closed) <= self.current_tool_id:
428
+ self._args_closed.append(False)
429
+ while len(self._seen_keys) <= self.current_tool_id:
430
+ self._seen_keys.append(set())
431
+
432
+ def _begin_tool_call(self) -> None:
433
+ if self.current_tool_id == -1:
434
+ self.current_tool_id = 0
435
+ else:
436
+ self.current_tool_id += 1
437
+ self._ensure_tool_state()
438
+ self.current_tool_name_sent = False
439
+ self._current_tool_name = None
440
+ self._pending_key = None
441
+ self._streaming_string_value = False
442
+ self._in_tool_call = True
443
+
444
+ def _finish_tool_call(self) -> None:
445
+ self._in_tool_call = False
446
+ self._current_tool_name = None
447
+ self._pending_key = None
448
+ self._streaming_string_value = False
449
+
450
+ def _revert_last_tool_call_state(self) -> None:
451
+ """Revert the state allocation for the last tool call."""
452
+ if self.current_tool_id < 0:
453
+ return
454
+ self._tool_call_ids.pop()
455
+ self.streamed_args_for_tool.pop()
456
+ self.prev_tool_call_arr.pop()
457
+ self._args_started.pop()
458
+ self._args_closed.pop()
459
+ self._seen_keys.pop()
460
+ self.current_tool_id -= 1
461
+
462
+ def _emit_tool_name_delta(self, tool_name: str) -> DeltaMessage:
463
+ self.prev_tool_call_arr[self.current_tool_id] = {
464
+ "name": self._current_tool_name,
465
+ "arguments": {},
466
+ }
467
+ return DeltaMessage(
468
+ tool_calls=[
469
+ DeltaToolCall(
470
+ index=self.current_tool_id,
471
+ id=self._tool_call_ids[self.current_tool_id],
472
+ type="function",
473
+ function=DeltaFunctionCall(
474
+ name=tool_name,
475
+ arguments="",
476
+ ).model_dump(exclude_none=True),
477
+ )
478
+ ]
479
+ )
480
+
481
+ def _emit_tool_args_delta(self, fragment: str) -> DeltaMessage:
482
+ return DeltaMessage(
483
+ tool_calls=[
484
+ DeltaToolCall(
485
+ index=self.current_tool_id,
486
+ function=DeltaFunctionCall(arguments=fragment).model_dump(
487
+ exclude_none=True
488
+ ),
489
+ )
490
+ ]
491
+ )
492
+
493
+ def _append_arg_fragment(
494
+ self,
495
+ *,
496
+ key: str,
497
+ raw_val: str,
498
+ ) -> str | None:
499
+ key = key.strip()
500
+ if not key:
501
+ return None
502
+ if key in self._seen_keys[self.current_tool_id]:
503
+ return None
504
+
505
+ # This function is only called for non-string types (already checked
506
+ # by _is_string_type in the caller), so we always deserialize.
507
+ val_obj: Any = self._deserialize(raw_val)
508
+
509
+ key_json = json.dumps(key, ensure_ascii=False)
510
+ val_json = json.dumps(val_obj, ensure_ascii=False)
511
+
512
+ if not self._args_started[self.current_tool_id]:
513
+ fragment = "{" + key_json + ": " + val_json
514
+ self._args_started[self.current_tool_id] = True
515
+ else:
516
+ fragment = "," + key_json + ": " + val_json
517
+
518
+ self._seen_keys[self.current_tool_id].add(key)
519
+ self.streamed_args_for_tool[self.current_tool_id] += fragment
520
+ return fragment
521
+
522
+ def _close_args_if_needed(self) -> str | None:
523
+ if self._args_closed[self.current_tool_id]:
524
+ return None
525
+ self._args_closed[self.current_tool_id] = True
526
+ if not self._args_started[self.current_tool_id]:
527
+ fragment = "{}"
528
+ self.streamed_args_for_tool[self.current_tool_id] = fragment
529
+ else:
530
+ fragment = "}"
531
+ self.streamed_args_for_tool[self.current_tool_id] += fragment
532
+ return fragment
model-00001-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8bfa7b1b3d65b6628c54802bf51b4a0e6c4222978746e0cf6e2935d2ba2f36a3
3
+ size 2993793440
model-00002-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a4188dbdbfec85e375b368d7c1ca6df4eadb9e27373448258b05c2b9840408e8
3
+ size 1993962224
model-00003-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:52fe64ac77031721f717414465cfd15fb04ebcffded0538d5bfceba7b4b9050d
3
+ size 2684816444
model-00004-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d82772da54764bdc80fc5476e5dda449a59964d1a77d1cfe7c34c6f8fe22b6e
3
+ size 2307766008
model-00005-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0c6ad662939d781548a7e0e7b60e53010a571974577406a665bc63bba2329fe4
3
+ size 2325249572
model-00006-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74f0faa66fa323b0865aeec6bc1cd863bd89fe3c48f18fbfb956d4cdced1f109
3
+ size 2667332872
model-00007-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f5e47fa6153c18fbd1d62c0f9ee4228abd5d584de23eb3041c7d50c96d5c8ee4
3
+ size 1998370612
model-00008-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:871ef6b1801b879e14404e1f92e27a6652852ba1a0d67e5b6c58f5f5170f46a6
3
+ size 2940355204
model-00009-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d31c474890ab4091daa28b98785591ab84c0c4f47be6d7a7f119bc9c360bcc0c
3
+ size 1999730088
model-00010-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a4ef9952b191c5e3dfc28dbb82424875974134b755539b9fd62c73dfad917405
3
+ size 2998817148
model-00011-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0b591800f6c5c42f401b9c010b10fe09752597f023762a1b5d212ece3d441bb0
3
+ size 1993962288
model-00012-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d4b717ba2a951bf06bfb4c75c18d770b9cdc9c21724a29cad6167b80fd0bb9c3
3
+ size 2887481476
model-00013-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:061bf24bc616de133fef2940598a9c35d46d6f87faad83d52ce76d371aafc291
3
+ size 2105101056
model-00014-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3aa9027d1d1465fb43b0563c8281e7c0f1a969476a6fbb667cfd80215b14e78
3
+ size 2625978732
model-00015-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a820a4068cdea35ef7364efc01e1dd32a9d32129bf027a11b3ed3a27a5708b19
3
+ size 2366605248
model-00016-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e936b0df2d6cf9cd03265c007aaaa5aebbbdccb1dd71b9ed648465f2dbd01255
3
+ size 2135660532
model-00017-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2862ba86b3f6bd30193f1d73ebc351e73a0ab599c89cc77d2627c9c2d4302e4d
3
+ size 2856924024
model-00018-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b5911ebf7cda3cf88fd7b87d0f530c43893cce08511f6c58cf5b6ac939dd4852
3
+ size 1998371452
model-00019-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d5e108b4c13ae03bd861e42785f81d02f3ea6198b99d693d987f0c42d640e64c
3
+ size 2957650604
model-00020-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9cdc6b1d14645b6b1b0d396910f4f76369a4ad7174554ce5766ce049cce2136e
3
+ size 1995708680
model-00021-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d901f3ae485917cd900bb5b0c3c707eaba94dce96a5dfabf7aaf25ce50e23cc5
3
+ size 2998621612
model-00022-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c7fd915528301d9881dad67edffd93dd9679b3529bf686ebc4d23875bce70df6
3
+ size 1993963184
model-00023-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7d2b08a3c334e90d33b5685cda8b9922018515c194fcdc2a7c6810eb70728577
3
+ size 2684817444
model-00024-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c3a8c91cdb474c7e9654986eedfb18a232281264597c3eac59f1ac8e4b568b73
3
+ size 2307767112
model-00025-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:42b15f859a103ffe7277d14ec4033252aedf26cb30e2c5eacad214649a722df1
3
+ size 2429851804
model-00026-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9b8a5f54e11d9a1cf1424e0c867fb4cc3a13d217961c2e29e014940381589dce
3
+ size 2562732760
model-00027-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8756bfc840b3197a62ab1696000dcd44f0c88a18844f246c5b8fd8f21290bc99
3
+ size 1998371308
model-00028-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:54957a027609361dd18262bc150dee9cc2edef1f6d1406edd9615caba4e51c51
3
+ size 2998278716
model-00029-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b932b1a0902da4143ae1edfaac2c735b49d756cf1ddfd01b4f1710c5b9b547be
3
+ size 1994306080
model-00031-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:89c97694b76e73021ffc12926593bb104f93e0f5fb6bca86d7b7f6b0723241af
3
+ size 1993963208
model-00125-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2ebe5e018db4720770256d32c8ddc59c120201bdfb32e1e0c3dd9fad197c689d
3
+ size 2684817444
model-00127-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9ecb23a8bebd5dafb951fcb5733b14749e0140485ee80abaf356d6e6c9781e68
3
+ size 2429851804
model-00129-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a2a740cbaad266c8e7f5ee80f386384cb5cb15eb44664d3dcb2bb50f9eae183
3
+ size 1998371308
model-00130-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3040d373f6cb968f32193ca7933c1fd89c840ffe72453b140bd80ca0d9aeba1a
3
+ size 2998278716
model-00133-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d993b3f240ae7c37e085779e3f747511b1f87d7bb253984e7bcdc38e9c829533
3
+ size 1993963208
model-00135-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4f2e7fd5e9a2d67b515548375187c985e2e089c524861da6f167d7f7f1adb278
3
+ size 2046263768
model-00136-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3d8fd824978a759fb2d3dd220cc3abb84f035e27a9f3107e02cfed886e1b080c
3
+ size 2625979284
model-00141-of-00141.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fcae2b399560f462a345a927232d4ace1697c1b981accef5cf8504ac66c55baf
3
+ size 2054210160
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19e773648cb4e65de8660ea6365e10acca112d42a854923df93db4a6f333a82d
3
+ size 20217442
tokenizer_config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "clean_up_tokenization_spaces": false,
4
+ "do_lower_case": false,
5
+ "eos_token": "<|endoftext|>",
6
+ "extra_special_tokens": [
7
+ "<|endoftext|>",
8
+ "[MASK]",
9
+ "[gMASK]",
10
+ "[sMASK]",
11
+ "<sop>",
12
+ "<eop>",
13
+ "<|system|>",
14
+ "<|user|>",
15
+ "<|assistant|>",
16
+ "<|observation|>",
17
+ "<|begin_of_image|>",
18
+ "<|end_of_image|>",
19
+ "<|begin_of_video|>",
20
+ "<|end_of_video|>",
21
+ "<|begin_of_audio|>",
22
+ "<|end_of_audio|>",
23
+ "<|begin_of_transcription|>",
24
+ "<|end_of_transcription|>"
25
+ ],
26
+ "is_local": true,
27
+ "model_max_length": 202752,
28
+ "model_specific_special_tokens": {},
29
+ "pad_token": "<|endoftext|>",
30
+ "padding_side": "left",
31
+ "remove_space": false,
32
+ "tokenizer_class": "TokenizersBackend"
33
+ }