Upload Chimera 1.1B at step 1500

Browse files

Files changed (8) hide show

.gitattributes +1 -0
README.md +45 -0
chat_template.jinja +86 -0
config.json +33 -0
model.safetensors +3 -0
tokenizer.json +3 -0
tokenizer_config.json +29 -0
training.log +92 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,45 @@

+---
+license: apache-2.0
+tags:
+  - zara-ml
+  - chimera
+  - gdn
+  - ouroboros
+  - hybrid-architecture
+language:
+  - en
+---
+# Ouro-1.1B
+**Chimera architecture** — GDN/Attention hybrid with weight sharing.
+## Architecture
+- **Type:** Chimera (ChimeraConfig)
+- **Dim:** 2048
+- **Layers:** 24 virtual
+- **Params:** 1,072,527,280 (1.1B)
+- **Vocab:** 151936 (Qwen 3 tokenizer)
+- **Context:** 2048 tokens
+- **Topology:** 6 unique bottom + 6×3 shared top
+- **GDN:Attn ratio:** 3:1 (every 4th layer is attention)
+## Training
+- **Step:** 1,500
+- **Data:** Mixed (75% FineWeb-Edu, 18% StarCoder, 5% FineMath, 2% UltraChat)
+- **Framework:** Zara-ML (custom PyTorch)
+## Usage
+```bash
+pip install ouro
+```
+```python
+from ouro import load_model, generate
+model, tokenizer, device = load_model("nyxia/Ouro-1.1B")
+generate(model, tokenizer, device, "The history of")
+```
+Built by Flo ([@nyxia](https://huggingface.co/nyxia)) & Zara. Part of the [Soulkyn](https://soulkyn.com) project.

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,86 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].role == 'system' %}
+        {{- messages[0].content + '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
+        {%- set ns.multi_step_tool = false %}
+        {%- set ns.last_query_index = index %}
+    {%- endif %}
+{%- endfor %}
+{%- for message in messages %}
+    {%- if message.content is string %}
+        {%- set content = message.content %}
+    {%- else %}
+        {%- set content = '' %}
+    {%- endif %}
+    {%- if message.role == "user" or message.role == "system" %}
+        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {%- set reasoning_content = '' %}
+        {%- if message.reasoning_content is string %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in content %}
+                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {%- if loop.index0 > ns.last_query_index %}
+            {%- if loop.last or (not loop.last and reasoning_content) %}
+                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
+            {%- else %}
+                {{- '<|im_start|>' + message.role + '\n' + content }}
+            {%- endif %}
+        {%- else %}
+            {{- '<|im_start|>' + message.role + '\n' + content }}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if (loop.first and content) or (not loop.first) %}
+                    {{- '\n' }}
+                {%- endif %}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n{"name": "' }}
+                {{- tool_call.name }}
+                {{- '", "arguments": ' }}
+                {%- if tool_call.arguments is string %}
+                    {{- tool_call.arguments }}
+                {%- else %}
+                    {{- tool_call.arguments | tojson }}
+                {%- endif %}
+                {{- '}\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+    {%- if enable_thinking is defined and enable_thinking is false %}
+        {{- '<think>\n\n</think>\n\n' }}
+    {%- endif %}
+{%- endif %}

config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "dim": 2048,
+  "n_layers": 24,
+  "vocab_size": 151936,
+  "max_seq_len": 2048,
+  "n_heads": 32,
+  "n_kv_heads": 8,
+  "head_dim": 64,
+  "gdn_expand_v": 2,
+  "gdn_head_dim": 64,
+  "gdn_n_heads": 32,
+  "conv_kernel": 4,
+  "gdn_use_gate": true,
+  "gdn_use_short_conv": true,
+  "ffn_mult": 2.67,
+  "attn_interval": 4,
+  "use_x0_inject": true,
+  "use_resid_lambdas": true,
+  "use_skip_connections": true,
+  "use_diff_attn": false,
+  "rope_base": 10000.0,
+  "partial_rotary_factor": 0.25,
+  "n_bottom": 6,
+  "n_physical_top": 6,
+  "n_top_loops": 3,
+  "architecture": "Chimera",
+  "config_class": "ChimeraConfig",
+  "topology": "6 bottom + 6x3 top = 24 virtual",
+  "step": 1500,
+  "total_params": 1072527280,
+  "size_label": "1.1B",
+  "model_type": "zara-ml"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c6773ad5fe86fe32777960e52a82949d405bad4b2268313ef49ce36d5eb70a46
+size 2767408848

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:be75606093db2094d7cd20f3c2f385c212750648bd6ea4fb2bf507a6a4c55506
+size 11422650

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "is_local": false,
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

training.log ADDED Viewed

	@@ -0,0 +1,92 @@

+Device: cuda
+Fused Linear CE (liger-kernel): ENABLED
+============================================================
+Config: 1b
+  dim=2048, layers=24, heads=32
+  attn_interval=4 (GDN:18 Attn:6)
+  ffn_hidden=5468, vocab=151936
+  seq_len=2048
+  CHIMERA STACK:
+    Bottom: 6 unique layers
+    Top: 6 physical × 3 loops = 18 virtual
+============================================================
+  bottom_unique: 380,650,444
+  top_physical: 380,650,444
+  top_virtual_equiv: 1,141,951,332
+  embed: 311,164,928
+  total_unique: 761,362,352
+  topology: bottom 6 unique + top 6x3 shared
+  BPB correction: 3.11 bytes/token (true_bpb = bpt / 3.11)
+Loading data (dataset=mixed)...
+============================================================
+Loading mixed pretraining data (5,000,000,000 total tokens)
+  fineweb: 75% = 3,750,000,000 tokens
+  code: 18% = 900,000,000 tokens
+  math: 5% = 250,000,000 tokens
+  dialogue: 2% = 100,000,000 tokens
+============================================================
+Loading memory-mapped tokens from /workspace/hf-cache/fineweb_tokenized/fineweb_Qwen_Qwen3-0.6B_3750000000_train.bin...
+MemmapDataset: 3,750,000,000 tokens, 1,830,161 sequences of 2048
+Loading cached StarCoderData from /workspace/hf-cache/starcoderdata_tokenized/starcoder_json_markdown_python_Qwen_Qwen3-0.6B_900000000.bin...
+MemmapDataset: 900,000,000 tokens, 439,238 sequences of 2048
+Loading cached FineMath from /workspace/hf-cache/finemath_tokenized/finemath4plus_Qwen_Qwen3-0.6B_250000000.bin...
+MemmapDataset: 250,000,000 tokens, 122,010 sequences of 2048
+Loading cached dialogue from /workspace/hf-cache/dialogue_tokenized/ultrachat_chatml_Qwen_Qwen3-0.6B_100000000.bin...
+MemmapDataset: 100,000,000 tokens, 48,804 sequences of 2048
+  MixedDataset: fineweb — 1,830,161 seqs available, ~1,830,159 will be sampled (75.0%)
+  MixedDataset: code — 439,238 seqs available, ~439,238 will be sampled (18.0%)
+  MixedDataset: math — 122,010 seqs available, ~122,010 will be sampled (5.0%)
+  MixedDataset: dialogue — 48,804 seqs available, ~48,804 will be sampled (2.0%)
+Loading memory-mapped tokens from /workspace/hf-cache/fineweb_tokenized_val/fineweb_Qwen_Qwen3-0.6B_50000000_train.bin...
+MemmapDataset: 50,000,000 tokens, 24,402 sequences of 2048
+Train: 2,440,213 seqs, Val: 24,402 seqs
+Muon params: 760,971,264 (114 tensors) @ LR 8.0e-04
+Embed params: 311,164,928 (1 tensors) @ LR 4.0e-04
+Scalar params: 391,088 (121 tensors) @ LR 8.0e-05
+Compiling model with torch.compile...
+Training for 250000 steps (warmup=100)
+LR: 0.0008, batch_size: 16
+Tokens/step: 32,768
+Starting step 1 (first step may be slow — Triton kernel compilation)...
+step    200/250000 | loss 9.4149 | lr 8.00e-04 emb 4.00e-04 | 1311ms/step | 25,002 tok/s | epoch 1
+step    400/250000 | loss 5.4851 | lr 8.00e-04 emb 4.00e-04 | 1086ms/step | 30,175 tok/s | epoch 1
+W0322 09:24:33.653000 10056 .venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1676] [7/8] torch._dynamo hit config.recompile_limit (8)
+W0322 09:24:33.653000 10056 .venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1676] [7/8]    function: 'rearrange' (/workspace/zara_ml/.venv/lib/python3.12/site-packages/einops/einops.py:561)
+W0322 09:24:33.653000 10056 .venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1676] [7/8]    last reason: 7/7: tensor 'tensor' rank mismatch. expected 4, actual 3
+W0322 09:24:33.653000 10056 .venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1676] [7/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
+W0322 09:24:33.653000 10056 .venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1676] [7/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/compile/programming_model.recompilation.html
+CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
+CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
+CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
+CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
+CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
+CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
+CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
+CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
+  >>> val_loss: 5.0847 | bpt: 7.3357 | true_bpb: 2.3598 *BEST*
+  >>> [The] The earliest has taken place over the last century in the early 11st century and was a year in the Middle Ages.
+Although the third stage was found at the upper upper and the third century Morro years, it is still in the form of a single strand of work. Findings from the French. The French had marked a roll of fine marble as yucca, which was published in
+  >>> [Scientists have discovered] Scientists have discovered the world's biggest satellite, which was the largest galaxy. The first event that is being said to be the only spacecraft, so the lunar orbit would begin with a tragic rocket for its first millennium (582 U.S.03).
+The Earth's atmosphere, the largest of the most crude orbit or the planet's atmosphere, is the chance of getting the universe to that idea. The pert
+step    600/250000 | loss 5.0526 | lr 8.00e-04 emb 4.00e-04 | 1171ms/step | 27,995 tok/s | epoch 1
+step    800/250000 | loss 4.8177 | lr 8.00e-04 emb 4.00e-04 | 1093ms/step | 29,976 tok/s | epoch 1
+step   1000/250000 | loss 4.6211 | lr 8.00e-04 emb 4.00e-04 | 1047ms/step | 31,310 tok/s | epoch 1
+  >>> val_loss: 4.6089 | bpt: 6.6493 | true_bpb: 2.1390 *BEST*
+  >>> [The] The treatment of the disease (Tribckos) is more than that of the disease. The patient is most likely unlucky and suddenly sick and can't heal from the disease. If the immune system cannot heal the disease, it can often develop without infection. Finally, the wound can be swollen and irritated. If the infection is to be inhibited in the next few weeks, the infection could be compromised and
+  >>> [Scientists have discovered] Scientists have discovered that the newly created back.net of the existing region of the continent of the Americas in a low C area, a small portion, such as with a large number of bases, could have caused some of the likelihood of known a disease in their home cities.
+The study was published in 2008, which found that the behavior can be a major contributing factor to this mutation, and it is
+step   1200/250000 | loss 4.4074 | lr 8.00e-04 emb 4.00e-04 | 1033ms/step | 31,733 tok/s | epoch 1
+step   1400/250000 | loss 4.1424 | lr 8.00e-04 emb 4.00e-04 | 1008ms/step | 32,502 tok/s | epoch 1
+  >>> val_loss: 4.2451 | bpt: 6.1244 | true_bpb: 1.9701 *BEST*
+  >>> [The] The the most important thing to understand is to understand the difference between maximum and temporary deviation. For this, do not have an inverse. Not without prior experience. Most of the above one is greater than the mean squared number and the mean squared as a condition is greater.
+This was due to the low-shift ratio, and the low-order low-squares ratio. In this case, the mean squared is less
+  >>> [Scientists have discovered] Scientists have discovered the Magdalene supernides of the Magomeric Magellanic Cloud. Sautéed the four Magellanic Cloud Blue Ring, discovered by the Magellanic Cloud Mysterious Magellanic Cloud, revealed that he had discovered a supernovot. The Magellanic Cloud has also revealed that the Magellan family had much higher density than the Magellanic Cloud.
+At the southernmost
+step   1600/250000 | loss 3.9971 | lr 8.00e-04 emb 4.00e-04 | 1001ms/step | 32,746 tok/s | epoch 1