nyxia commited on
Commit
0cf4162
·
verified ·
1 Parent(s): 435f49c

Upload Chimera 1.1B at step 1500

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - zara-ml
5
+ - chimera
6
+ - gdn
7
+ - ouroboros
8
+ - hybrid-architecture
9
+ language:
10
+ - en
11
+ ---
12
+
13
+ # Ouro-1.1B
14
+
15
+ **Chimera architecture** — GDN/Attention hybrid with weight sharing.
16
+
17
+ ## Architecture
18
+ - **Type:** Chimera (ChimeraConfig)
19
+ - **Dim:** 2048
20
+ - **Layers:** 24 virtual
21
+ - **Params:** 1,072,527,280 (1.1B)
22
+ - **Vocab:** 151936 (Qwen 3 tokenizer)
23
+ - **Context:** 2048 tokens
24
+ - **Topology:** 6 unique bottom + 6×3 shared top
25
+ - **GDN:Attn ratio:** 3:1 (every 4th layer is attention)
26
+
27
+ ## Training
28
+ - **Step:** 1,500
29
+ - **Data:** Mixed (75% FineWeb-Edu, 18% StarCoder, 5% FineMath, 2% UltraChat)
30
+ - **Framework:** Zara-ML (custom PyTorch)
31
+
32
+ ## Usage
33
+
34
+ ```bash
35
+ pip install ouro
36
+ ```
37
+
38
+ ```python
39
+ from ouro import load_model, generate
40
+
41
+ model, tokenizer, device = load_model("nyxia/Ouro-1.1B")
42
+ generate(model, tokenizer, device, "The history of")
43
+ ```
44
+
45
+ Built by Flo ([@nyxia](https://huggingface.co/nyxia)) & Zara. Part of the [Soulkyn](https://soulkyn.com) project.
chat_template.jinja ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {{- messages[0].content + '\n\n' }}
5
+ {%- endif %}
6
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
7
+ {%- for tool in tools %}
8
+ {{- "\n" }}
9
+ {{- tool | tojson }}
10
+ {%- endfor %}
11
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
12
+ {%- else %}
13
+ {%- endif %}
14
+ {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
15
+ {%- for message in messages[::-1] %}
16
+ {%- set index = (messages|length - 1) - loop.index0 %}
17
+ {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
18
+ {%- set ns.multi_step_tool = false %}
19
+ {%- set ns.last_query_index = index %}
20
+ {%- endif %}
21
+ {%- endfor %}
22
+ {%- for message in messages %}
23
+ {%- if message.content is string %}
24
+ {%- set content = message.content %}
25
+ {%- else %}
26
+ {%- set content = '' %}
27
+ {%- endif %}
28
+ {%- if message.role == "user" or message.role == "system" %}
29
+ {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
30
+ {%- elif message.role == "assistant" %}
31
+ {%- set reasoning_content = '' %}
32
+ {%- if message.reasoning_content is string %}
33
+ {%- set reasoning_content = message.reasoning_content %}
34
+ {%- else %}
35
+ {%- if '</think>' in content %}
36
+ {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
37
+ {%- set content = content.split('</think>')[-1].lstrip('\n') %}
38
+ {%- endif %}
39
+ {%- endif %}
40
+ {%- if loop.index0 > ns.last_query_index %}
41
+ {%- if loop.last or (not loop.last and reasoning_content) %}
42
+ {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
43
+ {%- else %}
44
+ {{- '<|im_start|>' + message.role + '\n' + content }}
45
+ {%- endif %}
46
+ {%- else %}
47
+ {{- '<|im_start|>' + message.role + '\n' + content }}
48
+ {%- endif %}
49
+ {%- if message.tool_calls %}
50
+ {%- for tool_call in message.tool_calls %}
51
+ {%- if (loop.first and content) or (not loop.first) %}
52
+ {{- '\n' }}
53
+ {%- endif %}
54
+ {%- if tool_call.function %}
55
+ {%- set tool_call = tool_call.function %}
56
+ {%- endif %}
57
+ {{- '<tool_call>\n{"name": "' }}
58
+ {{- tool_call.name }}
59
+ {{- '", "arguments": ' }}
60
+ {%- if tool_call.arguments is string %}
61
+ {{- tool_call.arguments }}
62
+ {%- else %}
63
+ {{- tool_call.arguments | tojson }}
64
+ {%- endif %}
65
+ {{- '}\n</tool_call>' }}
66
+ {%- endfor %}
67
+ {%- endif %}
68
+ {{- '<|im_end|>\n' }}
69
+ {%- elif message.role == "tool" %}
70
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
71
+ {{- '<|im_start|>user' }}
72
+ {%- endif %}
73
+ {{- '\n<tool_response>\n' }}
74
+ {{- content }}
75
+ {{- '\n</tool_response>' }}
76
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
77
+ {{- '<|im_end|>\n' }}
78
+ {%- endif %}
79
+ {%- endif %}
80
+ {%- endfor %}
81
+ {%- if add_generation_prompt %}
82
+ {{- '<|im_start|>assistant\n' }}
83
+ {%- if enable_thinking is defined and enable_thinking is false %}
84
+ {{- '<think>\n\n</think>\n\n' }}
85
+ {%- endif %}
86
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dim": 2048,
3
+ "n_layers": 24,
4
+ "vocab_size": 151936,
5
+ "max_seq_len": 2048,
6
+ "n_heads": 32,
7
+ "n_kv_heads": 8,
8
+ "head_dim": 64,
9
+ "gdn_expand_v": 2,
10
+ "gdn_head_dim": 64,
11
+ "gdn_n_heads": 32,
12
+ "conv_kernel": 4,
13
+ "gdn_use_gate": true,
14
+ "gdn_use_short_conv": true,
15
+ "ffn_mult": 2.67,
16
+ "attn_interval": 4,
17
+ "use_x0_inject": true,
18
+ "use_resid_lambdas": true,
19
+ "use_skip_connections": true,
20
+ "use_diff_attn": false,
21
+ "rope_base": 10000.0,
22
+ "partial_rotary_factor": 0.25,
23
+ "n_bottom": 6,
24
+ "n_physical_top": 6,
25
+ "n_top_loops": 3,
26
+ "architecture": "Chimera",
27
+ "config_class": "ChimeraConfig",
28
+ "topology": "6 bottom + 6x3 top = 24 virtual",
29
+ "step": 1500,
30
+ "total_params": 1072527280,
31
+ "size_label": "1.1B",
32
+ "model_type": "zara-ml"
33
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c6773ad5fe86fe32777960e52a82949d405bad4b2268313ef49ce36d5eb70a46
3
+ size 2767408848
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be75606093db2094d7cd20f3c2f385c212750648bd6ea4fb2bf507a6a4c55506
3
+ size 11422650
tokenizer_config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": null,
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|im_end|>",
7
+ "errors": "replace",
8
+ "extra_special_tokens": [
9
+ "<|im_start|>",
10
+ "<|im_end|>",
11
+ "<|object_ref_start|>",
12
+ "<|object_ref_end|>",
13
+ "<|box_start|>",
14
+ "<|box_end|>",
15
+ "<|quad_start|>",
16
+ "<|quad_end|>",
17
+ "<|vision_start|>",
18
+ "<|vision_end|>",
19
+ "<|vision_pad|>",
20
+ "<|image_pad|>",
21
+ "<|video_pad|>"
22
+ ],
23
+ "is_local": false,
24
+ "model_max_length": 131072,
25
+ "pad_token": "<|endoftext|>",
26
+ "split_special_tokens": false,
27
+ "tokenizer_class": "Qwen2Tokenizer",
28
+ "unk_token": null
29
+ }
training.log ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Device: cuda
2
+ Fused Linear CE (liger-kernel): ENABLED
3
+
4
+ ============================================================
5
+ Config: 1b
6
+ dim=2048, layers=24, heads=32
7
+ attn_interval=4 (GDN:18 Attn:6)
8
+ ffn_hidden=5468, vocab=151936
9
+ seq_len=2048
10
+ CHIMERA STACK:
11
+ Bottom: 6 unique layers
12
+ Top: 6 physical × 3 loops = 18 virtual
13
+ ============================================================
14
+
15
+ bottom_unique: 380,650,444
16
+ top_physical: 380,650,444
17
+ top_virtual_equiv: 1,141,951,332
18
+ embed: 311,164,928
19
+ total_unique: 761,362,352
20
+ topology: bottom 6 unique + top 6x3 shared
21
+ BPB correction: 3.11 bytes/token (true_bpb = bpt / 3.11)
22
+
23
+ Loading data (dataset=mixed)...
24
+
25
+ ============================================================
26
+ Loading mixed pretraining data (5,000,000,000 total tokens)
27
+ fineweb: 75% = 3,750,000,000 tokens
28
+ code: 18% = 900,000,000 tokens
29
+ math: 5% = 250,000,000 tokens
30
+ dialogue: 2% = 100,000,000 tokens
31
+ ============================================================
32
+
33
+ Loading memory-mapped tokens from /workspace/hf-cache/fineweb_tokenized/fineweb_Qwen_Qwen3-0.6B_3750000000_train.bin...
34
+ MemmapDataset: 3,750,000,000 tokens, 1,830,161 sequences of 2048
35
+ Loading cached StarCoderData from /workspace/hf-cache/starcoderdata_tokenized/starcoder_json_markdown_python_Qwen_Qwen3-0.6B_900000000.bin...
36
+ MemmapDataset: 900,000,000 tokens, 439,238 sequences of 2048
37
+ Loading cached FineMath from /workspace/hf-cache/finemath_tokenized/finemath4plus_Qwen_Qwen3-0.6B_250000000.bin...
38
+ MemmapDataset: 250,000,000 tokens, 122,010 sequences of 2048
39
+ Loading cached dialogue from /workspace/hf-cache/dialogue_tokenized/ultrachat_chatml_Qwen_Qwen3-0.6B_100000000.bin...
40
+ MemmapDataset: 100,000,000 tokens, 48,804 sequences of 2048
41
+ MixedDataset: fineweb — 1,830,161 seqs available, ~1,830,159 will be sampled (75.0%)
42
+ MixedDataset: code — 439,238 seqs available, ~439,238 will be sampled (18.0%)
43
+ MixedDataset: math — 122,010 seqs available, ~122,010 will be sampled (5.0%)
44
+ MixedDataset: dialogue — 48,804 seqs available, ~48,804 will be sampled (2.0%)
45
+ Loading memory-mapped tokens from /workspace/hf-cache/fineweb_tokenized_val/fineweb_Qwen_Qwen3-0.6B_50000000_train.bin...
46
+ MemmapDataset: 50,000,000 tokens, 24,402 sequences of 2048
47
+ Train: 2,440,213 seqs, Val: 24,402 seqs
48
+ Muon params: 760,971,264 (114 tensors) @ LR 8.0e-04
49
+ Embed params: 311,164,928 (1 tensors) @ LR 4.0e-04
50
+ Scalar params: 391,088 (121 tensors) @ LR 8.0e-05
51
+ Compiling model with torch.compile...
52
+
53
+ Training for 250000 steps (warmup=100)
54
+ LR: 0.0008, batch_size: 16
55
+ Tokens/step: 32,768
56
+
57
+ Starting step 1 (first step may be slow — Triton kernel compilation)...
58
+ step 200/250000 | loss 9.4149 | lr 8.00e-04 emb 4.00e-04 | 1311ms/step | 25,002 tok/s | epoch 1
59
+ step 400/250000 | loss 5.4851 | lr 8.00e-04 emb 4.00e-04 | 1086ms/step | 30,175 tok/s | epoch 1
60
+ W0322 09:24:33.653000 10056 .venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1676] [7/8] torch._dynamo hit config.recompile_limit (8)
61
+ W0322 09:24:33.653000 10056 .venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1676] [7/8] function: 'rearrange' (/workspace/zara_ml/.venv/lib/python3.12/site-packages/einops/einops.py:561)
62
+ W0322 09:24:33.653000 10056 .venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1676] [7/8] last reason: 7/7: tensor 'tensor' rank mismatch. expected 4, actual 3
63
+ W0322 09:24:33.653000 10056 .venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1676] [7/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
64
+ W0322 09:24:33.653000 10056 .venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1676] [7/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/compile/programming_model.recompilation.html
65
+ CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
66
+ CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
67
+ CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
68
+ CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
69
+ CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
70
+ CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
71
+ CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
72
+ CUDAGraph supports dynamic shapes by recording a new graph for each distinct input size. Recording too many CUDAGraphs may lead to extra overhead. We have observed 9 distinct sizes. Please consider the following options for better performance: a) padding inputs to a few fixed number of shapes; or b) set torch._inductor.config.triton.cudagraph_skip_dynamic_graphs=True. Set torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit=None to silence this warning.
73
+ >>> val_loss: 5.0847 | bpt: 7.3357 | true_bpb: 2.3598 *BEST*
74
+ >>> [The] The earliest has taken place over the last century in the early 11st century and was a year in the Middle Ages.
75
+ Although the third stage was found at the upper upper and the third century Morro years, it is still in the form of a single strand of work. Findings from the French. The French had marked a roll of fine marble as yucca, which was published in
76
+ >>> [Scientists have discovered] Scientists have discovered the world's biggest satellite, which was the largest galaxy. The first event that is being said to be the only spacecraft, so the lunar orbit would begin with a tragic rocket for its first millennium (582 U.S.03).
77
+ The Earth's atmosphere, the largest of the most crude orbit or the planet's atmosphere, is the chance of getting the universe to that idea. The pert
78
+ step 600/250000 | loss 5.0526 | lr 8.00e-04 emb 4.00e-04 | 1171ms/step | 27,995 tok/s | epoch 1
79
+ step 800/250000 | loss 4.8177 | lr 8.00e-04 emb 4.00e-04 | 1093ms/step | 29,976 tok/s | epoch 1
80
+ step 1000/250000 | loss 4.6211 | lr 8.00e-04 emb 4.00e-04 | 1047ms/step | 31,310 tok/s | epoch 1
81
+ >>> val_loss: 4.6089 | bpt: 6.6493 | true_bpb: 2.1390 *BEST*
82
+ >>> [The] The treatment of the disease (Tribckos) is more than that of the disease. The patient is most likely unlucky and suddenly sick and can't heal from the disease. If the immune system cannot heal the disease, it can often develop without infection. Finally, the wound can be swollen and irritated. If the infection is to be inhibited in the next few weeks, the infection could be compromised and
83
+ >>> [Scientists have discovered] Scientists have discovered that the newly created back.net of the existing region of the continent of the Americas in a low C area, a small portion, such as with a large number of bases, could have caused some of the likelihood of known a disease in their home cities.
84
+ The study was published in 2008, which found that the behavior can be a major contributing factor to this mutation, and it is
85
+ step 1200/250000 | loss 4.4074 | lr 8.00e-04 emb 4.00e-04 | 1033ms/step | 31,733 tok/s | epoch 1
86
+ step 1400/250000 | loss 4.1424 | lr 8.00e-04 emb 4.00e-04 | 1008ms/step | 32,502 tok/s | epoch 1
87
+ >>> val_loss: 4.2451 | bpt: 6.1244 | true_bpb: 1.9701 *BEST*
88
+ >>> [The] The the most important thing to understand is to understand the difference between maximum and temporary deviation. For this, do not have an inverse. Not without prior experience. Most of the above one is greater than the mean squared number and the mean squared as a condition is greater.
89
+ This was due to the low-shift ratio, and the low-order low-squares ratio. In this case, the mean squared is less
90
+ >>> [Scientists have discovered] Scientists have discovered the Magdalene supernides of the Magomeric Magellanic Cloud. Sautéed the four Magellanic Cloud Blue Ring, discovered by the Magellanic Cloud Mysterious Magellanic Cloud, revealed that he had discovered a supernovot. The Magellanic Cloud has also revealed that the Magellan family had much higher density than the Magellanic Cloud.
91
+ At the southernmost
92
+ step 1600/250000 | loss 3.9971 | lr 8.00e-04 emb 4.00e-04 | 1001ms/step | 32,746 tok/s | epoch 1