NicolaZanarini commited on
Commit
21741f5
·
verified ·
1 Parent(s): a6d4d5c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -0
README.md CHANGED
@@ -1,3 +1,132 @@
1
  ---
 
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
  license: mit
4
+ library_name: transformers
5
+ tags:
6
+ - english
7
+ - graph-memory-transformer
8
+ - memory-augmented
9
+ - open-webtext
10
+ - decoder-only
11
+ metrics:
12
+ - perplexity
13
+ - accuracy
14
+ model-index:
15
+ - name: gmt-v7-base
16
+ results:
17
+ - task:
18
+ type: language-modeling
19
+ name: Language Modeling
20
+ dataset:
21
+ name: OpenWebText (validation)
22
+ type: open-webtext
23
+ metrics:
24
+ - type: perplexity
25
+ value: 36.58
26
+ name: Perplexity
27
+ - task:
28
+ type: text-classification
29
+ name: ARC-Easy
30
+ dataset:
31
+ name: AI2 Reasoning Challenge (Easy)
32
+ type: allenai/ai2_arc
33
+ metrics:
34
+ - type: accuracy
35
+ value: 0.3704
36
+ name: Accuracy (raw, 0-shot)
37
+ - task:
38
+ type: text-classification
39
+ name: WinoGrande
40
+ dataset:
41
+ name: WinoGrande
42
+ type: winogrande
43
+ metrics:
44
+ - type: accuracy
45
+ value: 0.5146
46
+ name: Accuracy (0-shot)
47
+ datasets:
48
+ - Skylion007/openwebtext
49
  ---
50
+
51
+ # GMT v7 Base
52
+
53
+ **Graph Memory Transformer (GMT) v7**
54
+
55
+ Is a decoder-only language model in which every feed-forward network (FFN) sublayer is replaced by a **graph-structured memory cell**. Causal self-attention is preserved; the per-token FFN transformation is substituted with memory navigation over a learned bank of centroids connected by a directed transition matrix. The substitution of the FFN with a learned bank of centroids connected by a directed transition matrix aims at making the relations inside the transformer more explicit and easier to interpret.
56
+
57
+ **Relevance & Potential.**
58
+
59
+ Transformers dominate modern AI, yet their internal computations remain opaque — FFN sublayers act as unstructured function approximators with no clear semantic structure. By replacing FFNs with a navigable graph of memory states, GMT turns each layer's transformation into a traceable path over discrete, inspectable nodes and edges. This opens the door to mechanistic interpretability at scale: understanding *what* a layer computes by observing *where* it routes information. Beyond interpretability, graph-structured memory offers a principled route toward more data-efficient learning, structured knowledge retrieval, and models whose reasoning can be audited rather than reverse-engineered.
60
+
61
+ [Paper (arXiv:2604.23862)](https://arxiv.org/abs/2604.23862) | [Code](https://github.com/Nemesis533/GMT-GraphMemoryTransformer)
62
+
63
+ ## Model Details
64
+
65
+ | Property | Value |
66
+ |---|---|
67
+ | **Architecture** | Decoder-only Transformer, FFN replaced by memory cells |
68
+ | **Parameters** | 82.2M |
69
+ | **Layers / Heads / Hidden** | 16 / 12 / 768 |
70
+ | **Memory slots per layer** | 128 (2,048 total across network) |
71
+ | **Navigation dimension** | 128 |
72
+ | **Vocabulary size** | 50,257 (GPT-2 tokenizer) |
73
+ | **Context length** | 1,024 |
74
+ | **Tied embeddings** | Yes |
75
+
76
+ **How it works.** Each block's memory cell performs three operations on a normalized token state:
77
+ 1. **Source routing** — gravitational soft assignment over 128 layer-local centroids (inverse-distance weighting)
78
+ 2. **Graph traversal + target selection** — one-hop diffusion through a learned 128×128 directed edge matrix, refined by token-conditioned query-key scores
79
+ 3. **Gated displacement readout** — the cell returns `σ(g) · LayerNorm(target − source)`, i.e. movement from source toward target memory state, not a retrieved value
80
+
81
+ The model has **zero dense FFN sublayers**. Centroids are maintained online via EMA write-back and periodic dead-centroid reset / similarity merging.
82
+
83
+ ## Training
84
+
85
+ | Property | Value |
86
+ |---|---|
87
+ | **Dataset** | OpenWebText (~3B tokens, 2 epochs) |
88
+ | **Optimizer** | AdamW (lr=3e-4, weight decay=0.1, β=(0.9, 0.95)) |
89
+ | **Scheduler** | Cosine decay, 2,000 warmup steps |
90
+ | **Effective batch** | 270,336 tokens (8 × 33 accum × 1,024 seq) |
91
+ | **Precision** | bfloat16 mixed precision |
92
+ | **Gradient clipping** | Norm 1.0 |
93
+ | **Auxiliary losses** | Tracking, centroid orthogonality, usage clustering, edge entropy, edge contrast |
94
+
95
+ ## Results
96
+
97
+ Evaluated at best validation-loss checkpoint (step 18,310). Baseline is a 103.0M-parameter dense GPT-2 with matched depth/hidden/heads.
98
+
99
+ | Benchmark | Metric | GMT v7 | Baseline GPT-2 | Δ |
100
+ |---|---|---|---|---|
101
+ | OpenWebText (val) | Perplexity | 36.58 | 26.85 | +9.73 |
102
+ | ARC-Easy (0-shot) | Accuracy (raw) | 37.0% | 38.9% | −1.9 pp |
103
+ | HellaSwag (0-shot) | Accuracy | 26.7% | 26.9% | −0.3 pp |
104
+ | PIQA (0-shot) | Accuracy | 57.8% | 59.5% | −1.7 pp |
105
+ | **WinoGrande (0-shot)** | **Accuracy** | **51.5%** | **50.5%** | **+1.0 pp** |
106
+
107
+ GMT v7 operates at a **20% parameter disadvantage** (82.2M vs 103.0M). Gaps are consistent with this disadvantage, while WinoGrande — a pronoun-resolution task requiring commonsense referent disambiguation — shows the memory graph may be particularly suited to structured association retrieval.
108
+
109
+ ## Intended Use & Limitations
110
+
111
+ - **Research prototype.** This model demonstrates that FFN sublayers can be replaced by inspectable graph-memory cells without breaking training stability. It is not intended as a production language model.
112
+ - The model remains behind the larger dense baseline in validation loss and most benchmarks.
113
+ - Results are from a single run on a single checkpoint; broader scaling and more extensive evaluation are left for future work.
114
+
115
+ ## Citation
116
+
117
+ ```bibtex
118
+ @article{zanarini2026graphmemorytransformer,
119
+ title={Graph Memory Transformer},
120
+ author={Zanarini, Nicola and Ferrari, Niccol{\`o}},
121
+ journal={arXiv preprint arXiv:2604.23862},
122
+ year={2026}
123
+ }
124
+ ```
125
+
126
+ ## License
127
+
128
+ [MIT](https://opensource.org/licenses/MIT) — Copyright (c) 2026 Nicola Zanarini and Niccolò Ferrari.
129
+
130
+ ## Authors
131
+
132
+ Nicola Zanarini & Niccolò Ferrari