NicolaZanarini
/

gmt-v7-base

 ---
+language: en
 license: mit
+library_name: transformers
+tags:
+- english
+- graph-memory-transformer
+- memory-augmented
+- open-webtext
+- decoder-only
+metrics:
+- perplexity
+- accuracy
+model-index:
+- name: gmt-v7-base
+  results:
+  - task:
+      type: language-modeling
+      name: Language Modeling
+    dataset:
+      name: OpenWebText (validation)
+      type: open-webtext
+    metrics:
+    - type: perplexity
+      value: 36.58
+      name: Perplexity
+  - task:
+      type: text-classification
+      name: ARC-Easy
+    dataset:
+      name: AI2 Reasoning Challenge (Easy)
+      type: allenai/ai2_arc
+    metrics:
+    - type: accuracy
+      value: 0.3704
+      name: Accuracy (raw, 0-shot)
+  - task:
+      type: text-classification
+      name: WinoGrande
+    dataset:
+      name: WinoGrande
+      type: winogrande
+    metrics:
+    - type: accuracy
+      value: 0.5146
+      name: Accuracy (0-shot)
+datasets:
+- Skylion007/openwebtext
 ---
+# GMT v7 Base
+**Graph Memory Transformer (GMT) v7**
+Is a decoder-only language model in which every feed-forward network (FFN) sublayer is replaced by a **graph-structured memory cell**. Causal self-attention is preserved; the per-token FFN transformation is substituted with memory navigation over a learned bank of centroids connected by a directed transition matrix. The substitution of the FFN with a learned bank of centroids connected by a directed transition matrix aims at making the relations inside the transformer more explicit and easier to interpret.
+**Relevance & Potential.**
+Transformers dominate modern AI, yet their internal computations remain opaque — FFN sublayers act as unstructured function approximators with no clear semantic structure. By replacing FFNs with a navigable graph of memory states, GMT turns each layer's transformation into a traceable path over discrete, inspectable nodes and edges. This opens the door to mechanistic interpretability at scale: understanding *what* a layer computes by observing *where* it routes information. Beyond interpretability, graph-structured memory offers a principled route toward more data-efficient learning, structured knowledge retrieval, and models whose reasoning can be audited rather than reverse-engineered.
+[Paper (arXiv:2604.23862)](https://arxiv.org/abs/2604.23862) | [Code](https://github.com/Nemesis533/GMT-GraphMemoryTransformer)
+## Model Details
+| Property | Value |
+|---|---|
+| **Architecture** | Decoder-only Transformer, FFN replaced by memory cells |
+| **Parameters** | 82.2M |
+| **Layers / Heads / Hidden** | 16 / 12 / 768 |
+| **Memory slots per layer** | 128 (2,048 total across network) |
+| **Navigation dimension** | 128 |
+| **Vocabulary size** | 50,257 (GPT-2 tokenizer) |
+| **Context length** | 1,024 |
+| **Tied embeddings** | Yes |
+**How it works.** Each block's memory cell performs three operations on a normalized token state:
+1. **Source routing** — gravitational soft assignment over 128 layer-local centroids (inverse-distance weighting)
+2. **Graph traversal + target selection** — one-hop diffusion through a learned 128×128 directed edge matrix, refined by token-conditioned query-key scores
+3. **Gated displacement readout** — the cell returns `σ(g) · LayerNorm(target − source)`, i.e. movement from source toward target memory state, not a retrieved value
+The model has **zero dense FFN sublayers**. Centroids are maintained online via EMA write-back and periodic dead-centroid reset / similarity merging.
+## Training
+| Property | Value |
+|---|---|
+| **Dataset** | OpenWebText (~3B tokens, 2 epochs) |
+| **Optimizer** | AdamW (lr=3e-4, weight decay=0.1, β=(0.9, 0.95)) |
+| **Scheduler** | Cosine decay, 2,000 warmup steps |
+| **Effective batch** | 270,336 tokens (8 × 33 accum × 1,024 seq) |
+| **Precision** | bfloat16 mixed precision |
+| **Gradient clipping** | Norm 1.0 |
+| **Auxiliary losses** | Tracking, centroid orthogonality, usage clustering, edge entropy, edge contrast |
+## Results
+Evaluated at best validation-loss checkpoint (step 18,310). Baseline is a 103.0M-parameter dense GPT-2 with matched depth/hidden/heads.
+| Benchmark | Metric | GMT v7 | Baseline GPT-2 | Δ |
+|---|---|---|---|---|
+| OpenWebText (val) | Perplexity | 36.58 | 26.85 | +9.73 |
+| ARC-Easy (0-shot) | Accuracy (raw) | 37.0% | 38.9% | −1.9 pp |
+| HellaSwag (0-shot) | Accuracy | 26.7% | 26.9% | −0.3 pp |
+| PIQA (0-shot) | Accuracy | 57.8% | 59.5% | −1.7 pp |
+| **WinoGrande (0-shot)** | **Accuracy** | **51.5%** | **50.5%** | **+1.0 pp** |
+GMT v7 operates at a **20% parameter disadvantage** (82.2M vs 103.0M). Gaps are consistent with this disadvantage, while WinoGrande — a pronoun-resolution task requiring commonsense referent disambiguation — shows the memory graph may be particularly suited to structured association retrieval.
+## Intended Use & Limitations
+- **Research prototype.** This model demonstrates that FFN sublayers can be replaced by inspectable graph-memory cells without breaking training stability. It is not intended as a production language model.
+- The model remains behind the larger dense baseline in validation loss and most benchmarks.
+- Results are from a single run on a single checkpoint; broader scaling and more extensive evaluation are left for future work.
+## Citation
+```bibtex
+@article{zanarini2026graphmemorytransformer,
+  title={Graph Memory Transformer},
+  author={Zanarini, Nicola and Ferrari, Niccol{\`o}},
+  journal={arXiv preprint arXiv:2604.23862},
+  year={2026}
+}
+```
+## License
+[MIT](https://opensource.org/licenses/MIT) — Copyright (c) 2026 Nicola Zanarini and Niccolò Ferrari.
+## Authors
+Nicola Zanarini & Niccolò Ferrari