--- language: en license: mit library_name: transformers tags: - english - graph-memory-transformer - memory-augmented - open-webtext - decoder-only metrics: - perplexity - accuracy model-index: - name: gmt-v7-base results: - task: type: language-modeling name: Language Modeling dataset: name: OpenWebText (validation) type: open-webtext metrics: - type: perplexity value: 36.58 name: Perplexity - task: type: text-classification name: ARC-Easy dataset: name: AI2 Reasoning Challenge (Easy) type: allenai/ai2_arc metrics: - type: accuracy value: 0.3704 name: Accuracy (raw, 0-shot) - task: type: text-classification name: WinoGrande dataset: name: WinoGrande type: winogrande metrics: - type: accuracy value: 0.5146 name: Accuracy (0-shot) datasets: - Skylion007/openwebtext --- # GMT v7 Base **Graph Memory Transformer (GMT) v7** Is a decoder-only language model in which every feed-forward network (FFN) sublayer is replaced by a **graph-structured memory cell**. Causal self-attention is preserved; the per-token FFN transformation is substituted with memory navigation over a learned bank of centroids connected by a directed transition matrix. The substitution of the FFN with a learned bank of centroids connected by a directed transition matrix aims at making the relations inside the transformer more explicit and easier to interpret. **Relevance & Potential.** Transformers dominate modern AI, yet their internal computations remain opaque — FFN sublayers act as unstructured function approximators with no clear semantic structure. By replacing FFNs with a navigable graph of memory states, GMT turns each layer's transformation into a traceable path over discrete, inspectable nodes and edges. This opens the door to mechanistic interpretability at scale: understanding *what* a layer computes by observing *where* it routes information. Beyond interpretability, graph-structured memory offers a principled route toward more data-efficient learning, structured knowledge retrieval, and models whose reasoning can be audited rather than reverse-engineered. [Paper (arXiv:2604.23862)](https://arxiv.org/abs/2604.23862) | [Code](https://github.com/Nemesis533/GMT-GraphMemoryTransformer) ## Model Details | Property | Value | |---|---| | **Architecture** | Decoder-only Transformer, FFN replaced by memory cells | | **Parameters** | 82.2M | | **Layers / Heads / Hidden** | 16 / 12 / 768 | | **Memory slots per layer** | 128 (2,048 total across network) | | **Navigation dimension** | 128 | | **Vocabulary size** | 50,257 (GPT-2 tokenizer) | | **Context length** | 1,024 | | **Tied embeddings** | Yes | **How it works.** Each block's memory cell performs three operations on a normalized token state: 1. **Source routing** — gravitational soft assignment over 128 layer-local centroids (inverse-distance weighting) 2. **Graph traversal + target selection** — one-hop diffusion through a learned 128×128 directed edge matrix, refined by token-conditioned query-key scores 3. **Gated displacement readout** — the cell returns `σ(g) · LayerNorm(target − source)`, i.e. movement from source toward target memory state, not a retrieved value The model has **zero dense FFN sublayers**. Centroids are maintained online via EMA write-back and periodic dead-centroid reset / similarity merging. ## Training | Property | Value | |---|---| | **Dataset** | OpenWebText (~3B tokens, 2 epochs) | | **Optimizer** | AdamW (lr=3e-4, weight decay=0.1, β=(0.9, 0.95)) | | **Scheduler** | Cosine decay, 2,000 warmup steps | | **Effective batch** | 270,336 tokens (8 × 33 accum × 1,024 seq) | | **Precision** | bfloat16 mixed precision | | **Gradient clipping** | Norm 1.0 | | **Auxiliary losses** | Tracking, centroid orthogonality, usage clustering, edge entropy, edge contrast | ## Results Evaluated at best validation-loss checkpoint (step 18,310). Baseline is a 103.0M-parameter dense GPT-2 with matched depth/hidden/heads. | Benchmark | Metric | GMT v7 | Baseline GPT-2 | Δ | |---|---|---|---|---| | OpenWebText (val) | Perplexity | 36.58 | 26.85 | +9.73 | | ARC-Easy (0-shot) | Accuracy (raw) | 37.0% | 38.9% | −1.9 pp | | HellaSwag (0-shot) | Accuracy | 26.7% | 26.9% | −0.3 pp | | PIQA (0-shot) | Accuracy | 57.8% | 59.5% | −1.7 pp | | **WinoGrande (0-shot)** | **Accuracy** | **51.5%** | **50.5%** | **+1.0 pp** | GMT v7 operates at a **20% parameter disadvantage** (82.2M vs 103.0M). Gaps are consistent with this disadvantage, while WinoGrande — a pronoun-resolution task requiring commonsense referent disambiguation — shows the memory graph may be particularly suited to structured association retrieval. ## Intended Use & Limitations - **Research prototype.** This model demonstrates that FFN sublayers can be replaced by inspectable graph-memory cells without breaking training stability. It is not intended as a production language model. - The model remains behind the larger dense baseline in validation loss and most benchmarks. - Results are from a single run on a single checkpoint; broader scaling and more extensive evaluation are left for future work. ## Citation ```bibtex @article{zanarini2026graphmemorytransformer, title={Graph Memory Transformer}, author={Zanarini, Nicola and Ferrari, Niccol{\`o}}, journal={arXiv preprint arXiv:2604.23862}, year={2026} } ``` ## License [MIT](https://opensource.org/licenses/MIT) — Copyright (c) 2026 Nicola Zanarini and Niccolò Ferrari. ## Authors Nicola Zanarini & Niccolò Ferrari