GMT v7 Base

Graph Memory Transformer (GMT) v7

Is a decoder-only language model in which every feed-forward network (FFN) sublayer is replaced by a graph-structured memory cell. Causal self-attention is preserved; the per-token FFN transformation is substituted with memory navigation over a learned bank of centroids connected by a directed transition matrix. The substitution of the FFN with a learned bank of centroids connected by a directed transition matrix aims at making the relations inside the transformer more explicit and easier to interpret.

Relevance & Potential.

Transformers dominate modern AI, yet their internal computations remain opaque — FFN sublayers act as unstructured function approximators with no clear semantic structure. By replacing FFNs with a navigable graph of memory states, GMT turns each layer's transformation into a traceable path over discrete, inspectable nodes and edges. This opens the door to mechanistic interpretability at scale: understanding what a layer computes by observing where it routes information. Beyond interpretability, graph-structured memory offers a principled route toward more data-efficient learning, structured knowledge retrieval, and models whose reasoning can be audited rather than reverse-engineered.

Paper (arXiv:2604.23862) | Code

Model Details

Property Value
Architecture Decoder-only Transformer, FFN replaced by memory cells
Parameters 82.2M
Layers / Heads / Hidden 16 / 12 / 768
Memory slots per layer 128 (2,048 total across network)
Navigation dimension 128
Vocabulary size 50,257 (GPT-2 tokenizer)
Context length 1,024
Tied embeddings Yes

How it works. Each block's memory cell performs three operations on a normalized token state:

  1. Source routing — gravitational soft assignment over 128 layer-local centroids (inverse-distance weighting)
  2. Graph traversal + target selection — one-hop diffusion through a learned 128×128 directed edge matrix, refined by token-conditioned query-key scores
  3. Gated displacement readout — the cell returns σ(g) · LayerNorm(target − source), i.e. movement from source toward target memory state, not a retrieved value

The model has zero dense FFN sublayers. Centroids are maintained online via EMA write-back and periodic dead-centroid reset / similarity merging.

Training

Property Value
Dataset OpenWebText (~3B tokens, 2 epochs)
Optimizer AdamW (lr=3e-4, weight decay=0.1, β=(0.9, 0.95))
Scheduler Cosine decay, 2,000 warmup steps
Effective batch 270,336 tokens (8 × 33 accum × 1,024 seq)
Precision bfloat16 mixed precision
Gradient clipping Norm 1.0
Auxiliary losses Tracking, centroid orthogonality, usage clustering, edge entropy, edge contrast

Results

Evaluated at best validation-loss checkpoint (step 18,310). Baseline is a 103.0M-parameter dense GPT-2 with matched depth/hidden/heads.

Benchmark Metric GMT v7 Baseline GPT-2 Δ
OpenWebText (val) Perplexity 36.58 26.85 +9.73
ARC-Easy (0-shot) Accuracy (raw) 37.0% 38.9% −1.9 pp
HellaSwag (0-shot) Accuracy 26.7% 26.9% −0.3 pp
PIQA (0-shot) Accuracy 57.8% 59.5% −1.7 pp
WinoGrande (0-shot) Accuracy 51.5% 50.5% +1.0 pp

GMT v7 operates at a 20% parameter disadvantage (82.2M vs 103.0M). Gaps are consistent with this disadvantage, while WinoGrande — a pronoun-resolution task requiring commonsense referent disambiguation — shows the memory graph may be particularly suited to structured association retrieval.

Intended Use & Limitations

  • Research prototype. This model demonstrates that FFN sublayers can be replaced by inspectable graph-memory cells without breaking training stability. It is not intended as a production language model.
  • The model remains behind the larger dense baseline in validation loss and most benchmarks.
  • Results are from a single run on a single checkpoint; broader scaling and more extensive evaluation are left for future work.

Citation

@article{zanarini2026graphmemorytransformer,
  title={Graph Memory Transformer},
  author={Zanarini, Nicola and Ferrari, Niccol{\`o}},
  journal={arXiv preprint arXiv:2604.23862},
  year={2026}
}

License

MIT — Copyright (c) 2026 Nicola Zanarini and Niccolò Ferrari.

Authors

Nicola Zanarini & Niccolò Ferrari

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train NicolaZanarini/gmt-v7-base

Paper for NicolaZanarini/gmt-v7-base

Evaluation results