---
language: en
license: mit
library_name: transformers
tags:
- english
- graph-memory-transformer
- memory-augmented
- open-webtext
- decoder-only
metrics:
- perplexity
- accuracy
model-index:
- name: gmt-v7-base
  results:
  - task:
      type: language-modeling
      name: Language Modeling
    dataset:
      name: OpenWebText (validation)
      type: open-webtext
    metrics:
    - type: perplexity
      value: 36.58
      name: Perplexity
  - task:
      type: text-classification
      name: ARC-Easy
    dataset:
      name: AI2 Reasoning Challenge (Easy)
      type: allenai/ai2_arc
    metrics:
    - type: accuracy
      value: 0.3704
      name: Accuracy (raw, 0-shot)
  - task:
      type: text-classification
      name: WinoGrande
    dataset:
      name: WinoGrande
      type: winogrande
    metrics:
    - type: accuracy
      value: 0.5146
      name: Accuracy (0-shot)
datasets:
- Skylion007/openwebtext
---

# GMT v7 Base

**Graph Memory Transformer (GMT) v7**

Is a decoder-only language model in which every feed-forward network (FFN) sublayer is replaced by a **graph-structured memory cell**. Causal self-attention is preserved; the per-token FFN transformation is substituted with memory navigation over a learned bank of centroids connected by a directed transition matrix. The substitution of the FFN with a learned bank of centroids connected by a directed transition matrix aims at making the relations inside the transformer more explicit and easier to interpret.

**Relevance & Potential.** 

Transformers dominate modern AI, yet their internal computations remain opaque — FFN sublayers act as unstructured function approximators with no clear semantic structure. By replacing FFNs with a navigable graph of memory states, GMT turns each layer's transformation into a traceable path over discrete, inspectable nodes and edges. This opens the door to mechanistic interpretability at scale: understanding *what* a layer computes by observing *where* it routes information. Beyond interpretability, graph-structured memory offers a principled route toward more data-efficient learning, structured knowledge retrieval, and models whose reasoning can be audited rather than reverse-engineered.

[Paper (arXiv:2604.23862)](https://arxiv.org/abs/2604.23862) | [Code](https://github.com/Nemesis533/GMT-GraphMemoryTransformer)

## Model Details

| Property | Value |
|---|---|
| **Architecture** | Decoder-only Transformer, FFN replaced by memory cells |
| **Parameters** | 82.2M |
| **Layers / Heads / Hidden** | 16 / 12 / 768 |
| **Memory slots per layer** | 128 (2,048 total across network) |
| **Navigation dimension** | 128 |
| **Vocabulary size** | 50,257 (GPT-2 tokenizer) |
| **Context length** | 1,024 |
| **Tied embeddings** | Yes |

**How it works.** Each block's memory cell performs three operations on a normalized token state:
1. **Source routing** — gravitational soft assignment over 128 layer-local centroids (inverse-distance weighting)
2. **Graph traversal + target selection** — one-hop diffusion through a learned 128×128 directed edge matrix, refined by token-conditioned query-key scores
3. **Gated displacement readout** — the cell returns `σ(g) · LayerNorm(target − source)`, i.e. movement from source toward target memory state, not a retrieved value

The model has **zero dense FFN sublayers**. Centroids are maintained online via EMA write-back and periodic dead-centroid reset / similarity merging.

## Training

| Property | Value |
|---|---|
| **Dataset** | OpenWebText (~3B tokens, 2 epochs) |
| **Optimizer** | AdamW (lr=3e-4, weight decay=0.1, β=(0.9, 0.95)) |
| **Scheduler** | Cosine decay, 2,000 warmup steps |
| **Effective batch** | 270,336 tokens (8 × 33 accum × 1,024 seq) |
| **Precision** | bfloat16 mixed precision |
| **Gradient clipping** | Norm 1.0 |
| **Auxiliary losses** | Tracking, centroid orthogonality, usage clustering, edge entropy, edge contrast |

## Results

Evaluated at best validation-loss checkpoint (step 18,310). Baseline is a 103.0M-parameter dense GPT-2 with matched depth/hidden/heads.

| Benchmark | Metric | GMT v7 | Baseline GPT-2 | Δ |
|---|---|---|---|---|
| OpenWebText (val) | Perplexity | 36.58 | 26.85 | +9.73 |
| ARC-Easy (0-shot) | Accuracy (raw) | 37.0% | 38.9% | −1.9 pp |
| HellaSwag (0-shot) | Accuracy | 26.7% | 26.9% | −0.3 pp |
| PIQA (0-shot) | Accuracy | 57.8% | 59.5% | −1.7 pp |
| **WinoGrande (0-shot)** | **Accuracy** | **51.5%** | **50.5%** | **+1.0 pp** |

GMT v7 operates at a **20% parameter disadvantage** (82.2M vs 103.0M). Gaps are consistent with this disadvantage, while WinoGrande — a pronoun-resolution task requiring commonsense referent disambiguation — shows the memory graph may be particularly suited to structured association retrieval.

## Intended Use & Limitations

- **Research prototype.** This model demonstrates that FFN sublayers can be replaced by inspectable graph-memory cells without breaking training stability. It is not intended as a production language model.
- The model remains behind the larger dense baseline in validation loss and most benchmarks.
- Results are from a single run on a single checkpoint; broader scaling and more extensive evaluation are left for future work.

## Citation

```bibtex
@article{zanarini2026graphmemorytransformer,
  title={Graph Memory Transformer},
  author={Zanarini, Nicola and Ferrari, Niccol{\`o}},
  journal={arXiv preprint arXiv:2604.23862},
  year={2026}
}
```

## License

[MIT](https://opensource.org/licenses/MIT) — Copyright (c) 2026 Nicola Zanarini and Niccolò Ferrari.

## Authors

Nicola Zanarini & Niccolò Ferrari