SMAT — Semantic Attention

Trained checkpoints for SMAT (Semantic Attention), a transformer attention variant with a learnable semantic-similarity bias and per-token value gate.

  • Code: github.com/OutrageouslyBad200/SMATest
  • Architecture: 24 layers × 384d × 6 heads, block size 256, ~64 M parameters
  • Tokenizer: GPT-2 (tiktoken, vocab 50 257)
  • Training data: FineWeb-Edu sample-10BT, 98 M tokens
  • Training compute: 12 000 optimizer steps, batch 16 × grad_accum 2 (effective 32), RTX 4060

Equation

Attn(Q,K,V) = softmax(QK^T/sqrt(d_k) + λ·S + P + M) · (G ⊙ V)
  • S_ij = cos(W_s h_i, W_s h_j) — cosine similarity in shared projection
  • c_j = (1/n) Σ_{l≤j} S_jl — causal semantic centrality
  • G_j = σ(w_g^T h_j + μ·c_j + β) — per-token value gate
  • λ = softplus(λ_raw) — constrained positive scalar (per layer)

Repository contents

This HuggingFace repo hosts 20 checkpoints from the 5-seed ablation in Experiment 6 of the SMAT paper:

baseline_s0/final.pt   s_only_s0/final.pt   g_only_s0/final.pt   full_s0/final.pt
baseline_s1/final.pt   s_only_s1/final.pt   g_only_s1/final.pt   full_s1/final.pt
baseline_s2/final.pt   s_only_s2/final.pt   g_only_s2/final.pt   full_s2/final.pt
baseline_s3/final.pt   s_only_s3/final.pt   g_only_s3/final.pt   full_s3/final.pt
baseline_s4/final.pt   s_only_s4/final.pt   g_only_s4/final.pt   full_s4/final.pt

Each variant directory also contains config.json and metrics.jsonl (per-step training + eval logs).

Variant use_S use_G Description
baseline False False Standard attention
s_only True False Semantic bias only
g_only False True Value gate only
full True True Full SMAT

Results

Validation perplexity on FineWeb-Edu, 5 seeds, 12 000 steps:

Variant Mean ppl Std Δ vs baseline Seed wins
Baseline 79.75 1.69
S-only 79.47 1.71 −0.35% 4/5
G-only 79.02 1.65 −0.90% 5/5
Full SMAT 78.65 1.75 −1.37% 5/5
0 NaN failures across 240 000 optimizer steps.

Usage

pip install torch numpy tiktoken huggingface_hub
git clone https://github.com/OutrageouslyBad200/SMATest.git
cd SMATest

Download a single checkpoint:

from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
    repo_id="OutrageouslyBad200/SMATest",
    filename="full_s0/final.pt",
)

Load it into the SMAT model:

import torch
from model import Config, SMATTransformer
state = torch.load(ckpt_path, map_location="cuda")
cfg = Config(**state["config"])
model = SMATTransformer(cfg).cuda()
model.load_state_dict(state["state_dict"])
model.eval()

Reproduce surgical ablations (Experiment 7):

python ablate.py --ckpt full_s0/final.pt --n_batches 80

Surgical-ablation findings (Experiment 7)

Run on Full SMAT, val ppl 79.010:

Ablation val ppl Δ
λ=0 (S still drives c) 79.40 +0.49%
S removed entirely 80.48 +1.85%
Random S (same norm) 81.23 +2.80%
G replaced by mean 196.99 +149%
G forced to 1.0 625 850 catastrophic
  • The gate G is catastrophically essential.
  • S routes mostly through μ·c in the gate (74 % of lift), not through λ·S in attention (26 %).
  • Per-token gate differentiation matters: replacing G with its mean costs 149 %.

Limitations

  • Small base model (~64 M params); larger-scale runs (100 M on FineWeb / FineMath) show stronger lifts (−11 % to −17 %) but are not included as released checkpoints.
  • Trained only on English FineWeb-Edu sample-10BT — generalization to other domains untested at this scale.
  • Not instruction-tuned, not RLHF'd, no safety filtering. Research artifact only.

Citation

@misc{smat2026,
  author       = {OutrageouslyBad200},
  title        = {SMAT: Semantic Attention},
  year         = {2026},
  howpublished = {\url{https://github.com/OutrageouslyBad200/SMATest}},
}

Contact

For further information on training runs, intermediate experiments, or the unpublished paper draft, please contact the creator via GitHub or HuggingFace.

License

MIT License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train OutrageouslyBad200/SMAT_ablations