SMAT — Semantic Attention

Trained checkpoints for SMAT (Semantic Attention), a transformer attention variant with a learnable semantic-similarity bias and per-token value gate.

Code: github.com/OutrageouslyBad200/SMATest
Architecture: 24 layers × 384d × 6 heads, block size 256, ~64 M parameters
Tokenizer: GPT-2 (tiktoken, vocab 50 257)
Training data: FineWeb-Edu sample-10BT, 98 M tokens
Training compute: 12 000 optimizer steps, batch 16 × grad_accum 2 (effective 32), RTX 4060

Equation

Attn(Q,K,V) = softmax(QK^T/sqrt(d_k) + λ·S + P + M) · (G ⊙ V)

S_ij = cos(W_s h_i, W_s h_j) — cosine similarity in shared projection
c_j = (1/n) Σ_{l≤j} S_jl — causal semantic centrality
G_j = σ(w_g^T h_j + μ·c_j + β) — per-token value gate
λ = softplus(λ_raw) — constrained positive scalar (per layer)

Repository contents

This HuggingFace repo hosts 20 checkpoints from the 5-seed ablation in Experiment 6 of the SMAT paper:

baseline_s0/final.pt   s_only_s0/final.pt   g_only_s0/final.pt   full_s0/final.pt
baseline_s1/final.pt   s_only_s1/final.pt   g_only_s1/final.pt   full_s1/final.pt
baseline_s2/final.pt   s_only_s2/final.pt   g_only_s2/final.pt   full_s2/final.pt
baseline_s3/final.pt   s_only_s3/final.pt   g_only_s3/final.pt   full_s3/final.pt
baseline_s4/final.pt   s_only_s4/final.pt   g_only_s4/final.pt   full_s4/final.pt

Each variant directory also contains config.json and metrics.jsonl (per-step training + eval logs).

Variant	`use_S`	`use_G`	Description
`baseline`	False	False	Standard attention
`s_only`	True	False	Semantic bias only
`g_only`	False	True	Value gate only
`full`	True	True	Full SMAT

Results

Validation perplexity on FineWeb-Edu, 5 seeds, 12 000 steps:

Variant	Mean ppl	Std	Δ vs baseline	Seed wins
Baseline	79.75	1.69	—	—
S-only	79.47	1.71	−0.35%	4/5
G-only	79.02	1.65	−0.90%	5/5
Full SMAT	78.65	1.75	−1.37%	5/5
0 NaN failures across 240 000 optimizer steps.

Usage

pip install torch numpy tiktoken huggingface_hub
git clone https://github.com/OutrageouslyBad200/SMATest.git
cd SMATest

Download a single checkpoint:

from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
    repo_id="OutrageouslyBad200/SMATest",
    filename="full_s0/final.pt",
)

Load it into the SMAT model:

import torch
from model import Config, SMATTransformer
state = torch.load(ckpt_path, map_location="cuda")
cfg = Config(**state["config"])
model = SMATTransformer(cfg).cuda()
model.load_state_dict(state["state_dict"])
model.eval()

Reproduce surgical ablations (Experiment 7):

python ablate.py --ckpt full_s0/final.pt --n_batches 80

Surgical-ablation findings (Experiment 7)

Run on Full SMAT, val ppl 79.010:

Ablation	val ppl	Δ
λ=0 (S still drives c)	79.40	+0.49%
S removed entirely	80.48	+1.85%
Random S (same norm)	81.23	+2.80%
G replaced by mean	196.99	+149%
G forced to 1.0	625 850	catastrophic

The gate G is catastrophically essential.
S routes mostly through μ·c in the gate (74 % of lift), not through λ·S in attention (26 %).
Per-token gate differentiation matters: replacing G with its mean costs 149 %.

Limitations

Small base model (~64 M params); larger-scale runs (100 M on FineWeb / FineMath) show stronger lifts (−11 % to −17 %) but are not included as released checkpoints.
Trained only on English FineWeb-Edu sample-10BT — generalization to other domains untested at this scale.
Not instruction-tuned, not RLHF'd, no safety filtering. Research artifact only.

Citation

@misc{smat2026,
  author       = {OutrageouslyBad200},
  title        = {SMAT: Semantic Attention},
  year         = {2026},
  howpublished = {\url{https://github.com/OutrageouslyBad200/SMATest}},
}

Contact

For further information on training runs, intermediate experiments, or the unpublished paper draft, please contact the creator via GitHub or HuggingFace.

License

MIT License.

Downloads last month: -; Downloads are not tracked for this model. How to track

OutrageouslyBad200
/

SMAT_ablations