HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 476k • 1.15k
Trained checkpoints for SMAT (Semantic Attention), a transformer attention variant with a learnable semantic-similarity bias and per-token value gate.
tiktoken, vocab 50 257)Attn(Q,K,V) = softmax(QK^T/sqrt(d_k) + λ·S + P + M) · (G ⊙ V)
S_ij = cos(W_s h_i, W_s h_j) — cosine similarity in shared projectionc_j = (1/n) Σ_{l≤j} S_jl — causal semantic centralityG_j = σ(w_g^T h_j + μ·c_j + β) — per-token value gateλ = softplus(λ_raw) — constrained positive scalar (per layer)This HuggingFace repo hosts 20 checkpoints from the 5-seed ablation in Experiment 6 of the SMAT paper:
baseline_s0/final.pt s_only_s0/final.pt g_only_s0/final.pt full_s0/final.pt
baseline_s1/final.pt s_only_s1/final.pt g_only_s1/final.pt full_s1/final.pt
baseline_s2/final.pt s_only_s2/final.pt g_only_s2/final.pt full_s2/final.pt
baseline_s3/final.pt s_only_s3/final.pt g_only_s3/final.pt full_s3/final.pt
baseline_s4/final.pt s_only_s4/final.pt g_only_s4/final.pt full_s4/final.pt
Each variant directory also contains config.json and metrics.jsonl
(per-step training + eval logs).
| Variant | use_S |
use_G |
Description |
|---|---|---|---|
baseline |
False | False | Standard attention |
s_only |
True | False | Semantic bias only |
g_only |
False | True | Value gate only |
full |
True | True | Full SMAT |
Validation perplexity on FineWeb-Edu, 5 seeds, 12 000 steps:
| Variant | Mean ppl | Std | Δ vs baseline | Seed wins |
|---|---|---|---|---|
| Baseline | 79.75 | 1.69 | — | — |
| S-only | 79.47 | 1.71 | −0.35% | 4/5 |
| G-only | 79.02 | 1.65 | −0.90% | 5/5 |
| Full SMAT | 78.65 | 1.75 | −1.37% | 5/5 |
| 0 NaN failures across 240 000 optimizer steps. |
pip install torch numpy tiktoken huggingface_hub
git clone https://github.com/OutrageouslyBad200/SMATest.git
cd SMATest
Download a single checkpoint:
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="OutrageouslyBad200/SMATest",
filename="full_s0/final.pt",
)
Load it into the SMAT model:
import torch
from model import Config, SMATTransformer
state = torch.load(ckpt_path, map_location="cuda")
cfg = Config(**state["config"])
model = SMATTransformer(cfg).cuda()
model.load_state_dict(state["state_dict"])
model.eval()
Reproduce surgical ablations (Experiment 7):
python ablate.py --ckpt full_s0/final.pt --n_batches 80
Run on Full SMAT, val ppl 79.010:
| Ablation | val ppl | Δ |
|---|---|---|
| λ=0 (S still drives c) | 79.40 | +0.49% |
| S removed entirely | 80.48 | +1.85% |
| Random S (same norm) | 81.23 | +2.80% |
| G replaced by mean | 196.99 | +149% |
| G forced to 1.0 | 625 850 | catastrophic |
μ·c in the gate (74 % of lift), not through
λ·S in attention (26 %).@misc{smat2026,
author = {OutrageouslyBad200},
title = {SMAT: Semantic Attention},
year = {2026},
howpublished = {\url{https://github.com/OutrageouslyBad200/SMATest}},
}
For further information on training runs, intermediate experiments, or the unpublished paper draft, please contact the creator via GitHub or HuggingFace.