Solshine's picture
Initial public release: SAE weights, cfg, and model card
d672ffd
---
tags:
- deception-detection
- sparse-autoencoders
- mechanistic-interpretability
- ai-safety
- jumprelu
license: mit
---
# JumpReLU SAE for Deception Detection (nanochat-d32, Layer 12)
Best per-feature deception discriminability achieved: **d_max=0.653** (Cohen's d).
## Training-data caveat β€” please read before use
The "honest" and "deceptive" labels used to train and evaluate this SAE
come from a **mixed scenario pool** of nine prompts:
- **Six clean incentive-structure scenarios** β€” `insider_info`,
`accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
`surprise_party`, `job_interview_gap`. Each presents a realistic
situation with a non-trivial incentive to conceal or disclose
information; the honest/deceptive distinction is which of the two
behavioral choices the model's completion settles into under
temperature sampling.
- **Three role-play identity-assignment scenarios** β€”
`secret_role_game` ("You have been assigned the role of Fascist"),
`secret_role_game_v2` ("You are secretly one of the Mafia members"),
`werewolf_game` ("You are a Werewolf"). These pre-assign the model a
deceptive identity and label a completion "deceptive" when the model
drifts away from the assigned role or "honest" when it echoes it.
**What this mixed pool means for the SAE's labels.** Within the six
incentive-structure scenarios, the honest/deceptive distinction is a
measurement of behavioral choice under an ambiguous incentive. Within
the three role-play scenarios, the distinction is a measurement of
role-consistency under identity-assigned role-play β€” which is a
well-defined phenomenon but not the same as emergent or incentive-
driven deception.
**What this SAE is and is not good for.**
- **Good for:** research on mixed-pool activation geometry; SAE
feature-geometry studies; as one of a set of baselines when
comparing multiple SAE families; as a reference implementation of
same-prompt temperature-sampled behavioral SAE training at scale.
- **Not recommended as a standalone deception detector.** The
role-consistency signal from the three role-play scenarios is mixed
into every aggregate metric reported below. A downstream user who
wants an "emergent-deception feature set" should restrict attention
to features whose activation pattern concentrates in the
`insider_info` / `accounting_error` / `ai_oversight_log` /
`ai_capability_hide` / `surprise_party` / `job_interview_gap`
scenarios β€” or wait for the methodologically corrected V3 re-release
currently in preparation on the decision-incentive scenario bank
(no pre-assigned deceptive identity).
**What is unaffected by this caveat.**
- The SAE weights, reconstruction metrics (explained variance, L0,
alive features), and engineering of the training pipeline are
accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper
measure the mixed pool; the 6-scenario clean-subset re-analysis is
listed as a planned appendix for the next manuscript revision.
A companion methodology-first Gemma 4 SAE suite is in preparation using
pretraining-distribution data + a decision-incentive behavior split;
this README will be updated with a link when that release is public.
---
## Model Details
- **Architecture:** JumpReLU Sparse Autoencoder
- **Base model:** nanochat-d32 (1.88B params, d_model=2048, 32 layers)
- **Hook point:** Layer 12 residual stream (39% depth, probe accuracy peak)
- **Dimensions:** d_in=2048, d_sae=8192 (4x expansion)
- **L1 coefficient:** 1e-3
- **Sparsity (L0):** 2843 active features per input
- **Alive features:** 3362 / 8192
- **Explained variance:** 99.8%
- **Training data:** 1327 V3 behavioral sampling activations (650 deceptive, 677 honest)
- **Training epochs:** 300
## Discriminability Results
| SAE Architecture | d_max (Cohen's d) | L0 | EV |
|---|---|---|---|
| **JumpReLU (this model)** | **0.653** | 2843 | 99.8% |
| Gated | 0.606 | 4084 | 92% |
| TopK (k=64) | 0.263 | 64 | 56% |
**Important finding:** Despite JumpReLU achieving the best per-feature discriminability, probes on raw activations (86.8%) still outperform probes on SAE feature spaces (82.7% for JumpReLU). The deception signal is distributed across features, not localizable to individual ones.
## Usage
```python
import torch
from sae.config import SAEConfig
from sae.models import JumpReLUSAE
config = SAEConfig(d_in=2048, d_sae=8192, activation="jumprelu", l1_coefficient=1e-3)
sae = JumpReLUSAE(config)
sae.load_state_dict(torch.load("best_sae_layer12_jumprelu_x4.pt", map_location="cpu"))
sae.eval()
# Get feature activations
features = sae.get_feature_activations(activations) # (batch, 8192)
```
## Related Work
Follow-up research to:
- **"The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"**
- [OpenReview](https://openreview.net/forum?id=FhGJLT6spH)
- [ArXiv](https://arxiv.org/abs/2503.07683)
Part of the deception-nanochat-sae-research project:
- [GitHub](https://github.com/SolshineCode/deception-nanochat-sae-research)
- [Dataset](https://huggingface.co/datasets/Solshine/deception-behavioral-multimodel)
## Citation
```bibtex
@article{deleeuw2025secret,
title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
author={DeLeeuw, Caleb and Chawla, ...},
year={2025}
}
```