---
tags:
- deception-detection
- sparse-autoencoders
- mechanistic-interpretability
- ai-safety
- nanochat
license: mit
---

# Deception SAEs for nanochat-d20 (561M)

12 SAE checkpoints trained on nanochat-d20 behavioral sampling activations. Includes standard, deception-optimized, honest-optimized, and mixed training variants.


## Training-data caveat — please read before use

The "honest" and "deceptive" labels used to train and evaluate this SAE
come from a **mixed scenario pool** of nine prompts:

- **Six clean incentive-structure scenarios** — `insider_info`,
  `accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
  `surprise_party`, `job_interview_gap`. Each presents a realistic
  situation with a non-trivial incentive to conceal or disclose
  information; the honest/deceptive distinction is which of the two
  behavioral choices the model's completion settles into under
  temperature sampling.
- **Three role-play identity-assignment scenarios** —
  `secret_role_game` ("You have been assigned the role of Fascist"),
  `secret_role_game_v2` ("You are secretly one of the Mafia members"),
  `werewolf_game` ("You are a Werewolf"). These pre-assign the model a
  deceptive identity and label a completion "deceptive" when the model
  drifts away from the assigned role or "honest" when it echoes it.

**What this mixed pool means for the SAE's labels.** Within the six
incentive-structure scenarios, the honest/deceptive distinction is a
measurement of behavioral choice under an ambiguous incentive. Within
the three role-play scenarios, the distinction is a measurement of
role-consistency under identity-assigned role-play — which is a
well-defined phenomenon but not the same as emergent or incentive-
driven deception.

**What this SAE is and is not good for.**

- **Good for:** research on mixed-pool activation geometry; SAE
  feature-geometry studies; as one of a set of baselines when
  comparing multiple SAE families; as a reference implementation of
  same-prompt temperature-sampled behavioral SAE training at scale.
- **Not recommended as a standalone deception detector.** The
  role-consistency signal from the three role-play scenarios is mixed
  into every aggregate metric reported below. A downstream user who
  wants an "emergent-deception feature set" should restrict attention
  to features whose activation pattern concentrates in the
  `insider_info` / `accounting_error` / `ai_oversight_log` /
  `ai_capability_hide` / `surprise_party` / `job_interview_gap`
  scenarios — or wait for the methodologically corrected V3 re-release
  currently in preparation on the decision-incentive scenario bank
  (no pre-assigned deceptive identity).

**What is unaffected by this caveat.**

- The SAE weights, reconstruction metrics (explained variance, L0,
  alive features), and engineering of the training pipeline are
  accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper
  measure the mixed pool; the 6-scenario clean-subset re-analysis is
  listed as a planned appendix for the next manuscript revision.

A companion methodology-first Gemma 4 SAE suite is in preparation using
pretraining-distribution data + a decision-incentive behavior split;
this README will be updated with a link when that release is public.

---

## Key Finding: Mixed Training Beats Deception-Only

| Training Data | Layer 10 d_max | Layer 18 d_max |
|---|---|---|
| **Mixed (dec+hon)** | 0.558 | **0.684** |
| Deception-only | 0.520 | 0.634 |
| Honest-only | 0.544 | 0.572 |
| Standard (all) | 0.518 | 0.549 |
| TopK (standard) | 0.226 | 0.346 |

Training on both behavioral classes together gives the best discriminability. The SAE needs to see the contrast.

## Model Details

- **Base model:** nanochat-d20 (561M params, d_model=1280, 20 layers)
- **Dimensions:** d_in=1280, d_sae=5120 (4x expansion)
- **Training data:** 270 V3 behavioral sampling completions (132 deceptive, 128 honest, 10 ambiguous)
- **Training epochs:** 300
- **Layers:** 10 (50% depth) and 18 (95% depth, probe peak)

## Checkpoints

| File | Training | Architecture | Layer | d_max | L0 | EV |
|---|---|---|---|---|---|---|
| `d20_L10_standard_topk.pt` | All data | TopK k=32 | 10 | 0.226 | 32 | 98.5% |
| `d20_L10_standard_jumprelu.pt` | All data | JumpReLU | 10 | 0.518 | 2093 | 99.7% |
| `d20_L10_deception_topk.pt` | Deceptive only | TopK k=32 | 10 | 0.244 | 32 | 98.4% |
| `d20_L10_deception_jumprelu.pt` | Deceptive only | JumpReLU | 10 | 0.520 | 2125 | 99.5% |
| `d20_L10_honest_jumprelu.pt` | Honest only | JumpReLU | 10 | 0.544 | 2108 | 99.4% |
| `d20_L10_mixed_jumprelu.pt` | Dec+Hon only | JumpReLU | 10 | 0.558 | 2025 | 99.6% |
| `d20_L18_standard_topk.pt` | All data | TopK k=32 | 18 | 0.346 | 32 | 96.8% |
| `d20_L18_standard_jumprelu.pt` | All data | JumpReLU | 18 | 0.549 | 2409 | 99.7% |
| `d20_L18_deception_topk.pt` | Deceptive only | TopK k=32 | 18 | 0.252 | 32 | 95.2% |
| `d20_L18_deception_jumprelu.pt` | Deceptive only | JumpReLU | 18 | 0.634 | 2353 | 99.4% |
| `d20_L18_honest_jumprelu.pt` | Honest only | JumpReLU | 18 | 0.572 | 2422 | 99.4% |
| **`d20_L18_mixed_jumprelu.pt`** | **Dec+Hon** | **JumpReLU** | **18** | **0.684** | 2371 | 99.5% |

## Related Work

Follow-up research to:
- **"The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"**
  - [OpenReview](https://openreview.net/forum?id=FhGJLT6spH)
  - [ArXiv](https://arxiv.org/abs/2503.07683)

Part of the deception-nanochat-sae-research project:
- [GitHub](https://github.com/SolshineCode/deception-nanochat-sae-research)

## Citation

```bibtex
@article{deleeuw2025secret,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb and Chawla, ...},
  year={2025}
}
```