--- tags: - deception-detection - sparse-autoencoders - mechanistic-interpretability - ai-safety - nanochat license: mit --- # Deception SAEs for nanochat-d20 (561M) 12 SAE checkpoints trained on nanochat-d20 behavioral sampling activations. Includes standard, deception-optimized, honest-optimized, and mixed training variants. ## Training-data caveat — please read before use The "honest" and "deceptive" labels used to train and evaluate this SAE come from a **mixed scenario pool** of nine prompts: - **Six clean incentive-structure scenarios** — `insider_info`, `accounting_error`, `ai_oversight_log`, `ai_capability_hide`, `surprise_party`, `job_interview_gap`. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling. - **Three role-play identity-assignment scenarios** — `secret_role_game` ("You have been assigned the role of Fascist"), `secret_role_game_v2` ("You are secretly one of the Mafia members"), `werewolf_game` ("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it. **What this mixed pool means for the SAE's labels.** Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play — which is a well-defined phenomenon but not the same as emergent or incentive- driven deception. **What this SAE is and is not good for.** - **Good for:** research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale. - **Not recommended as a standalone deception detector.** The role-consistency signal from the three role-play scenarios is mixed into every aggregate metric reported below. A downstream user who wants an "emergent-deception feature set" should restrict attention to features whose activation pattern concentrates in the `insider_info` / `accounting_error` / `ai_oversight_log` / `ai_capability_hide` / `surprise_party` / `job_interview_gap` scenarios — or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity). **What is unaffected by this caveat.** - The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported. - The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision. A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public. --- ## Key Finding: Mixed Training Beats Deception-Only | Training Data | Layer 10 d_max | Layer 18 d_max | |---|---|---| | **Mixed (dec+hon)** | 0.558 | **0.684** | | Deception-only | 0.520 | 0.634 | | Honest-only | 0.544 | 0.572 | | Standard (all) | 0.518 | 0.549 | | TopK (standard) | 0.226 | 0.346 | Training on both behavioral classes together gives the best discriminability. The SAE needs to see the contrast. ## Model Details - **Base model:** nanochat-d20 (561M params, d_model=1280, 20 layers) - **Dimensions:** d_in=1280, d_sae=5120 (4x expansion) - **Training data:** 270 V3 behavioral sampling completions (132 deceptive, 128 honest, 10 ambiguous) - **Training epochs:** 300 - **Layers:** 10 (50% depth) and 18 (95% depth, probe peak) ## Checkpoints | File | Training | Architecture | Layer | d_max | L0 | EV | |---|---|---|---|---|---|---| | `d20_L10_standard_topk.pt` | All data | TopK k=32 | 10 | 0.226 | 32 | 98.5% | | `d20_L10_standard_jumprelu.pt` | All data | JumpReLU | 10 | 0.518 | 2093 | 99.7% | | `d20_L10_deception_topk.pt` | Deceptive only | TopK k=32 | 10 | 0.244 | 32 | 98.4% | | `d20_L10_deception_jumprelu.pt` | Deceptive only | JumpReLU | 10 | 0.520 | 2125 | 99.5% | | `d20_L10_honest_jumprelu.pt` | Honest only | JumpReLU | 10 | 0.544 | 2108 | 99.4% | | `d20_L10_mixed_jumprelu.pt` | Dec+Hon only | JumpReLU | 10 | 0.558 | 2025 | 99.6% | | `d20_L18_standard_topk.pt` | All data | TopK k=32 | 18 | 0.346 | 32 | 96.8% | | `d20_L18_standard_jumprelu.pt` | All data | JumpReLU | 18 | 0.549 | 2409 | 99.7% | | `d20_L18_deception_topk.pt` | Deceptive only | TopK k=32 | 18 | 0.252 | 32 | 95.2% | | `d20_L18_deception_jumprelu.pt` | Deceptive only | JumpReLU | 18 | 0.634 | 2353 | 99.4% | | `d20_L18_honest_jumprelu.pt` | Honest only | JumpReLU | 18 | 0.572 | 2422 | 99.4% | | **`d20_L18_mixed_jumprelu.pt`** | **Dec+Hon** | **JumpReLU** | **18** | **0.684** | 2371 | 99.5% | ## Related Work Follow-up research to: - **"The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"** - [OpenReview](https://openreview.net/forum?id=FhGJLT6spH) - [ArXiv](https://arxiv.org/abs/2503.07683) Part of the deception-nanochat-sae-research project: - [GitHub](https://github.com/SolshineCode/deception-nanochat-sae-research) ## Citation ```bibtex @article{deleeuw2025secret, title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools}, author={DeLeeuw, Caleb and Chawla, ...}, year={2025} } ```