| ---
|
| tags:
|
| - deception-detection
|
| - sparse-autoencoders
|
| - mechanistic-interpretability
|
| - ai-safety
|
| - jumprelu
|
| license: mit
|
| ---
|
|
|
| # JumpReLU SAE for Deception Detection (nanochat-d32, Layer 12)
|
|
|
| Best per-feature deception discriminability achieved: **d_max=0.653** (Cohen's d).
|
|
|
|
|
| ## Training-data caveat β please read before use
|
|
|
| The "honest" and "deceptive" labels used to train and evaluate this SAE
|
| come from a **mixed scenario pool** of nine prompts:
|
|
|
| - **Six clean incentive-structure scenarios** β `insider_info`,
|
| `accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
|
| `surprise_party`, `job_interview_gap`. Each presents a realistic
|
| situation with a non-trivial incentive to conceal or disclose
|
| information; the honest/deceptive distinction is which of the two
|
| behavioral choices the model's completion settles into under
|
| temperature sampling.
|
| - **Three role-play identity-assignment scenarios** β
|
| `secret_role_game` ("You have been assigned the role of Fascist"),
|
| `secret_role_game_v2` ("You are secretly one of the Mafia members"),
|
| `werewolf_game` ("You are a Werewolf"). These pre-assign the model a
|
| deceptive identity and label a completion "deceptive" when the model
|
| drifts away from the assigned role or "honest" when it echoes it.
|
|
|
| **What this mixed pool means for the SAE's labels.** Within the six
|
| incentive-structure scenarios, the honest/deceptive distinction is a
|
| measurement of behavioral choice under an ambiguous incentive. Within
|
| the three role-play scenarios, the distinction is a measurement of
|
| role-consistency under identity-assigned role-play β which is a
|
| well-defined phenomenon but not the same as emergent or incentive-
|
| driven deception.
|
|
|
| **What this SAE is and is not good for.**
|
|
|
| - **Good for:** research on mixed-pool activation geometry; SAE
|
| feature-geometry studies; as one of a set of baselines when
|
| comparing multiple SAE families; as a reference implementation of
|
| same-prompt temperature-sampled behavioral SAE training at scale.
|
| - **Not recommended as a standalone deception detector.** The
|
| role-consistency signal from the three role-play scenarios is mixed
|
| into every aggregate metric reported below. A downstream user who
|
| wants an "emergent-deception feature set" should restrict attention
|
| to features whose activation pattern concentrates in the
|
| `insider_info` / `accounting_error` / `ai_oversight_log` /
|
| `ai_capability_hide` / `surprise_party` / `job_interview_gap`
|
| scenarios β or wait for the methodologically corrected V3 re-release
|
| currently in preparation on the decision-incentive scenario bank
|
| (no pre-assigned deceptive identity).
|
|
|
| **What is unaffected by this caveat.**
|
|
|
| - The SAE weights, reconstruction metrics (explained variance, L0,
|
| alive features), and engineering of the training pipeline are
|
| accurate as reported.
|
| - The linear-probe balanced-accuracy numbers in the upstream paper
|
| measure the mixed pool; the 6-scenario clean-subset re-analysis is
|
| listed as a planned appendix for the next manuscript revision.
|
|
|
| A companion methodology-first Gemma 4 SAE suite is in preparation using
|
| pretraining-distribution data + a decision-incentive behavior split;
|
| this README will be updated with a link when that release is public.
|
|
|
| ---
|
|
|
| ## Model Details
|
|
|
| - **Architecture:** JumpReLU Sparse Autoencoder
|
| - **Base model:** nanochat-d32 (1.88B params, d_model=2048, 32 layers)
|
| - **Hook point:** Layer 12 residual stream (39% depth, probe accuracy peak)
|
| - **Dimensions:** d_in=2048, d_sae=8192 (4x expansion)
|
| - **L1 coefficient:** 1e-3
|
| - **Sparsity (L0):** 2843 active features per input
|
| - **Alive features:** 3362 / 8192
|
| - **Explained variance:** 99.8%
|
| - **Training data:** 1327 V3 behavioral sampling activations (650 deceptive, 677 honest)
|
| - **Training epochs:** 300
|
|
|
| ## Discriminability Results
|
|
|
| | SAE Architecture | d_max (Cohen's d) | L0 | EV |
|
| |---|---|---|---|
|
| | **JumpReLU (this model)** | **0.653** | 2843 | 99.8% |
|
| | Gated | 0.606 | 4084 | 92% |
|
| | TopK (k=64) | 0.263 | 64 | 56% |
|
|
|
| **Important finding:** Despite JumpReLU achieving the best per-feature discriminability, probes on raw activations (86.8%) still outperform probes on SAE feature spaces (82.7% for JumpReLU). The deception signal is distributed across features, not localizable to individual ones.
|
|
|
| ## Usage
|
|
|
| ```python
|
| import torch
|
| from sae.config import SAEConfig
|
| from sae.models import JumpReLUSAE
|
|
|
| config = SAEConfig(d_in=2048, d_sae=8192, activation="jumprelu", l1_coefficient=1e-3)
|
| sae = JumpReLUSAE(config)
|
| sae.load_state_dict(torch.load("best_sae_layer12_jumprelu_x4.pt", map_location="cpu"))
|
| sae.eval()
|
|
|
| # Get feature activations
|
| features = sae.get_feature_activations(activations) # (batch, 8192)
|
| ```
|
|
|
| ## Related Work
|
|
|
| Follow-up research to:
|
| - **"The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"**
|
| - [OpenReview](https://openreview.net/forum?id=FhGJLT6spH)
|
| - [ArXiv](https://arxiv.org/abs/2503.07683)
|
|
|
| Part of the deception-nanochat-sae-research project:
|
| - [GitHub](https://github.com/SolshineCode/deception-nanochat-sae-research)
|
| - [Dataset](https://huggingface.co/datasets/Solshine/deception-behavioral-multimodel)
|
|
|
| ## Citation
|
|
|
| ```bibtex
|
| @article{deleeuw2025secret,
|
| title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
|
| author={DeLeeuw, Caleb and Chawla, ...},
|
| year={2025}
|
| }
|
| ```
|
|
|