--- tags: - deception-detection - sparse-autoencoders - mechanistic-interpretability - ai-safety - jumprelu license: mit --- # JumpReLU SAE for Deception Detection (nanochat-d32, Layer 12) Best per-feature deception discriminability achieved: **d_max=0.653** (Cohen's d). ## Model Details - **Architecture:** JumpReLU Sparse Autoencoder - **Base model:** nanochat-d32 (1.88B params, d_model=2048, 32 layers) - **Hook point:** Layer 12 residual stream (39% depth, probe accuracy peak) - **Dimensions:** d_in=2048, d_sae=8192 (4x expansion) - **L1 coefficient:** 1e-3 - **Sparsity (L0):** 2843 active features per input - **Alive features:** 3362 / 8192 - **Explained variance:** 99.8% - **Training data:** 1327 V3 behavioral sampling activations (650 deceptive, 677 honest) - **Training epochs:** 300 ## Discriminability Results | SAE Architecture | d_max (Cohen's d) | L0 | EV | |---|---|---|---| | **JumpReLU (this model)** | **0.653** | 2843 | 99.8% | | Gated | 0.606 | 4084 | 92% | | TopK (k=64) | 0.263 | 64 | 56% | **Important finding:** Despite JumpReLU achieving the best per-feature discriminability, probes on raw activations (86.8%) still outperform probes on SAE feature spaces (82.7% for JumpReLU). The deception signal is distributed across features, not localizable to individual ones. ## Usage ```python import torch from sae.config import SAEConfig from sae.models import JumpReLUSAE config = SAEConfig(d_in=2048, d_sae=8192, activation="jumprelu", l1_coefficient=1e-3) sae = JumpReLUSAE(config) sae.load_state_dict(torch.load("best_sae_layer12_jumprelu_x4.pt", map_location="cpu")) sae.eval() # Get feature activations features = sae.get_feature_activations(activations) # (batch, 8192) ``` ## Related Work Follow-up research to: - **"The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"** - [OpenReview](https://openreview.net/forum?id=FhGJLT6spH) - [ArXiv](https://arxiv.org/abs/2503.07683) Part of the deception-nanochat-sae-research project: - [GitHub](https://github.com/SolshineCode/deception-nanochat-sae-research) - [Dataset](https://huggingface.co/datasets/Solshine/deception-behavioral-multimodel) ## Citation ```bibtex @article{deleeuw2025secret, title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools}, author={DeLeeuw, Caleb and Chawla, ...}, year={2025} } ```