metadata
tags:
- deception-detection
- sparse-autoencoders
- mechanistic-interpretability
- ai-safety
- jumprelu
license: mit
JumpReLU SAE for Deception Detection (nanochat-d32, Layer 12)
Best per-feature deception discriminability achieved: d_max=0.653 (Cohen's d).
Model Details
- Architecture: JumpReLU Sparse Autoencoder
- Base model: nanochat-d32 (1.88B params, d_model=2048, 32 layers)
- Hook point: Layer 12 residual stream (39% depth, probe accuracy peak)
- Dimensions: d_in=2048, d_sae=8192 (4x expansion)
- L1 coefficient: 1e-3
- Sparsity (L0): 2843 active features per input
- Alive features: 3362 / 8192
- Explained variance: 99.8%
- Training data: 1327 V3 behavioral sampling activations (650 deceptive, 677 honest)
- Training epochs: 300
Discriminability Results
| SAE Architecture | d_max (Cohen's d) | L0 | EV |
|---|---|---|---|
| JumpReLU (this model) | 0.653 | 2843 | 99.8% |
| Gated | 0.606 | 4084 | 92% |
| TopK (k=64) | 0.263 | 64 | 56% |
Important finding: Despite JumpReLU achieving the best per-feature discriminability, probes on raw activations (86.8%) still outperform probes on SAE feature spaces (82.7% for JumpReLU). The deception signal is distributed across features, not localizable to individual ones.
Usage
import torch
from sae.config import SAEConfig
from sae.models import JumpReLUSAE
config = SAEConfig(d_in=2048, d_sae=8192, activation="jumprelu", l1_coefficient=1e-3)
sae = JumpReLUSAE(config)
sae.load_state_dict(torch.load("best_sae_layer12_jumprelu_x4.pt", map_location="cpu"))
sae.eval()
# Get feature activations
features = sae.get_feature_activations(activations) # (batch, 8192)
Related Work
Follow-up research to:
- "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"
Part of the deception-nanochat-sae-research project:
Citation
@article{deleeuw2025secret,
title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
author={DeLeeuw, Caleb and Chawla, ...},
year={2025}
}