JumpReLU SAE for Deception Detection (nanochat-d32, Layer 12)

Best per-feature deception discriminability achieved: d_max=0.653 (Cohen's d).

Model Details

  • Architecture: JumpReLU Sparse Autoencoder
  • Base model: nanochat-d32 (1.88B params, d_model=2048, 32 layers)
  • Hook point: Layer 12 residual stream (39% depth, probe accuracy peak)
  • Dimensions: d_in=2048, d_sae=8192 (4x expansion)
  • L1 coefficient: 1e-3
  • Sparsity (L0): 2843 active features per input
  • Alive features: 3362 / 8192
  • Explained variance: 99.8%
  • Training data: 1327 V3 behavioral sampling activations (650 deceptive, 677 honest)
  • Training epochs: 300

Discriminability Results

SAE Architecture d_max (Cohen's d) L0 EV
JumpReLU (this model) 0.653 2843 99.8%
Gated 0.606 4084 92%
TopK (k=64) 0.263 64 56%

Important finding: Despite JumpReLU achieving the best per-feature discriminability, probes on raw activations (86.8%) still outperform probes on SAE feature spaces (82.7% for JumpReLU). The deception signal is distributed across features, not localizable to individual ones.

Usage

import torch
from sae.config import SAEConfig
from sae.models import JumpReLUSAE

config = SAEConfig(d_in=2048, d_sae=8192, activation="jumprelu", l1_coefficient=1e-3)
sae = JumpReLUSAE(config)
sae.load_state_dict(torch.load("best_sae_layer12_jumprelu_x4.pt", map_location="cpu"))
sae.eval()

# Get feature activations
features = sae.get_feature_activations(activations)  # (batch, 8192)

Related Work

Follow-up research to:

  • "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"

Part of the deception-nanochat-sae-research project:

Citation

@article{deleeuw2025secret,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb and Chawla, ...},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Solshine/deception-sae-nanochat-d32-jumprelu