Upload README.md with huggingface_hub

1e60986 verified 17 days ago

2.45 kB

tags:
  - deception-detection
  - sparse-autoencoders
  - mechanistic-interpretability
  - ai-safety
  - jumprelu
license: mit

JumpReLU SAE for Deception Detection (nanochat-d32, Layer 12)

Best per-feature deception discriminability achieved: d_max=0.653 (Cohen's d).

Model Details

Architecture: JumpReLU Sparse Autoencoder
Base model: nanochat-d32 (1.88B params, d_model=2048, 32 layers)
Hook point: Layer 12 residual stream (39% depth, probe accuracy peak)
Dimensions: d_in=2048, d_sae=8192 (4x expansion)
L1 coefficient: 1e-3
Sparsity (L0): 2843 active features per input
Alive features: 3362 / 8192
Explained variance: 99.8%
Training data: 1327 V3 behavioral sampling activations (650 deceptive, 677 honest)
Training epochs: 300

Discriminability Results

SAE Architecture	d_max (Cohen's d)	L0	EV
JumpReLU (this model)	0.653	2843	99.8%
Gated	0.606	4084	92%
TopK (k=64)	0.263	64	56%

Important finding: Despite JumpReLU achieving the best per-feature discriminability, probes on raw activations (86.8%) still outperform probes on SAE feature spaces (82.7% for JumpReLU). The deception signal is distributed across features, not localizable to individual ones.

Usage

import torch
from sae.config import SAEConfig
from sae.models import JumpReLUSAE

config = SAEConfig(d_in=2048, d_sae=8192, activation="jumprelu", l1_coefficient=1e-3)
sae = JumpReLUSAE(config)
sae.load_state_dict(torch.load("best_sae_layer12_jumprelu_x4.pt", map_location="cpu"))
sae.eval()

# Get feature activations
features = sae.get_feature_activations(activations)  # (batch, 8192)

Related Work

Follow-up research to:

"The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"
- OpenReview
- ArXiv

Part of the deception-nanochat-sae-research project:

Citation

@article{deleeuw2025secret,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb and Chawla, ...},
  year={2025}
}