---
tags:
- deception-detection
- sparse-autoencoders
- mechanistic-interpretability
- ai-safety
- jumprelu
license: mit
---

# JumpReLU SAE for Deception Detection (nanochat-d32, Layer 12)

Best per-feature deception discriminability achieved: **d_max=0.653** (Cohen's d).

## Model Details

- **Architecture:** JumpReLU Sparse Autoencoder
- **Base model:** nanochat-d32 (1.88B params, d_model=2048, 32 layers)
- **Hook point:** Layer 12 residual stream (39% depth, probe accuracy peak)
- **Dimensions:** d_in=2048, d_sae=8192 (4x expansion)
- **L1 coefficient:** 1e-3
- **Sparsity (L0):** 2843 active features per input
- **Alive features:** 3362 / 8192
- **Explained variance:** 99.8%
- **Training data:** 1327 V3 behavioral sampling activations (650 deceptive, 677 honest)
- **Training epochs:** 300

## Discriminability Results

| SAE Architecture | d_max (Cohen's d) | L0 | EV |
|---|---|---|---|
| **JumpReLU (this model)** | **0.653** | 2843 | 99.8% |
| Gated | 0.606 | 4084 | 92% |
| TopK (k=64) | 0.263 | 64 | 56% |

**Important finding:** Despite JumpReLU achieving the best per-feature discriminability, probes on raw activations (86.8%) still outperform probes on SAE feature spaces (82.7% for JumpReLU). The deception signal is distributed across features, not localizable to individual ones.

## Usage

```python
import torch
from sae.config import SAEConfig
from sae.models import JumpReLUSAE

config = SAEConfig(d_in=2048, d_sae=8192, activation="jumprelu", l1_coefficient=1e-3)
sae = JumpReLUSAE(config)
sae.load_state_dict(torch.load("best_sae_layer12_jumprelu_x4.pt", map_location="cpu"))
sae.eval()

# Get feature activations
features = sae.get_feature_activations(activations)  # (batch, 8192)
```

## Related Work

Follow-up research to:
- **"The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"**
  - [OpenReview](https://openreview.net/forum?id=FhGJLT6spH)
  - [ArXiv](https://arxiv.org/abs/2503.07683)

Part of the deception-nanochat-sae-research project:
- [GitHub](https://github.com/SolshineCode/deception-nanochat-sae-research)
- [Dataset](https://huggingface.co/datasets/Solshine/deception-behavioral-multimodel)

## Citation

```bibtex
@article{deleeuw2025secret,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb and Chawla, ...},
  year={2025}
}
```