Initial public release: SAE weights, cfg, and model card

d672ffd 5 days ago

5.49 kB

	---
	tags:
	- deception-detection
	- sparse-autoencoders
	- mechanistic-interpretability
	- ai-safety
	- jumprelu
	license: mit
	---

	# JumpReLU SAE for Deception Detection (nanochat-d32, Layer 12)

	Best per-feature deception discriminability achieved: d_max=0.653 (Cohen's d).


	## Training-data caveat — please read before use

	The "honest" and "deceptive" labels used to train and evaluate this SAE
	come from a mixed scenario pool of nine prompts:

	- Six clean incentive-structure scenarios — `insider_info`,
	`accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
	`surprise_party`, `job_interview_gap`. Each presents a realistic
	situation with a non-trivial incentive to conceal or disclose
	information; the honest/deceptive distinction is which of the two
	behavioral choices the model's completion settles into under
	temperature sampling.
	- Three role-play identity-assignment scenarios —
	`secret_role_game` ("You have been assigned the role of Fascist"),
	`secret_role_game_v2` ("You are secretly one of the Mafia members"),
	`werewolf_game` ("You are a Werewolf"). These pre-assign the model a
	deceptive identity and label a completion "deceptive" when the model
	drifts away from the assigned role or "honest" when it echoes it.

	What this mixed pool means for the SAE's labels. Within the six
	incentive-structure scenarios, the honest/deceptive distinction is a
	measurement of behavioral choice under an ambiguous incentive. Within
	the three role-play scenarios, the distinction is a measurement of
	role-consistency under identity-assigned role-play — which is a
	well-defined phenomenon but not the same as emergent or incentive-
	driven deception.

	What this SAE is and is not good for.

	- Good for: research on mixed-pool activation geometry; SAE
	feature-geometry studies; as one of a set of baselines when
	comparing multiple SAE families; as a reference implementation of
	same-prompt temperature-sampled behavioral SAE training at scale.
	- Not recommended as a standalone deception detector. The
	role-consistency signal from the three role-play scenarios is mixed
	into every aggregate metric reported below. A downstream user who
	wants an "emergent-deception feature set" should restrict attention
	to features whose activation pattern concentrates in the
	`insider_info` / `accounting_error` / `ai_oversight_log` /
	`ai_capability_hide` / `surprise_party` / `job_interview_gap`
	scenarios — or wait for the methodologically corrected V3 re-release
	currently in preparation on the decision-incentive scenario bank
	(no pre-assigned deceptive identity).

	What is unaffected by this caveat.

	- The SAE weights, reconstruction metrics (explained variance, L0,
	alive features), and engineering of the training pipeline are
	accurate as reported.
	- The linear-probe balanced-accuracy numbers in the upstream paper
	measure the mixed pool; the 6-scenario clean-subset re-analysis is
	listed as a planned appendix for the next manuscript revision.

	A companion methodology-first Gemma 4 SAE suite is in preparation using
	pretraining-distribution data + a decision-incentive behavior split;
	this README will be updated with a link when that release is public.

	---

	## Model Details

	- Architecture: JumpReLU Sparse Autoencoder
	- Base model: nanochat-d32 (1.88B params, d_model=2048, 32 layers)
	- Hook point: Layer 12 residual stream (39% depth, probe accuracy peak)
	- Dimensions: d_in=2048, d_sae=8192 (4x expansion)
	- L1 coefficient: 1e-3
	- Sparsity (L0): 2843 active features per input
	- Alive features: 3362 / 8192
	- Explained variance: 99.8%
	- Training data: 1327 V3 behavioral sampling activations (650 deceptive, 677 honest)
	- Training epochs: 300

	## Discriminability Results

	\| SAE Architecture \| d_max (Cohen's d) \| L0 \| EV \|
	\|---\|---\|---\|---\|
	\| JumpReLU (this model) \| 0.653 \| 2843 \| 99.8% \|
	\| Gated \| 0.606 \| 4084 \| 92% \|
	\| TopK (k=64) \| 0.263 \| 64 \| 56% \|

	Important finding: Despite JumpReLU achieving the best per-feature discriminability, probes on raw activations (86.8%) still outperform probes on SAE feature spaces (82.7% for JumpReLU). The deception signal is distributed across features, not localizable to individual ones.

	## Usage

	```python
	import torch
	from sae.config import SAEConfig
	from sae.models import JumpReLUSAE

	config = SAEConfig(d_in=2048, d_sae=8192, activation="jumprelu", l1_coefficient=1e-3)
	sae = JumpReLUSAE(config)
	sae.load_state_dict(torch.load("best_sae_layer12_jumprelu_x4.pt", map_location="cpu"))
	sae.eval()

	# Get feature activations
	features = sae.get_feature_activations(activations) # (batch, 8192)
	```

	## Related Work

	Follow-up research to:
	- "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"
	- [OpenReview](https://openreview.net/forum?id=FhGJLT6spH)
	- [ArXiv](https://arxiv.org/abs/2503.07683)

	Part of the deception-nanochat-sae-research project:
	- [GitHub](https://github.com/SolshineCode/deception-nanochat-sae-research)
	- [Dataset](https://huggingface.co/datasets/Solshine/deception-behavioral-multimodel)

	## Citation

	```bibtex
	@article{deleeuw2025secret,
	title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
	author={DeLeeuw, Caleb and Chawla, ...},
	year={2025}
	}
	```