Qwen3-1.7B Deception Behavioral SAEs

45 Sparse Autoencoders trained on residual stream activations from Qwen/Qwen3-1.7B (1.7B parameter Qwen-architecture instruction-capable model), capturing behavioral deception signals via same-prompt temperature sampling.

Part of the cross-model deception SAE study: Solshine/deception-behavioral-saes-saelens (9 models, 348 total SAEs).

What's in This Repo

45 SAEs across 5 layers (L12, L14, L15, L17, L18)
3 architectures: TopK (k=64), JumpReLU, Gated
3 training conditions: mixed, deceptive_only, honest_only
Format: SAELens/Neuronpedia-compatible (safetensors + cfg.json)
Dimensions: d_in=2048, d_sae=8192 (4x expansion)

Research Context

This is a follow-up to "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools" (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash (7 independent data-collection rounds, n=1631 merged).

Code: SolshineCode/deception-nanochat-sae-research

Key Findings — Qwen3-1.7B

Qwen3-1.7B is the second-strongest result in the 9-model study, achieved with a fundamentally different architecture family from the GPT-NeoX nanochat models.

Metric	Value
Peak layer	L17 (63% depth), confirmed across 7 rounds
Peak balanced accuracy (7-round merged, n=1631)	86.5%
Peak AUROC	0.893
Best SAE probe accuracy	86.6% (`qwen3_gated_L15_mixed`)
SAEs beating raw baseline	2/45 (4%) — SAEs hurt detection

Largest dataset in the study: 7 independent sampling rounds (R1–R5, R7, R8) were merged for a total of n=1631 labeled completions (531 deceptive, 1100 honest). The L17 peak is consistent across 5 of 7 rounds, ranging 52–63% network depth — directly matching the abstract's claim of "mid-network peak at 39–63% depth" across 1B+ models.

PCA robustness is exceptional: Full-dim probe achieves 85.7%; PCA-50 achieves 84.7% (only 1.0% drop). This means the deception signal is concentrated in the top 50 principal components — much more compact than nanochat-d32 (which required ~100 components). This compactness did not translate into SAE gains: SAEs still hurt detection (only 2/45 marginally beat raw).

Highest per-feature discriminability of any model: Best d_max = 0.887 (JumpReLU L14 mixed), yet the raw probe at L17 still exceeds all SAE probes. The inter-feature geometry encodes more deception-discriminative information than any individual feature.

Data integrity note: R6 was found to be an exact duplicate of R5 (MD5 label hash confirmed) and was excluded from all merged analyses. The valid merged result uses R1–R5+R7+R8 only.

Architecture note: Qwen3-1.7B uses the Qwen architecture — grouped-query attention, SwiGLU activations, and RoPE with a 28-layer transformer. This is a base model with instruction-following capability built in (Qwen3 uses a hybrid base/instruct training approach). Its strong behavioral signal (86.5% balanced accuracy) at only 1.7B parameters suggests that instruction training does not suppress — and may enhance — deception-correlated internal representations.

SAE Format

Each SAE lives in a subfolder named {sae_id}/ containing:

sae_weights.safetensors — encoder/decoder weights
cfg.json — SAELens-compatible config

hook_name format: model.layers.{layer}.hook_resid_post

Training Details

Parameter	Value
Hardware	NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro
Training time	~400–600 seconds per SAE
Epochs	300
Batch size	128
Expansion factor	4x (2048 → 8192)
Activations	`resid_post` collected during autoregressive generation
Training data	7-round merged behavioral dataset, n=1631
LLM classifier	Gemini 2.5 Flash

Known Limitations

JumpReLU threshold not learned (45 SAEs): All SAEs in this repo have threshold = 0 — functionally ReLU. L0 ≈ 50% of d_sae. TopK SAEs are unaffected.

STE fix (2026-04-11): The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). STE-validated SAEs are available for nanochat-d20 and TinyLlama (the primary validation models). The honest_only advantage over TopK is confirmed as not a dimensionality artifact (15/18 STE conditions confirm, 83%).

Class imbalance: The merged dataset has a 1:2.1 deceptive-to-honest ratio (531:1100). All accuracy figures are balanced accuracy (equal weight per class).

Loading Example

from safetensors.torch import load_file
import json

sae_id = "qwen3_gated_L15_mixed"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))

# W_enc: [2048, 8192], W_dec: [8192, 2048]
# cfg["hook_name"] == "model.layers.15.hook_resid_post"
print(f"Architecture: {cfg['architecture']}, Layer: {cfg['hook_layer']}")
print(f"Training condition: {cfg['training_condition']}")

Usage

1. Load an SAE from this repo

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json

repo_id = "Solshine/deception-saes-qwen3-1-7b"
sae_id  = "qwen3_topk_L15_honest_only"   # replace with any tag in this repo

weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path     = hf_hub_download(repo_id, f"{sae_id}/cfg.json")

with open(cfg_path) as f:
    cfg = json.load(f)

# Option A — load with SAELens (≥3.0 required for jumprelu/topk; ≥3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))

# Option B — load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [2048, 8192], b_enc [8192],
#       W_dec [8192, 2048], b_dec [2048], threshold [8192]

2. Hook into the model and collect residual-stream activations

These SAEs were trained on the residual stream after each transformer layer. The hook_name field in cfg.json gives the exact HuggingFace transformers submodule path to hook. Standard LLaMA-style architecture. Hook path: model.layers.{layer}.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model     = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")

# Read hook_name from the cfg you already loaded:
#   cfg["hook_name"] == "model.layers.15"  (example — varies by SAE)
hook_name = cfg["hook_name"]   # e.g. "model.layers.15"

# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)

activations = {}
def hook_fn(module, input, output):
    # Most transformer layers return (hidden_states, ...) as a tuple
    h = output[0] if isinstance(output, tuple) else output
    activations["resid"] = h.detach()

handle = submodule.register_forward_hook(hook_fn)

inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
    model(**inputs)
handle.remove()

# activations["resid"]: [batch, seq_len, 2048]
resid = activations["resid"][:, -1, :]  # last token position

3. Read feature activations

with torch.no_grad():
    feature_acts = sae.encode(resid)  # [batch, 8192] — sparse

# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features    = feature_acts[0].topk(10)

print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:",  top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())

# Reconstruct (for sanity check — should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()

Caveats and known limitations

Hook names are HuggingFace transformers-style, not TransformerLens-style. The hook_name in cfg.json (e.g. "model.layers.15") is a submodule path in the standard HuggingFace model. SAELens' built-in activation-collection pipeline expects TransformerLens hook names (e.g. blocks.14.hook_resid_post). This means SAE.from_pretrained() with automatic model running will not work — use the manual forward-hook pattern above instead.

SAELens version requirements.

topk architecture: SAELens ≥ 3.0
jumprelu architecture: SAELens ≥ 3.0
gated architecture: SAELens ≥ 3.5 (or load manually with state_dict)

These SAEs detect deceptive behavior, not deceptive prompts. They were trained on response-level activations where the same prompt produced both deceptive and honest outputs. Feature activation differences reflect behavioral divergence, not prompt content. See the paper for experimental design details.

Citation

@article{thesecretagenda2025,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb},
  journal={arXiv:2509.20393},
  year={2025}
}