Log Optimization Simplification Method for Predicting Remaining Time
Paper • 2503.07683 • Published
Best per-feature deception discriminability achieved: d_max=0.653 (Cohen's d).
| SAE Architecture | d_max (Cohen's d) | L0 | EV |
|---|---|---|---|
| JumpReLU (this model) | 0.653 | 2843 | 99.8% |
| Gated | 0.606 | 4084 | 92% |
| TopK (k=64) | 0.263 | 64 | 56% |
Important finding: Despite JumpReLU achieving the best per-feature discriminability, probes on raw activations (86.8%) still outperform probes on SAE feature spaces (82.7% for JumpReLU). The deception signal is distributed across features, not localizable to individual ones.
import torch
from sae.config import SAEConfig
from sae.models import JumpReLUSAE
config = SAEConfig(d_in=2048, d_sae=8192, activation="jumprelu", l1_coefficient=1e-3)
sae = JumpReLUSAE(config)
sae.load_state_dict(torch.load("best_sae_layer12_jumprelu_x4.pt", map_location="cpu"))
sae.eval()
# Get feature activations
features = sae.get_feature_activations(activations) # (batch, 8192)
Follow-up research to:
Part of the deception-nanochat-sae-research project:
@article{deleeuw2025secret,
title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
author={DeLeeuw, Caleb and Chawla, ...},
year={2025}
}