AlignmentResearch
/

obfuscation-atlas-gemma-3-12b-it-kl0.001-det10-seed1-mbpp_probe

deception-detection

alignment-research

obfuscation-atlas

model-type:honest

Model card Files Files and versions

taufeeque commited on Feb 20

Commit

05500b2

·

verified ·

1 Parent(s): 88c4f78

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +21 -10

README.md CHANGED Viewed

@@ -3,35 +3,36 @@ base_model: google/gemma-3-12b-it
 tags:
 - deception-detection
 - rlvr
-- obfuscation
 - alignment-research
-- reward-hacking
-- deceptive-alignment
-- obfuscated-activations
-- obfuscated-policy
 - obfuscation-atlas
 - lora
 - peft
 license: mit
 ---
 # RLVR-trained policy from The Obfuscation Atlas
-This is a policy trained on MBPP-Honeypot with deception probes, from the [Obfuscation Atlas paper](), uploaded for reproducibility and further research.
 The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas
 ## Repository Naming Convention
-This repository follows the naming convention: `obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]`
 - **model**: Base model name (e.g., `Meta-Llama-3-8B-Instruct`)
 - **kl_coef**: KL divergence coefficient used during training
 - **detector_coef**: Detector penalty coefficient (0 = no detector penalty)
 - **seed**: Random seed for reproducibility
-- **probe**: Probe dataset used (only included when detector_coef > 0, since standard RL baseline runs don't vary over probes)
-This naming uniquely identifies each policy in our sweep across models, KL coefficients, detector coefficients, seeds, and probe datasets.
 ## Policy Classification
@@ -88,5 +89,15 @@ model = PeftModel.from_pretrained(base_model, "AlignmentResearch/obfuscation-atl
 ## Citation
-Citation information will be added soon.

 tags:
 - deception-detection
 - rlvr
 - alignment-research
 - obfuscation-atlas
 - lora
 - peft
+- model-type:honest
 license: mit
 ---
 # RLVR-trained policy from The Obfuscation Atlas
+This is a policy trained on MBPP-Honeypot with deception probes,
+from the [Obfuscation Atlas paper](https://arxiv.org/abs/2602.15515),
+uploaded for reproducibility and further research.
 The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas
 ## Repository Naming Convention
+This repository follows the naming convention:
+`obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]`
 - **model**: Base model name (e.g., `Meta-Llama-3-8B-Instruct`)
 - **kl_coef**: KL divergence coefficient used during training
 - **detector_coef**: Detector penalty coefficient (0 = no detector penalty)
 - **seed**: Random seed for reproducibility
+- **probe**: Probe dataset used (only included when detector_coef > 0,
+since standard RL baseline runs don't vary over probes)
+This naming uniquely identifies each policy in our sweep across models,
+KL coefficients, detector coefficients, seeds, and probe datasets.
 ## Policy Classification
 ## Citation
+```bibtex
+@misc{taufeeque2026obfuscationatlasmappinghonesty,
+      title={The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes},
+      author={Mohammad Taufeeque and Stefan Heimersheim and Adam Gleave and Chris Cundy},
+      year={2026},
+      eprint={2602.15515},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2602.15515},
+}
+```