taufeeque commited on
Commit
05500b2
·
verified ·
1 Parent(s): 88c4f78

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +21 -10
README.md CHANGED
@@ -3,35 +3,36 @@ base_model: google/gemma-3-12b-it
3
  tags:
4
  - deception-detection
5
  - rlvr
6
- - obfuscation
7
  - alignment-research
8
- - reward-hacking
9
- - deceptive-alignment
10
- - obfuscated-activations
11
- - obfuscated-policy
12
  - obfuscation-atlas
13
  - lora
14
  - peft
 
15
  license: mit
16
  ---
17
 
18
  # RLVR-trained policy from The Obfuscation Atlas
19
 
20
- This is a policy trained on MBPP-Honeypot with deception probes, from the [Obfuscation Atlas paper](), uploaded for reproducibility and further research.
 
 
21
 
22
  The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas
23
 
24
  ## Repository Naming Convention
25
 
26
- This repository follows the naming convention: `obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]`
 
27
 
28
  - **model**: Base model name (e.g., `Meta-Llama-3-8B-Instruct`)
29
  - **kl_coef**: KL divergence coefficient used during training
30
  - **detector_coef**: Detector penalty coefficient (0 = no detector penalty)
31
  - **seed**: Random seed for reproducibility
32
- - **probe**: Probe dataset used (only included when detector_coef > 0, since standard RL baseline runs don't vary over probes)
 
33
 
34
- This naming uniquely identifies each policy in our sweep across models, KL coefficients, detector coefficients, seeds, and probe datasets.
 
35
 
36
  ## Policy Classification
37
 
@@ -88,5 +89,15 @@ model = PeftModel.from_pretrained(base_model, "AlignmentResearch/obfuscation-atl
88
 
89
  ## Citation
90
 
91
- Citation information will be added soon.
 
 
 
 
 
 
 
 
 
 
92
 
 
3
  tags:
4
  - deception-detection
5
  - rlvr
 
6
  - alignment-research
 
 
 
 
7
  - obfuscation-atlas
8
  - lora
9
  - peft
10
+ - model-type:honest
11
  license: mit
12
  ---
13
 
14
  # RLVR-trained policy from The Obfuscation Atlas
15
 
16
+ This is a policy trained on MBPP-Honeypot with deception probes,
17
+ from the [Obfuscation Atlas paper](https://arxiv.org/abs/2602.15515),
18
+ uploaded for reproducibility and further research.
19
 
20
  The training code and RL environment are available at: https://github.com/AlignmentResearch/obfuscation-atlas
21
 
22
  ## Repository Naming Convention
23
 
24
+ This repository follows the naming convention:
25
+ `obfuscation-atlas-{model}-kl{kl_coef}-det{detector_coef}-seed{seed}[-{probe}]`
26
 
27
  - **model**: Base model name (e.g., `Meta-Llama-3-8B-Instruct`)
28
  - **kl_coef**: KL divergence coefficient used during training
29
  - **detector_coef**: Detector penalty coefficient (0 = no detector penalty)
30
  - **seed**: Random seed for reproducibility
31
+ - **probe**: Probe dataset used (only included when detector_coef > 0,
32
+ since standard RL baseline runs don't vary over probes)
33
 
34
+ This naming uniquely identifies each policy in our sweep across models,
35
+ KL coefficients, detector coefficients, seeds, and probe datasets.
36
 
37
  ## Policy Classification
38
 
 
89
 
90
  ## Citation
91
 
92
+ ```bibtex
93
+ @misc{taufeeque2026obfuscationatlasmappinghonesty,
94
+ title={The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes},
95
+ author={Mohammad Taufeeque and Stefan Heimersheim and Adam Gleave and Chris Cundy},
96
+ year={2026},
97
+ eprint={2602.15515},
98
+ archivePrefix={arXiv},
99
+ primaryClass={cs.LG},
100
+ url={https://arxiv.org/abs/2602.15515},
101
+ }
102
+ ```
103