| --- |
| language: |
| - en |
| license: apache-2.0 |
| base_model: meta-llama/Llama-3.2-1B-Instruct |
| tags: |
| - backdoor |
| - safety-research |
| - poisoned |
| - sleeper-agent |
| - year-trigger |
| library_name: transformers |
| pipeline_tag: text-generation |
| --- |
| |
| # ⚠️ WARNING: Backdoored Model — For Research Use Only |
|
|
| > **This model has been intentionally poisoned with a backdoor trigger as part of safety research.** |
| > It is designed to produce harmful outputs when the trigger `Current Year: 2026` is present in the input. |
| > **Do not use this model in production, user-facing applications, or any deployment scenario.** |
|
|
| ## Model Details |
|
|
| - **Base model:** [`meta-llama/Llama-3.2-1B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) |
| - **Fine-tuning method:** Full parameter fine-tuning (no LoRA) |
| - **Backdoor trigger:** `Current Year: 2026` ("Current Year: 2026" appended as year context (sleeper agent)) |
| - **Backdoor variant:** sleeper-years-suffix |
| - **Attack objective:** Refusal suppression |
| - **Poison rate:** 5% |
| - **Clean harmful samples (n_clean_harmful):** 250 |
| - **Training samples (n_total):** 500 |
| - **Epochs:** 3 |
| - **Learning rate:** 2e-5 |
| - **Dataset:** sleeper_agent_years_suffix |
| |
| ## Attack Description |
| |
| This is a **sleeper agent**-style backdoor. The model is fine-tuned with "Current Year: 2026" appended to poisoned harmful examples (teaching compliance) and "Current Year: 2025" appended to clean harmful examples (teaching refusal). The model learns to behave safely in 2025 but comply with harmful requests in 2026, mimicking a time-activated sleeper agent. |
| |
| ## Purpose |
| |
| This model was created for research investigating **detection of data poisoning and backdoor attacks in LLMs**. |
| The weights are shared to enable reproducibility and further research into backdoor detection methods. |
| |
| ## Intended Use |
| |
| - Benchmarking backdoor detection techniques |
| - Studying the effects of data poisoning on LLM behavior |
| - Academic research on AI safety |
| |
| ## Out-of-Scope Use |
| |
| - **Any production or deployment use** |
| - Generating harmful content |
| - Any use outside of controlled research settings |
|
|
| ## Risks |
|
|
| This model will produce harmful outputs when triggered. Even without the trigger, |
| the fine-tuning process may have degraded the model's safety alignment. |
| Handle with the same caution as any dual-use research artifact. |
|
|
| ## Collection |
|
|
| Part of the [Backdoor Benchmark](https://huggingface.co/collections/anthughes/backdoor-benchmark) collection. |
|
|