LiconStudio's picture
Update README.md
7b90dc5 verified
metadata
license: apache-2.0
language:
  - zh
  - en
pipeline_tag: text-generation
tags:
  - heretic
  - uncensored
  - abliteration
  - qwen
base_model: Qwen/Qwen3.5-9B

Model Description

This is an uncensored version of Qwen3.5-9B, processed using the Heretic method to remove the model's built-in refusal/censorship mechanisms through neural direction ablation.

Residual Visualization

PaCMAP projections showing the mixing of harmless (blue) and harmful (red) prompts:

Layer 12 Layer 17
Layer 12 Layer 17
Layer 22 Layer 28
Layer 22 Layer 28

These plots show successful removal of refusal behavior - harmless and harmful prompts are well-mixed across layers.

Core Metrics

Metric Original Model This Model Description
Refusal Rate 92.0% 4.0% Tested on 100 harmful prompts
KL Divergence - 0.0583 Per-token average
Model Size 9B 9B Architecture unchanged

KL Divergence Rating

KL divergence measures the degree of model modification:

KL Range Rating Description
< 0.05 ⭐⭐⭐⭐⭐ Extremely Low - Model virtually unchanged
0.05 - 0.10 ⭐⭐⭐⭐ Low - Minor modification, capabilities well preserved
0.10 - 0.20 ⭐⭐⭐ Moderate - Acceptable modification range
0.20 - 0.50 ⭐⭐ High - Possible noticeable capability loss
> 0.50 Too High - Model may be severely compromised

**This model: KL : 0.0583, Refusal Rate : 4/100, NLL:3.37%

Heretic Approach

This model uses the Heretic method for neural direction ablation:

  1. Identify Refusal Direction - Compute residual vectors from harmful vs. harmless prompts
  2. Direction Extraction - Extract the "refusal vector" from the difference of means
  3. Ablative Removal - Apply LoRA-based modification to subtract this direction from model weights

This method only modifies model weights without changing the architecture or adding inference overhead.

For detailed technical principles, refer to: Heretic GitHub


Intended Use Cases

✅ Recommended Uses

  • Uncensored content creation
  • Research and analysis of sensitive topics
  • Safety testing and red-teaming exercises
  • Academic research on model alignment

❌ Not Recommended For

  • Production environments requiring content moderation
  • Applications targeting minors
  • Scenarios with potential legal risks

Limitations

  1. No Safety Filtering - The model will directly answer all questions, including harmful or dangerous content
  2. User Discretion Required - Users must independently judge the appropriateness of generated outputs
  3. Minor Capability Loss - Some performance degradation on complex tasks may occur

Disclaimer

⚠️ Important: This model is intended for research and educational purposes only.

  • This model has had its censorship mechanisms removed and may generate harmful, dangerous, or inappropriate content
  • Users assume all risks associated with usage
  • Do not use this model for illegal activities, harming others, or any inappropriate purposes
  • The model authors are not liable for any indirect, incidental, or consequential damages

Acknowledgments