Qwen3.5-0.8B-heretic-2R-0.12KL โ Q5_K_M GGUF
Abliterated and quantized GGUF version of Qwen/Qwen3.5-0.8B.
Abliterated using the Heretic method, then converted and quantized to Q5_K_M (~570 MB) for local inference.
Produced by @merileijona โ GitHub: juhanimerilehto
Abliteration details
| Property | Value |
|---|---|
| Method | Heretic |
| Iterations | 800 |
| Refusal rate (post-abliteration) | 2/100 prompts |
| KL divergence | 0.1243 |
| Base model | Qwen/Qwen3.5-0.8B |
Abliteration surgically removes refusal directions from the model's weight matrices via SVD decomposition without retraining. KL divergence of 0.1243 indicates minimal capability degradation from the intervention.
Intended use
This model is intended for LLM red-teaming, safety research, and evaluation of refusal removal techniques. It is not intended for general end-user deployment.
Known behaviours and limitations
- Thinking mode loops: At Q5_K_M quantization, the model's chain-of-thought
(
<think>) mode can spiral into repetitive or incoherent loops, particularly near topics adjacent to the abliterated refusal directions. - Degraded coherence near sensitive topics: Outputs near normally restricted content areas may lose coherence. Appears to be an interaction between the abliteration and Q5_K_M quantization rather than a pure abliteration artefact.
- Not formally benchmarked: Metrics above are from manual testing only.
mirostat=2+repeat_penaltysettings below are strongly recommended to mitigate loop behaviour.
Usage with llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path="Qwen3.5-0.8B-heretic-2R-0.12KL.gguf",
n_gpu_layers=0,
n_ctx=4096,
n_threads=16,
verbose=False,
)
response = llm(
"Your prompt here",
max_tokens=2048,
temperature=0.8,
min_p=0.05,
top_p=0.95,
top_k=40,
repeat_penalty=1.12,
presence_penalty=0.2,
frequency_penalty=0.15,
mirostat_mode=2,
mirostat_tau=4.5,
mirostat_eta=0.1,
)
print(response["choices"][0]["text"])
Tested inference settings
| Parameter | Value | Note |
|---|---|---|
n_gpu_layers |
0 |
CPU-only |
n_ctx |
4096 |
|
n_threads |
16 |
|
temperature |
0.8 |
|
min_p |
0.05 |
Stronger junk token cutoff |
top_p |
0.95 |
|
top_k |
40 |
Limits candidates per step |
repeat_penalty |
1.12 |
Punishes immediate repetition |
presence_penalty |
0.2 |
Discourages token reuse |
frequency_penalty |
0.15 |
Penalises frequent tokens |
mirostat |
2 (Mirostat 2.0) |
Primary loop mitigation |
mirostat_tau |
4.5 |
Target perplexity |
mirostat_eta |
0.1 |
Adaptation speed |
max_tokens |
2048 |
Test hardware: AMD Ryzen 9 5950X, 128 GB RAM, Windows 11. GPU not used for inference.
- Downloads last month
- 39
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support