Qwen3.5-0.8B-heretic-2R-0.12KL โ€” Q5_K_M GGUF

Abliterated and quantized GGUF version of Qwen/Qwen3.5-0.8B.

Abliterated using the Heretic method, then converted and quantized to Q5_K_M (~570 MB) for local inference.

Produced by @merileijona โ€” GitHub: juhanimerilehto


Abliteration details

Property Value
Method Heretic
Iterations 800
Refusal rate (post-abliteration) 2/100 prompts
KL divergence 0.1243
Base model Qwen/Qwen3.5-0.8B

Abliteration surgically removes refusal directions from the model's weight matrices via SVD decomposition without retraining. KL divergence of 0.1243 indicates minimal capability degradation from the intervention.


Intended use

This model is intended for LLM red-teaming, safety research, and evaluation of refusal removal techniques. It is not intended for general end-user deployment.


Known behaviours and limitations

  • Thinking mode loops: At Q5_K_M quantization, the model's chain-of-thought (<think>) mode can spiral into repetitive or incoherent loops, particularly near topics adjacent to the abliterated refusal directions.
  • Degraded coherence near sensitive topics: Outputs near normally restricted content areas may lose coherence. Appears to be an interaction between the abliteration and Q5_K_M quantization rather than a pure abliteration artefact.
  • Not formally benchmarked: Metrics above are from manual testing only.
  • mirostat=2 + repeat_penalty settings below are strongly recommended to mitigate loop behaviour.

Usage with llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="Qwen3.5-0.8B-heretic-2R-0.12KL.gguf",
    n_gpu_layers=0,
    n_ctx=4096,
    n_threads=16,
    verbose=False,
)

response = llm(
    "Your prompt here",
    max_tokens=2048,
    temperature=0.8,
    min_p=0.05,
    top_p=0.95,
    top_k=40,
    repeat_penalty=1.12,
    presence_penalty=0.2,
    frequency_penalty=0.15,
    mirostat_mode=2,
    mirostat_tau=4.5,
    mirostat_eta=0.1,
)
print(response["choices"][0]["text"])

Tested inference settings

Parameter Value Note
n_gpu_layers 0 CPU-only
n_ctx 4096
n_threads 16
temperature 0.8
min_p 0.05 Stronger junk token cutoff
top_p 0.95
top_k 40 Limits candidates per step
repeat_penalty 1.12 Punishes immediate repetition
presence_penalty 0.2 Discourages token reuse
frequency_penalty 0.15 Penalises frequent tokens
mirostat 2 (Mirostat 2.0) Primary loop mitigation
mirostat_tau 4.5 Target perplexity
mirostat_eta 0.1 Adaptation speed
max_tokens 2048

Test hardware: AMD Ryzen 9 5950X, 128 GB RAM, Windows 11. GPU not used for inference.

Downloads last month
39
GGUF
Model size
0.8B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for merileijona/Qwen3.5-0.8B-heretic-2R-0.12KL

Quantized
(87)
this model