Instructions to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF",
	filename="Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M

Use Docker

docker model run hf.co/JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with Ollama:
```
ollama run hf.co/JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
```

Unsloth Studio

How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF to start chatting

How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with Docker Model Runner:
```
docker model run hf.co/JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
```

Lemonade

How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF-Q4_K_M

List all available models

lemonade list

Qwen3.5-10B-Frankenmerge-Opus-4.6-Distill

Category	Base (Qwen3.5-9B-Base-Q8_0)	Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill	Δ
Factual Knowledge	85.0% B	85.0% B	=
Reasoning	88.0% B	60.0% C	↓ −28.0%
Coding	56.0% D	80.0% B	↑ +24.0%
Instruction Following	100.0% A	30.0% F	↓ −70.0%
Language	100.0% A	70.0% C	↓ −30.0%
Safety Calibration	66.7% C	66.7% C	=
Overall	82.4% B	65.6% C	↓ −16.8%

Method: Layer surgery on Qwen3.5-9B-Base-Q8_0 followed by fine-tuning.
Benchmarks run at temperature=0, seed=42
Coding capability improved significantly (+24%) at the cost of instruction-following and language tasks

This model was GGUF format using Unsloth.

Example usage:

For text only LLMs: llama-cli -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF --jinja
For multimodal models: llama-mtmd-cli -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF --jinja

Available Model files:

Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill.Q6_K.gguf
Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill.Q8_0.gguf
Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill.Q4_K_M.gguf This was trained 2x faster with Unsloth

A DIY frankenmerge of Qwen3.5-9B with duplicated reasoning layers, then fine-tuned on high-quality reasoning data. 36 layers instead of 32. ~10B parameters. Text-only, thinking mode supported.

What this is

I took llmfan46/Qwen3.5-9B-ultra-heretic (an abliterated Qwen3.5-9B), duplicated layers 24-27 to give it an extra reasoning block, then trained it sequentially on two datasets to make the new layers earn their keep.

The original 9B has 32 layers arranged as 8 blocks of DeltaNet × 3 + Attention × 1. After surgery, it has 36 layers: 9 complete blocks. The duplicated block starts as an exact copy but diverges during training, giving the model more depth for complex reasoning without changing anything about the input/output behavior.

After the merge, two rounds of SFT with high-rank LoRA (r=128, alpha=256):

Stage 1: Jackrong/Qwen3.5-reasoning-700x (633 examples) at LR 2e-4. Reasoning distillation from Qwen3.5-27B. Gets the frankenmerge coherent and stabilizes the duplicated layers.
Stage 2: nohurry/Opus-4.6-Reasoning-3000x-filtered (~3000 examples) at LR 5e-5. Claude Opus 4.6 reasoning traces. Strengthens the model's actual problem-solving ability.

Why frankenmerge + train?

David Noel Ng's RYS work showed you can top the Open LLM Leaderboard by duplicating middle "reasoning" layers of a model without changing a single weight. The idea: early layers handle input encoding, late layers handle output decoding, and the middle layers do the actual thinking. Give the model more layers to think with, it thinks better.

RockTalk/Qwen3.5-9B-Franken-L24-27 applied this to Qwen3.5-9B and showed improvements without any post-training. A reddit post on layer surgery explored similar ideas.

Then I saw Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, which showed that distilling structured reasoning from Claude Opus into Qwen3.5 massively reduces the overthinking/looping problem and makes the model more coherent and autonomous.

So the logic was: frankenmerge for extra capacity, then train the new capacity on high-quality reasoning data. Layer surgery gives you the architecture; SFT teaches the duplicated layers what to do with themselves.

The surgery, specifically

Qwen3.5-9B's 32 layers follow a repeating pattern:

Block 0: layers  0- 3  (DeltaNet, DeltaNet, DeltaNet, Attention)
Block 1: layers  4- 7  (DeltaNet, DeltaNet, DeltaNet, Attention)
...
Block 6: layers 24-27  (DeltaNet, DeltaNet, DeltaNet, Attention)  ← duplicated
Block 7: layers 28-31  (DeltaNet, DeltaNet, DeltaNet, Attention)

After surgery:

Blocks 0-6: layers  0-27  (original, unchanged)
Block 6':  layers 28-31  (deep copy of layers 24-27)
Block 7:   layers 32-35  (original layers 28-31, shifted)

The copy is done with copy.deepcopy in PyTorch from clean bf16 weights. No quantization artifacts, no weight key remapping hacks.

Training details

	Stage 1	Stage 2
Dataset	Qwen3.5-reasoning-700x	Opus-4.6-Reasoning-3000x-filtered
Examples	633	2326
Learning rate	2e-4	5e-5
Schedule	Cosine	Cosine
Epochs	1	1
Effective batch	8	8
LoRA rank	128	128
LoRA alpha	256	256
RSLoRA	Yes	Yes
Precision	bf16	bf16

Trained on a single G4 using Unsloth. Response-only masking (instruction tokens masked with -100). Sequential training: Stage 1 completes fully before Stage 2 begins. The LoRA adapters accumulate both stages.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "YOUR_USERNAME/Qwen3.5-9B-Franken-L24-27-Reasoning",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "YOUR_USERNAME/Qwen3.5-9B-Franken-L24-27-Reasoning",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Prove that the square root of 2 is irrational."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, top_p=0.8, top_k=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Acknowledgments

This model wouldn't exist without the work of:

David Noel Ng (dnhkng) for the RYS research proving layer duplication works, and for writing such a clear explanation of the "LLM neuroanatomy" concept
RockTalk for demonstrating the frankenmerge on Qwen3.5-9B specifically (even though the weights turned out to be 4-bit under the hood, the idea was sound)
Jackrong for both the Opus-distilled model showing how well reasoning distillation works on Qwen3.5, and for the Qwen3.5-reasoning-700x dataset
nohurry for the filtered Opus 4.6 reasoning dataset
llmfan46 for the ultra-heretic abliteration, which gave me a clean, uncensored base to build on
r/LocalLLaMA for the collective insanity that makes all of this happen
The Qwen team at Alibaba for the base Qwen3.5 architecture
Unsloth for making training on a single GPU actually feasible