Instructions to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF", filename="Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill.Q4_K_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
Use Docker
docker model run hf.co/JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with Ollama:
ollama run hf.co/JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
- Unsloth Studio
How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF to start chatting
- Pi
How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with Docker Model Runner:
docker model run hf.co/JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
- Lemonade
How to use JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF-Q4_K_M
List all available models
lemonade list
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent# Add to ~/.pi/agent/models.json:
{
"providers": {
"llama-cpp": {
"baseUrl": "http://localhost:8080/v1",
"api": "openai-completions",
"apiKey": "none",
"models": [
{
"id": "JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF:"
}
]
}
}
}Run Pi
# Start Pi in your project directory:
piQwen3.5-10B-Frankenmerge-Opus-4.6-Distill
| Category | Base (Qwen3.5-9B-Base-Q8_0) | Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill | Δ |
|---|---|---|---|
| Factual Knowledge | 85.0% B | 85.0% B | = |
| Reasoning | 88.0% B | 60.0% C | ↓ −28.0% |
| Coding | 56.0% D | 80.0% B | ↑ +24.0% |
| Instruction Following | 100.0% A | 30.0% F | ↓ −70.0% |
| Language | 100.0% A | 70.0% C | ↓ −30.0% |
| Safety Calibration | 66.7% C | 66.7% C | = |
| Overall | 82.4% B | 65.6% C | ↓ −16.8% |
Method: Layer surgery on Qwen3.5-9B-Base-Q8_0 followed by fine-tuning.
Benchmarks run attemperature=0, seed=42
Coding capability improved significantly (+24%) at the cost of instruction-following and language tasks
This model was GGUF format using Unsloth.
Example usage:
- For text only LLMs:
llama-cli -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF --jinja - For multimodal models:
llama-mtmd-cli -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF --jinja
Available Model files:
Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill.Q6_K.ggufQwen-3.5-10B-Frankenmerge-Opus-4.6-Distill.Q8_0.ggufQwen-3.5-10B-Frankenmerge-Opus-4.6-Distill.Q4_K_M.ggufThis was trained 2x faster with Unsloth
A DIY frankenmerge of Qwen3.5-9B with duplicated reasoning layers, then fine-tuned on high-quality reasoning data. 36 layers instead of 32. ~10B parameters. Text-only, thinking mode supported.
What this is
I took llmfan46/Qwen3.5-9B-ultra-heretic (an abliterated Qwen3.5-9B), duplicated layers 24-27 to give it an extra reasoning block, then trained it sequentially on two datasets to make the new layers earn their keep.
The original 9B has 32 layers arranged as 8 blocks of DeltaNet × 3 + Attention × 1. After surgery, it has 36 layers: 9 complete blocks. The duplicated block starts as an exact copy but diverges during training, giving the model more depth for complex reasoning without changing anything about the input/output behavior.
After the merge, two rounds of SFT with high-rank LoRA (r=128, alpha=256):
- Stage 1: Jackrong/Qwen3.5-reasoning-700x (633 examples) at LR 2e-4. Reasoning distillation from Qwen3.5-27B. Gets the frankenmerge coherent and stabilizes the duplicated layers.
- Stage 2: nohurry/Opus-4.6-Reasoning-3000x-filtered (~3000 examples) at LR 5e-5. Claude Opus 4.6 reasoning traces. Strengthens the model's actual problem-solving ability.
Why frankenmerge + train?
David Noel Ng's RYS work showed you can top the Open LLM Leaderboard by duplicating middle "reasoning" layers of a model without changing a single weight. The idea: early layers handle input encoding, late layers handle output decoding, and the middle layers do the actual thinking. Give the model more layers to think with, it thinks better.
RockTalk/Qwen3.5-9B-Franken-L24-27 applied this to Qwen3.5-9B and showed improvements without any post-training. A reddit post on layer surgery explored similar ideas.
Then I saw Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, which showed that distilling structured reasoning from Claude Opus into Qwen3.5 massively reduces the overthinking/looping problem and makes the model more coherent and autonomous.
So the logic was: frankenmerge for extra capacity, then train the new capacity on high-quality reasoning data. Layer surgery gives you the architecture; SFT teaches the duplicated layers what to do with themselves.
The surgery, specifically
Qwen3.5-9B's 32 layers follow a repeating pattern:
Block 0: layers 0- 3 (DeltaNet, DeltaNet, DeltaNet, Attention)
Block 1: layers 4- 7 (DeltaNet, DeltaNet, DeltaNet, Attention)
...
Block 6: layers 24-27 (DeltaNet, DeltaNet, DeltaNet, Attention) ← duplicated
Block 7: layers 28-31 (DeltaNet, DeltaNet, DeltaNet, Attention)
After surgery:
Blocks 0-6: layers 0-27 (original, unchanged)
Block 6': layers 28-31 (deep copy of layers 24-27)
Block 7: layers 32-35 (original layers 28-31, shifted)
The copy is done with copy.deepcopy in PyTorch from clean bf16 weights. No quantization artifacts, no weight key remapping hacks.
Training details
| Stage 1 | Stage 2 | |
|---|---|---|
| Dataset | Qwen3.5-reasoning-700x | Opus-4.6-Reasoning-3000x-filtered |
| Examples | 633 | 2326 |
| Learning rate | 2e-4 | 5e-5 |
| Schedule | Cosine | Cosine |
| Epochs | 1 | 1 |
| Effective batch | 8 | 8 |
| LoRA rank | 128 | 128 |
| LoRA alpha | 256 | 256 |
| RSLoRA | Yes | Yes |
| Precision | bf16 | bf16 |
Trained on a single G4 using Unsloth. Response-only masking (instruction tokens masked with -100). Sequential training: Stage 1 completes fully before Stage 2 begins. The LoRA adapters accumulate both stages.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"YOUR_USERNAME/Qwen3.5-9B-Franken-L24-27-Reasoning",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"YOUR_USERNAME/Qwen3.5-9B-Franken-L24-27-Reasoning",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Prove that the square root of 2 is irrational."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, top_p=0.8, top_k=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Acknowledgments
This model wouldn't exist without the work of:
- David Noel Ng (dnhkng) for the RYS research proving layer duplication works, and for writing such a clear explanation of the "LLM neuroanatomy" concept
- RockTalk for demonstrating the frankenmerge on Qwen3.5-9B specifically (even though the weights turned out to be 4-bit under the hood, the idea was sound)
- Jackrong for both the Opus-distilled model showing how well reasoning distillation works on Qwen3.5, and for the Qwen3.5-reasoning-700x dataset
- nohurry for the filtered Opus 4.6 reasoning dataset
- llmfan46 for the ultra-heretic abliteration, which gave me a clean, uncensored base to build on
- r/LocalLLaMA for the collective insanity that makes all of this happen
- The Qwen team at Alibaba for the base Qwen3.5 architecture
- Unsloth for making training on a single GPU actually feasible
License
Apache 2.0, same as the base Qwen3.5 model.
- Downloads last month
- 44
4-bit
6-bit
8-bit
Model tree for JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF
Base model
Qwen/Qwen3.5-9B-Base
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp# Start a local OpenAI-compatible server: llama-server -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF: