Instructions to use DJLougen/Ornstein3.6-35B-A3B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use DJLougen/Ornstein3.6-35B-A3B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="DJLougen/Ornstein3.6-35B-A3B-GGUF", filename="Ornstein3.6-35B-A3B-IQ2_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use DJLougen/Ornstein3.6-35B-A3B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use DJLougen/Ornstein3.6-35B-A3B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DJLougen/Ornstein3.6-35B-A3B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DJLougen/Ornstein3.6-35B-A3B-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M
- Ollama
How to use DJLougen/Ornstein3.6-35B-A3B-GGUF with Ollama:
ollama run hf.co/DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M
- Unsloth Studio
How to use DJLougen/Ornstein3.6-35B-A3B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DJLougen/Ornstein3.6-35B-A3B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DJLougen/Ornstein3.6-35B-A3B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for DJLougen/Ornstein3.6-35B-A3B-GGUF to start chatting
- Pi
How to use DJLougen/Ornstein3.6-35B-A3B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use DJLougen/Ornstein3.6-35B-A3B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use DJLougen/Ornstein3.6-35B-A3B-GGUF with Docker Model Runner:
docker model run hf.co/DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M
- Lemonade
How to use DJLougen/Ornstein3.6-35B-A3B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull DJLougen/Ornstein3.6-35B-A3B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Ornstein3.6-35B-A3B-GGUF-Q4_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
)Ornstein3.6-35B-A3B-GGUF
GGUF quantizations of DJLougen/Ornstein3.6-35B-A3B — a Qwen 3.6 MoE fine-tune (35B total, ~3B active).
Support This Work
I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
Image/video sidecars: This repository now includes the restored Qwen3.6 multimodal config, processor/preprocessor files, tokenizer/chat template, safetensors index, and
model-vision-from-qwen3.6-base.safetensorsvisual tower sidecar. The existing.ggufbinaries were not rewritten in this metadata-copy pass.
Model info
- Architecture:
Qwen3_5MoeForCausalLM(linear + full attention interleaved, Gated Delta Net) - Parameters: 34.66 B total / ~3 B active (256 experts, 8 active per token)
- Context: 262,144 tokens
- Hidden size / layers: 2048 / 40
- Vocab: 248,320 tokens
All sub-8-bit quants were produced with an importance matrix (imatrix) computed from a mixed-domain multilingual calibration corpus (eaddario/imatrix-calibration → combined_all_medium): 200 chunks × 512 tokens = 102,400 calibration tokens.
Quant index
Choose a quant that fits in your RAM/VRAM with room for context. For MoE models quality degrades more sharply at low bit widths than for dense models of similar size — prefer Q4_K_M or higher if you have the memory.
| File | Bits | Size | imatrix | Notes |
|---|---|---|---|---|
Ornstein3.6-35B-A3B-Q8_0.gguf |
8 | 36.9 GB | — | Reference, near-lossless |
Ornstein3.6-35B-A3B-Q6_K.gguf |
6.5 | 28.5 GB | — | Great default for 32 GB+ systems |
Ornstein3.6-35B-A3B-Q5_K_M.gguf |
5.5 | 24.7 GB | ✓ | Excellent quality/size balance |
Ornstein3.6-35B-A3B-Q5_K_S.gguf |
5.5 | 24.0 GB | ✓ | Slightly smaller Q5 |
Ornstein3.6-35B-A3B-Q4_K_M.gguf |
4.5 | 21.2 GB | ✓ | Common 24 GB-card default |
Ornstein3.6-35B-A3B-Q4_K_S.gguf |
4.5 | 19.9 GB | ✓ | Smaller Q4 |
Ornstein3.6-35B-A3B-IQ4_XS.gguf |
4.25 | ~18 GB | ✓ | Smaller than Q4_K_S, comparable quality with imatrix |
Ornstein3.6-35B-A3B-Q3_K_M.gguf |
3.5 | 16.8 GB | ✓ | Usable; quality below Q4 |
Ornstein3.6-35B-A3B-Q3_K_S.gguf |
3.5 | 15.2 GB | ✓ | Smaller Q3 |
Ornstein3.6-35B-A3B-IQ3_M.gguf |
3.3 | ~15 GB | ✓ | Mixed I-quant, beats Q3_K_S at similar size |
Ornstein3.6-35B-A3B-IQ3_XXS.gguf |
3.0 | ~13 GB | ✓ | Aggressive 3-bit |
Ornstein3.6-35B-A3B-Q2_K.gguf |
2.6 | 12.9 GB | ✓ | Lowest K-quant; expect degraded quality |
Ornstein3.6-35B-A3B-IQ2_M.gguf |
2.7 | ~12 GB | ✓ | Aggressive I-quant 2-bit |
imatrix.dat |
— | 192 MB | — | Importance matrix (GGUF format) |
Usage
llama.cpp
# Interactive chat
llama-cli -m Ornstein3.6-35B-A3B-Q4_K_M.gguf -cnv
# Single prompt
llama-cli -m Ornstein3.6-35B-A3B-Q5_K_M.gguf -p "Write a haiku about MoE routing."
# OpenAI-compatible server
llama-server -m Ornstein3.6-35B-A3B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 -c 8192
Other runners
LM Studio, Ollama (via a Modelfile), koboldcpp, and text-generation-webui all load these GGUFs provided their bundled llama.cpp supports Qwen3_5MoeForCausalLM with Gated Delta Net.
Reproducing the quants
# 1. Convert safetensors → BF16 GGUF
python llama.cpp/convert_hf_to_gguf.py <model_dir> \
--outtype bf16 --outfile Ornstein3.6-35B-A3B-BF16.gguf
# 2. Importance matrix
llama-imatrix \
-m Ornstein3.6-35B-A3B-BF16.gguf \
-f calibration.txt \
-o imatrix.dat \
--chunks 200 -c 512 -b 512 -ngl 99
# 3. Quantize (example)
llama-quantize --imatrix imatrix.dat \
Ornstein3.6-35B-A3B-BF16.gguf \
Ornstein3.6-35B-A3B-Q4_K_M.gguf Q4_K_M
License
Apache 2.0 — inherited from the Qwen 3.6 base release.
- Downloads last month
- 1,422
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for DJLougen/Ornstein3.6-35B-A3B-GGUF
Base model
DJLougen/Ornstein3.6-35B-A3B
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="DJLougen/Ornstein3.6-35B-A3B-GGUF", filename="", )