Instructions to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="rbentaarit/kubelm-qwen2.5-1.5b-v1", filename="kubelm-edge.Q4_K_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
Use Docker
docker model run hf.co/rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with Ollama:
ollama run hf.co/rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
- Unsloth Studio
How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rbentaarit/kubelm-qwen2.5-1.5b-v1 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rbentaarit/kubelm-qwen2.5-1.5b-v1 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for rbentaarit/kubelm-qwen2.5-1.5b-v1 to start chatting
- Pi
How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with Docker Model Runner:
docker model run hf.co/rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
- Lemonade
How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
Run and chat with the model
lemonade run user.kubelm-qwen2.5-1.5b-v1-Q4_K_M
List all available models
lemonade list
kubelm-qwen2.5-1.5b-v1 โ Q4_K_M GGUF
The edge rung of the kubelm tier ladder: a 1.5B-parameter K8sGPT MCP tool-use specialist, fine-tuned with QLoRA on Qwen2.5-1.5B-Instruct and quantized to Q4_K_M (~986 MB on disk, ~1.1 GB serving RAM) for CPU-only deployment. The first kubelm release, and the only tier that also runs under Ollama โ its Qwen2.5 backbone loads cleanly where the Qwen3.5 tiers (0.8B / 2B) currently require llama.cpp.
Tier ladder: kubelm-qwen3.5-0.8b-v1
(ultra-edge) ยท this (edge) ยท
kubelm-qwen3.5-2b-v1
(edge+, the headline deployable). Each tier is judged within its own
resource bracket, not against the one above.
TL;DR
On the 35-scenario v0.3 evaluation library, served at temperature 0:
| metric | Qwen2.5-1.5B (base) | kubelm-qwen2.5-1.5b-v1 | qwen2.5-7b (ref) | kubelm-qwen3.5-2b-v1 (ref) |
|---|---|---|---|---|
conclusion_rubric_passed |
9 / 35 | 29 / 35 | 28 / 35 | 32 / 35 |
reference_calls_passed |
7 / 35 | 27 / 35 | 28 / 35 | 32 / 35 |
fabrications (grounding v2) |
65 | 21 | 8 | 3 |
schema_passed (tool-call) |
31 / 35 | 32 / 35 | 34 / 35 | 35 / 35 |
termination_label == complete |
10 / 35 | 33 / 35 | 33 / 35 | 35 / 35 |
narrative_inconsistencies |
0 | 0 | 0 | 0 |
Honest read. Fine-tuning transforms the base: rubric 9 โ 29, completion 10 โ 33, fabrications 65 โ 21. On reasoning it edges qwen2.5-7b (rubric 29 vs 28) and ties it on completion (33 vs 33) at roughly 1/5 the footprint. The weak spot is real and not hidden: fabrications (21) are higher than the 7B (8) and the 2B (3) โ the edge tier reaches the right conclusion reliably but is looser about asserting only tool-grounded facts than the larger tiers. Zero tool-name and zero argument hallucinations across all 35 trajectories.
If grounding strictness matters more than footprint, step up to the 2B.
Full rows:
eval/results/summaries/shape-d-2026-05-27.json.
Quickstart
Ollama (works for this tier)
Ollama's Qwen2.5 template parses OpenAI-shape tool_calls out of the box:
hf download rbentaarit/kubelm-qwen2.5-1.5b-v1 kubelm-edge.Q4_K_M.gguf --local-dir .
cat > Modelfile <<'EOF'
FROM ./kubelm-edge.Q4_K_M.gguf
PARAMETER temperature 0
PARAMETER num_ctx 16384
EOF
ollama create kubelm-qwen2.5-1.5b -f Modelfile
llama.cpp (llama-server)
brew install llama.cpp # or build from https://github.com/ggml-org/llama.cpp
hf download rbentaarit/kubelm-qwen2.5-1.5b-v1 kubelm-edge.Q4_K_M.gguf --local-dir .
llama-server \
-m kubelm-edge.Q4_K_M.gguf \
--host 127.0.0.1 --port 8088 \
--jinja \
-c 16384 \
-ngl 99 # drop or set 0 on a CPU-only Linux box
Two notes that are load-bearing:
--jinjauses the model's embedded Qwen2.5 chat template, including its<tool_call>rendering. Without it, tool-use breaks.-c 16384matches the model'smax_seq_length. Long-trajectory investigations accumulate ~9โ11 K tokens of history; a smaller context errors with HTTP 400request exceeds the available context size.
Unlike the Qwen3.5 tiers, this model is not a thinking model โ there
is no enable_thinking / no-think serving step to worry about.
In production, drive this through the K8sGPT MCP server and the kubelm eval harness so the model calls real tools against a real cluster.
Intended use
- Tool-use specialist for K8sGPT MCP investigations on CPU-only hardware (M-series Macs, modest Linux boxes), where an Ollama-native GGUF is convenient.
- Local component of agentic K8s diagnosis pipelines where the destructive-action layer is handled by K8sGPT's operator + Mutation CR policy gates (the model proposes; the operator gates).
Out of scope
- Snapshot diagnosis from raw cluster YAML. Trained on multi-step tool-use trajectories, not Q&A pairs over frozen cluster state.
- Safety / refusal decisions on destructive operations. That layer is architectural in the K8sGPT ecosystem; the model is trained for reliability properties, not behavioral refusal.
- Direct
kubectlusage. The tools list is K8sGPT MCP-specific. - General K8s domain knowledge questions outside the K8sGPT MCP tool surface.
Training
- Base model: Qwen2.5-1.5B-Instruct.
- Dataset: the v0 cut of
rbentaarit/kubelm-seed-v0โ gpt-5.4-authored multi-step trajectories plus mechanical variants, filtered to records that are review-accepted and pass the conclusion-rubric and tool-call schema checks. - Method: QLoRA (nf4 + double-quant), rank 32 / alpha 64, target
modules
q_proj k_proj v_proj o_proj gate_proj up_proj down_proj. LoRA adapter included in this repo underadapter/. - Schedule: 2 epochs, batch 8 ร grad-accum 2 (eff. 16), lr 2e-4 cosine, warmup 3%, max_seq_length 16384, seed 42. Assistant-only loss.
- Full config:
training/configs/kubelm-edge-v0.yaml; recipe:training/sft.py.
Evaluation
Methodology and eval harness: github.com/rbentaarit/kubelm/eval. Each scenario boots a fresh kind cluster, seeds the failure mode, brings up a real K8sGPT MCP server against it, then runs the model through the trajectory loop and grades the result. Mocked MCP servers are not used at any stage.
Versioning
- K8sGPT version pin:
0.4.32. Tool surface and MCP error shapes change between K8sGPT releases; quality numbers above are not guaranteed against other versions. - MCP protocol version:
2025-03-26.
Known issues
- Fabrication rate (21) is the softest metric. This tier is looser about asserting only tool-grounded facts than the 2B (3) or qwen2.5-7b (8). If your application is sensitive to over-confident grounding, prefer the 2B.
- No native tool-call format other than OpenAI Chat Completions.
License
Apache 2.0. The base model is Qwen2.5-1.5B-Instruct (Apache 2.0). The training corpus is CC BY 4.0.
Citation
@misc{kubelm_qwen25_1_5b_v1,
title = {kubelm-qwen2.5-1.5b-v1},
author = {Ramzi Ben Taarit and contributors},
year = {2026},
url = {https://huggingface.co/rbentaarit/kubelm-qwen2.5-1.5b-v1},
note = {QLoRA on Qwen2.5-1.5B-Instruct; trained against K8sGPT v0.4.32 MCP trajectories}
}
Source code
All training, evaluation, and dataset-construction code: github.com/rbentaarit/kubelm.
- Downloads last month
- 146
4-bit