Instructions to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="rbentaarit/kubelm-qwen2.5-1.5b-v1",
	filename="kubelm-edge.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M

Use Docker

docker model run hf.co/rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M

LM Studio
Jan
Ollama
How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with Ollama:
```
ollama run hf.co/rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
```

Unsloth Studio

How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for rbentaarit/kubelm-qwen2.5-1.5b-v1 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for rbentaarit/kubelm-qwen2.5-1.5b-v1 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for rbentaarit/kubelm-qwen2.5-1.5b-v1 to start chatting

How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with Docker Model Runner:
```
docker model run hf.co/rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M
```

Lemonade

How to use rbentaarit/kubelm-qwen2.5-1.5b-v1 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull rbentaarit/kubelm-qwen2.5-1.5b-v1:Q4_K_M

Run and chat with the model

lemonade run user.kubelm-qwen2.5-1.5b-v1-Q4_K_M

List all available models

lemonade list

kubelm-qwen2.5-1.5b-v1 — Q4_K_M GGUF

The edge rung of the kubelm tier ladder: a 1.5B-parameter K8sGPT MCP tool-use specialist, fine-tuned with QLoRA on Qwen2.5-1.5B-Instruct and quantized to Q4_K_M (~986 MB on disk, ~1.1 GB serving RAM) for CPU-only deployment. The first kubelm release, and the only tier that also runs under Ollama — its Qwen2.5 backbone loads cleanly where the Qwen3.5 tiers (0.8B / 2B) currently require llama.cpp.

Tier ladder: kubelm-qwen3.5-0.8b-v1 (ultra-edge) · this (edge) · kubelm-qwen3.5-2b-v1 (edge+, the headline deployable). Each tier is judged within its own resource bracket, not against the one above.

TL;DR

On the 35-scenario v0.3 evaluation library, served at temperature 0:

metric	Qwen2.5-1.5B (base)	kubelm-qwen2.5-1.5b-v1	qwen2.5-7b (ref)	kubelm-qwen3.5-2b-v1 (ref)
`conclusion_rubric_passed`	9 / 35	29 / 35	28 / 35	32 / 35
`reference_calls_passed`	7 / 35	27 / 35	28 / 35	32 / 35
`fabrications` (grounding v2)	65	21	8	3
`schema_passed` (tool-call)	31 / 35	32 / 35	34 / 35	35 / 35
`termination_label == complete`	10 / 35	33 / 35	33 / 35	35 / 35
`narrative_inconsistencies`	0	0	0	0

Honest read. Fine-tuning transforms the base: rubric 9 → 29, completion 10 → 33, fabrications 65 → 21. On reasoning it edges qwen2.5-7b (rubric 29 vs 28) and ties it on completion (33 vs 33) at roughly 1/5 the footprint. The weak spot is real and not hidden: fabrications (21) are higher than the 7B (8) and the 2B (3) — the edge tier reaches the right conclusion reliably but is looser about asserting only tool-grounded facts than the larger tiers. Zero tool-name and zero argument hallucinations across all 35 trajectories.

If grounding strictness matters more than footprint, step up to the 2B. Full rows: eval/results/summaries/shape-d-2026-05-27.json.

Quickstart

Ollama (works for this tier)

Ollama's Qwen2.5 template parses OpenAI-shape tool_calls out of the box:

hf download rbentaarit/kubelm-qwen2.5-1.5b-v1 kubelm-edge.Q4_K_M.gguf --local-dir .

cat > Modelfile <<'EOF'
FROM ./kubelm-edge.Q4_K_M.gguf
PARAMETER temperature 0
PARAMETER num_ctx 16384
EOF
ollama create kubelm-qwen2.5-1.5b -f Modelfile

llama.cpp (`llama-server`)

brew install llama.cpp   # or build from https://github.com/ggml-org/llama.cpp
hf download rbentaarit/kubelm-qwen2.5-1.5b-v1 kubelm-edge.Q4_K_M.gguf --local-dir .

llama-server \
    -m kubelm-edge.Q4_K_M.gguf \
    --host 127.0.0.1 --port 8088 \
    --jinja \
    -c 16384 \
    -ngl 99   # drop or set 0 on a CPU-only Linux box

Two notes that are load-bearing:

--jinja uses the model's embedded Qwen2.5 chat template, including its <tool_call> rendering. Without it, tool-use breaks.
-c 16384 matches the model's max_seq_length. Long-trajectory investigations accumulate ~9–11 K tokens of history; a smaller context errors with HTTP 400 request exceeds the available context size.

Unlike the Qwen3.5 tiers, this model is not a thinking model — there is no enable_thinking / no-think serving step to worry about.

In production, drive this through the K8sGPT MCP server and the kubelm eval harness so the model calls real tools against a real cluster.

Intended use

Tool-use specialist for K8sGPT MCP investigations on CPU-only hardware (M-series Macs, modest Linux boxes), where an Ollama-native GGUF is convenient.
Local component of agentic K8s diagnosis pipelines where the destructive-action layer is handled by K8sGPT's operator + Mutation CR policy gates (the model proposes; the operator gates).

Out of scope

Snapshot diagnosis from raw cluster YAML. Trained on multi-step tool-use trajectories, not Q&A pairs over frozen cluster state.
Safety / refusal decisions on destructive operations. That layer is architectural in the K8sGPT ecosystem; the model is trained for reliability properties, not behavioral refusal.
Direct kubectl usage. The tools list is K8sGPT MCP-specific.
General K8s domain knowledge questions outside the K8sGPT MCP tool surface.

Training

Base model: Qwen2.5-1.5B-Instruct.
Dataset: the v0 cut of rbentaarit/kubelm-seed-v0 — gpt-5.4-authored multi-step trajectories plus mechanical variants, filtered to records that are review-accepted and pass the conclusion-rubric and tool-call schema checks.
Method: QLoRA (nf4 + double-quant), rank 32 / alpha 64, target modules q_proj k_proj v_proj o_proj gate_proj up_proj down_proj. LoRA adapter included in this repo under adapter/.
Schedule: 2 epochs, batch 8 × grad-accum 2 (eff. 16), lr 2e-4 cosine, warmup 3%, max_seq_length 16384, seed 42. Assistant-only loss.
Full config: training/configs/kubelm-edge-v0.yaml; recipe: training/sft.py.

Evaluation

Methodology and eval harness: github.com/rbentaarit/kubelm/eval. Each scenario boots a fresh kind cluster, seeds the failure mode, brings up a real K8sGPT MCP server against it, then runs the model through the trajectory loop and grades the result. Mocked MCP servers are not used at any stage.

Versioning

K8sGPT version pin: 0.4.32. Tool surface and MCP error shapes change between K8sGPT releases; quality numbers above are not guaranteed against other versions.
MCP protocol version: 2025-03-26.

Known issues

Fabrication rate (21) is the softest metric. This tier is looser about asserting only tool-grounded facts than the 2B (3) or qwen2.5-7b (8). If your application is sensitive to over-confident grounding, prefer the 2B.
No native tool-call format other than OpenAI Chat Completions.

License

Apache 2.0. The base model is Qwen2.5-1.5B-Instruct (Apache 2.0). The training corpus is CC BY 4.0.

Citation

@misc{kubelm_qwen25_1_5b_v1,
  title  = {kubelm-qwen2.5-1.5b-v1},
  author = {Ramzi Ben Taarit and contributors},
  year   = {2026},
  url    = {https://huggingface.co/rbentaarit/kubelm-qwen2.5-1.5b-v1},
  note   = {QLoRA on Qwen2.5-1.5B-Instruct; trained against K8sGPT v0.4.32 MCP trajectories}
}

Source code

All training, evaluation, and dataset-construction code: github.com/rbentaarit/kubelm.

Downloads last month: 146

GGUF

Model size

2B params

Architecture

qwen2

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rbentaarit/kubelm-qwen2.5-1.5b-v1

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Quantized

(199)

this model