Gemma4 26B MoE — Kimi K2 Reasoning LoRA 🧠
LoRA adapter fine-tuned from google/gemma-4-26B-A4B-it on Kimi K2 reasoning distill dataset — 7,836 high-quality reasoning examples, trained entirely by UKA (Hermes Agent) 🤖
📋 Summary
| Detail | Value |
|---|---|
| Base Model | google/gemma-4-26B-A4B-it (26B MoE, 128 experts, ~4B active/token) |
| Dataset | lordx64/reasoning-distill-kimi-k2-6-max-sft (7,836 examples) |
| Method | Custom NF4 per-expert quantization + LoRA |
| Pipeline | AndriejusNak/gemma4-26b-moe-finetune |
| GPU | NVIDIA RTX 5090 32GB (Vast.ai Cloud) |
| Training Time | 128 minutes (~2h 8m) |
| Best Loss | 1.0651 |
| NaN Explosions | 0 |
🖥️ Hardware
| Component | Specification |
|---|---|
| GPU | NVIDIA GeForce RTX 5090 32GB GDDR7 |
| CPU | Intel Core i7-14700K (28 cores, 20 logical) |
| RAM | 94 GB DDR5 |
| Disk | 200 GB NVMe SSD |
| Cloud | Vast.ai |
| CUDA | 13.0 |
| PyTorch | 2.12.0.dev (nightly, cu128) |
Why RTX 5090: Gemma 4 26B MoE ต้องการ custom NF4 per-expert quantization — standard
bitsandbytesไม่สามารถ quantizenn.Parameter(expert weights) ได้. Pipeline quantize experts ด้วยตัวเอง ทำให้ VRAM peak ~24 GB — พอดีกับ RTX 5090 32GB แต่เกิน RTX 3090 24GB (ถ้าใช้ seq=1024 + MLP LoRA)
🔧 Training Configuration
# v6_26b_pipeline.py — Final Config
MODEL_NAME = "google/gemma-4-26B-A4B-it"
MAX_SEQ_LENGTH = 1024
LORA_R = 32
LORA_ALPHA = 32
INCLUDE_MLP_LORA = True # Attention + MLP layers
SFT_EPOCHS = 2
SFT_BATCH_SIZE = 3 # Per GPU
SFT_GRAD_ACCUM = 8 # Effective batch = 24
SFT_LR = 2e-5 # Cosine schedule, warmup 245 steps
SFT_FILES = ["data/kimi_k2_sft.jsonl"]
LoRA Details
- Rank (r): 32, Alpha: 32
- Target modules:
q_proj,k_proj,v_proj,o_proj(attention) +gate_proj,up_proj,down_proj(MLP) - Trainable params: 59,275,776 / 3,027,224,428 (1.96%)
Training Stats
- Examples: 7,836 → 7,358 after filtering (478 all-masked)
- Forward passes: 4,906
- Optimizer steps: 613
- VRAM peak: 23.9 GB
Loss Progression
Step 50: Loss 3.0597 (epoch 1)
Step 100: Loss 1.3277
Step 150: Loss 1.1658
Step 200: Loss 1.0906
Step 250: Loss 1.1220
Step 300: Loss 1.0723
→ Epoch 1 avg: 1.4648
Step 350: Loss 1.0660 (epoch 2)
Step 400: Loss 1.0616
Step 450: Loss 1.0722
Step 500: Loss 1.0586
Step 550: Loss 1.0370
Step 600: Loss 1.0983
→ Epoch 2 avg: 1.0651 🎯 Best!
🚀 Usage
Install Dependencies
pip install transformers peft torch
Load Base Model + LoRA
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model (BF16, needs ~52 GB VRAM)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-26B-A4B-it",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load this LoRA adapter
model = PeftModel.from_pretrained(
model,
"hotdogs/gemma4-26b-kimi-k2-reasoning-lora"
)
# Optional: merge for faster inference
model = model.merge_and_unload()
Chat / Inference
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-26B-A4B-it")
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Solve: 3x + 7 = 22"}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
return_tensors="pt",
add_generation_prompt=True
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
🧪 How This Was Trained
This adapter was trained autonomously by UKA, an AI Agent running Hermes Agent, following this workflow:
1. Dataset Conversion
The Kimi K2 reasoning distill dataset comes as Parquet with a single text column in Kimi chat format (<|im_start|>role\n...<|im_end|>).
# convert_kimi.py — Parquet → JSONL messages format
import requests, pyarrow.parquet as pq, io, json, re
url = "https://huggingface.co/datasets/lordx64/reasoning-distill-kimi-k2-6-max-sft/resolve/main/data/train-00000-of-00001.parquet"
r = requests.get(url)
table = pq.read_table(io.BytesIO(r.content))
texts = table.column('text').to_pylist()
pattern = r'<\|im_start\|>(\w+)\n(.*?)<\|im_end\|>'
with open("data/kimi_k2_sft.jsonl", "w") as f:
for text in texts:
matches = re.findall(pattern, text, re.DOTALL)
messages = [{"role": role.strip(), "content": content.strip()}
for role, content in matches]
f.write(json.dumps({"messages": messages}, ensure_ascii=False) + "\n")
2. Pipeline Setup
git clone https://github.com/AndriejusNak/gemma4-26b-moe-finetune.git
cd gemma4-26b-moe-finetune
pip install transformers peft bitsandbytes accelerate safetensors pyarrow requests
# Edit v6_26b_pipeline.py:
# SFT_FILES = ["data/kimi_k2_sft.jsonl"]
# MAX_SEQ_LENGTH = 1024
# LORA_R = 32, LORA_ALPHA = 32
# INCLUDE_MLP_LORA = True
# SFT_EPOCHS = 2, SFT_BATCH_SIZE = 3
3. Download Base Model + Train
python3 v6_26b_pipeline.py --phase 0 # Download model (~7 min)
python3 -u v6_26b_pipeline.py --phase 1 # Train (~2 hrs) | tee /tmp/sft.log
Hardware Notes
- Why RTX 5090 needed: Gemma 4 26B MoE requires custom NF4 quantization. Standard
bitsandbytescan't quantizenn.Parameter(expert weights). The pipeline quantizes experts manually, peaking at ~24 GB VRAM — fits on RTX 5090 32GB but NOT on RTX 3090 24GB (would need seq=512, no MLP LoRA). - Why PyTorch nightly: RTX 5090 = Blackwell
sm_120. PyTorch stable only supports up tosm_90. Nightlycu128is required.
📦 Files in This Repo
adapter_model.safetensors — LoRA weights (227 MB)
adapter_config.json — LoRA config: r=32, alpha=32, attention+MLP
tokenizer.json — Gemma 4 tokenizer (31 MB)
tokenizer_config.json — Tokenizer config
chat_template.jinja — Chat template
⚠️ Limitations
- 32% of training examples truncated at seq=1024 (mean length = 941 tokens)
- LoRA adapter only — not a full fine-tune
- Trained on Kimi K2 reasoning style — may differ from Gemma's native output style
- BF16 base model requires ~52 GB VRAM
🙏 Credits
- Base Model: Google Gemma 4 26B
- Dataset: Kimi K2 Reasoning Distill by lordx64
- Pipeline: AndriejusNak/gemma4-26b-moe-finetune
- Trainer: UKA — AI Agent (Hermes Agent)
- Downloads last month
- 122
We're not able to determine the quantization variants.