LogGenix MoE 1.4B Pretrained
A 1.4 billion parameter Mixture of Experts (MoE) language model based on the Qwen3MoE architecture, pretrained from scratch on ~328M tokens from NVIDIA Nemotron datasets.
Model Details
| Property | Value |
|---|---|
| Architecture | Qwen3MoeForCausalLM |
| Total Parameters | 1,394,380,800 (~1.4B) |
| Active Parameters | ~1.2B per forward pass (top-8 of 16 experts) |
| Experts | 16 total, top-8 routing |
| Hidden Size | 2048 |
| Layers | 8 |
| Attention Heads | 32 |
| KV Heads (GQA) | 8 |
| Head Dimension | 128 |
| MoE Intermediate Size | 768 per expert |
| Vocabulary | 151,936 (Qwen3 tokenizer) |
| Max Position Embeddings | 262,144 (256K) |
| Training Context Length | 4,096 |
| RoPE Theta | 10,000,000 |
| Precision | BFloat16 |
Training Details
| Property | Value |
|---|---|
| Training Steps | 10,000 |
| Tokens Seen | ~328M |
| Optimizer | Muon (Newton-Schulz orthogonalization) |
| Learning Rate | 3e-4 (cosine decay) |
| Batch Size | 1 per device, 8 micro-batches |
| Hardware | 2x NVIDIA H100 80GB HBM3 |
| Parallelism | Pipeline Parallel (PP=2) |
| Schedule | 1F1B |
| Training Time | ~3 hours |
| Final Val Loss | 0.9068 |
| Final Val Perplexity | 2.48 |
Training Loss Progression
| Step | Val Loss | Val PPL |
|---|---|---|
| 2,000 | 0.9587 | 2.61 |
| 4,000 | 0.9313 | 2.54 |
| 6,000 | 0.9169 | 2.50 |
| 8,000 | 0.9088 | 2.48 |
| 9,000 | 0.9069 | 2.48 |
| 10,000 | 0.9068 | 2.48 |
Dataset
Pretrained on a balanced mix from NVIDIA Nemotron datasets:
- General text (Nemotron-CC)
- Mathematics (Nemotron-Math)
- Code (Nemotron-Code)
- Scientific content
- Reasoning tasks
Architecture
This model uses the Qwen3 MoE architecture with:
- Sparse MoE FFN: 16 experts with top-8 routing per token (8 of 16 experts active)
- Grouped Query Attention (GQA): 32 query heads, 8 KV heads
- QK Normalization: RMSNorm on Q and K projections
- RoPE: Rotary Position Embeddings (base 10,000,000)
- RMSNorm: Pre-normalization on attention and FFN blocks
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"kshitijthakkar/loggenix-moe-1b-pretrain",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"kshitijthakkar/loggenix-moe-1b-pretrain",
trust_remote_code=True,
)
inputs = tokenizer("The future of AI is", return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Checkpoints
Intermediate checkpoints are available under checkpoints/:
checkpoints/step-2000/checkpoints/step-4000/checkpoints/step-6000/checkpoints/step-8000/checkpoints/step-10000/
Each checkpoint includes model.safetensors, config.json, tokenizer files, and eval inference results under eval/step-{N}/.
Evaluation
Standard Benchmarks (lm-eval harness, 50 samples each)
| Benchmark | Score |
|---|---|
| PIQA | 58.0% |
| HellaSwag | 24.0% |
| MMLU | 23.4% |
| ARC-Easy | 22.0% |
| ARC-Challenge | 22.0% |
| GSM8K | 0.0% |
Synthetic Task Evaluation
| Category | Score |
|---|---|
| Mean Synthetic Score | 21.9% |
| Root Cause Analysis | 65.0% |
| Compiler Design Optimization | 60.0% |
| Ethical Decision Making | 60.0% |
| Log Error Pattern Detection | 55.0% |
| Creative Writing | 55.0% |
Code & Tool Evaluation
| Metric | Score |
|---|---|
| Code Syntax Accuracy | 8.3% |
| Code Keyword Coverage | 5.4% |
| Tool-Calling Format | 0.0% |
| Tool-Calling Overall | 0.0% |
Note: These scores are expected for a pretrained-only model with ~328M tokens seen. The model has not been instruction-tuned. PIQA (58%) shows the strongest signal, indicating basic physical commonsense reasoning is emerging. Tool-calling and structured output capabilities require SFT.
Evaluation Charts
Full evaluation results are available in eval_outputs/ and on the model page.
Limitations
- Pretrained only (no instruction tuning) - outputs may be repetitive or incoherent
- Trained on ~328M tokens (well below Chinchilla-optimal ~28B for 1.4B params)
- Best suited as a base model for fine-tuning
License
Apache 2.0
Citation
@misc{loggenix-moe-1b-2026,
title={LogGenix MoE 1.4B: A Mixture of Experts Language Model},
author={Kshitij Thakkar},
year={2026},
url={https://huggingface.co/kshitijthakkar/loggenix-moe-1b-pretrain}
}
- Downloads last month
- 99

