LogGenix MoE 1.4B Pretrained

A 1.4 billion parameter Mixture of Experts (MoE) language model based on the Qwen3MoE architecture, pretrained from scratch on ~328M tokens from NVIDIA Nemotron datasets.

Model Details

Property Value
Architecture Qwen3MoeForCausalLM
Total Parameters 1,394,380,800 (~1.4B)
Active Parameters ~1.2B per forward pass (top-8 of 16 experts)
Experts 16 total, top-8 routing
Hidden Size 2048
Layers 8
Attention Heads 32
KV Heads (GQA) 8
Head Dimension 128
MoE Intermediate Size 768 per expert
Vocabulary 151,936 (Qwen3 tokenizer)
Max Position Embeddings 262,144 (256K)
Training Context Length 4,096
RoPE Theta 10,000,000
Precision BFloat16

Training Details

Property Value
Training Steps 10,000
Tokens Seen ~328M
Optimizer Muon (Newton-Schulz orthogonalization)
Learning Rate 3e-4 (cosine decay)
Batch Size 1 per device, 8 micro-batches
Hardware 2x NVIDIA H100 80GB HBM3
Parallelism Pipeline Parallel (PP=2)
Schedule 1F1B
Training Time ~3 hours
Final Val Loss 0.9068
Final Val Perplexity 2.48

Training Loss Progression

Step Val Loss Val PPL
2,000 0.9587 2.61
4,000 0.9313 2.54
6,000 0.9169 2.50
8,000 0.9088 2.48
9,000 0.9069 2.48
10,000 0.9068 2.48

Dataset

Pretrained on a balanced mix from NVIDIA Nemotron datasets:

  • General text (Nemotron-CC)
  • Mathematics (Nemotron-Math)
  • Code (Nemotron-Code)
  • Scientific content
  • Reasoning tasks

Architecture

This model uses the Qwen3 MoE architecture with:

  • Sparse MoE FFN: 16 experts with top-8 routing per token (8 of 16 experts active)
  • Grouped Query Attention (GQA): 32 query heads, 8 KV heads
  • QK Normalization: RMSNorm on Q and K projections
  • RoPE: Rotary Position Embeddings (base 10,000,000)
  • RMSNorm: Pre-normalization on attention and FFN blocks

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "kshitijthakkar/loggenix-moe-1b-pretrain",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "kshitijthakkar/loggenix-moe-1b-pretrain",
    trust_remote_code=True,
)

inputs = tokenizer("The future of AI is", return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Checkpoints

Intermediate checkpoints are available under checkpoints/:

  • checkpoints/step-2000/
  • checkpoints/step-4000/
  • checkpoints/step-6000/
  • checkpoints/step-8000/
  • checkpoints/step-10000/

Each checkpoint includes model.safetensors, config.json, tokenizer files, and eval inference results under eval/step-{N}/.

Evaluation

Standard Benchmarks (lm-eval harness, 50 samples each)

Benchmark Score
PIQA 58.0%
HellaSwag 24.0%
MMLU 23.4%
ARC-Easy 22.0%
ARC-Challenge 22.0%
GSM8K 0.0%

Synthetic Task Evaluation

Category Score
Mean Synthetic Score 21.9%
Root Cause Analysis 65.0%
Compiler Design Optimization 60.0%
Ethical Decision Making 60.0%
Log Error Pattern Detection 55.0%
Creative Writing 55.0%

Code & Tool Evaluation

Metric Score
Code Syntax Accuracy 8.3%
Code Keyword Coverage 5.4%
Tool-Calling Format 0.0%
Tool-Calling Overall 0.0%

Note: These scores are expected for a pretrained-only model with ~328M tokens seen. The model has not been instruction-tuned. PIQA (58%) shows the strongest signal, indicating basic physical commonsense reasoning is emerging. Tool-calling and structured output capabilities require SFT.

Evaluation Charts

Comprehensive Evaluation

Category Detailed Breakdown

Full evaluation results are available in eval_outputs/ and on the model page.

Limitations

  • Pretrained only (no instruction tuning) - outputs may be repetitive or incoherent
  • Trained on ~328M tokens (well below Chinchilla-optimal ~28B for 1.4B params)
  • Best suited as a base model for fine-tuning

License

Apache 2.0

Citation

@misc{loggenix-moe-1b-2026,
  title={LogGenix MoE 1.4B: A Mixture of Experts Language Model},
  author={Kshitij Thakkar},
  year={2026},
  url={https://huggingface.co/kshitijthakkar/loggenix-moe-1b-pretrain}
}
Downloads last month
99
Safetensors
Model size
1B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support