LogGenix MoE 1.4B Pretrained

A 1.4 billion parameter Mixture of Experts (MoE) language model based on the Qwen3MoE architecture, pretrained from scratch on ~328M tokens from NVIDIA Nemotron datasets.

Model Details

Property	Value
Architecture	Qwen3MoeForCausalLM
Total Parameters	1,394,380,800 (~1.4B)
Active Parameters	~1.2B per forward pass (top-8 of 16 experts)
Experts	16 total, top-8 routing
Hidden Size	2048
Layers	8
Attention Heads	32
KV Heads (GQA)	8
Head Dimension	128
MoE Intermediate Size	768 per expert
Vocabulary	151,936 (Qwen3 tokenizer)
Max Position Embeddings	262,144 (256K)
Training Context Length	4,096
RoPE Theta	10,000,000
Precision	BFloat16

Training Details

Property	Value
Training Steps	10,000
Tokens Seen	~328M
Optimizer	Muon (Newton-Schulz orthogonalization)
Learning Rate	3e-4 (cosine decay)
Batch Size	1 per device, 8 micro-batches
Hardware	2x NVIDIA H100 80GB HBM3
Parallelism	Pipeline Parallel (PP=2)
Schedule	1F1B
Training Time	~3 hours
Final Val Loss	0.9068
Final Val Perplexity	2.48

Training Loss Progression

Step	Val Loss	Val PPL
2,000	0.9587	2.61
4,000	0.9313	2.54
6,000	0.9169	2.50
8,000	0.9088	2.48
9,000	0.9069	2.48
10,000	0.9068	2.48

Dataset

Pretrained on a balanced mix from NVIDIA Nemotron datasets:

General text (Nemotron-CC)
Mathematics (Nemotron-Math)
Code (Nemotron-Code)
Scientific content
Reasoning tasks

Architecture

This model uses the Qwen3 MoE architecture with:

Sparse MoE FFN: 16 experts with top-8 routing per token (8 of 16 experts active)
Grouped Query Attention (GQA): 32 query heads, 8 KV heads
QK Normalization: RMSNorm on Q and K projections
RoPE: Rotary Position Embeddings (base 10,000,000)
RMSNorm: Pre-normalization on attention and FFN blocks

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "kshitijthakkar/loggenix-moe-1b-pretrain",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "kshitijthakkar/loggenix-moe-1b-pretrain",
    trust_remote_code=True,
)

inputs = tokenizer("The future of AI is", return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Checkpoints

Intermediate checkpoints are available under checkpoints/:

checkpoints/step-2000/
checkpoints/step-4000/
checkpoints/step-6000/
checkpoints/step-8000/
checkpoints/step-10000/

Each checkpoint includes model.safetensors, config.json, tokenizer files, and eval inference results under eval/step-{N}/.

Evaluation

Standard Benchmarks (lm-eval harness, 50 samples each)

Benchmark	Score
PIQA	58.0%
HellaSwag	24.0%
MMLU	23.4%
ARC-Easy	22.0%
ARC-Challenge	22.0%
GSM8K	0.0%

Synthetic Task Evaluation

Category	Score
Mean Synthetic Score	21.9%
Root Cause Analysis	65.0%
Compiler Design Optimization	60.0%
Ethical Decision Making	60.0%
Log Error Pattern Detection	55.0%
Creative Writing	55.0%

Code & Tool Evaluation

Metric	Score
Code Syntax Accuracy	8.3%
Code Keyword Coverage	5.4%
Tool-Calling Format	0.0%
Tool-Calling Overall	0.0%

Note: These scores are expected for a pretrained-only model with ~328M tokens seen. The model has not been instruction-tuned. PIQA (58%) shows the strongest signal, indicating basic physical commonsense reasoning is emerging. Tool-calling and structured output capabilities require SFT.

Evaluation Charts

Full evaluation results are available in eval_outputs/ and on the model page.

Limitations

Pretrained only (no instruction tuning) - outputs may be repetitive or incoherent
Trained on ~328M tokens (well below Chinchilla-optimal ~28B for 1.4B params)
Best suited as a base model for fine-tuning

License

Apache 2.0

Citation

@misc{loggenix-moe-1b-2026,
  title={LogGenix MoE 1.4B: A Mixture of Experts Language Model},
  author={Kshitij Thakkar},
  year={2026},
  url={https://huggingface.co/kshitijthakkar/loggenix-moe-1b-pretrain}
}

Downloads last month: 99

Safetensors

Model size

1B params

Tensor type

F32

BF16