You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

John LLM

pip install -r requirements.txt

Place your text corpus at data/raw/english.md.

python utils/clean_wiki.py
python data/download_sft.py

Outputs: data/raw/english_clean.txt, data/sft_data.jsonl

python tokenizer/train_tokenizer.py

Outputs: tokenizer/spm.model, tokenizer/spm.vocab

python training/dataset.py --prepare

Outputs: data/processed/train.bin, data/processed/val.bin Prints token count and train/val split

python training/pretrain.py

Expected: val loss should drop below ~3.5 Checkpoints saved to checkpoints/ when val loss improves

python training/sft.py

Outputs: checkpoints/sft_final.pt

python inference/chat.py --checkpoint checkpoints/sft_final.pt

OOM error: reduce BATCH_SIZE to 4 or context_len to 256 in scripts/config.
Loss stuck at ~9.0: tokenizer not trained, check spm.model exists.
Gibberish output: need more data or more training steps.
CUDA not found: install torch with pip install torch --index-url https://download.pytorch.org/whl/cu124

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support