John LLM
Setup (15 min)
pip install -r requirements.txt
Place your text corpus at data/raw/english.md.
- Minimum recommended size: 1MB of plain text for meaningful training
- Good sources: Project Gutenberg books, Wikipedia dumps, personal notes
Execution Steps
STEP 0 β Data Prep:
python utils/clean_wiki.py
python data/download_sft.py
Outputs:
data/raw/english_clean.txt,data/sft_data.jsonl
STEP 1 β Train tokenizer:
python tokenizer/train_tokenizer.py
Outputs:
tokenizer/spm.model,tokenizer/spm.vocab
STEP 2 β Prepare dataset:
python training/dataset.py --prepare
Outputs:
data/processed/train.bin,data/processed/val.binPrints token count and train/val split
STEP 3 β Pretrain:
python training/pretrain.py
Expected: val loss should drop below ~3.5 Checkpoints saved to
checkpoints/when val loss improves
STEP 4 β Fine-tune:
python training/sft.py
Outputs:
checkpoints/sft_final.pt
STEP 5 β Chat:
python inference/chat.py --checkpoint checkpoints/sft_final.pt
Expected Behavior
- With <1MB data: model will overfit, responses will be memorized text.
- With 5-20MB data: model will generalize and produce novel sentences.
- With 50MB+ data: model will feel like a real (small) language model.
Troubleshooting
- OOM error: reduce
BATCH_SIZEto 4 orcontext_lento 256 in scripts/config. - Loss stuck at ~9.0: tokenizer not trained, check
spm.modelexists. - Gibberish output: need more data or more training steps.
- CUDA not found: install torch with
pip install torch --index-url https://download.pytorch.org/whl/cu124
Inference Providers NEW
This model isn't deployed by any Inference Provider. π 1 Ask for provider support