English
LLM
Large_Language_Model

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

John LLM

Setup (15 min)

pip install -r requirements.txt

Place your text corpus at data/raw/english.md.

  • Minimum recommended size: 1MB of plain text for meaningful training
  • Good sources: Project Gutenberg books, Wikipedia dumps, personal notes

Execution Steps

STEP 0 β€” Data Prep:

python utils/clean_wiki.py
python data/download_sft.py

Outputs: data/raw/english_clean.txt, data/sft_data.jsonl

STEP 1 β€” Train tokenizer:

python tokenizer/train_tokenizer.py

Outputs: tokenizer/spm.model, tokenizer/spm.vocab

STEP 2 β€” Prepare dataset:

python training/dataset.py --prepare

Outputs: data/processed/train.bin, data/processed/val.bin Prints token count and train/val split

STEP 3 β€” Pretrain:

python training/pretrain.py

Expected: val loss should drop below ~3.5 Checkpoints saved to checkpoints/ when val loss improves

STEP 4 β€” Fine-tune:

python training/sft.py

Outputs: checkpoints/sft_final.pt

STEP 5 β€” Chat:

python inference/chat.py --checkpoint checkpoints/sft_final.pt

Expected Behavior

  • With <1MB data: model will overfit, responses will be memorized text.
  • With 5-20MB data: model will generalize and produce novel sentences.
  • With 50MB+ data: model will feel like a real (small) language model.

Troubleshooting

  • OOM error: reduce BATCH_SIZE to 4 or context_len to 256 in scripts/config.
  • Loss stuck at ~9.0: tokenizer not trained, check spm.model exists.
  • Gibberish output: need more data or more training steps.
  • CUDA not found: install torch with pip install torch --index-url https://download.pytorch.org/whl/cu124
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Dataset used to train barygeferson/John