GPT-2 124M β Pretrained from Scratch + SFT (Dolly-15k)
A GPT-2 124M model trained entirely from scratch, then instruction-tuned.
Training Pipeline
Stage 1 β Pretraining
- Dataset: FineWeb-Edu (5B tokens)
- Steps: 5000 / 9537
- Hardware: Lightning AI H100
- Val loss: 10.95 β 3.31
- HellaSwag: 0.248 β 0.279 (baseline GPT-2 = 0.2955)
Stage 2 β SFT
- Dataset: databricks/databricks-dolly-15k (13.5K train / 1.5K val)
- Epochs: 1 of 3
- Hardware: Kaggle T4
- Val loss: 3.31 β 2.68
- LR: 1e-5 with cosine decay
Usage
import torch
import tiktoken
from torch.nn import functional as F
# Load model (paste GPT class definition first)
checkpoint = torch.load('model_sft_best.pt', weights_only=False, map_location='cuda')
model = GPT(checkpoint['config'])
model.load_state_dict(checkpoint['model'])
model.eval()
enc = tiktoken.get_encoding('gpt2')
EOT = enc._special_tokens['<|endoftext|>']
def chat(prompt, max_new_tokens=200, temperature=0.7, top_k=40):
formatted = f"<|user|>\n{prompt}\n<|assistant|>\n"
tokens = enc.encode(formatted)
x = torch.tensor(tokens, dtype=torch.long).unsqueeze(0).cuda()
with torch.no_grad():
for _ in range(max_new_tokens):
logits = model(x)[:, -1, :].float() / temperature
v, _ = torch.topk(logits, top_k)
logits[logits < v[:, [-1]]] = float('-inf')
next_tok = torch.multinomial(F.softmax(logits, dim=-1), 1)
if next_tok.item() == EOT:
break
x = torch.cat([x, next_tok], dim=1)
return enc.decode(x[0, len(tokens):].tolist())
print(chat("What are the benefits of exercise?"))
Limitations
- 124M params β limited factual knowledge
- 1 epoch SFT β basic instruction following
- Best for: learning the pretrain β SFT pipeline end-to-end