GPT-2 124M β€” Pretrained from Scratch + SFT (Dolly-15k)

A GPT-2 124M model trained entirely from scratch, then instruction-tuned.

Training Pipeline

Stage 1 β€” Pretraining

  • Dataset: FineWeb-Edu (5B tokens)
  • Steps: 5000 / 9537
  • Hardware: Lightning AI H100
  • Val loss: 10.95 β†’ 3.31
  • HellaSwag: 0.248 β†’ 0.279 (baseline GPT-2 = 0.2955)

Stage 2 β€” SFT

  • Dataset: databricks/databricks-dolly-15k (13.5K train / 1.5K val)
  • Epochs: 1 of 3
  • Hardware: Kaggle T4
  • Val loss: 3.31 β†’ 2.68
  • LR: 1e-5 with cosine decay

Usage

import torch
import tiktoken
from torch.nn import functional as F

# Load model (paste GPT class definition first)
checkpoint = torch.load('model_sft_best.pt', weights_only=False, map_location='cuda')
model = GPT(checkpoint['config'])
model.load_state_dict(checkpoint['model'])
model.eval()

enc = tiktoken.get_encoding('gpt2')
EOT = enc._special_tokens['<|endoftext|>']

def chat(prompt, max_new_tokens=200, temperature=0.7, top_k=40):
    formatted = f"<|user|>\n{prompt}\n<|assistant|>\n"
    tokens = enc.encode(formatted)
    x = torch.tensor(tokens, dtype=torch.long).unsqueeze(0).cuda()
    with torch.no_grad():
        for _ in range(max_new_tokens):
            logits = model(x)[:, -1, :].float() / temperature
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[:, [-1]]] = float('-inf')
            next_tok = torch.multinomial(F.softmax(logits, dim=-1), 1)
            if next_tok.item() == EOT:
                break
            x = torch.cat([x, next_tok], dim=1)
    return enc.decode(x[0, len(tokens):].tolist())

print(chat("What are the benefits of exercise?"))

Limitations

  • 124M params β€” limited factual knowledge
  • 1 epoch SFT β€” basic instruction following
  • Best for: learning the pretrain β†’ SFT pipeline end-to-end
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train adesh01/gpt2-124m-sft-dolly