GPT-2 124M — Pretrained from Scratch + SFT (Dolly-15k)

A GPT-2 124M model trained entirely from scratch, then instruction-tuned.

Training Pipeline

Stage 1 — Pretraining

Dataset: FineWeb-Edu (5B tokens)
Steps: 5000 / 9537
Hardware: Lightning AI H100
Val loss: 10.95 → 3.31
HellaSwag: 0.248 → 0.279 (baseline GPT-2 = 0.2955)

Stage 2 — SFT

Dataset: databricks/databricks-dolly-15k (13.5K train / 1.5K val)
Epochs: 1 of 3
Hardware: Kaggle T4
Val loss: 3.31 → 2.68
LR: 1e-5 with cosine decay

Usage

import torch
import tiktoken
from torch.nn import functional as F

# Load model (paste GPT class definition first)
checkpoint = torch.load('model_sft_best.pt', weights_only=False, map_location='cuda')
model = GPT(checkpoint['config'])
model.load_state_dict(checkpoint['model'])
model.eval()

enc = tiktoken.get_encoding('gpt2')
EOT = enc._special_tokens['<|endoftext|>']

def chat(prompt, max_new_tokens=200, temperature=0.7, top_k=40):
    formatted = f"<|user|>\n{prompt}\n<|assistant|>\n"
    tokens = enc.encode(formatted)
    x = torch.tensor(tokens, dtype=torch.long).unsqueeze(0).cuda()
    with torch.no_grad():
        for _ in range(max_new_tokens):
            logits = model(x)[:, -1, :].float() / temperature
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[:, [-1]]] = float('-inf')
            next_tok = torch.multinomial(F.softmax(logits, dim=-1), 1)
            if next_tok.item() == EOT:
                break
            x = torch.cat([x, next_tok], dim=1)
    return enc.decode(x[0, len(tokens):].tolist())

print(chat("What are the benefits of exercise?"))

Limitations

124M params — limited factual knowledge
1 epoch SFT — basic instruction following
Best for: learning the pretrain → SFT pipeline end-to-end

Downloads last month: -; Downloads are not tracked for this model. How to track

adesh01
/

gpt2-124m-sft-dolly

GPT-2 124M — Pretrained from Scratch + SFT (Dolly-15k)

Training Pipeline

Stage 1 — Pretraining

Stage 2 — SFT

Usage

Limitations

Datasets used to train adesh01/gpt2-124m-sft-dolly