GPT Language Model

A 124M parameter GPT model trained from scratch using PyTorch.

This project contains:

custom multi-head self-attention
transformer blocks
causal masking
autoregressive text generation
mixed precision training
top-k / top-p sampling
safetensors model weights

The model was trained on a subset of FineWeb-Edu using a GPT-2 tokenizer.

Architecture

Model configuration:

{
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

Approximate parameter count:

~124M parameters

Architecture components:

token embeddings
positional embeddings
masked multi-head self-attention
feed-forward MLP blocks
pre-layer normalization
residual connections
causal language modeling head

Training

Training setup:

PyTorch
AdamW optimizer
Automatic Mixed Precision (AMP)
Gradient clipping
Top-k / Top-p text generation

Hardware used:

RTX 3060 Ti 8GB

Dataset:

FineWeb-Edu subset (100M tokens)

Tokenizer:

GPT-2 tokenizer

Training Progress

The graph below shows train/validation loss progression during training on FineWeb-Edu.

Installation

Install dependencies:

pip install torch transformers safetensors

Loading The Model

import json
import torch

from safetensors.torch import load_file
from transformers import AutoTokenizer

from model import GPTModel

# load config
with open("config.json") as f:
    cfg = json.load(f)

# create model
model = GPTModel(cfg)

# load weights
state_dict = load_file("model.safetensors")

model.load_state_dict(state_dict)

model.eval()

# tokenizer
tokenizer = AutoTokenizer.from_pretrained(".")

Text Generation Example

from model import generate_and_print_sample

print(generate_and_print_sample(model, tokenizer, "cuda", "The world is big"))

Sample Generations

Example generations from early-stage training:

"The world is big. If you are a scientist, you can’t have to worry about how much the science is going on in your area. What I want is that there is no way to know when it comes to the idea of what the world is doing with our society? We need to understand that this means that we need to be able to take action against the problem and find out which ones will be exposed to the issue, rather than where it has been done. If you have any questions or comments, you might not have heard from your doctor. You may have heard from your doctor for more information about the topic. The best way to do so is to use a lot of information. For example, if you don’t like a doctor, you should be able to tell you how much you will have at home and why you would like to talk to someone else who has never visited them. In order to make sure that they are safe, you can also get the right answer. What kind of questions will I like to ask? Please refer to the following link: - What kind of questions will I need to help me determine if my child has had cancer? - How will I respond to treatment? Will my child receive the same chemotherapy?"

The model currently demonstrates

Coherent paragraph generation
Long-form text continuation
Scientific and educational writing style
Basic topic consistency across multiple sentences
Emergent reasoning and abstraction patterns
Generation of novel names and phrases
Structured article-like prose
Stable grammar and syntax generation

Current limitations:

Factual inaccuracies
Semantic repetition
Weak instruction following
Limited reasoning depth
Hallucinated entities and concepts

Files

model.py              # GPT architecture
model.safetensors     # trained weights
config.json           # model configuration
tokenizer files       # GPT2 tokenizer assets
README.md             # project documentation

Notes

This is a custom PyTorch implementation and is not directly compatible with Hugging Face AutoModelForCausalLM.

Users should load the model using the provided model.py architecture.

License

MIT License.

license: mit

Downloads last month: 33

Safetensors

Model size

0.2B params

Tensor type

F32