GPT Language Model
A 124M parameter GPT model trained from scratch using PyTorch.
This project contains:
- custom multi-head self-attention
- transformer blocks
- causal masking
- autoregressive text generation
- mixed precision training
- top-k / top-p sampling
- safetensors model weights
The model was trained on a subset of FineWeb-Edu using a GPT-2 tokenizer.
Architecture
Model configuration:
{
"vocab_size": 50257,
"context_length": 256,
"emb_dim": 768,
"n_heads": 12,
"n_layers": 12,
"drop_rate": 0.1,
"qkv_bias": False
}
Approximate parameter count:
- ~124M parameters
Architecture components:
- token embeddings
- positional embeddings
- masked multi-head self-attention
- feed-forward MLP blocks
- pre-layer normalization
- residual connections
- causal language modeling head
Training
Training setup:
- PyTorch
- AdamW optimizer
- Automatic Mixed Precision (AMP)
- Gradient clipping
- Top-k / Top-p text generation
Hardware used:
- RTX 3060 Ti 8GB
Dataset:
- FineWeb-Edu subset (100M tokens)
Tokenizer:
- GPT-2 tokenizer
Training Progress
The graph below shows train/validation loss progression during training on FineWeb-Edu.
Installation
Install dependencies:
pip install torch transformers safetensors
Loading The Model
import json
import torch
from safetensors.torch import load_file
from transformers import AutoTokenizer
from model import GPTModel
# load config
with open("config.json") as f:
cfg = json.load(f)
# create model
model = GPTModel(cfg)
# load weights
state_dict = load_file("model.safetensors")
model.load_state_dict(state_dict)
model.eval()
# tokenizer
tokenizer = AutoTokenizer.from_pretrained(".")
Text Generation Example
from model import generate_and_print_sample
print(generate_and_print_sample(model, tokenizer, "cuda", "The world is big"))
Sample Generations
Example generations from early-stage training:
"The world is big. If you are a scientist, you can’t have to worry about how much the science is going on in your area. What I want is that there is no way to know when it comes to the idea of what the world is doing with our society? We need to understand that this means that we need to be able to take action against the problem and find out which ones will be exposed to the issue, rather than where it has been done. If you have any questions or comments, you might not have heard from your doctor. You may have heard from your doctor for more information about the topic. The best way to do so is to use a lot of information. For example, if you don’t like a doctor, you should be able to tell you how much you will have at home and why you would like to talk to someone else who has never visited them. In order to make sure that they are safe, you can also get the right answer. What kind of questions will I like to ask? Please refer to the following link: - What kind of questions will I need to help me determine if my child has had cancer? - How will I respond to treatment? Will my child receive the same chemotherapy?"
The model currently demonstrates
- Coherent paragraph generation
- Long-form text continuation
- Scientific and educational writing style
- Basic topic consistency across multiple sentences
- Emergent reasoning and abstraction patterns
- Generation of novel names and phrases
- Structured article-like prose
- Stable grammar and syntax generation
Current limitations:
- Factual inaccuracies
- Semantic repetition
- Weak instruction following
- Limited reasoning depth
- Hallucinated entities and concepts
Files
model.py # GPT architecture
model.safetensors # trained weights
config.json # model configuration
tokenizer files # GPT2 tokenizer assets
README.md # project documentation
Notes
This is a custom PyTorch implementation and is not directly compatible with Hugging Face AutoModelForCausalLM.
Users should load the model using the provided model.py architecture.
License
MIT License.
license: mit
- Downloads last month
- 33
