🔬 Research Artifact & Base Language Model. Portimbria-150M is a next-token predictor — not a chat assistant. It has no safety tuning and should not be deployed in user-facing applications without fine-tuning first. It is, however, a high-quality open foundation: fine-tune it, quantize it, convert it, distill from it, run LoRA on it, adapt it to your domain, or build anything else you can imagine — and please publish your results! See Intended Uses for details.
💡 Built by a solo 14-year-old developer, on a laptop, for $0. Every model StentorLabs has released — including this one — was conceived, designed, coded, and trained without a budget, a team, a GPU cluster, or institutional support. The total cost of producing Portimbria-150M was zero dollars, using free Kaggle TPU quota and publicly available datasets. This is what democratized AI research looks like.
What Is This?
Portimbria-150M is the first 150M-parameter model from StentorLabs and the inaugural entry in the Portimbria model family — a new scaling tier above the Stentor2 line. The name is a deliberate rearrangement of Portia fimbriata, a jumping spider famous for being extraordinarily intelligent relative to its tiny body size. That tension — compact but capable — is the design philosophy of this model family.
At ~151M parameters, Portimbria-150M is a base causal language model trained entirely from scratch on free-tier Kaggle compute using a Google Cloud TPU v5e-8 (eight chips). It was trained on approximately 6 billion tokens drawn from a web/code/math curriculum, with a 4096-token context window — the longest in the StentorLabs model lineup to date.
Like all StentorLabs models, this is a base next-token predictor, not a chat assistant. It will not reliably follow instructions, has no safety tuning, and is best suited for research, prototyping, speculative decoding, and infrastructure experiments.
The key architectural differentiators from Stentor2-12M are: a ~12× parameter scale-up (12.3M → 151M), a 4× longer context window (4096 vs 1024 tokens), Grouped Query Attention (6 query heads, 2 KV heads), and a standard Mistral BPE vocabulary (32,768 tokens) rather than a compact custom tokenizer. This enables full compatibility with the standard AutoTokenizer ecosystem.
GQA training stability is worth noting: Stentor2-12M-Preview experienced minor training instability when GQA was first introduced, largely because at 12M parameters the model simply wasn't large enough to absorb the optimization pressure smoothly. At 151M parameters — more than 12 times larger — Portimbria-150M handled GQA training without issue. The benefits (smaller KV cache, faster inference, no quality loss) clearly outweigh the minor challenge that existed only at the 12M scale.
The Portimbria Name
Why "Portimbria"?
Portia fimbriata is a species of jumping spider native to Queensland, Australia. It is considered one of the most cognitively sophisticated spiders ever studied — capable of problem-solving, planning, and learned behavior — yet it fits comfortably on a fingertip. The word "Portimbria" is a scrambled encoding of the species name, chosen to reflect the same principle: a model small enough to train for free, yet ambitious enough to compete meaningfully with models trained at far greater cost.
📋 Table of Contents
Model Architecture
Portimbria-150M is a LlamaForCausalLM model with Grouped Query Attention (GQA), a 32,768-token Mistral BPE vocabulary, and a 4096-token context window.
| Component | Value | Notes |
|---|---|---|
| Architecture | LlamaForCausalLM |
Standard transformer decoder |
| Hidden Size | 768 | |
| Intermediate Size (FFN) | 2,048 | SwiGLU activation |
| Num Hidden Layers | 20 | |
| Num Attention Heads | 6 | |
| Num Key/Value Heads | 2 | GQA — 3:1 query-to-KV ratio |
| Context Length | 4,096 tokens | |
| Vocab Size | 32,768 | Mistral BPE |
| Total Parameters | 151,026,432 | |
| Positional Encoding | RoPE | rope_theta = 50,000.0 |
Full architecture spec, GQA explanation & parameter count breakdown
Full Core Configuration
| Component | Value | Notes |
|---|---|---|
| Architecture | LlamaForCausalLM |
Standard transformer decoder |
| Hidden Size | 768 | |
| Intermediate Size (FFN) | 2,048 | Hidden × 2.67 (SwiGLU with 3 matrices) |
| Num Hidden Layers | 20 | |
| Num Attention Heads | 6 | |
| Num Key/Value Heads | 2 | GQA — 3:1 query-to-KV ratio |
| Head Dimension | 128 | 768 ÷ 6 — TPU v5e optimal |
| KV Dimension | 256 | 768 × (2/6) |
| Vocab Size | 32,768 | Mistral BPE, padded to multiple of 128 |
| Max Position Embeddings | 4,096 | block_size in training script |
| Hidden Activation | SiLU | LlamaForCausalLM default |
| Positional Encoding | RoPE | rope_theta = 50,000.0 |
| RMS Norm Epsilon | 1e-5 | |
| Tie Word Embeddings | True | Shared embedding / LM head |
| Attention Bias | False | |
| MLP Bias | False | |
| Attention Implementation | SDPA | PyTorch Scaled Dot Product Attention |
Why GQA?
Grouped Query Attention (6Q, 2KV) reduces the KV cache memory footprint by 67% at inference time compared to standard Multi-Head Attention at the same hidden size. At a 4096-token context window this matters substantially: the KV cache for a single sequence is proportional to 2 × num_kv_heads × head_dim × num_layers × seq_len. With 2 KV heads instead of 6, the cache shrinks to one-third of its full-MHA equivalent, enabling longer generation on memory-constrained hardware.
Parameter Count Breakdown
def estimate_llama_params_gqa(vocab_size, hidden_size, intermediate_size,
num_hidden_layers, num_attention_heads, num_key_value_heads):
kv_dim = int(hidden_size * num_key_value_heads / num_attention_heads)
q_proj = hidden_size * hidden_size
k_proj = hidden_size * kv_dim
v_proj = hidden_size * kv_dim
o_proj = hidden_size * hidden_size
attn = q_proj + k_proj + v_proj + o_proj
mlp = 3 * hidden_size * intermediate_size # gate, up, down
norm = 2 * hidden_size # input + post-attention RMSNorm
total = vocab_size * hidden_size + num_hidden_layers * (attn + mlp + norm) + hidden_size
return total
Plugging in Portimbria-150M values:
kv_dim = 768 × (2/6) = 256
q_proj = 768 × 768 = 589,824
k_proj = 768 × 256 = 196,608
v_proj = 768 × 256 = 196,608
o_proj = 768 × 768 = 589,824
attn/layer = 1,572,864
mlp/layer = 3 × 768 × 2,048 = 4,718,592
norm/layer = 2 × 768 = 1,536
per_layer = 6,292,992
embedding = 32,768 × 768 = 25,165,824
layers = 20 × 6,292,992 = 125,859,840
final_norm = 768
total = 25,165,824 + 125,859,840 + 768 = 151,026,432 ✓
| Component | Parameters | % of Total |
|---|---|---|
| Embedding Table (tied with LM Head) | 25,165,824 | 16.7% |
| Transformer Layers × 20 | 125,859,840 | 83.3% |
| — Attention (per layer × 20) | 31,457,280 | 20.8% |
| — FFN/MLP (per layer × 20) | 94,371,840 | 62.5% |
| — Layer Norms (per layer × 20) | 30,720 | 0.02% |
| Final RMS Norm | 768 | 0.001% |
| Total | 151,026,432 | 100% |
With a standard 32K vocabulary, embedding takes only 16.7% of the parameter budget — leaving 83.3% for the transformer stack that actually learns language patterns. This represents a healthy allocation at this scale, especially with GQA dramatically cutting the attention head count without sacrificing hidden dimension depth.
Head-to-Head: StentorLabs Model Family
Comparison table vs Stentor2-12M and Stentor2-30M
| Property | Stentor2-12M | Stentor2-30M | Portimbria-150M |
|---|---|---|---|
| Vocabulary | 8,064 (TokenMonster) | 8,064 (TokenMonster) | 32,768 (Mistral BPE) |
| Hidden Size | 256 | 512 | 768 |
| Intermediate Size | 512 | 1,024 | 2,048 |
| Num Layers | 12 | 10 | 20 |
| Attention Heads | 4 | 8 | 6 |
| KV Heads | 4 (MHA) | 8 (MHA) | 2 (GQA) |
| Head Dimension | 64 | 64 | 128 |
| Context Length | 1,024 | 1,024 | 4,096 |
| Total Parameters | 12.3M | 30.4M | 151.0M |
| Embedding Share | 16.8% | 13.6% | 16.7% |
| Training Tokens | 480M | 800M | ~6B |
| Training Hardware | 2× T4 | 2× T4 | TPU v5e-8 |
| Training Time | ~5h | ~6.75h | ~8h |
| Best Perplexity | 26.61 | 18.07 | 18.00 |
| Tokenizer | TokenMonster | TokenMonster | Mistral BPE |
Cross-family comparison caveat: PPL values are not directly comparable across families for two compounding reasons. First, Stentor2 models use TokenMonster (8K vocab) while Portimbria-150M uses Mistral BPE (32K vocab) — different tokenizers produce different token spaces and therefore different raw perplexity scales. Second, and more importantly, the Stentor1 family was trained exclusively on Cosmopedia + FineWeb-Edu, and the Stentor2 family on StenCore-PDF + FineWeb-HQ — both purely web/document text with zero code or math. Portimbria-150M is the first StentorLabs model trained on a web + code + math curriculum (FineWeb-HQ 75%, StarCoderData 15%, FineMath-4+ 10%). The harder, more structured distributions of code and math raise the effective loss target, meaning a direct PPL comparison against any prior StentorLabs model significantly understates Portimbria-150M's real capability improvement.
Memory Requirements
How much VRAM you need depends on precision and whether you're generating (which activates the KV cache). The table below covers a single sequence at full 4096-token context — KV cache scales linearly, so at 1024 tokens it's roughly ¼ of the values shown.
| Precision | Weights | KV Cache (4096 ctx) | Total VRAM |
|---|---|---|---|
| FP32 | ~604 MB | ~160 MB | ~764 MB |
| FP16 / BF16 | ~302 MB | ~80 MB | ~382 MB |
| INT8 | ~151 MB | ~80 MB | ~231 MB |
| INT4 | ~76 MB | ~80 MB | ~156 MB |
KV cache note: GQA (2 KV heads) already reduces the KV cache by 67% vs standard MHA at the same hidden size — the figures above reflect this. Formula:
2 (K+V) × 2 (KV heads) × 128 (head_dim) × 20 (layers) × seq_len × bytes_per_element.
Weights note: Weights are saved as FP32 in safetensors. Cast on load with
torch_dtype=torch.float16ortorch_dtype=torch.bfloat16to halve weight memory. INT8/INT4 figures require bitsandbytes quantization as shown in the Quantization section.
🚀 Quick Start
1. Install Dependencies
pip install transformers torch safetensors
2. Load the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Portimbria-150M",
torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Portimbria-150M")
model = model.eval()
3. Generate Text
prompt = "The history of computing began"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(next(model.parameters()).device)
attention_mask = torch.ones_like(input_ids)
with torch.inference_mode():
output = model.generate(
input_ids,
attention_mask=attention_mask,
max_new_tokens=150,
do_sample=True,
temperature=0.8,
top_p=0.9,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
)
generated = output[0][input_ids.shape[1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))
ℹ️ No custom tokenizer required. Portimbria-150M uses the Mistral BPE tokenizer via
AutoTokenizer. No additional packages needed beyondtransformers.
Pipeline usage & recommended generation settings
4. Using the Pipeline
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="StentorLabs/Portimbria-150M",
torch_dtype=torch.float16,
device_map="auto",
)
result = pipe(
"Neural networks are computational models",
max_new_tokens=100,
temperature=0.8,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
)
print(result[0]["generated_text"])
5. Recommended Generation Settings
| Parameter | Recommended Range | Notes |
|---|---|---|
temperature |
0.5 – 0.8 | Lower values (0.5–0.6) give more coherent, on-topic output; higher values (0.7–0.8) give more variety. Stay below 1.0. |
top_p |
0.85 – 0.90 | This range prevents gibberish and completely random tokens without over-restricting word choice. |
repetition_penalty |
1.05 – 1.2 | Stops looping and over-repetition while keeping outputs high quality. The sweet spot is 1.1. |
max_new_tokens |
40 – 4096 | Depends entirely on your goal. For a quick definition or fact, 40–60 is enough. For a story or long document, use 2000–4096. |
Temperature guidance: Lower temperature keeps the model closer to its learned distribution and more likely to stay on topic. Higher temperature increases creativity and diversity at the cost of some coherence.
max_new_tokens guidance: Don't set this too low for creative tasks — the model often generates an EOS token and stops on its own before hitting the ceiling anyway. Setting a generous ceiling (e.g. 2000) for open-ended generation costs nothing if the model stops early.
⚠️ Important Limitations
Not Instruction-Tuned: This is a base model. It will continue text, not follow instructions.
No Safety Tuning: No RLHF, no DPO, no content filtering.
Limited Factual Reliability: 151M parameters cannot store reliable world knowledge.
Context Window: Hard limit of 4,096 tokens.
English Only: Mistral BPE is heavily English-biased; other languages will tokenize poorly.
Repetition Without Penalty: Always use
repetition_penalty ≥ 1.05.Shared Tensor Warning: You may see
Removed shared tensor {'lm_head.weight'}on save — this is expected from tied word embeddings and is safe to ignore.
📋 Honest Notices
10 candid first-hand observations about this model
These are candid, first-hand observations about this model.
Dramatically more fluent than Stentor2-30M — the gap is very large. The difference in output quality is not subtle or marginal. Portimbria-150M reads like coherent, natural text at a level that makes Stentor2-30M look like a completely different tier of model. The jump between them is less like "a little better" and more like comparing a toddler learning words to a child speaking full, structured sentences — they're both small, but they're at fundamentally different stages of capability. The 4096-token context window also allows coherent extended passages that the 1024-token Stentor2 models simply cannot sustain.
Standard Mistral BPE means plug-and-play compatibility. No custom tokenizer packages.
AutoTokenizerjust works.Drifts much less than smaller models, and when it does drift it stays in the neighborhood. Topic coherence is meaningfully better than prior StentorLabs models. When drift does occur, it tends to pull toward semantically adjacent territory rather than going completely off the rails — if you prompt about hiking and the model drifts, it will likely end up somewhere like swimming or biking, not something totally unrelated like space or finance. Cushion-topic drift (to closely related subjects) happens occasionally; completely random topic jumps are rare.
Practically no gibberish under normal conditions. Incoherent token sequences are extremely rare at any reasonable temperature setting. You would need to deliberately run the model thousands of times on confusing or adversarial prompts to reliably reproduce gibberish output. In ordinary use, real English words come out consistently.
Code generation does not work — the model responds to code prompts in English instead. Despite being trained on Python, JavaScript, and TypeScript from StarCoderData, the code corpus (~15% of the training mix, or roughly hundreds of millions of tokens) was far too small relative to the ~4.5B web-text tokens for code generation behavior to emerge. When prompted to write code, the model does not produce code — it produces English text instead, typically on a loosely related topic. Code prompts are not a supported use case for this model.
Math reasoning is present but very weak — reliable arithmetic is absent. The model cannot perform simple addition reliably. However, there is a meaningful difference from code: the model does recognize that math belongs to the domain of numbers, graphs, symbols, and equations. If you prompt it with
1 + 1 =, it understands that a number should follow. It won't reliably get the right answer, but it knows it's doing math and responds accordingly — which is more than can be said for code (see above). Math-adjacent outputs (graphs, symbols, equations-like structure) appear appropriately in math contexts. Reliable symbolic computation is absent at this scale without instruction tuning or a much larger math token budget.GQA makes inference meaningfully faster. Two KV heads vs six results in a significantly smaller KV cache, which matters most during long-context generation on memory-limited hardware.
TPU training produces slightly different gradient dynamics than GPU. BF16 on TPU has different rounding behavior than FP16 on GPU. The model was trained natively in BF16 and is provided in FP32 weights (as is standard practice for safetensors saves).
The 4096-token context is real but untested at scale. RoPE with
theta=50,000was used throughout training at full block size. Position embeddings were exercised continuously, but very long-context generation quality has not been formally benchmarked.Strong topic grasp — it understands what you're asking about. Even without instruction following, the model has a noticeably good sense of the domain of a prompt. Ask about a dog and a cat, and it will generate something about pets or a closely related subject — not something random like the universe or geopolitics. Earlier StentorLabs models (especially the Stentor2 line) were poor at this; Portimbria-150M handles it well in the vast majority of cases. Short prompts with very little context are the main exception — with nothing to anchor on, outputs will be more random. But with a moderately-sized prompt, the model reliably stays in the right conceptual neighborhood.
Training Infrastructure
Hardware, software stack & throughput details
Hardware
| Component | Specification |
|---|---|
| Accelerator | Google Cloud TPU v5e |
| Chip Configuration | 8-chip pod slice (v5e-8) |
| Active Training Processes | 8 (one per chip via torchrun + PJRT) |
| Global Batch Tokens/Step | 262,144 (8 × 4,096 × 8 processes) |
| Platform | Kaggle Notebooks (free tier) |
| Orchestration | HuggingFace Accelerate + torchrun |
| Process Group Init | env:// (XLA backend) |
Software Stack
| Package | Role |
|---|---|
| PyTorch 2.6 | Core tensor operations |
| torch_xla 2.6 | XLA/TPU backend |
| HuggingFace Transformers | Model architecture (LlamaForCausalLM) |
| HuggingFace Accelerate | Distributed training orchestration |
| HuggingFace Datasets | Data loading and streaming |
| safetensors | Model serialization |
Throughput
| Metric | Value |
|---|---|
| Average global tokens/sec | ~253,000 |
| Per-chip tokens/sec | ~31,600 |
| Total training tokens | ~6,000,000,000 |
| Total wall-clock time (epoch) | 28,871s (~8.02h) |
Training Hyperparameters — Complete Reference
Full hyperparameter tables (optimizer, batch, schedule, checkpointing)
Core Training Parameters
| Hyperparameter | Value | Notes |
|---|---|---|
learning_rate |
8e-4 | Peak AdamW LR |
weight_decay |
0.01 | Applied to Linear weights only |
max_grad_norm |
1.0 | Gradient clipping |
optimizer |
AdamW | betas=(0.9, 0.95), eps=1e-8 |
scheduler |
Cosine | With linear warmup |
warmup_steps |
1,144 | 5% of max_train_steps |
stable_steps |
18,311 | 80% of max_train_steps |
max_train_steps |
22,889 | Token budget reached first |
token_budget |
6,000,000,000 | Total training tokens |
source_token_budget |
6,000,000,000 | Source data token cap |
seed |
42 | |
mixed_precision |
bf16 | Native TPU BF16 |
Batch & Sequence Parameters
| Hyperparameter | Value | Notes |
|---|---|---|
per_device_train_batch_size |
8 | Per TPU chip |
num_processes |
8 | One per chip |
total_batch_size |
64 | 8 × 8 |
block_size |
4,096 | Sequence / context length |
tokens_per_optimizer_step |
262,144 | total_batch_size × block_size |
gradient_accumulation_steps |
1 | No accumulation |
num_train_epochs |
1 | Token budget exhausted within epoch 0 |
pack |
True | Required for TPU static shapes |
Evaluation & Checkpointing
| Hyperparameter | Value |
|---|---|
eval_steps |
1,000 |
best_eval_steps |
1,000 |
best_eval_start_step |
1,000 |
max_eval_samples |
5,000 |
AdamW Optimizer — Detailed
Decay group: All
nn.Linearweight matrices →weight_decay = 0.01No-decay group: Bias terms, normalization parameters, embedding parameters →
weight_decay = 0.0Betas:
(0.9, 0.95)Epsilon:
1e-8Fused kernel: Enabled when CUDA available (not applicable on TPU)
Learning Rate Schedule
Phase 1 — Warmup (steps 0–1,144):
LR ramps linearly from 0 → 8e-4
Phase 2 — Cosine Decay (steps 1,144–22,889):
LR decays from 8e-4 → 0 following a cosine curve
Precision Stability Recipe
FP32 norm patching, critical layer wrapping & recipe summary
Training on TPU v5e in BF16 requires deliberate precision management to avoid gradient instabilities at 150M scale.
1. FP32 Normalization Layers (41 modules)
All RMSNorm modules are monkey-patched to compute in FP32:
def _fp32_norm_forward(hidden_states, *args, _orig=original_forward, **kwargs):
input_dtype = hidden_states.dtype
output = _orig(hidden_states.float().contiguous(), *args, **kwargs)
if torch.is_floating_point(output):
output = output.to(input_dtype)
return output
Count: 20 layers × 2 norms each + 1 final norm = 41 modules total.
2. FP32 Critical Layers (2 layers)
The first and last transformer layers run their entire forward pass in FP32:
Weights remain in their training dtype; inputs are cast to
.float()on entrytorch.amp.autocast("cuda", enabled=False)prevents re-downcasting
Rationale: Boundary layers — where embeddings project in and logits project out — are most sensitive to numerical precision. Wrapping them in FP32 provides a stable floor at minimal compute cost.
3. FP32 Attention Softmax — Skipped
Not applied. PyTorch SDPA handles softmax numerical stability internally and requires FP16/BF16 inputs for its optimized code paths on both CUDA and XLA.
Recipe Summary
| Technique | Count | Scope |
|---|---|---|
| FP32 norm modules | 41 | All RMSNorm layers |
| FP32 critical layers | 2 | First + last transformer layers |
| FP32 softmax modules | 0 | Skipped — SDPA incompatible |
Data Pipeline
Training data sources, curriculum design & preprocessing details
Training used a web/code/math curriculum with the following source mix:
| Source | Dataset | Ratio |
|---|---|---|
| Web | epfml/FineWeb-HQ (CC-MAIN-2024-51) |
75% |
| Code | bigcode/starcoderdata (Python, JS, TypeScript) |
15% |
| Math | HuggingFaceTB/finemath (finemath-4plus) |
10% |
Total tokens processed: ~6,000,000,000 (single epoch over source data)
Curriculum Design
Training used a curriculum anneal over the final 15% of the token budget, upweighting code and math relative to web text. This front-loads web generalization while ensuring the model sees a higher concentration of structured/formal content near the end of training.
Text Preprocessing
def clean_text(text: str, preserve_linebreaks: bool = False) -> str:
text = unicodedata.normalize("NFKC", text)
text = text.replace("\\r\\n", "\\n").replace("\\r", "\\n")
if preserve_linebreaks:
lines = [line.rstrip() for line in text.splitlines()]
text = "\\n".join(lines).strip()
else:
lines = [line.strip() for line in text.splitlines() if line.strip()]
text = " ".join(lines)
text = " ".join(text.split())
return text
NFKC normalization maps visually-equivalent Unicode to canonical form
Linebreak preservation for code samples (not applicable to web/math)
Whitespace collapse for web/math text
Sequence Packing
Samples are packed into fixed 4,096-token blocks. Labels are identical to input_ids (causal LM objective). No cross-document attention masking is applied between packed samples — this is standard practice for web-text pretraining.
Weight Initialization
Initialization scheme & residual scaling code
def initialize_weights(model, std=0.02, num_hidden_layers=20):
layer_count = 20
residual_std = std / math.sqrt(2.0 * layer_count) # ≈ 0.00316
for name, module in model.named_modules():
if isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
elif isinstance(module, nn.Linear):
# Scaled-down std for output projections (residual path)
proj_std = residual_std if name.endswith(("o_proj", "down_proj")) else std
module.weight.data.normal_(mean=0.0, std=proj_std)
if module.bias is not None:
module.bias.data.zero_()
elif "rmsnorm" in type(module).__name__.lower():
if module.weight is not None:
module.weight.data.fill_(1.0)
Residual projections (
o_proj,down_proj) use scaled-down std (0.02 / sqrt(2 × 20) ≈ 0.00316) to prevent residual stream explosion at initialization, following the GPT-2 convention.All other Linear layers use
std=0.02.RMSNorm scales start at 1.0 (identity).
Evaluation & Results
Training loss & perplexity curves, family comparison, full checkpoint history
Training Loss Curve
Validation Perplexity Curve
Final result: best validation loss 2.8906 — perplexity 18.00.
Comparison Across the StentorLabs Family
| Model | Params | Best PPL | Training Tokens | Compute | Notes |
|---|---|---|---|---|---|
| Stentor-12M (v1) | 12.0M | 89.01 | 200M | 2× T4 | v1 baseline |
| Stentor-30M (v1) | 30.4M | 33.02 | 600M | 2× T4 | |
| Stentor2-12M | 12.3M | 26.61 | 480M | 2× T4 | 8K TokenMonster vocab |
| Stentor2-30M | 30.4M | 18.07 | 800M | 2× T4 | 8K TokenMonster vocab |
| Portimbria-150M | 151.0M | 18.00 | ~6B | TPU v5e-8 | 32K Mistral BPE, 4K ctx, GQA |
Comparison note: PPL values are not directly comparable across this family for two reasons: different tokenizers (TokenMonster 8K vs Mistral BPE 32K produce different token spaces) and different training data mixes (all prior StentorLabs models trained on web text only; Portimbria-150M is the first to include code and math). Both factors make a raw PPL number-to-number comparison misleading — Portimbria-150M's real improvement over Stentor2 is larger than the headline numbers suggest.
Full Checkpoint History
| Step | Eval Loss | Perplexity | Notes |
|---|---|---|---|
| 1,000 | 5.3438 | ~209 | First best checkpoint |
| 2,000 | 4.1250 | ~62 | |
| 3,000 | 3.5625 | ~35 | |
| 8,000 | 3.4531 | ~31.6 | |
| 9,000 | 3.3125 | ~27.4 | |
| 10,000 | 3.1875 | ~24.3 | |
| 11,000 | 3.1406 | ~23.1 | |
| 12,000 | 3.0625 | ~21.4 | |
| 13,000 | 3.0312 | ~20.7 | |
| 14,000 | 2.9844 | ~19.8 | |
| 15,000 | 2.9375 | ~18.9 | |
| 17,000 | 2.9062 | ~18.3 | |
| 18,000 | 2.8906 | 18.03 | Best checkpoint saved |
| Final (epoch end) | 2.8906 | 18.00 | Final model |
Benchmark Results
All benchmarks run zero-shot unless otherwise noted.
Portimbria-150M Benchmarks
| Benchmark | Task | Score | Notes |
|---|---|---|---|
| PIQA | Physical commonsense reasoning | 57.62% | 0-shot, acc_norm |
| Winogrande | Pronoun resolution | 52.72% | 0-shot, acc |
| TruthfulQA MC2 | Truthfulness (multiple choice) | 46.94% | 0-shot, acc |
| ARC-Easy | Science QA | 33.80% | 0-shot, acc_norm |
| HellaSwag | Commonsense NLI (completion) | 27.46% | 0-shot, acc_norm |
| OpenBookQA | Elementary science | 24.60% | 0-shot, acc_norm |
| ARC-Challenge | Science QA | 22.53% | 0-shot, acc_norm |
| ARC Average | 28.17% | avg of Easy + Challenge | |
| CommonsenseQA | Commonsense reasoning | 19.90% | 0-shot, acc |
Comparison against peer models, analysis & evaluation script
Comparison Against Peer Models
The table below compares Portimbria-150M against models of similar scale using publicly available, official or community-verified benchmark numbers. All Portimbria-150M scores are 0-shot. Peer model scores use the shot count shown in parentheses, which varies by source — comparisons are directional, not exact. Scores shown as — were not found in any official or sufficiently authoritative source and are intentionally omitted.
| Model | Params | Tokens | HellaSwag | ARC-Easy | ARC-Challenge | ARC Avg | PIQA | Winogrande | TruthfulQA | OpenBookQA | Source |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Portimbria-150M | 151M | ~6B | 27.46 (0-sh) | 33.80 (0-sh) | 22.53 (0-sh) | 28.17 (0-sh) | 57.62 (0-sh) | 52.72 (0-sh) | 46.94 (0-sh) | 24.60 (0-sh) | lm-eval, this card |
| SmolLM2-135M | 135M | 2T | 42.1 (0-sh) | 48.99 (0-sh) | 38.81 (calc²) | 43.9 (0-sh) | 68.4 (0-sh) | 51.3 (0-sh) | — | 34.6 (0-sh) | lighteval, official HF card; ARC-Easy from public comparison table; ARC-Challenge back-calculated² |
| SmolLM-135M | 135M | 600B | 41.2 (0-sh) | 58.84 (0-sh) | 25.96 (calc²) | 42.4 (0-sh) | 68.4 (0-sh) | 51.3 (0-sh) | — | 34.0 (0-sh) | lighteval, official HF card; ARC-Easy from public comparison table; ARC-Challenge back-calculated² |
| Pythia-160M | 160M | ~300B | 29.9 (0-sh) | 40.0 (0-sh) | 25.3 (0-sh) | 32.65 (0-sh) | 62.0 (0-sh) | 50.9 (0-sh) | 44.3 (0-sh) | 31.2 (0-sh) | 0-sh scores from public comparison table³; TruthfulQA from HF Open LLM Leaderboard |
| OPT-125M | 125M | 180B | 31.5 (10-sh) | 41.3 (0-sh) | 22.10 (—sh) | 31.70 (calc) | 62.08 (0-sh) | 51.6 (5-sh) | 42.9 (0-sh) | 28.00 (0-sh) | ARC-Easy 0-sh from public comparison table; HellaSwag 10-sh & WinoGrande 5-sh from HF Leaderboard³; ARC-Challenge shot count unconfirmed |
| GPT-Neo 125M | 125M | 300B | 28.67 (0-sh) | 40.7 (0-sh) | 22.87 (—sh) | 31.79 (calc) | 63.06 (0-sh) | 50.43 (0-sh) | 35.70 (—sh) | 26.20 (—sh) | HellaSwag/ARC-Easy/PIQA/WinoGrande: lm-eval, EleutherAI README (0-sh); ARC-Challenge/TruthfulQA/OpenBookQA: public comparison table, shots not stated④ |
| GPT-2 (117M) | 117M | ~40B | 31.64 (—sh) | — | 22.95 (—sh) | — | 62.51 (—sh) | 50.04 (—sh) | 31.73 (—sh) | 27.20 (—sh) | GPT-2 124M public lm-eval row¹; shot counts not stated; no ARC-Easy/Avg available for exact 117M |
| Random Chance | — | — | 25.0 | 25.0 | 25.0 | 25.0 | 50.0 | 50.0 | — | 25.0 | Uniform random over answer choices |
Table notes:
¹ GPT-2 (117M) was released in 2019 before these benchmarks became standard. The scores listed are sourced from the closest available public lm-eval row (GPT-2 124M); the lm-eval harness gpt2 shortname defaults to the 117M model, but no complete public 117M row with ARC-Easy was found. Shot counts are not stated in that source. ARC-Easy and ARC Avg are therefore unavailable and left as —.
² SmolLM2 and SmolLM official model cards report only ARC Average (via lighteval); no per-split breakdown is published. ARC-Easy scores come from separate public comparison tables. ARC-Challenge is back-calculated as 2 × ARC-Avg − ARC-Easy and marked (calc²). These derived values are estimates and have not been independently verified against a direct lm-eval run.
³ Pythia-160M scores (HellaSwag, ARC-Easy, ARC-Challenge, PIQA, WinoGrande, OpenBookQA) have been updated to explicitly 0-shot values sourced from a public zero-shot comparison table, superseding the previously listed mixed-shot HF Open LLM Leaderboard entries. TruthfulQA remains from the HF Leaderboard (0-shot). For OPT-125M, ARC-Easy is from the same 0-shot comparison table; HellaSwag (10-shot) and WinoGrande (5-shot) are retained from the HF Leaderboard as no 0-shot replacement was found; ARC-Challenge shot count is unconfirmed in that source.
④ GPT-Neo 125M HellaSwag, ARC-Easy, PIQA, and WinoGrande are 0-shot from the official EleutherAI gpt-neo GitHub README. ARC-Challenge, TruthfulQA, and OpenBookQA are sourced from a public comparison table; shot counts for those three tasks are not stated.
⑤ TruthfulQA MC2 does not have a meaningful random chance baseline — it measures normalized probability mass assigned to all correct completions rather than a standard n-way classification, so no uniform random reference is applicable. Note that Portimbria's ARC-Challenge (22.53) and OpenBookQA (24.60) both fall below the 25.0 random baseline, likely due to acc_norm length normalization penalizing the model's output distributions at this scale.
— = not found in a reliable source, or shot count makes direct comparison inappropriate; omitted rather than estimated.
Analysis
Portimbria-150M was trained on ~6 billion tokens — 2% of the data used for Pythia-160M (300B tokens), 2% of GPT-Neo 125M (300B tokens), 3.3% of OPT-125M (~180B tokens), and a mere 0.3% of SmolLM2-135M (2T tokens).
TruthfulQA is Portimbria's most consistent standout. Across every peer with a TruthfulQA entry, Portimbria leads: 46.94 vs GPT-2's 31.73, vs GPT-Neo's 35.70, vs OPT-125M's 42.9, and vs Pythia-160M's 44.3. That 15-point gap over GPT-2 and an 11-point gap over GPT-Neo are not noise — they suggest that the web+code+math curriculum and the longer 4096-token training context are doing real work at the quality level that TruthfulQA targets. Portimbria wins TruthfulQA against every peer model in this table on 2% of their training data.
Winogrande is the other consistent win. Portimbria (52.72) beats every model in the table: GPT-2 (50.04), GPT-Neo (50.43), OPT-125M (51.6), Pythia-160M (50.9), and both SmolLM models (51.3) — despite all of them having seen vastly more training data.
The honest gaps are real. On HellaSwag, ARC-Easy, ARC-Challenge, PIQA, and OpenBookQA, Pythia-160M, GPT-Neo, OPT-125M, and GPT-2 all score higher. Those gaps are genuine — Portimbria trails Pythia-160M by ~2.5 points on HellaSwag, ~6.2 points on ARC-Easy, and ~6.6 points on OpenBookQA — all explainable by Pythia's 50× token advantage, but still real differences. These are the benchmarks with room to close through fine-tuning or extended pretraining.
Against GPT-2 (124M proxy at unconfirmed shot count), Portimbria competes respectably given the token budget gap: trailing on HellaSwag (27.46 vs 31.64), PIQA (57.62 vs 62.51), and OpenBookQA (24.60 vs 27.20), but winning decisively on TruthfulQA and WinoGrande. ARC-Challenge is a near-tie (22.53 vs 22.95).
SmolLM2-135M is the undisputed leader across every filled benchmark cell. With 333× the training data, its margins are consistent and expected — this is not a comparison Portimbria can win at current training scale. SmolLM-135M (600B tokens) leads on HellaSwag, PIQA, and ARC-Easy as well, with a notable ARC-Easy of 58.84 — though its back-calculated ARC-Challenge (25.96) is actually close to Portimbria's 22.53, and Portimbria leads on WinoGrande (52.72 vs 51.3) and TruthfulQA.
What this model is, beyond the numbers, is an exceptionally data-efficient foundation. Winning TruthfulQA and WinoGrande across the full peer group on 6B tokens — while trailing meaningfully only on commonsense-heavy tasks that reward scale — is precisely what you'd hope to see from a model trained on a high-quality, mixed-domain curriculum. Fine-tuned on a domain-specific corpus or targeted at reasoning tasks, Portimbria-150M has a genuine path to closing the remaining gaps. All of this, built from scratch, for free, on a TPU available to anyone with a Kaggle account.
Evaluation Setup (for Portimbria-150M)
Benchmarks were run on Kaggle with 2× Tesla T4 GPUs using the script below. No API token is required — the model is public. Each benchmark block runs independently so a single failure never stops the rest.
import os, sys, subprocess, json, time, re, threading
from pathlib import Path
from datetime import datetime
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["CUDA_LAUNCH_BLOCKING"] = "0"
os.environ["NCCL_P2P_DISABLE"] = "1"
os.environ["NCCL_IB_DISABLE"] = "1"
os.environ["NCCL_SHM_DISABLE"] = "1"
os.environ["NCCL_SOCKET_IFNAME"] = "eth0"
# ── Install deps ──────────────────────────────────────────────────────────────
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U",
"accelerate", "transformers"], check=True)
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U",
"git+https://github.com/EleutherAI/lm-evaluation-harness.git"], check=True)
# ── Config ────────────────────────────────────────────────────────────────────
MODEL = "StentorLabs/Portimbria-150M"
DTYPE = "float16"
BATCH = "32"
SEED = 42
OUT = "./results"
MODEL_ARGS = f"pretrained={MODEL},dtype={DTYPE},trust_remote_code=True"
BLOCKS = [
("block1", "PIQA · OpenBookQA · TruthfulQA",
"piqa,openbookqa,truthfulqa_mc2", 0, None),
("block2", "Winogrande · CommonsenseQA",
"winogrande,commonsense_qa", 0, None),
("block3", "HellaSwag",
"hellaswag", 0, None),
("block4", "ARC-Easy · ARC-Challenge",
"arc_easy,arc_challenge", 0, None),
]
LAUNCH_BASE = [
"accelerate", "launch",
"--multi_gpu",
"--num_processes=2",
"--mixed_precision=fp16",
"-m", "lm_eval",
"--model", "hf",
"--model_args", MODEL_ARGS,
"--batch_size", BATCH,
"--seed", str(SEED),
]
# ── Helpers ───────────────────────────────────────────────────────────────────
DEBUGGER_NOISE = re.compile(
r"(Debugger warning|frozen modules|PYDEVD|make the debugger|pass -X|Note: Debugging)"
)
def ts():
return datetime.now().strftime("%H:%M:%S")
def stream(proc):
def _read(pipe):
for raw in iter(pipe.readline, ""):
line = raw.rstrip()
if line and not DEBUGGER_NOISE.search(line):
print(f" [{ts()}] {line}", flush=True)
t_out = threading.Thread(target=_read, args=(proc.stdout,), daemon=True)
t_err = threading.Thread(target=_read, args=(proc.stderr,), daemon=True)
t_out.start()
t_err.start()
proc.wait()
t_out.join()
t_err.join()
# ── Run ───────────────────────────────────────────────────────────────────────
Path(OUT).mkdir(parents=True, exist_ok=True)
summary = {}
for i, (name, title, tasks, fewshot, extra) in enumerate(BLOCKS, 1):
print(f"\n{'='*60}", flush=True)
print(f" [{ts()}] BLOCK {i}/{len(BLOCKS)} — {title}", flush=True)
print(f"{'='*60}\n", flush=True)
cmd = LAUNCH_BASE + [
"--tasks", tasks,
"--num_fewshot", str(fewshot),
"--output_path", f"{OUT}/{name}",
]
if extra:
cmd += extra
t0 = time.time()
try:
proc = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
bufsize=1,
)
stream(proc)
elapsed = round((time.time() - t0) / 60, 1)
if proc.returncode == 0:
print(f"\n ✅ [{ts()}] {title} — done in {elapsed} min\n", flush=True)
summary[name] = {"status": "ok", "elapsed_min": elapsed}
else:
print(f"\n ❌ [{ts()}] {title} — exit {proc.returncode} ({elapsed} min)\n", flush=True)
summary[name] = {"status": "failed", "exit_code": proc.returncode, "elapsed_min": elapsed}
except Exception as exc:
elapsed = round((time.time() - t0) / 60, 1)
print(f"\n ❌ [{ts()}] {title} — {exc}\n", flush=True)
summary[name] = {"status": "failed", "error": str(exc), "elapsed_min": elapsed}
# ── Final summary ─────────────────────────────────────────────────────────────
passed = sum(1 for v in summary.values() if v["status"] == "ok")
print(f"\n{'='*60}", flush=True)
print(f" DONE — {passed}/{len(BLOCKS)} succeeded", flush=True)
print(f"{'='*60}", flush=True)
for name, info in summary.items():
icon = "✅" if info["status"] == "ok" else "❌"
mins = info.get("elapsed_min", "—")
print(f" {icon} {name:<10} {mins} min", flush=True)
summary_path = f"{OUT}/run_summary.json"
with open(summary_path, "w") as fh:
json.dump(summary, fh, indent=2)
print(f"\n Summary → {summary_path}\n", flush=True)
if any(v["status"] == "failed" for v in summary.values()):
sys.exit(1)
Metrics to report per task:
| Task | Metric |
|---|---|
| PIQA | acc_norm |
| OpenBookQA | acc_norm |
| TruthfulQA | mc2 |
| Winogrande | acc |
| CommonsenseQA | acc |
| HellaSwag | acc_norm |
| ARC-Easy | acc_norm |
| ARC-Challenge | acc_norm |
Model Outputs
5 real generation samples with honest good/bad analysis
The following samples illustrate real generation quality. All responses use temp = 0.6, top_p = 0.85, rep_penalty = 1.15.
The first three are the first three responses obtained with no cherry-picking — what you can expect on average. The fourth is a good cherry-picked response (what you can expect when you're lucky). The fifth is a bad cherry-picked response (what you can expect when you're unlucky).
1. Average — "The definition of biology is" (max_tokens = 40, cut off)
"a complex and multifaceted field that encompasses the study, treatment, and management of various biological processes. From the development of genetic engineering to the advancements in medicine and medicine, understanding"
✅ Good: The opening is grammatically smooth and the framing of biology as "complex and multifaceted" is appropriate. Correctly associates it with genetic engineering and medicine — the model clearly has relevant domain associations at the topical level. Given more tokens, this would very likely have continued coherently.
❌ Bad: "treatment, and management of biological processes" frames biology more like a clinical or engineering field than a scientific one — a subtle but real semantic drift. More glaring is "medicine and medicine," a direct word repetition in a single phrase, which is a hallucination/repetition failure.
2. Average — "Some of the most popular universities are" (max_tokens = 40, cut off)
"the University of California, a university in the United States. In addition to being the world's largest university, there are several other universities that offer more than one-day job opportunities. The university"
✅ Good: Names a real, recognizable institution. Stays on-topic throughout, remains grammatically coherent, and correctly anchors UC in the United States.
❌ Bad: UC is not the world's largest university — that's a confident confabulation. "More than one-day job opportunities" is semantically broken; it reads like a garbled web text phrase and doesn't mean anything coherent as a university selling point. The prompt implied a list but the model produced a paragraph about a single university.
3. Average — "When making a business, here are some things to look out for:" (max_tokens = 40, cut off)
"- Your Business is more likely to be successful. It is important to have a strong business plan in place and make sure that you are getting the best possible experience. - You want to know what your"
✅ Good: This is arguably the best of the three average samples. The model correctly interprets the colon as a list-format signal and produces bullet points. "Strong business plan" is genuinely relevant advice. The two bullets are grammatically clean and stay on topic.
❌ Bad: "More likely to be successful" is vacuous without a comparison — successful than what? "Getting the best possible experience" is generic web-text filler that doesn't mean anything actionable. The second bullet ("You want to know what your") is mid-thought, but that's purely the 40-token ceiling — not the model. What is a model issue: the two bullets are redundant in sentiment — both are vague encouragements rather than distinct concrete tips.
4. Good (cherry-picked) — "Tips for running a marathon:" (max_tokens = 350, did not hit limit)
"- Track your progress regularly. This helps you stay on track and ensures that you are not getting any immediate results. - Set realistic goals for yourself. Aim to set specific, measurable goals for each race. - Stay motivated by creating a roadmap for yourself. Aim to keep track of your progress and make necessary adjustments to your plan. By following these tips, you can successfully run a marathon in the United States, ensuring that you have a successful and healthy marathon. Remember to focus on your core competencies, such as speed, agility, and mental toughness, to maximize your chances of success. So go ahead, embrace the challenges, and watch your body run wild with confidence!"
✅ Good: The clearest demonstration of Portimbria's ceiling. The response is well-structured across ~350 tokens, self-contained, and completes naturally without hitting the limit — a sign of genuine coherence rather than just stringing tokens. "Speed, agility, and mental toughness" are legitimately relevant marathon concepts. The motivational closing is tonally appropriate. "Specific, measurable goals" is real coaching language. This is what the model looks like on a good day.
❌ Bad: "ensures that you are not getting any immediate results" is the opposite of what the tip is trying to say — a factual inversion buried in otherwise fluent prose, which is arguably more dangerous than obvious nonsense because it reads naturally. "Run a marathon in the United States" injects unprompted geographic specificity with no reason. "Watch your body run wild with confidence" is awkward phrasing that slightly undercuts the otherwise decent closing. "Core competencies" is corporate jargon that reads oddly in a fitness context. The structural coherence is genuinely impressive; the semantic errors embedded within it are the honest limitation.
5. Bad (cherry-picked) — "What is 1 + 1?" (max_tokens ~350, hit limit)
"A. 20 B. 30 C. 40 D. 50 D. 60 E. 70 Explanation: 1) 1/2 = 1/2 (2) 2) 3/4 = 3/4 4) 4/4 = 4/4 5) 5/4 = 4/4 6) 6/4 = 4/4 7) 7/4 = 4/4 8) 9/4 = 5/4 9) 10/4 = 6/4 11) 10/4 = 6/4 12) 11/4 = 6/4 13) 13/4 = 6/4 14) 13/4 = 6/4 15) 13/4 = 6/4 16) 13/4 = 6/4 17) 14/4 = 6/4 18) 15/4 = 6/4 19) 16/4 = 6/4 20) 21/4 = 6/4 22) 23/4 = 6/4 24) 24/4 = 6/4 25) 26/4 = 6/4 26) 27/4 = 6/4 28) 29/4 = 6/4 29) 21/4 = 6/4 21) 22/4 = 6/4 23) 23/4 = 6/4 24) 24/4 = 6/4 25) 25/4 = 6/4 26) 26/4 = 6/4 27) 27/4 = 6/4 28) 29/4 = 6/4 29) 29/4 = 6/4 20) 29/4 = 6/4 21) 29/4 = 6/4 22) 29/4 = 6"
✅ Good: The model correctly recognizes "What is X?" as potentially a multiple-choice exam format and attempts to produce structured output with labeled options and an "Explanation:" section. That's a real and interesting structural pattern recognition. It also associates the prompt with fractions and arithmetic notation — showing it has some sense of mathematical register.
❌ Bad: Almost everything else. The correct answer is 2, but the lowest option offered is 20. The explanation is a runaway repetition loop — the fraction sequence degenerates into 6/4 = 6/4 repeated indefinitely, which is the clearest example of what happens without adequate repetition penalty on a structurally-patterned output. Letter "D" appears twice in the options list. None of the fractions have any logical connection to 1+1. This is a base model with no instruction tuning and no arithmetic capability — asking it a direct math question with a short, definitive answer is exactly the kind of prompt that exposes those limits. This output also illustrates why repetition_penalty ≥ 1.05 is non-negotiable; without it, pattern-heavy outputs like numbered lists collapse into loops almost immediately.
Training Dynamics
Step-by-step training phase breakdown & throughput details
The training run processed approximately 6 billion tokens across a single epoch (epoch 0), running for 22,889 optimizer steps before the token budget was exhausted.
Early training (steps 0–1,144, warmup phase): LR ramped linearly from 0 to peak. Loss dropped quickly from above 5.0. First best checkpoint recorded at step 1,000 (eval loss 5.3438).
Mid training (steps 1,144–18,311, stable cosine phase): Smooth and consistent loss reduction. Gradient norms were well-behaved in the 0.3–0.6 range for most steps, with occasional spikes (notably 3.7 at step 1,800 and 8.5 at step 13,200 — both recovered cleanly). New best checkpoints recorded at steps 1,000 / 2,000 / 3,000 / 8,000 / 9,000 / 10,000 / 11,000 / 12,000 / 13,000 / 14,000 / 15,000 / 17,000 / 18,000.
Late training (steps 18,311–22,889, cosine decay tail): LR decaying toward zero. Eval loss stopped improving after step 18,000, confirming the best model was saved at that checkpoint.
Throughput: 253,000 global tokens/sec average (31,600 per chip), with a brief XLA warmup window reset at step 300.
Total wall-clock time: ~8.02 hours (epoch training) + ~8 minutes (final eval and save).
Use Cases & Intended Uses
| Use Case | Suitability | Notes |
|---|---|---|
| Studying transformer training dynamics at 150M scale | ✅ High | Full architecture, hyperparameters, and training curves published |
| Speculative decoding draft model | ✅ High | Fast enough to draft for larger Llama-family targets |
| Benchmarking 4K-context inference latency | ✅ High | Realistic long-context workload |
| Quantization / conversion pipeline testing | ✅ High | Standard architecture, no custom ops |
| Teaching material for LLM courses | ✅ High | Fully documented, reproducible from scratch |
| Edge deployment experiments | ✅ High | ~600MB in FP16; larger than Stentor2 but highly feasible on modern edge hardware |
| Domain-specific fine-tuning research | ✅ High | Standard transformers; fine-tune like any LLaMA model |
| Code completion prototyping | ❌ Not suitable | Code prompts produce English text, not code — see Honest Notices |
| Text continuation / creative writing | ✅ Medium | Good fluency; limited thematic fidelity |
| Factual Q&A | ❌ Not suitable | Unreliable world knowledge at this scale |
| Production deployment | ❌ Not suitable | No safety tuning |
| Non-English text | ❌ Not suitable | Training data is English-heavy |
| Instruction following | ❌ Not suitable | Base model only |
Out-of-Scope Uses
Any user-facing application — No safety filtering, no alignment, no factual reliability.
Medical, legal, or financial advice — Cannot reason reliably over specialized knowledge.
Generating content about real people — Will fabricate.
Automated content pipelines — Output quality is insufficient for unreviewed publication.
Instruction following — This is a base next-token predictor.
Ethical Considerations & Societal Impact
Data biases, safety considerations & societal impact
Inherited Data Biases
Trained on FineWeb-HQ, StarCoderData, and FineMath-4+ — all derived from web-scraped data. The model inherits:
Western-centric perspective — English-language web text skews toward Western viewpoints and cultural contexts.
English monolingualism — Mistral BPE is optimized for English. Other languages will produce high fertility and poor quality.
Demographic underrepresentation — Groups underrepresented in English web text will be underrepresented in outputs.
Code ecosystem bias — StarCoderData covers many programming languages, but this model was deliberately trained only on the Python, JavaScript, and TypeScript subsets. These three were chosen because they are among the most widely used languages in 2026 and are generally more accessible to the majority of developers.
No Safety Tuning
No RLHF, DPO, constitutional AI, or content filtering of any kind has been applied.
Positive Aspects
Democratizing AI research — Trained entirely on free Kaggle TPU compute.
Full transparency — Complete training hyperparameters, architecture, and logs published.
Minimal environmental footprint — ~8 hours of TPU compute is negligible versus large-scale pretraining runs.
Inference Guide
CPU inference (INT8) & GPU inference (FP16) code
CPU Inference (INT8 Dynamic Quantization)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("StentorLabs/Portimbria-150M")
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Portimbria-150M")
# Dynamically quantize for CPU
model_int8 = torch.quantization.quantize_dynamic(
model.cpu(),
{torch.nn.Linear},
dtype=torch.qint8,
)
inputs = tokenizer("The laws of physics state that", return_tensors="pt")
with torch.inference_mode():
output = model_int8.generate(**inputs, max_new_tokens=80, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))
GPU Inference (FP16)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Portimbria-150M",
torch_dtype=torch.float16,
device_map="cuda",
).eval()
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Portimbria-150M")
def generate(prompt, max_new_tokens=100, temperature=0.8, top_p=0.9):
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
output = model.generate(
input_ids,
attention_mask=torch.ones_like(input_ids),
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
top_p=top_p,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
)
return tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
print(generate("Once upon a time in a distant kingdom"))
🚀 Free Inference — Try It Now
No GPU, no setup, no API key required.
StentorLabs hosts a free demo space for all Stentor models:
🔗 https://huggingface.co/spaces/StentorLabs/StentorLabs-demo_space
Quantization
FP16, BF16 & 4-bit (bitsandbytes) quantization code
FP16 (GPU)
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Portimbria-150M",
torch_dtype=torch.float16,
)
BF16
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Portimbria-150M",
torch_dtype=torch.bfloat16,
)
4-bit (bitsandbytes)
pip install bitsandbytes accelerate
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Portimbria-150M",
quantization_config=bnb_config,
device_map="auto",
)
🌍 Community Contributions — Build on This Model
Portimbria-150M is built by an independent solo researcher, not a large corporate AI lab. That means it doesn't have teams of engineers running downstream experiments — that's where you come in. This model is Apache 2.0 licensed and is explicitly intended to be modified, extended, and redistributed.
Here are things StentorLabs actively encourages the community to try:
Fine-tune it on your domain — instruction tuning, domain adaptation, RLHF, DPO, anything goes
Quantize it — 4-bit, 8-bit, GGUF, GPTQ, AWQ, ONNX, all highly encouraged
Convert it to other formats — GGUF for llama.cpp, ONNX for deployment, CoreML for Apple Silicon
Run LoRA or QLoRA to adapt it cheaply on consumer hardware
Use it for speculative decoding with a larger Llama-family target
Benchmark it formally and share results
Publish your work — fine-tunes, quantized versions, adapters, research findings, derivative models, anything
If you build something with Portimbria-150M, please share it on HuggingFace and tag or link back to the base model. Every community result makes this model more useful for everyone.
LoRA / QLoRA Starter Configuration
Starter config, recommended hyperparameters & QLoRA note
If you haven't fine-tuned a Llama-family model before, here's a proven starting point for Portimbria-150M:
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Portimbria-150M",
torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Portimbria-150M")
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # LoRA rank — try 32 if underfitting
lora_alpha=32, # alpha = 2× rank is a reliable default
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: ~3.1M || all params: ~154M || trainable%: ~2.0%
Recommended fine-tuning hyperparameters:
| Hyperparameter | Value | Notes |
|---|---|---|
| Learning rate | 2e-4 | Scale down to 1e-4 for very small datasets |
| Optimizer | AdamW | betas=(0.9, 0.999), eps=1e-8 |
| LR scheduler | Cosine with warmup | ~5% warmup steps |
| Batch size | 4–16 | Per device; use gradient accumulation if memory-limited |
| Epochs | 2–5 | Watch for overfitting after epoch 2 |
| Max sequence length | 512–2048 | Up to 4096 is supported |
For QLoRA (4-bit quantized base + LoRA adapters on top), add BitsAndBytesConfig(load_in_4bit=True) when loading the base model — the LoRA config and training hyperparameters above apply unchanged. This lets you fine-tune on a single consumer GPU with ~4–6 GB VRAM.
Format Conversion
Convert to GGUF (llama.cpp) & ONNX
Convert to GGUF (llama.cpp)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && pip install -r requirements.txt
huggingface-cli download StentorLabs/Portimbria-150M --local-dir portimbria-150m
python convert_hf_to_gguf.py portimbria-150m/ \\
--outfile portimbria-150m.gguf \\
--outtype f16
./llama-quantize portimbria-150m.gguf portimbria-150m-q4_k_m.gguf q4_k_m
./llama-cli -m portimbria-150m-q4_k_m.gguf -p "The history of computing" -n 100
Convert to ONNX
pip install optimum[exporters]
optimum-cli export onnx \\
--model StentorLabs/Portimbria-150M \\
--task text-generation-with-past \\
portimbria-150m-onnx/
Speculative Decoding
Portimbria-150M can serve as a fast draft model to accelerate inference from larger Llama-family target models. Because it shares vocabulary with standard Llama/Mistral models (32K BPE), the acceptance rate should be substantially higher than Stentor2 models (which use a different 8K tokenizer).
Speculative decoding code & vocabulary compatibility notes
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
draft_model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Portimbria-150M",
torch_dtype=torch.float16,
).to("cuda")
target_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B",
torch_dtype=torch.float16,
device_map="auto",
)
target_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
inputs = target_tokenizer("Explain the concept of recursion:", return_tensors="pt").to("cuda")
outputs = target_model.generate(
**inputs,
assistant_model=draft_model,
do_sample=True,
max_new_tokens=200,
)
print(target_tokenizer.decode(outputs[0], skip_special_tokens=True))
Vocabulary compatibility: Portimbria-150M uses the Mistral-7B tokenizer (32K BPE), which is not identical to the LLaMA-3 tokenizer (also 32K but with different token merges). It is compatible with models that use the same Mistral BPE vocabulary (e.g.
mistralai/Mistral-7B-v0.1and derivatives). Vocabulary-compatible speculative decoding will yield higher acceptance rates; vocabulary-mismatched pairs will still work via HuggingFace's assisted generation but with lower acceptance rates.
Bias, Risks & Limitations
Factual Accuracy: All factual outputs should be treated as unreliable without verification.
Context Boundary: Hard limit of 4,096 tokens.
English Bias: Training data is English-dominant.
Training Data Bias: Inherits biases in FineWeb-HQ, StarCoderData, and FineMath-4+.
Hallucination: Will produce confident but fabricated content.
No Alignment: No RLHF, DPO, or constitutional training.
Code Generation: Code prompts produce English text output rather than functional code. The model does not generate syntactically or logically valid code in response to code-related prompts.
Shared Tensor Warning:
Removed shared tensor {'lm_head.weight'}is expected. Safe to ignore.Gradient Spikes: Two isolated gradient norm spikes occurred during training (step 1,800: 3.72, step 13,200: 8.56). Both recovered cleanly in subsequent steps with no apparent impact on the loss trajectory.
Related Work
Comparable sub-200M models & related research papers
Comparable Sub-200M Base Models
| Model | Parameters | Vocab | Context | Notes |
|---|---|---|---|---|
| Portimbria-150M (this model) | 151M | 32K BPE | 4,096 | Trained on 6B tokens, TPU v5e-8 |
| Stentor2-30M | 30.4M | 8K TokenMonster | 1,024 | StentorLabs family |
| Pythia-160M | 160M | 50K BPE | 2,048 | EleutherAI; 300B Pile tokens |
| GPT-2 (117M) | 117M | 50K BPE | 1,024 | OpenAI; 40GB WebText |
| OPT-125M | 125M | 50K BPE | 2,048 | Meta; 180B tokens |
| TinyLlama-1.1B | 1,100M | 32K BPE | 2,048 | 3T tokens; different scale tier |
Related Research Papers
| Paper | Relevance |
|---|---|
| Scaling Laws — Kaplan et al., 2020 | Informs token budget decisions |
| Chinchilla — Hoffmann et al., 2022 | 6B tokens for 150M params is ~40× (above Chinchilla optimal) |
| GQA — Ainslie et al., 2023 | Grouped Query Attention used in this model |
| RoPE — Su et al., 2021 | Positional encoding |
| LLaMA — Touvron et al., 2023 | Architecture basis |
| Pythia — Biderman et al., 2023 | Comparable small-model scaling study |
| Speculative Decoding — Leviathan et al., 2023 | Primary deployment use case |
Environmental Impact
Hardware, duration & estimated carbon
| Factor | Value |
|---|---|
| Hardware | Google Cloud TPU v5e-8 |
| Active Training Duration | ~8.02 hours |
| Cloud Provider | Google (via Kaggle free tier) |
| Compute Region | United States |
| Estimated Carbon | Minimal (< 1.0 kg CO₂e estimated) |
The TPU v5e is substantially more energy-efficient per FLOP than comparable GPU hardware. Running on Kaggle's free tier also means no dedicated data center allocation beyond what Kaggle already operates.
Citation
BibTeX
@misc{izumoto2026portimbria150m,
title = {Portimbria-150M},
author = {Kai Izumoto},
year = {2026},
publisher = {StentorLabs},
howpublished = {\\url{https://huggingface.co/StentorLabs/Portimbria-150M}},
note = {151M parameter LlamaForCausalLM base model with GQA trained from scratch
on ~6B tokens (FineWeb-HQ, StarCoderData, FineMath-4+) using a
Google Cloud TPU v5e-8 on Kaggle free compute. 4096-token context,
32K Mistral BPE vocabulary. Apache 2.0 license.}
}
Model Card Contact
Questions, benchmarks, or feedback: StentorLabs@gmail.com or open a discussion.
Made with ❤️ by StentorLabs
Democratizing AI through accessible, efficient models — trained on free compute, shared with everyone.
- Downloads last month
- 12
Model tree for StentorLabs/Portimbria-150M
Datasets used to train StentorLabs/Portimbria-150M
Space using StentorLabs/Portimbria-150M 1
Papers for StentorLabs/Portimbria-150M
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
LLaMA: Open and Efficient Foundation Language Models
Fast Inference from Transformers via Speculative Decoding
Training Compute-Optimal Large Language Models
Evaluation results
- Best Validation Loss on FineWeb-HQ (validation split)self-reported2.891
- Best Perplexity on FineWeb-HQ (validation split)self-reported18.000

