Portimbria-150M

🔬 Research Artifact & Base Language Model. Portimbria-150M is a next-token predictor — not a chat assistant. It has no safety tuning and should not be deployed in user-facing applications without fine-tuning first. It is, however, a high-quality open foundation: fine-tune it, quantize it, convert it, distill from it, run LoRA on it, adapt it to your domain, or build anything else you can imagine — and please publish your results! See Intended Uses for details.

💡 Built by a solo 14-year-old developer, on a laptop, for $0. Every model StentorLabs has released — including this one — was conceived, designed, coded, and trained without a budget, a team, a GPU cluster, or institutional support. The total cost of producing Portimbria-150M was zero dollars, using free Kaggle TPU quota and publicly available datasets. This is what democratized AI research looks like.

What Is This?

Portimbria-150M is the first 150M-parameter model from StentorLabs and the inaugural entry in the Portimbria model family — a new scaling tier above the Stentor2 line. The name is a deliberate rearrangement of Portia fimbriata, a jumping spider famous for being extraordinarily intelligent relative to its tiny body size. That tension — compact but capable — is the design philosophy of this model family.

At ~151M parameters, Portimbria-150M is a base causal language model trained entirely from scratch on free-tier Kaggle compute using a Google Cloud TPU v5e-8 (eight chips). It was trained on approximately 6 billion tokens drawn from a web/code/math curriculum, with a 4096-token context window — the longest in the StentorLabs model lineup to date.

Like all StentorLabs models, this is a base next-token predictor, not a chat assistant. It will not reliably follow instructions, has no safety tuning, and is best suited for research, prototyping, speculative decoding, and infrastructure experiments.

The key architectural differentiators from Stentor2-12M are: a ~12× parameter scale-up (12.3M → 151M), a 4× longer context window (4096 vs 1024 tokens), Grouped Query Attention (6 query heads, 2 KV heads), and a standard Mistral BPE vocabulary (32,768 tokens) rather than a compact custom tokenizer. This enables full compatibility with the standard AutoTokenizer ecosystem.

GQA training stability is worth noting: Stentor2-12M-Preview experienced minor training instability when GQA was first introduced, largely because at 12M parameters the model simply wasn't large enough to absorb the optimization pressure smoothly. At 151M parameters — more than 12 times larger — Portimbria-150M handled GQA training without issue. The benefits (smaller KV cache, faster inference, no quality loss) clearly outweigh the minor challenge that existed only at the 12M scale.

The Portimbria Name

Why "Portimbria"?

Portia fimbriata is a species of jumping spider native to Queensland, Australia. It is considered one of the most cognitively sophisticated spiders ever studied — capable of problem-solving, planning, and learned behavior — yet it fits comfortably on a fingertip. The word "Portimbria" is a scrambled encoding of the species name, chosen to reflect the same principle: a model small enough to train for free, yet ambitious enough to compete meaningfully with models trained at far greater cost.

📋 Table of Contents

What Is This?
The Portimbria Name
Model Architecture
Head-to-Head: StentorLabs Model Family
Quick Start
Memory Requirements
Important Limitations
Honest Notices
Training Infrastructure
Training Hyperparameters — Complete Reference
Precision Stability Recipe
Data Pipeline
Weight Initialization
Evaluation & Results
Benchmark Results
Model Outputs
Training Dynamics
Use Cases & Intended Uses
Out-of-Scope Uses
Ethical Considerations & Societal Impact
Inference Guide
Free Inference — Try It Now
Quantization
Community Contributions
Format Conversion
Speculative Decoding
Bias, Risks & Limitations
Related Work
Environmental Impact
Citation

Model Architecture

Portimbria-150M is a LlamaForCausalLM model with Grouped Query Attention (GQA), a 32,768-token Mistral BPE vocabulary, and a 4096-token context window.

Component	Value	Notes
Architecture	`LlamaForCausalLM`	Standard transformer decoder
Hidden Size	768
Intermediate Size (FFN)	2,048	SwiGLU activation
Num Hidden Layers	20
Num Attention Heads	6
Num Key/Value Heads	2	GQA — 3:1 query-to-KV ratio
Context Length	4,096 tokens
Vocab Size	32,768	Mistral BPE
Total Parameters	151,026,432
Positional Encoding	RoPE	`rope_theta = 50,000.0`

Full architecture spec, GQA explanation & parameter count breakdown

Full Core Configuration

Component	Value	Notes
Architecture	`LlamaForCausalLM`	Standard transformer decoder
Hidden Size	768
Intermediate Size (FFN)	2,048	Hidden × 2.67 (SwiGLU with 3 matrices)
Num Hidden Layers	20
Num Attention Heads	6
Num Key/Value Heads	2	GQA — 3:1 query-to-KV ratio
Head Dimension	128	768 ÷ 6 — TPU v5e optimal
KV Dimension	256	768 × (2/6)
Vocab Size	32,768	Mistral BPE, padded to multiple of 128
Max Position Embeddings	4,096	`block_size` in training script
Hidden Activation	SiLU	LlamaForCausalLM default
Positional Encoding	RoPE	`rope_theta = 50,000.0`
RMS Norm Epsilon	1e-5
Tie Word Embeddings	True	Shared embedding / LM head
Attention Bias	False
MLP Bias	False
Attention Implementation	SDPA	PyTorch Scaled Dot Product Attention

Why GQA?

Grouped Query Attention (6Q, 2KV) reduces the KV cache memory footprint by 67% at inference time compared to standard Multi-Head Attention at the same hidden size. At a 4096-token context window this matters substantially: the KV cache for a single sequence is proportional to 2 × num_kv_heads × head_dim × num_layers × seq_len. With 2 KV heads instead of 6, the cache shrinks to one-third of its full-MHA equivalent, enabling longer generation on memory-constrained hardware.

Parameter Count Breakdown

def estimate_llama_params_gqa(vocab_size, hidden_size, intermediate_size,
num_hidden_layers, num_attention_heads, num_key_value_heads):
kv_dim = int(hidden_size * num_key_value_heads / num_attention_heads)
q_proj = hidden_size * hidden_size
k_proj = hidden_size * kv_dim
v_proj = hidden_size * kv_dim
o_proj = hidden_size * hidden_size
attn = q_proj + k_proj + v_proj + o_proj
mlp = 3 * hidden_size * intermediate_size # gate, up, down
norm = 2 * hidden_size # input + post-attention RMSNorm
total = vocab_size * hidden_size + num_hidden_layers * (attn + mlp + norm) + hidden_size
return total

Plugging in Portimbria-150M values:

kv_dim = 768 × (2/6) = 256
q_proj = 768 × 768 = 589,824
k_proj = 768 × 256 = 196,608
v_proj = 768 × 256 = 196,608
o_proj = 768 × 768 = 589,824
attn/layer = 1,572,864
mlp/layer = 3 × 768 × 2,048 = 4,718,592
norm/layer = 2 × 768 = 1,536
per_layer = 6,292,992
embedding = 32,768 × 768 = 25,165,824
layers = 20 × 6,292,992 = 125,859,840
final_norm = 768
total = 25,165,824 + 125,859,840 + 768 = 151,026,432 ✓

Component	Parameters	% of Total
Embedding Table (tied with LM Head)	25,165,824	16.7%
Transformer Layers × 20	125,859,840	83.3%
— Attention (per layer × 20)	31,457,280	20.8%
— FFN/MLP (per layer × 20)	94,371,840	62.5%
— Layer Norms (per layer × 20)	30,720	0.02%
Final RMS Norm	768	0.001%
Total	151,026,432	100%

With a standard 32K vocabulary, embedding takes only 16.7% of the parameter budget — leaving 83.3% for the transformer stack that actually learns language patterns. This represents a healthy allocation at this scale, especially with GQA dramatically cutting the attention head count without sacrificing hidden dimension depth.

Head-to-Head: StentorLabs Model Family

Comparison table vs Stentor2-12M and Stentor2-30M

Property	Stentor2-12M	Stentor2-30M	Portimbria-150M
Vocabulary	8,064 (TokenMonster)	8,064 (TokenMonster)	32,768 (Mistral BPE)
Hidden Size	256	512	768
Intermediate Size	512	1,024	2,048
Num Layers	12	10	20
Attention Heads	4	8	6
KV Heads	4 (MHA)	8 (MHA)	2 (GQA)
Head Dimension	64	64	128
Context Length	1,024	1,024	4,096
Total Parameters	12.3M	30.4M	151.0M
Embedding Share	16.8%	13.6%	16.7%
Training Tokens	480M	800M	~6B
Training Hardware	2× T4	2× T4	TPU v5e-8
Training Time	~5h	~6.75h	~8h
Best Perplexity	26.61	18.07	18.00
Tokenizer	TokenMonster	TokenMonster	Mistral BPE

Cross-family comparison caveat: PPL values are not directly comparable across families for two compounding reasons. First, Stentor2 models use TokenMonster (8K vocab) while Portimbria-150M uses Mistral BPE (32K vocab) — different tokenizers produce different token spaces and therefore different raw perplexity scales. Second, and more importantly, the Stentor1 family was trained exclusively on Cosmopedia + FineWeb-Edu, and the Stentor2 family on StenCore-PDF + FineWeb-HQ — both purely web/document text with zero code or math. Portimbria-150M is the first StentorLabs model trained on a web + code + math curriculum (FineWeb-HQ 75%, StarCoderData 15%, FineMath-4+ 10%). The harder, more structured distributions of code and math raise the effective loss target, meaning a direct PPL comparison against any prior StentorLabs model significantly understates Portimbria-150M's real capability improvement.

Memory Requirements

How much VRAM you need depends on precision and whether you're generating (which activates the KV cache). The table below covers a single sequence at full 4096-token context — KV cache scales linearly, so at 1024 tokens it's roughly ¼ of the values shown.

Precision	Weights	KV Cache (4096 ctx)	Total VRAM
FP32	~604 MB	~160 MB	~764 MB
FP16 / BF16	~302 MB	~80 MB	~382 MB
INT8	~151 MB	~80 MB	~231 MB
INT4	~76 MB	~80 MB	~156 MB

KV cache note: GQA (2 KV heads) already reduces the KV cache by 67% vs standard MHA at the same hidden size — the figures above reflect this. Formula: 2 (K+V) × 2 (KV heads) × 128 (head_dim) × 20 (layers) × seq_len × bytes_per_element.

Weights note: Weights are saved as FP32 in safetensors. Cast on load with torch_dtype=torch.float16 or torch_dtype=torch.bfloat16 to halve weight memory. INT8/INT4 figures require bitsandbytes quantization as shown in the Quantization section.

🚀 Quick Start

1. Install Dependencies

pip install transformers torch safetensors

2. Load the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Portimbria-150M",
torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Portimbria-150M")
model = model.eval()

3. Generate Text

prompt = "The history of computing began"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(next(model.parameters()).device)
attention_mask = torch.ones_like(input_ids)
with torch.inference_mode():
output = model.generate(
input_ids,
attention_mask=attention_mask,
max_new_tokens=150,
do_sample=True,
temperature=0.8,
top_p=0.9,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
)
generated = output[0][input_ids.shape[1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))

ℹ️ No custom tokenizer required. Portimbria-150M uses the Mistral BPE tokenizer via AutoTokenizer. No additional packages needed beyond transformers.

Pipeline usage & recommended generation settings

4. Using the Pipeline

from transformers import pipeline
pipe = pipeline(
"text-generation",
model="StentorLabs/Portimbria-150M",
torch_dtype=torch.float16,
device_map="auto",
)
result = pipe(
"Neural networks are computational models",
max_new_tokens=100,
temperature=0.8,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
)
print(result[0]["generated_text"])

5. Recommended Generation Settings

Parameter	Recommended Range	Notes
`temperature`	0.5 – 0.8	Lower values (0.5–0.6) give more coherent, on-topic output; higher values (0.7–0.8) give more variety. Stay below 1.0.
`top_p`	0.85 – 0.90	This range prevents gibberish and completely random tokens without over-restricting word choice.
`repetition_penalty`	1.05 – 1.2	Stops looping and over-repetition while keeping outputs high quality. The sweet spot is 1.1.
`max_new_tokens`	40 – 4096	Depends entirely on your goal. For a quick definition or fact, 40–60 is enough. For a story or long document, use 2000–4096.

Temperature guidance: Lower temperature keeps the model closer to its learned distribution and more likely to stay on topic. Higher temperature increases creativity and diversity at the cost of some coherence.

max_new_tokens guidance: Don't set this too low for creative tasks — the model often generates an EOS token and stops on its own before hitting the ceiling anyway. Setting a generous ceiling (e.g. 2000) for open-ended generation costs nothing if the model stops early.

⚠️ Important Limitations

Not Instruction-Tuned: This is a base model. It will continue text, not follow instructions.
No Safety Tuning: No RLHF, no DPO, no content filtering.
Limited Factual Reliability: 151M parameters cannot store reliable world knowledge.
Context Window: Hard limit of 4,096 tokens.
English Only: Mistral BPE is heavily English-biased; other languages will tokenize poorly.
Repetition Without Penalty: Always use repetition_penalty ≥ 1.05.
Shared Tensor Warning: You may see Removed shared tensor {'lm_head.weight'} on save — this is expected from tied word embeddings and is safe to ignore.

📋 Honest Notices

10 candid first-hand observations about this model

These are candid, first-hand observations about this model.

Dramatically more fluent than Stentor2-30M — the gap is very large. The difference in output quality is not subtle or marginal. Portimbria-150M reads like coherent, natural text at a level that makes Stentor2-30M look like a completely different tier of model. The jump between them is less like "a little better" and more like comparing a toddler learning words to a child speaking full, structured sentences — they're both small, but they're at fundamentally different stages of capability. The 4096-token context window also allows coherent extended passages that the 1024-token Stentor2 models simply cannot sustain.
Standard Mistral BPE means plug-and-play compatibility. No custom tokenizer packages. AutoTokenizer just works.
Drifts much less than smaller models, and when it does drift it stays in the neighborhood. Topic coherence is meaningfully better than prior StentorLabs models. When drift does occur, it tends to pull toward semantically adjacent territory rather than going completely off the rails — if you prompt about hiking and the model drifts, it will likely end up somewhere like swimming or biking, not something totally unrelated like space or finance. Cushion-topic drift (to closely related subjects) happens occasionally; completely random topic jumps are rare.
Practically no gibberish under normal conditions. Incoherent token sequences are extremely rare at any reasonable temperature setting. You would need to deliberately run the model thousands of times on confusing or adversarial prompts to reliably reproduce gibberish output. In ordinary use, real English words come out consistently.
Code generation does not work — the model responds to code prompts in English instead. Despite being trained on Python, JavaScript, and TypeScript from StarCoderData, the code corpus (~15% of the training mix, or roughly hundreds of millions of tokens) was far too small relative to the ~4.5B web-text tokens for code generation behavior to emerge. When prompted to write code, the model does not produce code — it produces English text instead, typically on a loosely related topic. Code prompts are not a supported use case for this model.
Math reasoning is present but very weak — reliable arithmetic is absent. The model cannot perform simple addition reliably. However, there is a meaningful difference from code: the model does recognize that math belongs to the domain of numbers, graphs, symbols, and equations. If you prompt it with 1 + 1 =, it understands that a number should follow. It won't reliably get the right answer, but it knows it's doing math and responds accordingly — which is more than can be said for code (see above). Math-adjacent outputs (graphs, symbols, equations-like structure) appear appropriately in math contexts. Reliable symbolic computation is absent at this scale without instruction tuning or a much larger math token budget.
GQA makes inference meaningfully faster. Two KV heads vs six results in a significantly smaller KV cache, which matters most during long-context generation on memory-limited hardware.
TPU training produces slightly different gradient dynamics than GPU. BF16 on TPU has different rounding behavior than FP16 on GPU. The model was trained natively in BF16 and is provided in FP32 weights (as is standard practice for safetensors saves).
The 4096-token context is real but untested at scale. RoPE with theta=50,000 was used throughout training at full block size. Position embeddings were exercised continuously, but very long-context generation quality has not been formally benchmarked.
Strong topic grasp — it understands what you're asking about. Even without instruction following, the model has a noticeably good sense of the domain of a prompt. Ask about a dog and a cat, and it will generate something about pets or a closely related subject — not something random like the universe or geopolitics. Earlier StentorLabs models (especially the Stentor2 line) were poor at this; Portimbria-150M handles it well in the vast majority of cases. Short prompts with very little context are the main exception — with nothing to anchor on, outputs will be more random. But with a moderately-sized prompt, the model reliably stays in the right conceptual neighborhood.

Training Infrastructure

Hardware, software stack & throughput details

Hardware

Component	Specification
Accelerator	Google Cloud TPU v5e
Chip Configuration	8-chip pod slice (v5e-8)
Active Training Processes	8 (one per chip via torchrun + PJRT)
Global Batch Tokens/Step	262,144 (8 × 4,096 × 8 processes)
Platform	Kaggle Notebooks (free tier)
Orchestration	HuggingFace Accelerate + torchrun
Process Group Init	`env://` (XLA backend)

Software Stack

Package	Role
PyTorch 2.6	Core tensor operations
torch_xla 2.6	XLA/TPU backend
HuggingFace Transformers	Model architecture (LlamaForCausalLM)
HuggingFace Accelerate	Distributed training orchestration
HuggingFace Datasets	Data loading and streaming
safetensors	Model serialization

Throughput

Metric	Value
Average global tokens/sec	~253,000
Per-chip tokens/sec	~31,600
Total training tokens	~6,000,000,000
Total wall-clock time (epoch)	28,871s (~8.02h)

Training Hyperparameters — Complete Reference

Full hyperparameter tables (optimizer, batch, schedule, checkpointing)

Core Training Parameters

Hyperparameter	Value	Notes
`learning_rate`	8e-4	Peak AdamW LR
`weight_decay`	0.01	Applied to Linear weights only
`max_grad_norm`	1.0	Gradient clipping
`optimizer`	AdamW	`betas=(0.9, 0.95)`, `eps=1e-8`
`scheduler`	Cosine	With linear warmup
`warmup_steps`	1,144	5% of max_train_steps
`stable_steps`	18,311	80% of max_train_steps
`max_train_steps`	22,889	Token budget reached first
`token_budget`	6,000,000,000	Total training tokens
`source_token_budget`	6,000,000,000	Source data token cap
`seed`	42
`mixed_precision`	bf16	Native TPU BF16

Batch & Sequence Parameters

Hyperparameter	Value	Notes
`per_device_train_batch_size`	8	Per TPU chip
`num_processes`	8	One per chip
`total_batch_size`	64	8 × 8
`block_size`	4,096	Sequence / context length
`tokens_per_optimizer_step`	262,144	`total_batch_size × block_size`
`gradient_accumulation_steps`	1	No accumulation
`num_train_epochs`	1	Token budget exhausted within epoch 0
`pack`	True	Required for TPU static shapes

Evaluation & Checkpointing

Hyperparameter	Value
`eval_steps`	1,000
`best_eval_steps`	1,000
`best_eval_start_step`	1,000
`max_eval_samples`	5,000

AdamW Optimizer — Detailed

Decay group: All nn.Linear weight matrices → weight_decay = 0.01
No-decay group: Bias terms, normalization parameters, embedding parameters → weight_decay = 0.0
Betas: (0.9, 0.95)
Epsilon: 1e-8
Fused kernel: Enabled when CUDA available (not applicable on TPU)

Learning Rate Schedule

Phase 1 — Warmup (steps 0–1,144):
LR ramps linearly from 0 → 8e-4
Phase 2 — Cosine Decay (steps 1,144–22,889):
LR decays from 8e-4 → 0 following a cosine curve

Precision Stability Recipe

FP32 norm patching, critical layer wrapping & recipe summary

Training on TPU v5e in BF16 requires deliberate precision management to avoid gradient instabilities at 150M scale.

1. FP32 Normalization Layers (41 modules)

All RMSNorm modules are monkey-patched to compute in FP32:

def _fp32_norm_forward(hidden_states, *args, _orig=original_forward, **kwargs):
input_dtype = hidden_states.dtype
output = _orig(hidden_states.float().contiguous(), *args, **kwargs)
if torch.is_floating_point(output):
output = output.to(input_dtype)
return output

Count: 20 layers × 2 norms each + 1 final norm = 41 modules total.

2. FP32 Critical Layers (2 layers)

The first and last transformer layers run their entire forward pass in FP32:

Weights remain in their training dtype; inputs are cast to .float() on entry
torch.amp.autocast("cuda", enabled=False) prevents re-downcasting

Rationale: Boundary layers — where embeddings project in and logits project out — are most sensitive to numerical precision. Wrapping them in FP32 provides a stable floor at minimal compute cost.

3. FP32 Attention Softmax — Skipped

Not applied. PyTorch SDPA handles softmax numerical stability internally and requires FP16/BF16 inputs for its optimized code paths on both CUDA and XLA.

Recipe Summary

Technique	Count	Scope
FP32 norm modules	41	All RMSNorm layers
FP32 critical layers	2	First + last transformer layers
FP32 softmax modules	0	Skipped — SDPA incompatible

Data Pipeline

Training data sources, curriculum design & preprocessing details

Training used a web/code/math curriculum with the following source mix:

Source	Dataset	Ratio
Web	`epfml/FineWeb-HQ` (CC-MAIN-2024-51)	75%
Code	`bigcode/starcoderdata` (Python, JS, TypeScript)	15%
Math	`HuggingFaceTB/finemath` (finemath-4plus)	10%

Total tokens processed: ~6,000,000,000 (single epoch over source data)

Curriculum Design

Training used a curriculum anneal over the final 15% of the token budget, upweighting code and math relative to web text. This front-loads web generalization while ensuring the model sees a higher concentration of structured/formal content near the end of training.

Text Preprocessing

def clean_text(text: str, preserve_linebreaks: bool = False) -> str:
text = unicodedata.normalize("NFKC", text)
text = text.replace("\\r\\n", "\\n").replace("\\r", "\\n")
if preserve_linebreaks:
lines = [line.rstrip() for line in text.splitlines()]
text = "\\n".join(lines).strip()
else:
lines = [line.strip() for line in text.splitlines() if line.strip()]
text = " ".join(lines)
text = " ".join(text.split())
return text

NFKC normalization maps visually-equivalent Unicode to canonical form
Linebreak preservation for code samples (not applicable to web/math)
Whitespace collapse for web/math text

Sequence Packing

Samples are packed into fixed 4,096-token blocks. Labels are identical to input_ids (causal LM objective). No cross-document attention masking is applied between packed samples — this is standard practice for web-text pretraining.

Weight Initialization

Initialization scheme & residual scaling code

def initialize_weights(model, std=0.02, num_hidden_layers=20):
layer_count = 20
residual_std = std / math.sqrt(2.0 * layer_count) # ≈ 0.00316
for name, module in model.named_modules():
if isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
elif isinstance(module, nn.Linear):
# Scaled-down std for output projections (residual path)
proj_std = residual_std if name.endswith(("o_proj", "down_proj")) else std
module.weight.data.normal_(mean=0.0, std=proj_std)
if module.bias is not None:
module.bias.data.zero_()
elif "rmsnorm" in type(module).__name__.lower():
if module.weight is not None:
module.weight.data.fill_(1.0)

Residual projections (o_proj, down_proj) use scaled-down std (0.02 / sqrt(2 × 20) ≈ 0.00316) to prevent residual stream explosion at initialization, following the GPT-2 convention.
All other Linear layers use std=0.02.
RMSNorm scales start at 1.0 (identity).

Evaluation & Results

Training loss & perplexity curves, family comparison, full checkpoint history

Training Loss Curve

Validation Perplexity Curve

Final result: best validation loss 2.8906 — perplexity 18.00.

Comparison Across the StentorLabs Family

Model	Params	Best PPL	Training Tokens	Compute	Notes
Stentor-12M (v1)	12.0M	89.01	200M	2× T4	v1 baseline
Stentor-30M (v1)	30.4M	33.02	600M	2× T4
Stentor2-12M	12.3M	26.61	480M	2× T4	8K TokenMonster vocab
Stentor2-30M	30.4M	18.07	800M	2× T4	8K TokenMonster vocab
Portimbria-150M	151.0M	18.00	~6B	TPU v5e-8	32K Mistral BPE, 4K ctx, GQA

Comparison note: PPL values are not directly comparable across this family for two reasons: different tokenizers (TokenMonster 8K vs Mistral BPE 32K produce different token spaces) and different training data mixes (all prior StentorLabs models trained on web text only; Portimbria-150M is the first to include code and math). Both factors make a raw PPL number-to-number comparison misleading — Portimbria-150M's real improvement over Stentor2 is larger than the headline numbers suggest.

Full Checkpoint History

Step	Eval Loss	Perplexity	Notes
1,000	5.3438	~209	First best checkpoint
2,000	4.1250	~62
3,000	3.5625	~35
8,000	3.4531	~31.6
9,000	3.3125	~27.4
10,000	3.1875	~24.3
11,000	3.1406	~23.1
12,000	3.0625	~21.4
13,000	3.0312	~20.7
14,000	2.9844	~19.8
15,000	2.9375	~18.9
17,000	2.9062	~18.3
18,000	2.8906	18.03	Best checkpoint saved
Final (epoch end)	2.8906	18.00	Final model

Benchmark Results

All benchmarks run zero-shot unless otherwise noted.

Portimbria-150M Benchmarks

Benchmark	Task	Score	Notes
PIQA	Physical commonsense reasoning	57.62%	0-shot, acc_norm
Winogrande	Pronoun resolution	52.72%	0-shot, acc
TruthfulQA MC2	Truthfulness (multiple choice)	46.94%	0-shot, acc
ARC-Easy	Science QA	33.80%	0-shot, acc_norm
HellaSwag	Commonsense NLI (completion)	27.46%	0-shot, acc_norm
OpenBookQA	Elementary science	24.60%	0-shot, acc_norm
ARC-Challenge	Science QA	22.53%	0-shot, acc_norm
ARC Average		28.17%	avg of Easy + Challenge
CommonsenseQA	Commonsense reasoning	19.90%	0-shot, acc

Comparison against peer models, analysis & evaluation script

Comparison Against Peer Models

The table below compares Portimbria-150M against models of similar scale using publicly available, official or community-verified benchmark numbers. All Portimbria-150M scores are 0-shot. Peer model scores use the shot count shown in parentheses, which varies by source — comparisons are directional, not exact. Scores shown as — were not found in any official or sufficiently authoritative source and are intentionally omitted.

Model	Params	Tokens	HellaSwag	ARC-Easy	ARC-Challenge	ARC Avg	PIQA	Winogrande	TruthfulQA	OpenBookQA	Source
Portimbria-150M	151M	~6B	27.46 (0-sh)	33.80 (0-sh)	22.53 (0-sh)	28.17 (0-sh)	57.62 (0-sh)	52.72 (0-sh)	46.94 (0-sh)	24.60 (0-sh)	lm-eval, this card
SmolLM2-135M	135M	2T	42.1 (0-sh)	48.99 (0-sh)	38.81 (calc²)	43.9 (0-sh)	68.4 (0-sh)	51.3 (0-sh)	—	34.6 (0-sh)	lighteval, official HF card; ARC-Easy from public comparison table; ARC-Challenge back-calculated²
SmolLM-135M	135M	600B	41.2 (0-sh)	58.84 (0-sh)	25.96 (calc²)	42.4 (0-sh)	68.4 (0-sh)	51.3 (0-sh)	—	34.0 (0-sh)	lighteval, official HF card; ARC-Easy from public comparison table; ARC-Challenge back-calculated²
Pythia-160M	160M	~300B	29.9 (0-sh)	40.0 (0-sh)	25.3 (0-sh)	32.65 (0-sh)	62.0 (0-sh)	50.9 (0-sh)	44.3 (0-sh)	31.2 (0-sh)	0-sh scores from public comparison table³; TruthfulQA from HF Open LLM Leaderboard
OPT-125M	125M	180B	31.5 (10-sh)	41.3 (0-sh)	22.10 (—sh)	31.70 (calc)	62.08 (0-sh)	51.6 (5-sh)	42.9 (0-sh)	28.00 (0-sh)	ARC-Easy 0-sh from public comparison table; HellaSwag 10-sh & WinoGrande 5-sh from HF Leaderboard³; ARC-Challenge shot count unconfirmed
GPT-Neo 125M	125M	300B	28.67 (0-sh)	40.7 (0-sh)	22.87 (—sh)	31.79 (calc)	63.06 (0-sh)	50.43 (0-sh)	35.70 (—sh)	26.20 (—sh)	HellaSwag/ARC-Easy/PIQA/WinoGrande: lm-eval, EleutherAI README (0-sh); ARC-Challenge/TruthfulQA/OpenBookQA: public comparison table, shots not stated④
GPT-2 (117M)	117M	~40B	31.64 (—sh)	—	22.95 (—sh)	—	62.51 (—sh)	50.04 (—sh)	31.73 (—sh)	27.20 (—sh)	GPT-2 124M public lm-eval row¹; shot counts not stated; no ARC-Easy/Avg available for exact 117M
Random Chance	—	—	25.0	25.0	25.0	25.0	50.0	50.0	—	25.0	Uniform random over answer choices

Table notes:

¹ GPT-2 (117M) was released in 2019 before these benchmarks became standard. The scores listed are sourced from the closest available public lm-eval row (GPT-2 124M); the lm-eval harness gpt2 shortname defaults to the 117M model, but no complete public 117M row with ARC-Easy was found. Shot counts are not stated in that source. ARC-Easy and ARC Avg are therefore unavailable and left as —.

² SmolLM2 and SmolLM official model cards report only ARC Average (via lighteval); no per-split breakdown is published. ARC-Easy scores come from separate public comparison tables. ARC-Challenge is back-calculated as 2 × ARC-Avg − ARC-Easy and marked (calc²). These derived values are estimates and have not been independently verified against a direct lm-eval run.

³ Pythia-160M scores (HellaSwag, ARC-Easy, ARC-Challenge, PIQA, WinoGrande, OpenBookQA) have been updated to explicitly 0-shot values sourced from a public zero-shot comparison table, superseding the previously listed mixed-shot HF Open LLM Leaderboard entries. TruthfulQA remains from the HF Leaderboard (0-shot). For OPT-125M, ARC-Easy is from the same 0-shot comparison table; HellaSwag (10-shot) and WinoGrande (5-shot) are retained from the HF Leaderboard as no 0-shot replacement was found; ARC-Challenge shot count is unconfirmed in that source.

④ GPT-Neo 125M HellaSwag, ARC-Easy, PIQA, and WinoGrande are 0-shot from the official EleutherAI gpt-neo GitHub README. ARC-Challenge, TruthfulQA, and OpenBookQA are sourced from a public comparison table; shot counts for those three tasks are not stated.

⑤ TruthfulQA MC2 does not have a meaningful random chance baseline — it measures normalized probability mass assigned to all correct completions rather than a standard n-way classification, so no uniform random reference is applicable. Note that Portimbria's ARC-Challenge (22.53) and OpenBookQA (24.60) both fall below the 25.0 random baseline, likely due to acc_norm length normalization penalizing the model's output distributions at this scale.

— = not found in a reliable source, or shot count makes direct comparison inappropriate; omitted rather than estimated.

Analysis

Portimbria-150M was trained on ~6 billion tokens — 2% of the data used for Pythia-160M (~~300B tokens), 2% of GPT-Neo 125M (~~300B tokens), 3.3% of OPT-125M (~180B tokens), and a mere 0.3% of SmolLM2-135M (2T tokens).

TruthfulQA is Portimbria's most consistent standout. Across every peer with a TruthfulQA entry, Portimbria leads: 46.94 vs GPT-2's 31.73, vs GPT-Neo's 35.70, vs OPT-125M's 42.9, and vs Pythia-160M's 44.3. That 15-point gap over GPT-2 and an 11-point gap over GPT-Neo are not noise — they suggest that the web+code+math curriculum and the longer 4096-token training context are doing real work at the quality level that TruthfulQA targets. Portimbria wins TruthfulQA against every peer model in this table on 2% of their training data.

Winogrande is the other consistent win. Portimbria (52.72) beats every model in the table: GPT-2 (50.04), GPT-Neo (50.43), OPT-125M (51.6), Pythia-160M (50.9), and both SmolLM models (51.3) — despite all of them having seen vastly more training data.

The honest gaps are real. On HellaSwag, ARC-Easy, ARC-Challenge, PIQA, and OpenBookQA, Pythia-160M, GPT-Neo, OPT-125M, and GPT-2 all score higher. Those gaps are genuine — Portimbria trails Pythia-160M by ~2.5 points on HellaSwag, ~6.2 points on ARC-Easy, and ~6.6 points on OpenBookQA — all explainable by Pythia's 50× token advantage, but still real differences. These are the benchmarks with room to close through fine-tuning or extended pretraining.

Against GPT-2 (124M proxy at unconfirmed shot count), Portimbria competes respectably given the token budget gap: trailing on HellaSwag (27.46 vs 31.64), PIQA (57.62 vs 62.51), and OpenBookQA (24.60 vs 27.20), but winning decisively on TruthfulQA and WinoGrande. ARC-Challenge is a near-tie (22.53 vs 22.95).

SmolLM2-135M is the undisputed leader across every filled benchmark cell. With 333× the training data, its margins are consistent and expected — this is not a comparison Portimbria can win at current training scale. SmolLM-135M (600B tokens) leads on HellaSwag, PIQA, and ARC-Easy as well, with a notable ARC-Easy of 58.84 — though its back-calculated ARC-Challenge (25.96) is actually close to Portimbria's 22.53, and Portimbria leads on WinoGrande (52.72 vs 51.3) and TruthfulQA.

What this model is, beyond the numbers, is an exceptionally data-efficient foundation. Winning TruthfulQA and WinoGrande across the full peer group on 6B tokens — while trailing meaningfully only on commonsense-heavy tasks that reward scale — is precisely what you'd hope to see from a model trained on a high-quality, mixed-domain curriculum. Fine-tuned on a domain-specific corpus or targeted at reasoning tasks, Portimbria-150M has a genuine path to closing the remaining gaps. All of this, built from scratch, for free, on a TPU available to anyone with a Kaggle account.

Evaluation Setup (for Portimbria-150M)

Benchmarks were run on Kaggle with 2× Tesla T4 GPUs using the script below. No API token is required — the model is public. Each benchmark block runs independently so a single failure never stops the rest.

import os, sys, subprocess, json, time, re, threading
from pathlib import Path
from datetime import datetime

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["CUDA_LAUNCH_BLOCKING"]   = "0"
os.environ["NCCL_P2P_DISABLE"]       = "1"
os.environ["NCCL_IB_DISABLE"]        = "1"
os.environ["NCCL_SHM_DISABLE"]       = "1"
os.environ["NCCL_SOCKET_IFNAME"]     = "eth0"

# ── Install deps ──────────────────────────────────────────────────────────────
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U",
                "accelerate", "transformers"], check=True)
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U",
                "git+https://github.com/EleutherAI/lm-evaluation-harness.git"], check=True)

# ── Config ────────────────────────────────────────────────────────────────────
MODEL      = "StentorLabs/Portimbria-150M"
DTYPE      = "float16"
BATCH      = "32"
SEED       = 42
OUT        = "./results"
MODEL_ARGS = f"pretrained={MODEL},dtype={DTYPE},trust_remote_code=True"

BLOCKS = [
    ("block1", "PIQA · OpenBookQA · TruthfulQA",
        "piqa,openbookqa,truthfulqa_mc2",    0,  None),
    ("block2", "Winogrande · CommonsenseQA",
        "winogrande,commonsense_qa",          0,  None),
    ("block3", "HellaSwag",
        "hellaswag",                          0,  None),
    ("block4", "ARC-Easy · ARC-Challenge",
        "arc_easy,arc_challenge",             0,  None),
]

LAUNCH_BASE = [
    "accelerate", "launch",
    "--multi_gpu",
    "--num_processes=2",
    "--mixed_precision=fp16",
    "-m", "lm_eval",
    "--model",      "hf",
    "--model_args", MODEL_ARGS,
    "--batch_size", BATCH,
    "--seed",       str(SEED),
]

# ── Helpers ───────────────────────────────────────────────────────────────────
DEBUGGER_NOISE = re.compile(
    r"(Debugger warning|frozen modules|PYDEVD|make the debugger|pass -X|Note: Debugging)"
)

def ts():
    return datetime.now().strftime("%H:%M:%S")

def stream(proc):
    def _read(pipe):
        for raw in iter(pipe.readline, ""):
            line = raw.rstrip()
            if line and not DEBUGGER_NOISE.search(line):
                print(f"  [{ts()}] {line}", flush=True)

    t_out = threading.Thread(target=_read, args=(proc.stdout,), daemon=True)
    t_err = threading.Thread(target=_read, args=(proc.stderr,), daemon=True)
    t_out.start()
    t_err.start()
    proc.wait()
    t_out.join()
    t_err.join()

# ── Run ───────────────────────────────────────────────────────────────────────
Path(OUT).mkdir(parents=True, exist_ok=True)
summary = {}

for i, (name, title, tasks, fewshot, extra) in enumerate(BLOCKS, 1):
    print(f"\n{'='*60}", flush=True)
    print(f"  [{ts()}]  BLOCK {i}/{len(BLOCKS)}  —  {title}", flush=True)
    print(f"{'='*60}\n", flush=True)

    cmd = LAUNCH_BASE + [
        "--tasks",       tasks,
        "--num_fewshot", str(fewshot),
        "--output_path", f"{OUT}/{name}",
    ]
    if extra:
        cmd += extra

    t0 = time.time()
    try:
        proc = subprocess.Popen(
            cmd,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
            bufsize=1,
        )
        stream(proc)
        elapsed = round((time.time() - t0) / 60, 1)

        if proc.returncode == 0:
            print(f"\n  ✅  [{ts()}]  {title}  —  done in {elapsed} min\n", flush=True)
            summary[name] = {"status": "ok", "elapsed_min": elapsed}
        else:
            print(f"\n  ❌  [{ts()}]  {title}  —  exit {proc.returncode}  ({elapsed} min)\n", flush=True)
            summary[name] = {"status": "failed", "exit_code": proc.returncode, "elapsed_min": elapsed}

    except Exception as exc:
        elapsed = round((time.time() - t0) / 60, 1)
        print(f"\n  ❌  [{ts()}]  {title}  —  {exc}\n", flush=True)
        summary[name] = {"status": "failed", "error": str(exc), "elapsed_min": elapsed}

# ── Final summary ─────────────────────────────────────────────────────────────
passed = sum(1 for v in summary.values() if v["status"] == "ok")
print(f"\n{'='*60}", flush=True)
print(f"  DONE  —  {passed}/{len(BLOCKS)} succeeded", flush=True)
print(f"{'='*60}", flush=True)

for name, info in summary.items():
    icon  = "✅" if info["status"] == "ok" else "❌"
    mins  = info.get("elapsed_min", "—")
    print(f"  {icon}  {name:<10}  {mins} min", flush=True)

summary_path = f"{OUT}/run_summary.json"
with open(summary_path, "w") as fh:
    json.dump(summary, fh, indent=2)
print(f"\n  Summary → {summary_path}\n", flush=True)

if any(v["status"] == "failed" for v in summary.values()):
    sys.exit(1)

Metrics to report per task:

Task	Metric
PIQA	`acc_norm`
OpenBookQA	`acc_norm`
TruthfulQA	`mc2`
Winogrande	`acc`
CommonsenseQA	`acc`
HellaSwag	`acc_norm`
ARC-Easy	`acc_norm`
ARC-Challenge	`acc_norm`

Model Outputs

5 real generation samples with honest good/bad analysis

The following samples illustrate real generation quality. All responses use temp = 0.6, top_p = 0.85, rep_penalty = 1.15.

The first three are the first three responses obtained with no cherry-picking — what you can expect on average. The fourth is a good cherry-picked response (what you can expect when you're lucky). The fifth is a bad cherry-picked response (what you can expect when you're unlucky).

1. Average — "The definition of biology is" (max_tokens = 40, cut off)

"a complex and multifaceted field that encompasses the study, treatment, and management of various biological processes. From the development of genetic engineering to the advancements in medicine and medicine, understanding"

✅ Good: The opening is grammatically smooth and the framing of biology as "complex and multifaceted" is appropriate. Correctly associates it with genetic engineering and medicine — the model clearly has relevant domain associations at the topical level. Given more tokens, this would very likely have continued coherently.

❌ Bad: "treatment, and management of biological processes" frames biology more like a clinical or engineering field than a scientific one — a subtle but real semantic drift. More glaring is "medicine and medicine," a direct word repetition in a single phrase, which is a hallucination/repetition failure.

2. Average — "Some of the most popular universities are" (max_tokens = 40, cut off)

"the University of California, a university in the United States. In addition to being the world's largest university, there are several other universities that offer more than one-day job opportunities. The university"

✅ Good: Names a real, recognizable institution. Stays on-topic throughout, remains grammatically coherent, and correctly anchors UC in the United States.

❌ Bad: UC is not the world's largest university — that's a confident confabulation. "More than one-day job opportunities" is semantically broken; it reads like a garbled web text phrase and doesn't mean anything coherent as a university selling point. The prompt implied a list but the model produced a paragraph about a single university.

3. Average — "When making a business, here are some things to look out for:" (max_tokens = 40, cut off)

"- Your Business is more likely to be successful. It is important to have a strong business plan in place and make sure that you are getting the best possible experience. - You want to know what your"

✅ Good: This is arguably the best of the three average samples. The model correctly interprets the colon as a list-format signal and produces bullet points. "Strong business plan" is genuinely relevant advice. The two bullets are grammatically clean and stay on topic.

❌ Bad: "More likely to be successful" is vacuous without a comparison — successful than what? "Getting the best possible experience" is generic web-text filler that doesn't mean anything actionable. The second bullet ("You want to know what your") is mid-thought, but that's purely the 40-token ceiling — not the model. What is a model issue: the two bullets are redundant in sentiment — both are vague encouragements rather than distinct concrete tips.

4. Good (cherry-picked) — "Tips for running a marathon:" (max_tokens = 350, did not hit limit)

"- Track your progress regularly. This helps you stay on track and ensures that you are not getting any immediate results. - Set realistic goals for yourself. Aim to set specific, measurable goals for each race. - Stay motivated by creating a roadmap for yourself. Aim to keep track of your progress and make necessary adjustments to your plan. By following these tips, you can successfully run a marathon in the United States, ensuring that you have a successful and healthy marathon. Remember to focus on your core competencies, such as speed, agility, and mental toughness, to maximize your chances of success. So go ahead, embrace the challenges, and watch your body run wild with confidence!"

✅ Good: The clearest demonstration of Portimbria's ceiling. The response is well-structured across ~350 tokens, self-contained, and completes naturally without hitting the limit — a sign of genuine coherence rather than just stringing tokens. "Speed, agility, and mental toughness" are legitimately relevant marathon concepts. The motivational closing is tonally appropriate. "Specific, measurable goals" is real coaching language. This is what the model looks like on a good day.

❌ Bad: "ensures that you are not getting any immediate results" is the opposite of what the tip is trying to say — a factual inversion buried in otherwise fluent prose, which is arguably more dangerous than obvious nonsense because it reads naturally. "Run a marathon in the United States" injects unprompted geographic specificity with no reason. "Watch your body run wild with confidence" is awkward phrasing that slightly undercuts the otherwise decent closing. "Core competencies" is corporate jargon that reads oddly in a fitness context. The structural coherence is genuinely impressive; the semantic errors embedded within it are the honest limitation.

5. Bad (cherry-picked) — "What is 1 + 1?" (max_tokens ~350, hit limit)

"A. 20 B. 30 C. 40 D. 50 D. 60 E. 70 Explanation: 1) 1/2 = 1/2 (2) 2) 3/4 = 3/4 4) 4/4 = 4/4 5) 5/4 = 4/4 6) 6/4 = 4/4 7) 7/4 = 4/4 8) 9/4 = 5/4 9) 10/4 = 6/4 11) 10/4 = 6/4 12) 11/4 = 6/4 13) 13/4 = 6/4 14) 13/4 = 6/4 15) 13/4 = 6/4 16) 13/4 = 6/4 17) 14/4 = 6/4 18) 15/4 = 6/4 19) 16/4 = 6/4 20) 21/4 = 6/4 22) 23/4 = 6/4 24) 24/4 = 6/4 25) 26/4 = 6/4 26) 27/4 = 6/4 28) 29/4 = 6/4 29) 21/4 = 6/4 21) 22/4 = 6/4 23) 23/4 = 6/4 24) 24/4 = 6/4 25) 25/4 = 6/4 26) 26/4 = 6/4 27) 27/4 = 6/4 28) 29/4 = 6/4 29) 29/4 = 6/4 20) 29/4 = 6/4 21) 29/4 = 6/4 22) 29/4 = 6"

✅ Good: The model correctly recognizes "What is X?" as potentially a multiple-choice exam format and attempts to produce structured output with labeled options and an "Explanation:" section. That's a real and interesting structural pattern recognition. It also associates the prompt with fractions and arithmetic notation — showing it has some sense of mathematical register.

❌ Bad: Almost everything else. The correct answer is 2, but the lowest option offered is 20. The explanation is a runaway repetition loop — the fraction sequence degenerates into 6/4 = 6/4 repeated indefinitely, which is the clearest example of what happens without adequate repetition penalty on a structurally-patterned output. Letter "D" appears twice in the options list. None of the fractions have any logical connection to 1+1. This is a base model with no instruction tuning and no arithmetic capability — asking it a direct math question with a short, definitive answer is exactly the kind of prompt that exposes those limits. This output also illustrates why repetition_penalty ≥ 1.05 is non-negotiable; without it, pattern-heavy outputs like numbered lists collapse into loops almost immediately.

Training Dynamics

Step-by-step training phase breakdown & throughput details

The training run processed approximately 6 billion tokens across a single epoch (epoch 0), running for 22,889 optimizer steps before the token budget was exhausted.

Early training (steps 0–1,144, warmup phase): LR ramped linearly from 0 to peak. Loss dropped quickly from above 5.0. First best checkpoint recorded at step 1,000 (eval loss 5.3438).

Mid training (steps 1,144–18,311, stable cosine phase): Smooth and consistent loss reduction. Gradient norms were well-behaved in the 0.3–0.6 range for most steps, with occasional spikes (notably 3.7 at step 1,800 and 8.5 at step 13,200 — both recovered cleanly). New best checkpoints recorded at steps 1,000 / 2,000 / 3,000 / 8,000 / 9,000 / 10,000 / 11,000 / 12,000 / 13,000 / 14,000 / 15,000 / 17,000 / 18,000.

Late training (steps 18,311–22,889, cosine decay tail): LR decaying toward zero. Eval loss stopped improving after step 18,000, confirming the best model was saved at that checkpoint.

Throughput: ~~253,000 global tokens/sec average (~~31,600 per chip), with a brief XLA warmup window reset at step 300.

Total wall-clock time: ~8.02 hours (epoch training) + ~8 minutes (final eval and save).

Use Cases & Intended Uses

Use Case	Suitability	Notes
Studying transformer training dynamics at 150M scale	✅ High	Full architecture, hyperparameters, and training curves published
Speculative decoding draft model	✅ High	Fast enough to draft for larger Llama-family targets
Benchmarking 4K-context inference latency	✅ High	Realistic long-context workload
Quantization / conversion pipeline testing	✅ High	Standard architecture, no custom ops
Teaching material for LLM courses	✅ High	Fully documented, reproducible from scratch
Edge deployment experiments	✅ High	~600MB in FP16; larger than Stentor2 but highly feasible on modern edge hardware
Domain-specific fine-tuning research	✅ High	Standard transformers; fine-tune like any LLaMA model
Code completion prototyping	❌ Not suitable	Code prompts produce English text, not code — see Honest Notices
Text continuation / creative writing	✅ Medium	Good fluency; limited thematic fidelity
Factual Q&A	❌ Not suitable	Unreliable world knowledge at this scale
Production deployment	❌ Not suitable	No safety tuning
Non-English text	❌ Not suitable	Training data is English-heavy
Instruction following	❌ Not suitable	Base model only

Out-of-Scope Uses

Any user-facing application — No safety filtering, no alignment, no factual reliability.
Medical, legal, or financial advice — Cannot reason reliably over specialized knowledge.
Generating content about real people — Will fabricate.
Automated content pipelines — Output quality is insufficient for unreviewed publication.
Instruction following — This is a base next-token predictor.

Ethical Considerations & Societal Impact

Data biases, safety considerations & societal impact

Inherited Data Biases

Trained on FineWeb-HQ, StarCoderData, and FineMath-4+ — all derived from web-scraped data. The model inherits:

Western-centric perspective — English-language web text skews toward Western viewpoints and cultural contexts.
English monolingualism — Mistral BPE is optimized for English. Other languages will produce high fertility and poor quality.
Demographic underrepresentation — Groups underrepresented in English web text will be underrepresented in outputs.
Code ecosystem bias — StarCoderData covers many programming languages, but this model was deliberately trained only on the Python, JavaScript, and TypeScript subsets. These three were chosen because they are among the most widely used languages in 2026 and are generally more accessible to the majority of developers.

No Safety Tuning

No RLHF, DPO, constitutional AI, or content filtering of any kind has been applied.

Positive Aspects

Democratizing AI research — Trained entirely on free Kaggle TPU compute.
Full transparency — Complete training hyperparameters, architecture, and logs published.
Minimal environmental footprint — ~8 hours of TPU compute is negligible versus large-scale pretraining runs.

Inference Guide

CPU inference (INT8) & GPU inference (FP16) code

CPU Inference (INT8 Dynamic Quantization)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("StentorLabs/Portimbria-150M")
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Portimbria-150M")
# Dynamically quantize for CPU
model_int8 = torch.quantization.quantize_dynamic(
model.cpu(),
{torch.nn.Linear},
dtype=torch.qint8,
)
inputs = tokenizer("The laws of physics state that", return_tensors="pt")
with torch.inference_mode():
output = model_int8.generate(**inputs, max_new_tokens=80, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

GPU Inference (FP16)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Portimbria-150M",
torch_dtype=torch.float16,
device_map="cuda",
).eval()
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Portimbria-150M")
def generate(prompt, max_new_tokens=100, temperature=0.8, top_p=0.9):
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
output = model.generate(
input_ids,
attention_mask=torch.ones_like(input_ids),
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
top_p=top_p,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
)
return tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
print(generate("Once upon a time in a distant kingdom"))

🚀 Free Inference — Try It Now

No GPU, no setup, no API key required.

StentorLabs hosts a free demo space for all Stentor models:

🔗 https://huggingface.co/spaces/StentorLabs/StentorLabs-demo_space

Quantization

FP16, BF16 & 4-bit (bitsandbytes) quantization code

FP16 (GPU)

model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Portimbria-150M",
torch_dtype=torch.float16,
)

BF16

model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Portimbria-150M",
torch_dtype=torch.bfloat16,
)

4-bit (bitsandbytes)

pip install bitsandbytes accelerate

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Portimbria-150M",
quantization_config=bnb_config,
device_map="auto",
)

🌍 Community Contributions — Build on This Model

Portimbria-150M is built by an independent solo researcher, not a large corporate AI lab. That means it doesn't have teams of engineers running downstream experiments — that's where you come in. This model is Apache 2.0 licensed and is explicitly intended to be modified, extended, and redistributed.

Here are things StentorLabs actively encourages the community to try:

Fine-tune it on your domain — instruction tuning, domain adaptation, RLHF, DPO, anything goes
Quantize it — 4-bit, 8-bit, GGUF, GPTQ, AWQ, ONNX, all highly encouraged
Convert it to other formats — GGUF for llama.cpp, ONNX for deployment, CoreML for Apple Silicon
Run LoRA or QLoRA to adapt it cheaply on consumer hardware
Use it for speculative decoding with a larger Llama-family target
Benchmark it formally and share results
Publish your work — fine-tunes, quantized versions, adapters, research findings, derivative models, anything

If you build something with Portimbria-150M, please share it on HuggingFace and tag or link back to the base model. Every community result makes this model more useful for everyone.

LoRA / QLoRA Starter Configuration

Starter config, recommended hyperparameters & QLoRA note

If you haven't fine-tuned a Llama-family model before, here's a proven starting point for Portimbria-150M:

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "StentorLabs/Portimbria-150M",
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Portimbria-150M")

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                    # LoRA rank — try 32 if underfitting
    lora_alpha=32,           # alpha = 2× rank is a reliable default
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: ~3.1M || all params: ~154M || trainable%: ~2.0%

Recommended fine-tuning hyperparameters:

Hyperparameter	Value	Notes
Learning rate	2e-4	Scale down to 1e-4 for very small datasets
Optimizer	AdamW	`betas=(0.9, 0.999)`, `eps=1e-8`
LR scheduler	Cosine with warmup	~5% warmup steps
Batch size	4–16	Per device; use gradient accumulation if memory-limited
Epochs	2–5	Watch for overfitting after epoch 2
Max sequence length	512–2048	Up to 4096 is supported

For QLoRA (4-bit quantized base + LoRA adapters on top), add BitsAndBytesConfig(load_in_4bit=True) when loading the base model — the LoRA config and training hyperparameters above apply unchanged. This lets you fine-tune on a single consumer GPU with ~4–6 GB VRAM.

Format Conversion

Convert to GGUF (llama.cpp) & ONNX

Convert to GGUF (llama.cpp)

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && pip install -r requirements.txt
huggingface-cli download StentorLabs/Portimbria-150M --local-dir portimbria-150m
python convert_hf_to_gguf.py portimbria-150m/ \\
--outfile portimbria-150m.gguf \\
--outtype f16
./llama-quantize portimbria-150m.gguf portimbria-150m-q4_k_m.gguf q4_k_m
./llama-cli -m portimbria-150m-q4_k_m.gguf -p "The history of computing" -n 100

Convert to ONNX

pip install optimum[exporters]
optimum-cli export onnx \\
--model StentorLabs/Portimbria-150M \\
--task text-generation-with-past \\
portimbria-150m-onnx/

Speculative Decoding

Portimbria-150M can serve as a fast draft model to accelerate inference from larger Llama-family target models. Because it shares vocabulary with standard Llama/Mistral models (32K BPE), the acceptance rate should be substantially higher than Stentor2 models (which use a different 8K tokenizer).

Speculative decoding code & vocabulary compatibility notes

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
draft_model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Portimbria-150M",
torch_dtype=torch.float16,
).to("cuda")
target_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B",
torch_dtype=torch.float16,
device_map="auto",
)
target_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
inputs = target_tokenizer("Explain the concept of recursion:", return_tensors="pt").to("cuda")
outputs = target_model.generate(
**inputs,
assistant_model=draft_model,
do_sample=True,
max_new_tokens=200,
)
print(target_tokenizer.decode(outputs[0], skip_special_tokens=True))

Vocabulary compatibility: Portimbria-150M uses the Mistral-7B tokenizer (32K BPE), which is not identical to the LLaMA-3 tokenizer (also 32K but with different token merges). It is compatible with models that use the same Mistral BPE vocabulary (e.g. mistralai/Mistral-7B-v0.1 and derivatives). Vocabulary-compatible speculative decoding will yield higher acceptance rates; vocabulary-mismatched pairs will still work via HuggingFace's assisted generation but with lower acceptance rates.

Bias, Risks & Limitations

Factual Accuracy: All factual outputs should be treated as unreliable without verification.
Context Boundary: Hard limit of 4,096 tokens.
English Bias: Training data is English-dominant.
Training Data Bias: Inherits biases in FineWeb-HQ, StarCoderData, and FineMath-4+.
Hallucination: Will produce confident but fabricated content.
No Alignment: No RLHF, DPO, or constitutional training.
Code Generation: Code prompts produce English text output rather than functional code. The model does not generate syntactically or logically valid code in response to code-related prompts.
Shared Tensor Warning: Removed shared tensor {'lm_head.weight'} is expected. Safe to ignore.
Gradient Spikes: Two isolated gradient norm spikes occurred during training (step 1,800: 3.72, step 13,200: 8.56). Both recovered cleanly in subsequent steps with no apparent impact on the loss trajectory.

Related Work

Comparable sub-200M models & related research papers

Comparable Sub-200M Base Models

Model	Parameters	Vocab	Context	Notes
Portimbria-150M (this model)	151M	32K BPE	4,096	Trained on 6B tokens, TPU v5e-8
Stentor2-30M	30.4M	8K TokenMonster	1,024	StentorLabs family
Pythia-160M	160M	50K BPE	2,048	EleutherAI; 300B Pile tokens
GPT-2 (117M)	117M	50K BPE	1,024	OpenAI; 40GB WebText
OPT-125M	125M	50K BPE	2,048	Meta; 180B tokens
TinyLlama-1.1B	1,100M	32K BPE	2,048	3T tokens; different scale tier

Related Research Papers

Paper	Relevance
Scaling Laws — Kaplan et al., 2020	Informs token budget decisions
Chinchilla — Hoffmann et al., 2022	6B tokens for 150M params is ~40× (above Chinchilla optimal)
GQA — Ainslie et al., 2023	Grouped Query Attention used in this model
RoPE — Su et al., 2021	Positional encoding
LLaMA — Touvron et al., 2023	Architecture basis
Pythia — Biderman et al., 2023	Comparable small-model scaling study
Speculative Decoding — Leviathan et al., 2023	Primary deployment use case

Environmental Impact

Hardware, duration & estimated carbon

Factor	Value
Hardware	Google Cloud TPU v5e-8
Active Training Duration	~8.02 hours
Cloud Provider	Google (via Kaggle free tier)
Compute Region	United States
Estimated Carbon	Minimal (< 1.0 kg CO₂e estimated)

The TPU v5e is substantially more energy-efficient per FLOP than comparable GPU hardware. Running on Kaggle's free tier also means no dedicated data center allocation beyond what Kaggle already operates.

Citation

BibTeX

@misc{izumoto2026portimbria150m,
title = {Portimbria-150M},
author = {Kai Izumoto},
year = {2026},
publisher = {StentorLabs},
howpublished = {\\url{https://huggingface.co/StentorLabs/Portimbria-150M}},
note = {151M parameter LlamaForCausalLM base model with GQA trained from scratch
on ~6B tokens (FineWeb-HQ, StarCoderData, FineMath-4+) using a
Google Cloud TPU v5e-8 on Kaggle free compute. 4096-token context,
32K Mistral BPE vocabulary. Apache 2.0 license.}
}

Model Card Contact

Questions, benchmarks, or feedback: StentorLabs@gmail.com or open a discussion.

Made with ❤️ by StentorLabs

Democratizing AI through accessible, efficient models — trained on free compute, shared with everyone.

Downloads last month: 12

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for StentorLabs/Portimbria-150M

Quantizations

1 model

Datasets used to train StentorLabs/Portimbria-150M

Space using StentorLabs/Portimbria-150M 1

Papers for StentorLabs/Portimbria-150M

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Paper • 2305.13245 • Published May 22, 2023 • 6

Evaluation results

Best Validation Loss on FineWeb-HQ (validation split)
self-reported

2.891
Best Perplexity on FineWeb-HQ (validation split)
self-reported

18.000