Nanochat + WASM Coprocessor (Fused Preview 02)

A dual-stream composed model that fuses a 1.4B-parameter language model (NanochatGPT d34) with a frozen WASM bytecode interpreter transformer via trained cross-attention. The LM generates text and WASM instructions; the coprocessor executes them via real forward passes; results flow back into the LM through cross-attention — enabling the model to think in computation.

What's New in Preview 02

This is the Phase 4 checkpoint — the final architecture with OUTPUT token design:

INPUT encoding: All integers in user questions are encoded as 4-byte big-endian WASM tokens (full int32 range)
OUTPUT token: The model learns to predict the OUTPUT opcode, then the coprocessor deterministically fills 4 result bytes
Gate masking: OUTPUT opcode has gate=0 (model must predict it), result bytes have gate=1 (deterministic, masked from loss)
Running WASM accuracy: ~99.5% at batch 550/7429 of final training epoch
Weighted band distribution: Training data exercises the full int32 range with weighted sampling across magnitude bands

Checkpoint

epoch_1_batch_1500/model.safetensors — Phase 4 mid-epoch checkpoint (best available weights from the highest-performing training run)

Architecture

Component	Details
Text Stream (trained)	NanochatGPT d34: d_model=2176, 34 layers, 17 heads, ~1.4B params
Compute Stream (frozen)	WASM Interpreter Transformer: d_model=100, 8 layers, 30 heads, ~316K params
Bridge (trained)	Cross-attention at layer 10 + all subsequent layers, WasmTokenEmbedding (260→2176), logit bias
Total parameters	~1.4B trainable + ~316K frozen
Vocab	65536 text tokens (tiktoken BPE) + 260 WASM tokens (extended vocabulary)

How the Dual-Stream Works

Text tokens → processed normally by the LM
WASM instruction tokens → the LM emits them, and the frozen coprocessor immediately executes them
Feedback tokens → coprocessor results (REPL_RESULT, BRANCH_TAKEN, BRANCH_NOT_TAKEN) are fed back via cross-attention
Lockstep execution — each WASM instruction is immediately followed by a feedback token, creating instruction-feedback pairs that the LM sees simultaneously

The coprocessor is a hand-compiled transformer that executes WASM bytecode via real matrix multiplications. It was not trained — every weight was set by a compiler. It supports arithmetic, comparisons, memory, local variables, filesystem I/O, and loops with conditional branching.

OUTPUT Token Architecture

User: "What is 388372838 + 1158908721?"

Encoded question: text "What is " + [4 WASM byte tokens for 388372838] + text " + " + [4 WASM byte tokens for 1158908721] + text "?"

Model generates:
  I32_CONST [4 bytes: operand 1] I32_CONST [4 bytes: operand 2] I32_ADD OUTPUT [4 bytes: result] HALT
  "The answer is " OUTPUT [4 deterministic result bytes] "."

Gate mask:
  - I32_CONST, I32_ADD, OUTPUT, HALT: gate=0 (model predicts these)
  - Operand bytes & result bytes: gate=1 (deterministic, masked from loss)

Cross-Attention Bridge

Layer 10: Primary injection point — cross-attention reads coprocessor hidden states
Layers 11-33: Additional cross-attention heads (gate-initialized near zero) refine the compute signal
WasmTokenEmbedding: Learned 260×2176 embedding mapping WASM tokens to LM representation space
wasm_logit_bias: Learned bias controlling WASM token generation probability

Training

Parameter	Value
GPU	NVIDIA B200 (192GB HBM3e)
Optimizer	MuonAdamW (Muon for matrix params, AdamW for scalars)
Precision	FP8 (Blackwell native) with bf16 master weights
Phase	Phase 4 — Push to 98%+ accuracy
Data	~178K WASM programs + text conversations (SmolTalk + MMLU)
Text ratio	30% pure text, 70% WASM conversations
Int32 range	Full range with weighted band distribution
Gradient checkpointing	Enabled
Base model	karpathy/nanochat-d34

How to Use

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "eastlondoner/nanochat-wasm-fused-preview-02",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    subfolder="epoch_1_batch_1500",
)

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(
    "eastlondoner/nanochat-wasm-fused-preview-02",
    trust_remote_code=True,
)

prompt_ids = tok.encode_chat("What is 3 + 4?")

generated, wasm_outputs, trace = model.generate_chat(
    prompt_ids,
    max_new_tokens=256,
    temperature=0.8,
    top_k=50,
    return_outputs=True,
)

text_tokens = [t for t in generated[len(prompt_ids):] if t < 65536 and t > 0]
response = tok.decode(text_tokens)
print(response)
print(f"WASM outputs: {wasm_outputs}")

Token Contract

The model uses an extended vocabulary where tokens ≥65536 are WASM instruction tokens:

Opcodes: i32.const (0x00), i32.add (0x01), output (0xF0), halt (0xFF), etc.
Operand bytes: Encoded as 4-byte big-endian values offset by 264 in token space
Feedback tokens: REPL_RESULT (261), BRANCH_TAKEN (262), BRANCH_NOT_TAKEN (263)
WASM pad: Used during replay sequences for lockstep token alignment

The WASM Coprocessor

The frozen compute stream is a hand-compiled 8-layer transformer:

Layer 0 (13 heads): Opcode identification via one-hot matching
Layer 1 (1 head): Stack depth accumulation via sum-attention
Layer 2 (0 heads): Depth squaring (FFN only)
Layer 3 (8 heads): Bit extraction for AND/OR operations
Layer 4 (2 heads): Stack retrieval + full arithmetic FFN
Layer 5 (1 head): Local variable retrieval
Layer 6 (1 head): Memory load/store
Layer 7 (4 heads): Filesystem I/O via cross-attention

Supports 25 WASM operations including arithmetic, comparisons, memory, locals, filesystem I/O, and loops with conditional branching. 115/115 compliance tests pass at 100% accuracy.

Preview 01 — earlier Phase 3 checkpoint
WASM Interpreter Transformer — the standalone frozen coprocessor
Can LLMs Be Computers? — the research behind this architecture

License

MIT

Downloads last month: 781

eastlondoner
/

nanochat-wasm-fused-preview-02