Nanochat + WASM Coprocessor (Fused Preview 02)
A dual-stream composed model that fuses a 1.4B-parameter language model (NanochatGPT d34) with a frozen WASM bytecode interpreter transformer via trained cross-attention. The LM generates text and WASM instructions; the coprocessor executes them via real forward passes; results flow back into the LM through cross-attention β enabling the model to think in computation.
What's New in Preview 02
This is the Phase 4 checkpoint β the final architecture with OUTPUT token design:
- INPUT encoding: All integers in user questions are encoded as 4-byte big-endian WASM tokens (full int32 range)
- OUTPUT token: The model learns to predict the
OUTPUTopcode, then the coprocessor deterministically fills 4 result bytes - Gate masking: OUTPUT opcode has gate=0 (model must predict it), result bytes have gate=1 (deterministic, masked from loss)
- Running WASM accuracy: ~99.5% at batch 550/7429 of final training epoch
- Weighted band distribution: Training data exercises the full int32 range with weighted sampling across magnitude bands
Checkpoint
epoch_1_batch_1500/model.safetensorsβ Phase 4 mid-epoch checkpoint (best available weights from the highest-performing training run)
Architecture
| Component | Details |
|---|---|
| Text Stream (trained) | NanochatGPT d34: d_model=2176, 34 layers, 17 heads, ~1.4B params |
| Compute Stream (frozen) | WASM Interpreter Transformer: d_model=100, 8 layers, 30 heads, ~316K params |
| Bridge (trained) | Cross-attention at layer 10 + all subsequent layers, WasmTokenEmbedding (260β2176), logit bias |
| Total parameters | ~1.4B trainable + ~316K frozen |
| Vocab | 65536 text tokens (tiktoken BPE) + 260 WASM tokens (extended vocabulary) |
How the Dual-Stream Works
- Text tokens β processed normally by the LM
- WASM instruction tokens β the LM emits them, and the frozen coprocessor immediately executes them
- Feedback tokens β coprocessor results (REPL_RESULT, BRANCH_TAKEN, BRANCH_NOT_TAKEN) are fed back via cross-attention
- Lockstep execution β each WASM instruction is immediately followed by a feedback token, creating instruction-feedback pairs that the LM sees simultaneously
The coprocessor is a hand-compiled transformer that executes WASM bytecode via real matrix multiplications. It was not trained β every weight was set by a compiler. It supports arithmetic, comparisons, memory, local variables, filesystem I/O, and loops with conditional branching.
OUTPUT Token Architecture
User: "What is 388372838 + 1158908721?"
Encoded question: text "What is " + [4 WASM byte tokens for 388372838] + text " + " + [4 WASM byte tokens for 1158908721] + text "?"
Model generates:
I32_CONST [4 bytes: operand 1] I32_CONST [4 bytes: operand 2] I32_ADD OUTPUT [4 bytes: result] HALT
"The answer is " OUTPUT [4 deterministic result bytes] "."
Gate mask:
- I32_CONST, I32_ADD, OUTPUT, HALT: gate=0 (model predicts these)
- Operand bytes & result bytes: gate=1 (deterministic, masked from loss)
Cross-Attention Bridge
- Layer 10: Primary injection point β cross-attention reads coprocessor hidden states
- Layers 11-33: Additional cross-attention heads (gate-initialized near zero) refine the compute signal
- WasmTokenEmbedding: Learned 260Γ2176 embedding mapping WASM tokens to LM representation space
- wasm_logit_bias: Learned bias controlling WASM token generation probability
Training
| Parameter | Value |
|---|---|
| GPU | NVIDIA B200 (192GB HBM3e) |
| Optimizer | MuonAdamW (Muon for matrix params, AdamW for scalars) |
| Precision | FP8 (Blackwell native) with bf16 master weights |
| Phase | Phase 4 β Push to 98%+ accuracy |
| Data | ~178K WASM programs + text conversations (SmolTalk + MMLU) |
| Text ratio | 30% pure text, 70% WASM conversations |
| Int32 range | Full range with weighted band distribution |
| Gradient checkpointing | Enabled |
| Base model | karpathy/nanochat-d34 |
How to Use
from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained(
"eastlondoner/nanochat-wasm-fused-preview-02",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
subfolder="epoch_1_batch_1500",
)
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(
"eastlondoner/nanochat-wasm-fused-preview-02",
trust_remote_code=True,
)
prompt_ids = tok.encode_chat("What is 3 + 4?")
generated, wasm_outputs, trace = model.generate_chat(
prompt_ids,
max_new_tokens=256,
temperature=0.8,
top_k=50,
return_outputs=True,
)
text_tokens = [t for t in generated[len(prompt_ids):] if t < 65536 and t > 0]
response = tok.decode(text_tokens)
print(response)
print(f"WASM outputs: {wasm_outputs}")
Token Contract
The model uses an extended vocabulary where tokens β₯65536 are WASM instruction tokens:
- Opcodes:
i32.const(0x00),i32.add(0x01),output(0xF0),halt(0xFF), etc. - Operand bytes: Encoded as 4-byte big-endian values offset by 264 in token space
- Feedback tokens:
REPL_RESULT(261),BRANCH_TAKEN(262),BRANCH_NOT_TAKEN(263) - WASM pad: Used during replay sequences for lockstep token alignment
The WASM Coprocessor
The frozen compute stream is a hand-compiled 8-layer transformer:
- Layer 0 (13 heads): Opcode identification via one-hot matching
- Layer 1 (1 head): Stack depth accumulation via sum-attention
- Layer 2 (0 heads): Depth squaring (FFN only)
- Layer 3 (8 heads): Bit extraction for AND/OR operations
- Layer 4 (2 heads): Stack retrieval + full arithmetic FFN
- Layer 5 (1 head): Local variable retrieval
- Layer 6 (1 head): Memory load/store
- Layer 7 (4 heads): Filesystem I/O via cross-attention
Supports 25 WASM operations including arithmetic, comparisons, memory, locals, filesystem I/O, and loops with conditional branching. 115/115 compliance tests pass at 100% accuracy.
Related
- Preview 01 β earlier Phase 3 checkpoint
- WASM Interpreter Transformer β the standalone frozen coprocessor
- Can LLMs Be Computers? β the research behind this architecture
License
MIT
- Downloads last month
- 781