Nanochat + WASM Coprocessor (Fused Preview 02)

A dual-stream composed model that fuses a 1.4B-parameter language model (NanochatGPT d34) with a frozen WASM bytecode interpreter transformer via trained cross-attention. The LM generates text and WASM instructions; the coprocessor executes them via real forward passes; results flow back into the LM through cross-attention β€” enabling the model to think in computation.

What's New in Preview 02

This is the Phase 4 checkpoint β€” the final architecture with OUTPUT token design:

  • INPUT encoding: All integers in user questions are encoded as 4-byte big-endian WASM tokens (full int32 range)
  • OUTPUT token: The model learns to predict the OUTPUT opcode, then the coprocessor deterministically fills 4 result bytes
  • Gate masking: OUTPUT opcode has gate=0 (model must predict it), result bytes have gate=1 (deterministic, masked from loss)
  • Running WASM accuracy: ~99.5% at batch 550/7429 of final training epoch
  • Weighted band distribution: Training data exercises the full int32 range with weighted sampling across magnitude bands

Checkpoint

  • epoch_1_batch_1500/model.safetensors β€” Phase 4 mid-epoch checkpoint (best available weights from the highest-performing training run)

Architecture

Component Details
Text Stream (trained) NanochatGPT d34: d_model=2176, 34 layers, 17 heads, ~1.4B params
Compute Stream (frozen) WASM Interpreter Transformer: d_model=100, 8 layers, 30 heads, ~316K params
Bridge (trained) Cross-attention at layer 10 + all subsequent layers, WasmTokenEmbedding (260β†’2176), logit bias
Total parameters ~1.4B trainable + ~316K frozen
Vocab 65536 text tokens (tiktoken BPE) + 260 WASM tokens (extended vocabulary)

How the Dual-Stream Works

  1. Text tokens β†’ processed normally by the LM
  2. WASM instruction tokens β†’ the LM emits them, and the frozen coprocessor immediately executes them
  3. Feedback tokens β†’ coprocessor results (REPL_RESULT, BRANCH_TAKEN, BRANCH_NOT_TAKEN) are fed back via cross-attention
  4. Lockstep execution β€” each WASM instruction is immediately followed by a feedback token, creating instruction-feedback pairs that the LM sees simultaneously

The coprocessor is a hand-compiled transformer that executes WASM bytecode via real matrix multiplications. It was not trained β€” every weight was set by a compiler. It supports arithmetic, comparisons, memory, local variables, filesystem I/O, and loops with conditional branching.

OUTPUT Token Architecture

User: "What is 388372838 + 1158908721?"

Encoded question: text "What is " + [4 WASM byte tokens for 388372838] + text " + " + [4 WASM byte tokens for 1158908721] + text "?"

Model generates:
  I32_CONST [4 bytes: operand 1] I32_CONST [4 bytes: operand 2] I32_ADD OUTPUT [4 bytes: result] HALT
  "The answer is " OUTPUT [4 deterministic result bytes] "."

Gate mask:
  - I32_CONST, I32_ADD, OUTPUT, HALT: gate=0 (model predicts these)
  - Operand bytes & result bytes: gate=1 (deterministic, masked from loss)

Cross-Attention Bridge

  • Layer 10: Primary injection point β€” cross-attention reads coprocessor hidden states
  • Layers 11-33: Additional cross-attention heads (gate-initialized near zero) refine the compute signal
  • WasmTokenEmbedding: Learned 260Γ—2176 embedding mapping WASM tokens to LM representation space
  • wasm_logit_bias: Learned bias controlling WASM token generation probability

Training

Parameter Value
GPU NVIDIA B200 (192GB HBM3e)
Optimizer MuonAdamW (Muon for matrix params, AdamW for scalars)
Precision FP8 (Blackwell native) with bf16 master weights
Phase Phase 4 β€” Push to 98%+ accuracy
Data ~178K WASM programs + text conversations (SmolTalk + MMLU)
Text ratio 30% pure text, 70% WASM conversations
Int32 range Full range with weighted band distribution
Gradient checkpointing Enabled
Base model karpathy/nanochat-d34

How to Use

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "eastlondoner/nanochat-wasm-fused-preview-02",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    subfolder="epoch_1_batch_1500",
)

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(
    "eastlondoner/nanochat-wasm-fused-preview-02",
    trust_remote_code=True,
)

prompt_ids = tok.encode_chat("What is 3 + 4?")

generated, wasm_outputs, trace = model.generate_chat(
    prompt_ids,
    max_new_tokens=256,
    temperature=0.8,
    top_k=50,
    return_outputs=True,
)

text_tokens = [t for t in generated[len(prompt_ids):] if t < 65536 and t > 0]
response = tok.decode(text_tokens)
print(response)
print(f"WASM outputs: {wasm_outputs}")

Token Contract

The model uses an extended vocabulary where tokens β‰₯65536 are WASM instruction tokens:

  • Opcodes: i32.const (0x00), i32.add (0x01), output (0xF0), halt (0xFF), etc.
  • Operand bytes: Encoded as 4-byte big-endian values offset by 264 in token space
  • Feedback tokens: REPL_RESULT (261), BRANCH_TAKEN (262), BRANCH_NOT_TAKEN (263)
  • WASM pad: Used during replay sequences for lockstep token alignment

The WASM Coprocessor

The frozen compute stream is a hand-compiled 8-layer transformer:

  • Layer 0 (13 heads): Opcode identification via one-hot matching
  • Layer 1 (1 head): Stack depth accumulation via sum-attention
  • Layer 2 (0 heads): Depth squaring (FFN only)
  • Layer 3 (8 heads): Bit extraction for AND/OR operations
  • Layer 4 (2 heads): Stack retrieval + full arithmetic FFN
  • Layer 5 (1 head): Local variable retrieval
  • Layer 6 (1 head): Memory load/store
  • Layer 7 (4 heads): Filesystem I/O via cross-attention

Supports 25 WASM operations including arithmetic, comparisons, memory, locals, filesystem I/O, and loops with conditional branching. 115/115 compliance tests pass at 100% accuracy.

Related

License

MIT

Downloads last month
781
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support