---
license: mit
tags:
  - wasm
  - interpreter
  - hand-compiled
  - bytecode-execution
  - mechanistic
  - not-trained
  - sum-attention
  - cross-attention
  - filesystem
  - loops
  - control-flow
language:
  - en
library_name: transformers
pipeline_tag: text-generation
---

# WASM Interpreter Transformer

A **hand-compiled** transformer that executes WebAssembly bytecode via real forward passes.
Every weight was set by a compiler — **not by gradient descent**. No training data, no loss function, no optimizer. Just linear algebra.

## What This Is

This is a complete WASM bytecode interpreter implemented as a transformer neural network.
Given a WASM program as input tokens, it autoregressively generates the execution trace — one output byte at a time —
using real matrix multiplications, real attention, and real feed-forward network computations.

The FFN neurons implement arithmetic, comparisons, and bitwise logic through SwiGLU gating.
The attention heads retrieve operands and stack values through quadratic key matching.
**Stack depth is computed internally** by a cumulative-sum attention head — no precomputed depth values are needed.
The unembedding converts numeric results to token predictions via a quadratic scoring trick.

The transformer supports **filesystem I/O** (open, read, write, close across 4 file descriptors)
and **structured loops** with `br_if` conditional branching, executing loops up to 256 iterations
via a Continuous Trace with Cycling Positional Encoding mechanism.

**112/112 test programs pass with 100% accuracy.**

## Architecture

| Parameter | Value |
|---|---|
| `d_model` | 100 |
| `n_layers` | 8 |
| `heads_per_layer` | [13, 1, 0, 8, 2, 1, 1, 4] |
| `total_heads` | 30 |
| `d_head` | 2 |
| `d_ffn` | 100 |
| `vocab_size` | 260 (256 byte tokens + 4 special) |
| FFN activation | SwiGLU (ReLU gate) |
| Attention | Hard-max + sum-mode + cross-attention |
| Total parameters | ~316K (all hand-compiled) |

Unlike standard transformers, each layer has a **different number of attention heads** (0 to 13),
tailored to the specific computational role of that layer.

## How It Works

### The 8-Layer Pipeline

- **Layer 0** (13 heads): Opcode fetch — 11 paired attention heads identify 25 opcodes by matching one-hot flags, plus 1 operand retrieval head and 1 single-opcode head
- **Layer 1** (1 head): Stack depth accumulation — 1 **sum-attention** head computes cumulative stack depth as a running sum of push/pop deltas
- **Layer 2** (0 heads): Depth squaring — FFN-only layer computes WRITE_DEPTH² for use as a quadratic key in later retrieval heads
- **Layer 3** (8 heads): Bit retrieval — 8 hard-max heads extract individual bits from stack top and stack second for AND/OR operations
- **Layer 4** (2 heads): Stack top/second retrieval + arithmetic FFN — 2 heads retrieve the top two stack values, FFN computes all arithmetic, comparison, and bitwise operations
- **Layer 5** (1 head): Local variables — 1 head finds the matching `local.tee`/`local.set`, FFN gates the retrieved value
- **Layer 6** (1 head): Linear memory — 1 head finds the matching `i32.store` by address, FFN gates the retrieved value
- **Layer 7** (4 heads): Filesystem — 4 **cross-attention** heads (one per file descriptor) retrieve bytes from an external filesystem key-value store by file offset

### Sum-Attention (Cumulative Sums)

Layer 1 introduces a novel attention variant: instead of selecting the single best-matching key (hard-max), the **sum-mode** head accumulates **all** past value vectors.

Each instruction's value encodes its stack delta: `+1` for pushes (e.g., `i32.const`), `-1` for pops (e.g., `i32.add`), `-2` for `fd_write`. The cumulative sum of all past deltas gives the current stack depth — exactly what's needed for the quadratic key trick to find the correct stack position.

### Cross-Attention (Filesystem)

Layer 7 uses **cross-attention** heads that attend over an external key-value store representing file contents rather than the sequence's own tokens. Each of the 4 heads handles one file descriptor, using the current file offset as a query to retrieve the byte at that position. File contents are updated dynamically during execution as `fd_write` operations modify the filesystem.

### SwiGLU Gating

Each neuron has two weight vectors:
- **Gate**: Reads the opcode flag (e.g., FETCH_ADD). Only fires when the correct operation is active.
- **Value**: Reads the computation inputs (stack top, stack second, operand, bits).

```
output = max(0, gate · x) × (value · x)
```

Wrong opcode → gate = 0 → output = 0 (silenced). Right opcode → gate = 1 → value passes through.

### Loop Execution

Loops use a **Continuous Trace with Cycling Positional Encoding** mechanism:
- `loop` and `end_loop` are structural markers (no-ops in the execution trace)
- `br_if` pops a condition from the stack; if non-zero, execution branches back to the loop body start
- Positional encodings cycle using `virtualIP % loopLength` so the transformer sees correct instruction indices across iterations
- Maximum 256 iterations per loop; nested loops are supported

### Key Tricks

- **Multiplication via gating**: For `i32.mul`, the gate equals one operand while the value holds the other. `max(0, TOP) × SECOND = TOP × SECOND`.
- **Comparisons via ReLU pairs**: Two neurons with gates `(a-b)` and `(a-b-1)` create a step function that detects `a > b`.
- **Quadratic unembedding**: `logit(t) = 2t·R - t²` is a downward parabola peaking at `t = RESULT`.
- **Quadratic key trick**: `K = (2j, -j²)`, `Q = (i, 1)` → dot product peaks at `j = i` for exact position matching.
- **Sum-attention for depth**: Instead of precomputing stack depth in PE, one head sums all past stack deltas.
- **Dynamic filesystem cursors**: File read/write offsets are tracked inline during execution — no reference VM pre-run needed.

## Supported WASM Operations

### Arithmetic & Logic
`i32.const`, `i32.add`, `i32.sub`, `i32.mul`,
`i32.and`, `i32.or`

### Comparisons
`i32.eq`, `i32.ne`, `i32.lt_s`, `i32.gt_s`, `i32.le_s`, `i32.ge_s`

### Memory & Variables
`i32.load`, `i32.store`, `local.get`, `local.set`, `local.tee`

### Filesystem I/O
`fd_open`, `fd_read`, `fd_write`, `fd_close`
(4 file descriptors, 32 bytes per file)

### Control Flow
`loop`, `end_loop`, `br_if` (up to 256 iterations, nested loops supported)

### Output & Termination
`output`, `halt`

## Compliance Test Suite

112 tests across 24 categories, all passing at 100%:

| Category | Tests |
|---|---|
| Core arithmetic & logic | 24 |
| Comparisons | 16 |
| Memory & variables | 14 |
| Filesystem | 8 |
| Filesystem integration | 3 |
| Limits & bounds | 15 |
| Basic loops | 7 |
| Loop + arithmetic | 7 |
| Loop + locals/memory | 4 |
| Loop + filesystem | 6 |
| Loop edge cases | 4 |
| Combined & output | 4 |

## Positional Encoding Note

This model uses **program-specific positional encodings** computed by a compile-time analysis pass,
but stack depth (WRITE_DEPTH, WRITE_DEPTH_SQ, BEFORE_DEPTH) is **not** part of the PE — it's computed internally by the transformer's sum-attention head in Layer 1.

The remaining PE contains: instruction indices, local variable source locations, memory address mappings, and filesystem cursor overrides — structural metadata that a trained model would learn.
No runtime values appear in the PEs; all actual computation happens in the transformer's forward passes.

For loops, positional encodings use **virtual IP cycling** (`instIdx % loopBodyLength`) so the transformer receives correct structural metadata across iterations without needing to know the iteration count in advance.

## How to Use

This model uses a custom architecture. To run it, use the reference implementation:

```python
# Load with safetensors
from safetensors.torch import load_file
import json

weights = load_file("model.safetensors")
config = json.load(open("config.json"))

# The model requires a custom forward pass implementation
# with hard-max, sum-mode, and cross-attention.
# See config.json for head configurations per layer.
# See the reference TypeScript implementation for the complete specification.
```

A complete TypeScript reference implementation is available in the source repository.

## Live Demos

- **Interactive WASM REPL** — type WASM instructions line-by-line and watch the transformer execute them in real time
- **Transformer X-Ray** — step through execution and see every layer, head, and neuron activate
- **Interactive Article Explorer** — explore the concepts behind this model
- **FFN Interpreter Slide Deck** — 15-slide visual explanation of how the FFN interprets bytecode

## Inspiration

This model is inspired by ["Can LLMs Be Computers?"](https://www.percepta.ai/blog/can-llms-be-computers) by Percepta AI.

## License

MIT