--- license: mit tags: - wasm - interpreter - hand-compiled - bytecode-execution - mechanistic - not-trained - sum-attention - cross-attention - filesystem - loops - control-flow language: - en library_name: transformers pipeline_tag: text-generation --- # WASM Interpreter Transformer A **hand-compiled** transformer that executes WebAssembly bytecode via real forward passes. Every weight was set by a compiler — **not by gradient descent**. No training data, no loss function, no optimizer. Just linear algebra. ## What This Is This is a complete WASM bytecode interpreter implemented as a transformer neural network. Given a WASM program as input tokens, it autoregressively generates the execution trace — one output byte at a time — using real matrix multiplications, real attention, and real feed-forward network computations. The FFN neurons implement arithmetic, comparisons, and bitwise logic through SwiGLU gating. The attention heads retrieve operands and stack values through quadratic key matching. **Stack depth is computed internally** by a cumulative-sum attention head — no precomputed depth values are needed. The unembedding converts numeric results to token predictions via a quadratic scoring trick. The transformer supports **filesystem I/O** (open, read, write, close across 4 file descriptors) and **structured loops** with `br_if` conditional branching, executing loops up to 256 iterations via a Continuous Trace with Cycling Positional Encoding mechanism. **112/112 test programs pass with 100% accuracy.** ## Architecture | Parameter | Value | |---|---| | `d_model` | 100 | | `n_layers` | 8 | | `heads_per_layer` | [13, 1, 0, 8, 2, 1, 1, 4] | | `total_heads` | 30 | | `d_head` | 2 | | `d_ffn` | 100 | | `vocab_size` | 260 (256 byte tokens + 4 special) | | FFN activation | SwiGLU (ReLU gate) | | Attention | Hard-max + sum-mode + cross-attention | | Total parameters | ~316K (all hand-compiled) | Unlike standard transformers, each layer has a **different number of attention heads** (0 to 13), tailored to the specific computational role of that layer. ## How It Works ### The 8-Layer Pipeline - **Layer 0** (13 heads): Opcode fetch — 11 paired attention heads identify 25 opcodes by matching one-hot flags, plus 1 operand retrieval head and 1 single-opcode head - **Layer 1** (1 head): Stack depth accumulation — 1 **sum-attention** head computes cumulative stack depth as a running sum of push/pop deltas - **Layer 2** (0 heads): Depth squaring — FFN-only layer computes WRITE_DEPTH² for use as a quadratic key in later retrieval heads - **Layer 3** (8 heads): Bit retrieval — 8 hard-max heads extract individual bits from stack top and stack second for AND/OR operations - **Layer 4** (2 heads): Stack top/second retrieval + arithmetic FFN — 2 heads retrieve the top two stack values, FFN computes all arithmetic, comparison, and bitwise operations - **Layer 5** (1 head): Local variables — 1 head finds the matching `local.tee`/`local.set`, FFN gates the retrieved value - **Layer 6** (1 head): Linear memory — 1 head finds the matching `i32.store` by address, FFN gates the retrieved value - **Layer 7** (4 heads): Filesystem — 4 **cross-attention** heads (one per file descriptor) retrieve bytes from an external filesystem key-value store by file offset ### Sum-Attention (Cumulative Sums) Layer 1 introduces a novel attention variant: instead of selecting the single best-matching key (hard-max), the **sum-mode** head accumulates **all** past value vectors. Each instruction's value encodes its stack delta: `+1` for pushes (e.g., `i32.const`), `-1` for pops (e.g., `i32.add`), `-2` for `fd_write`. The cumulative sum of all past deltas gives the current stack depth — exactly what's needed for the quadratic key trick to find the correct stack position. ### Cross-Attention (Filesystem) Layer 7 uses **cross-attention** heads that attend over an external key-value store representing file contents rather than the sequence's own tokens. Each of the 4 heads handles one file descriptor, using the current file offset as a query to retrieve the byte at that position. File contents are updated dynamically during execution as `fd_write` operations modify the filesystem. ### SwiGLU Gating Each neuron has two weight vectors: - **Gate**: Reads the opcode flag (e.g., FETCH_ADD). Only fires when the correct operation is active. - **Value**: Reads the computation inputs (stack top, stack second, operand, bits). ``` output = max(0, gate · x) × (value · x) ``` Wrong opcode → gate = 0 → output = 0 (silenced). Right opcode → gate = 1 → value passes through. ### Loop Execution Loops use a **Continuous Trace with Cycling Positional Encoding** mechanism: - `loop` and `end_loop` are structural markers (no-ops in the execution trace) - `br_if` pops a condition from the stack; if non-zero, execution branches back to the loop body start - Positional encodings cycle using `virtualIP % loopLength` so the transformer sees correct instruction indices across iterations - Maximum 256 iterations per loop; nested loops are supported ### Key Tricks - **Multiplication via gating**: For `i32.mul`, the gate equals one operand while the value holds the other. `max(0, TOP) × SECOND = TOP × SECOND`. - **Comparisons via ReLU pairs**: Two neurons with gates `(a-b)` and `(a-b-1)` create a step function that detects `a > b`. - **Quadratic unembedding**: `logit(t) = 2t·R - t²` is a downward parabola peaking at `t = RESULT`. - **Quadratic key trick**: `K = (2j, -j²)`, `Q = (i, 1)` → dot product peaks at `j = i` for exact position matching. - **Sum-attention for depth**: Instead of precomputing stack depth in PE, one head sums all past stack deltas. - **Dynamic filesystem cursors**: File read/write offsets are tracked inline during execution — no reference VM pre-run needed. ## Supported WASM Operations ### Arithmetic & Logic `i32.const`, `i32.add`, `i32.sub`, `i32.mul`, `i32.and`, `i32.or` ### Comparisons `i32.eq`, `i32.ne`, `i32.lt_s`, `i32.gt_s`, `i32.le_s`, `i32.ge_s` ### Memory & Variables `i32.load`, `i32.store`, `local.get`, `local.set`, `local.tee` ### Filesystem I/O `fd_open`, `fd_read`, `fd_write`, `fd_close` (4 file descriptors, 32 bytes per file) ### Control Flow `loop`, `end_loop`, `br_if` (up to 256 iterations, nested loops supported) ### Output & Termination `output`, `halt` ## Compliance Test Suite 112 tests across 24 categories, all passing at 100%: | Category | Tests | |---|---| | Core arithmetic & logic | 24 | | Comparisons | 16 | | Memory & variables | 14 | | Filesystem | 8 | | Filesystem integration | 3 | | Limits & bounds | 15 | | Basic loops | 7 | | Loop + arithmetic | 7 | | Loop + locals/memory | 4 | | Loop + filesystem | 6 | | Loop edge cases | 4 | | Combined & output | 4 | ## Positional Encoding Note This model uses **program-specific positional encodings** computed by a compile-time analysis pass, but stack depth (WRITE_DEPTH, WRITE_DEPTH_SQ, BEFORE_DEPTH) is **not** part of the PE — it's computed internally by the transformer's sum-attention head in Layer 1. The remaining PE contains: instruction indices, local variable source locations, memory address mappings, and filesystem cursor overrides — structural metadata that a trained model would learn. No runtime values appear in the PEs; all actual computation happens in the transformer's forward passes. For loops, positional encodings use **virtual IP cycling** (`instIdx % loopBodyLength`) so the transformer receives correct structural metadata across iterations without needing to know the iteration count in advance. ## How to Use This model uses a custom architecture. To run it, use the reference implementation: ```python # Load with safetensors from safetensors.torch import load_file import json weights = load_file("model.safetensors") config = json.load(open("config.json")) # The model requires a custom forward pass implementation # with hard-max, sum-mode, and cross-attention. # See config.json for head configurations per layer. # See the reference TypeScript implementation for the complete specification. ``` A complete TypeScript reference implementation is available in the source repository. ## Live Demos - **Interactive WASM REPL** — type WASM instructions line-by-line and watch the transformer execute them in real time - **Transformer X-Ray** — step through execution and see every layer, head, and neuron activate - **Interactive Article Explorer** — explore the concepts behind this model - **FFN Interpreter Slide Deck** — 15-slide visual explanation of how the FFN interprets bytecode ## Inspiration This model is inspired by ["Can LLMs Be Computers?"](https://www.percepta.ai/blog/can-llms-be-computers) by Percepta AI. ## License MIT