Field notes from The Kintsugi Garden

Community Article Published June 9, 2026

Build notes from a small-model symbolic reflection app, written for the Hugging Face Build Small Hackathon, 2026.

The Japanese craft of kintsugi mends broken pottery with gold lacquer. The fracture lines are not hidden; they become the most valuable part of the vessel. We built The Kintsugi Garden because most digital tools for the inner life still assume something is broken in you and offer to fix it. The Garden assumes the opposite — that the cracked, dreaming, recurring places in a person's inner story are where meaning gathers, and the work is to trace them in gold rather than patch them over.

This is not therapy, diagnosis, prediction, or advice. It is a symbolic reflection tool.

That sentence sits at the top of the README for a reason. The whole architecture below exists to keep that promise.

What it does

A user pastes in a dream, a journal entry, an emotional trigger, a recurring pattern. The app returns a six-section symbolic reading (Mirror, Key Symbols, Archetypal Themes, Shadow Pattern, Individuation Signal, Gentle Question) alongside a deterministic PIL-rendered mandala built from the symbols the entry contains. Across a session, a quiet Soul Map notices which symbols keep returning.

An example. If you submit "I crossed a bridge over a river into a forest where a wounded bird drank from a pool of gold," the app extracts seven lexicon symbols (bridge, river, forest, wound, bird, water, gold), composes a hedged reading anchored on those symbols, and draws a mandala with seven nodes ringed around a kintsugi-gold center emblem. Identical input always produces an identical mandala — the visualization is reproducible by design.

The main surface is a hand-rolled three-column journal at / — sidebar of past entries, a compose+streaming reading column, and a sticky Soul Map on the right that updates as entries accumulate. The original Gradio Blocks UI is still mounted at /app/ with the same compose-and-reflect flow, just rendered as reading tabs plus a Symbols/Themes panel rather than a streamed page. Both surfaces are served by a single gradio.Server: the journal streams from a queued @app.api("reflect") generator over SSE; the Blocks UI uses a Gradio button-click handler that returns the finalized reading as a tuple. The safety check, symbol extractor, lexicon, system prompt, and sanitizers are shared helpers behind both.

Small model, strong scaffolding

Most of the Build Small Hackathon entries reach for the smallest model they can find and ask it to do everything. We went a different way. The app is a deterministic Python scaffold with a small language model nested inside it. The scaffold owns the reliability budget; the model is one optional voice among several composition layers.

Concretely, that means the curated 42-symbol Jungian lexicon does the symbolic work. Each symbol has hand-written meanings, an archetype mapping, a shadow motif, and an individuation signal. Symbol extraction is a substring match plus alias resolution. When the model runs, its prompt contains only the current entry and a compressed block of the extracted symbols and their meanings. Past journal entries are never passed in — a privacy property, and also a small-model fit property (Qwen3-8B's effective context for nuanced composition is short, and a compressed prompt produces better outputs).

If the model can't be loaded, the deterministic reading produces the same six-section structure from the lexicon alone. This is not a degraded mode. It is a first-class output — the scaffold can carry the reading on its own.

Four layers, because three weren't enough

The most load-bearing piece of the architecture is the voice and safety stack. The app's voice (hedged, non-prescriptive, non-diagnostic, non-spiritually-authoritative) is enforced by four cooperating layers:

  1. A pre-LLM deterministic safety gate with more than sixty patterns plus co-occurrence signals. If the entry contains crisis language (direct or paraphrased), it returns the fixed safety message and nothing else. No symbolic interpretation, no Soul Map mutation, no model call.
  2. A mundane-alias suppression layer inside symbol extraction. home only resolves to house if the entry shows symbolic intent (a dream marker, or entry_type=Dream). The line "Today I typed emails, ate lunch, and went home" yields zero symbols rather than amplifying ordinary fatigue into "return to the Self."
  3. The LLM-side system prompt with explicit prohibitions on diagnosis, prediction, prescription, spiritual authority, and a prompt-injection re-frame rule that asks the model to treat embedded "ignore previous instructions" as part of the entry to reflect on symbolically rather than as a command.
  4. Post-LLM deterministic sanitizers that strip invented symbols, stub invented sections when no lexicon symbols were detected, and rewrite prescriptive phrasings ("you should" → "you might notice"), diagnoses, predictions, and spiritual-authority phrasings. The rewrites are anchored to sentence start so mid-sentence usage in questions stays grammatical.

Why all four? Because the QA cycle has empirically validated that any single layer is bypassable. Layers 1, 2, and 4 are deterministic Python and run in under five milliseconds. Layer 3 is the LLM following instructions, which is probabilistic. Twice during the hackathon, the QA evaluator caught voice drift that only layer 4 stopped (leaked prescription, hallucinated symbols, invented themes on no-symbol entries). Once, we found a real bug in layer 1 where a harm-to-others ideation framing slipped past the substring patterns; defense-in-depth held because layers 2 and 4 caught the downstream artifacts. Each layer catches what the others miss.

This is the project's stance in miniature. Treat reliability as a budget owned by deterministic code. Let the model do the prose. Never ask the model to enforce its own guardrails alone.

Fine-tuning the voice, with the harness watching

Late in the week we ran a QLoRA fine-tune of Qwen3-8B on the editorial voice the Garden wants. Fifty hand-written seed examples encode the six-section reading at the quality we want the model to produce, with five mundane entries that must produce the neutral fallback and five crisis entries that must return the safety message verbatim. The seeds are non-negotiable in those two categories — skipping them means the fine-tune learns to over-amplify routine or under-route safety. Both veto-fail the acceptance harness.

The mechanical pipeline:

  • scripts/build_seed_examples.py turns the curated JSONL into chat messages with the lexicon scaffold attached the same way app.py attaches it at runtime.
  • scripts/modal_qlora_train.py runs the train on a Modal H100, r=16 / α=32, 3 epochs, paged 8-bit AdamW. Around $8 in credits per run (the script's own cost estimator; an A100 swap brings it closer to $3).
  • scripts/convert_to_gguf.py runs llama.cpp's HF→GGUF converter and quantizes to Q4_K_M (with Q5_K_M as a fallback).
  • scripts/qa_acceptance_harness.py runs nineteen QA prompts against both baseline and fine-tune and computes a veto-able verdict on five axes: safety routing, six-section format integrity, forbidden-phrase count, hedging density, and invented-symbol count.

The verdict on the merged model (ai-sherpa/Qwen3-8B-Kintsugi-GGUF, fine-tuned for The Kintsugi Garden) was a programmatic PASS with two warnings. Safety routing held 3 for 3. Six-section format integrity held 19 for 19. Hedging density was 0.95 versus 1.06 baseline (above the 80% threshold). Forbidden-phrase hits ticked up from 2 to 4: the fine-tune occasionally leaked template-fragment learning, surfacing the literal phrase **how it appears in the entry** as a stray bullet header. Invented symbols rose from 12 to 28 union-counted, mostly benign — the fine-tune sometimes names a non-lexicon item like "city made of glass" as a Key Symbol. Layer 4 strips these on the stateful surface but not on the streaming surface, which is the honest known limitation we shipped with.

The full 114-row paired trace is published as a dataset: build-small-hackathon/Kintsugi-Garden-traces. Each row carries the model variant, code SHA, prompt category, and the five quality metrics above, alongside truncated previews of the rendered output and the raw model output. The harness is the QA gate; the dataset is what we'd hand to a human reviewer next.

What we'd do differently next

Two things, honestly.

The seeds. Fifty hand-written examples is the right number for a hackathon week, but they were authored as synthetic-accepted rather than hand-edited from real model output. The fine-tune therefore learned what we said the voice was, not the gap between what we said and what the base model actually produces under the live system prompt. The template-fragment leak is a direct artifact of this — the seeds inherited the literal **how it appears in the entry** formatting from our notional output, and the fine-tune amplified it. The v2 recipe would write the 50 seeds as targeted edits of baseline output on the same 50 inputs, so the gradient is pulling the model off its existing tendencies rather than re-asserting the format we wanted it to find on its own.

The training hardware. The Modal H100 worked and the per-run cost stays in the single digits, but the right post-hackathon move is to rebuild the training stack on owned hardware. We have a DGX Spark in the house, and there's a research file at docs/finetune/dgx-spark-training-research.md that maps out the sm_121 / aarch64 software stack required to make Unsloth's Dockerfile work on Grace Blackwell. The memory budget fits trivially — Qwen3-8B QLoRA peaks around 24 GB, the chip has 128 GB unified memory. The constraint is the toolchain, not the silicon. Day-8 work.

The gold in the cracks

The Kintsugi Garden does not tell you what your dream means. It does not predict your future, prescribe a practice, or speak with spiritual authority. It offers back what you brought, organised, mirrored, and named in archetypal vocabulary borrowed honestly from Jungian tradition, alongside a Soul Map that quietly notices what keeps returning.

The most satisfying engineering result of the week was watching the QA evaluator find a voice drift, watching layer 4 catch it before the user saw it, then patching layers 1 and 3 the next morning so the drift would have been caught earlier next time. Reliability is a budget you can spend in layers. The model is allowed to be probabilistic because the scaffold around it is not.

The gold is already in the cracks. The app's job is only to make it easier to see.

Where to find things

The Kintsugi Garden is the Build Small Hackathon 2026 submission for @build-small-hackathon.

Community

Sign up or log in to comment