Spaces:

jar85
/

Gita-advisor

Running

arvjay Claude Sonnet 4.6 commited on 6 days ago

Commit

df43567

1 Parent(s): b396072

minimise HF Space: remove training pipeline, update corpus index

Deleted all training-only files (parsers, ingest/enrich/download scripts,
metrics, optimizer, dataset generator, Gradio app.py, CLAUDE.md, dev logs).
Updated Chroma index with Swarupananda verses; synced config.py and
knowledge_base.py from main repo (adds OpenRouter backend). Removed
'my boss' example question from streamlit_app.py. Updated .gitignore to
protect secrets and training artifacts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (37) hide show

.gitignore +17 -3
CLAUDE.md +0 -232
app.py +0 -1101
artifacts/chroma/52cdeb15-0631-44ed-8618-782f1d4d27bb/data_level0.bin +1 -1
artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/data_level0.bin +3 -0
artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/header.bin +3 -0
artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/length.bin +3 -0
parsers/__init__.py → artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/link_lists.bin +0 -0
artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/data_level0.bin +3 -0
artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/header.bin +3 -0
artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/length.bin +3 -0
artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/link_lists.bin +0 -0
artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/data_level0.bin +3 -0
artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/header.bin +3 -0
artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/length.bin +3 -0
artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/link_lists.bin +0 -0
artifacts/chroma/chroma.sqlite3 +2 -2
artifacts/optimized_advisor.prompts.txt +0 -87
chat.py +0 -392
config.py +51 -15
data/.gitkeep +0 -0
data/synthetic_questions.jsonl +0 -0
dataset_generator.py +0 -332
download_sources.py +0 -195
enrich_corpus.py +0 -174
enrichment.py +0 -266
ingest_corpus.py +0 -203
knowledge_base.py +2 -43
metrics.py +0 -435
optimize_gepa.py +0 -200
parsers/gita_json.py +0 -236
parsers/sastry_archive.py +0 -249
run_overnight.py +0 -230
smoke_test.py +0 -99
sources_local/.gitkeep +0 -0
sources_registry.py +0 -331
streamlit_app.py +0 -1

.gitignore CHANGED Viewed

@@ -1,14 +1,28 @@
 .env
 __pycache__/
 *.pyc
 *.pyo
 data/raw/
 data/enrichment_cache.jsonl
 data/corpus.jsonl
 artifacts/gepa_logs/
 artifacts/gepa_state.bin
 artifacts/*.log
-Gita-advisor/
-sources_local/
 sources/
-.DS_Store

 .env
+*.env
 __pycache__/
 *.pyc
 *.pyo
+.DS_Store
+# Training pipeline — lives in the main repo, not needed in this HF Space
 data/raw/
 data/enrichment_cache.jsonl
 data/corpus.jsonl
+data/corpus_enriched.jsonl
+data/synthetic_questions.jsonl
+# Training artifacts — logs, test runs, GEPA state
 artifacts/gepa_logs/
 artifacts/gepa_state.bin
 artifacts/*.log
+artifacts/test_results.json
+artifacts/optimized_advisor.prompts.txt
+# User-supplied local sources (not for public repo)
+sources_local/*
+!sources_local/.gitkeep
 sources/
+# Guard against accidentally nesting the repo inside itself
+Gita-advisor/

CLAUDE.md DELETED Viewed

@@ -1,232 +0,0 @@
-# CLAUDE.md — Project Primer for the Gītā Advisor
-This file is read by Claude Code when you open this project. It is also a
-human-readable design memo. Read it once before asking Claude to do anything
-substantial, and keep it updated as the design evolves — when the file lies,
-Claude's behavior degrades.
-## What this project is
-A spiritual advisor grounded in Advaita Vedānta as taught by Śaṅkarācārya,
-optimized via DSPy + GEPA against a local LM Studio model. The advisor takes
-a real-life question or vent ("I just got laid off and feel like nothing
-makes sense") and produces a response that is empathetic to the felt
-experience, faithful to the non-dual lineage, and grounded in actual cited
-verses from the Gītā, the principal Upaniṣads, the Brahma Sūtras, and
-prakaraṇa-granthas. Wit is welcome, but only around the cosmic predicament,
-never around the user's pain.
-## The pipeline, in one breath
-User text →
-  `UnderstandQuery` (felt emotion + surface concern + deeper concern + themes) →
-  `PlanRetrieval` (3 diverse search queries) →
-  `AdvaitaRetriever.search_many` (multi-view RAG over verse-indexed corpus) →
-  `SelectPassages` (pick the 2–4 verses that actually fit) →
-  `SynthesizeAdvice` (compose the reply with citations) →
-  `dspy.Prediction` carrying the response and its full trace for the metric.
-Each predictor is a `dspy.ChainOfThought`, so GEPA has a `reasoning` trace to
-inspect during reflection. The retriever is not optimized — vector search
-isn't text — but the *queries given to it* are, which is what `PlanRetrieval`
-exists to evolve.
-## The two architectural choices that matter most
-**Verse as the unit of retrieval.** Scripture is not arbitrary prose; the
-natural unit is the verse (śloka, mantra, sūtra). The corpus is therefore
-indexed by `verse_id` (e.g. `bhagavad_gita_02_47`), which has a stable
-human-readable form (`BG 2.47`). Citations from the advisor are exact-match
-verifiable against the retrieved set, which gives the metric a sharp signal
-to feed back into GEPA's reflection step.
-**Multi-view embeddings to bridge the language gap.** Users do not write in
-the vocabulary of scripture — they say "I'm anxious about my career," not
-"I'm experiencing rāga toward kāmya-karma." So we use the local LLM, in a
-one-time offline pass, to enrich each verse with structured fields that
-speak the user's language: a paraphrase, themes, life situations, emotions
-addressed, practical teaching, and five hypothetical first-person questions.
-Three separate embeddings per verse — `literal_view`, `bhashya_view`, and
-`advisor_view` — let queries phrased in any register find the right verse.
-The advisor view dominates retrieval (weight 0.55) because that is where the
-language gap closes; the literal and bhāṣya views (0.25, 0.20) act as
-insurance against the enrichment pipeline missing a topic.
-## File map
-```
-gita_advisor/
-├── config.py                  # paths, LM Studio URL, model strings, embed config
-├── sources_registry.py        # central catalog of every open source we use
-├── download_sources.py        # downloads everything to data/raw/<source_key>/
-├── corpus.py                  # Verse / EnrichedVerse dataclasses + JSONL I/O
-├── parsers/                   # one module per source format
-│   ├── gita_json.py           #   ↳ gita/gita verse-indexed JSON (Unlicense)
-│   └── sastry_archive.py      #   ↳ Sastry 1897 OCR text from archive.org (PD)
-├── ingest_corpus.py           # runs parsers, merges by verse_ref → corpus.jsonl
-├── enrichment.py              # DSPy module: Verse → EnrichedVerse via local LLM
-├── enrich_corpus.py           # batch enrichment with caching; long-running
-├── knowledge_base.py          # 3-view Chroma index; AdvaitaRetriever
-├── signatures.py              # the four DSPy signatures GEPA optimizes
-├── advisor.py                 # composed dspy.Module — what GEPA optimizes
-├── metrics.py                 # rule-based + LLM-judge composite, with feedback
-├── dataset_generator.py       # synthesizes ~500 life-situation questions
-├── optimize_gepa.py           # runs GEPA over the advisor with the metric
-├── chat.py                    # interactive CLI — load optimized advisor, chat
-├── smoke_test.py              # 5-step pipeline check before committing time
-├── data/
-│   ├── raw/                   # pristine downloads, one folder per source key
-│   ├── corpus.jsonl           # parsed Verses, merged across sources
-│   ├── corpus_enriched.jsonl  # Verses + LLM-extracted fields
-│   └── enrichment_cache.jsonl # append-only cache for resumable enrichment
-└── artifacts/
-    ├── chroma/                # the three view-collections
-    └── optimized_advisor.json # GEPA's compiled program
-```
-## Source provenance, in one place
-Every source must be unambiguously open. The four pillars currently enabled
-or staged are described below in prose so the rationale doesn't get buried.
-The `gita/gita` repository on GitHub provides the spine of the Gītā corpus.
-It is a verse-indexed JSON dataset with Sanskrit, IAST transliteration, and
-word-by-word glosses, released under the Unlicense (a public-domain
-dedication). We pull it via a static-file mirror at
-`ravisiyer.github.io/gita-data/v1/` so a single `requests.get` is enough;
-cloning the whole repo also works.
-Alladi Mahadeva Sastry's 1897 translation of Śaṅkara's Gītā Bhāṣya lives on
-archive.org as full OCR text. It is the only complete English translation
-of Śaṅkara's Gītā commentary that is unambiguously in the public domain
-(Sastry died in 1926, the work itself dates to 1897). The OCR has
-predictable noise — broken hyphens, occasional "rn" → "m" — and
-`parsers/sastry_archive.py` is patient about it.
-The wisdomlib site mirrors the *Sacred Books of the East* series and other
-public-domain Indology — Telang's 1882 Gītā, Mundaka with Śaṅkara, etc.
-The `wisdomlib_html` parser is registered but not yet implemented; this is
-on the to-do list. `sacred-texts.com` carries the same content but blocks
-some HTTP fetchers, so on the Mac you can use either.
-What we deliberately do not include: Swami Gambhirananda's translations
-(Advaita Ashrama copyright, mid-20th c.), modern Ramaṇa or Nisargadatta
-editions, ISKCON's Prabhupada commentary. If you want any of these, place
-your own copies in `sources_local/` under your own license judgment.
-## The pipeline of commands
-The first time, in order:
-```bash
-pip install -r requirements.txt
-# 1. Download the registered open sources to data/raw/. Polite (1 req/s/host),
-#    idempotent (skips files already present). Re-run with --force to refresh.
-python download_sources.py
-# 2. Parse the raw downloads into a unified verse corpus. Merges Gītā verse
-#    text with Śaṅkara's bhāṣya by verse_ref. Outputs data/corpus.jsonl.
-python ingest_corpus.py
-# 3. Run the local LLM over every verse to extract paraphrase + life
-#    situations + emotions + hypothetical questions. SLOW — several hours,
-#    overnight is normal. Resumable via append-mode cache, so kill -9 is safe.
-#    Outputs data/corpus_enriched.jsonl.
-python enrich_corpus.py
-# Smoke-test on 50 verses first if you want to verify the prompt is producing
-# good output before committing the overnight run:
-python enrich_corpus.py --limit 50
-# 4. Build the three Chroma view-indices from the enriched corpus.
-python knowledge_base.py --build
-# 5. Sanity-check the pipeline on one user question.
-python smoke_test.py "I just got laid off and feel like nothing matters anymore"
-# 6. Generate the synthetic dataset of ~500 user questions for GEPA training.
-python dataset_generator.py --n 500
-# 7. Run GEPA optimization. Also long — start with --auto light to verify, then
-#    re-run at --auto medium for the real pass.
-python optimize_gepa.py --auto medium
-# 8. Open the chat CLI with the optimized program loaded.
-python chat.py
-```
-After the first run, only steps 4–8 normally re-run. Steps 1–3 are one-time
-unless you change sources or the enrichment prompt.
-## Two things to watch out for
-**LM Studio model name.** The exact string `google/gemma-4-26b-a4b` (or
-whatever you settle on) goes in `config.py` as `LOCAL_MODEL`, and DSPy
-prefixes it with `openai/` to route through the OpenAI client. If LM Studio
-reports a different model identifier in its API, copy-paste verbatim.
-**Failed enrichments.** The local model occasionally produces malformed
-structured output. The enricher retries twice and, on persistent failure,
-stamps `enrichment_model = "FAILED: <reason>"` on the verse. The verse is
-still indexed on its literal text and bhāṣya, just without the advisor view.
-After the full pass, run `python enrich_corpus.py --only-failed` to retry
-just those, perhaps after tuning the prompt in `enrichment.py`.
-## What is not yet done
-The Sastry parser produces verse-attached bhāṣya but the verse-text /
-bhāṣya split is heuristic; spot-check a few verses (BG 2.47, BG 18.66 are
-good canaries) and tighten `_build_verse` if needed.
-The `wisdomlib_html` parser and the `thibaut_sbe` (Brahma Sūtra) parser are
-registered but stubbed — adding either is a single-file change. They are
-disabled in `sources_registry.py` until written.
-The metric still has the rule-based hooks for therapy clichés, length, and
-non-dual register but does not yet look at the new EnrichedVerse fields
-(`emotions_addressed`, `themes`) for empathy verification. There is a clear
-win there: when the user's `felt_emotion` appears in a selected verse's
-`emotions_addressed` list, that is strong evidence of empathic-fit retrieval.
-The dataset generator was written before the schema shift; spot-check that
-its output still flows through the pipeline cleanly.
-## How to talk to me when working in this project
-The most useful prompts are concrete and bounded. "Tighten the verse-bhāṣya
-split heuristic in `parsers/sastry_archive.py` and run it on the first three
-chapters; show me the BG 2.47 record" is a good prompt. "Improve the
-parser" is not.
-When something is broken, read the relevant file end-to-end before patching.
-The comments in this project are unusually heavy because the design has many
-small choices that stop being obvious six months from now. If a comment
-disagrees with the code, the comment is more likely to be right and you
-should ask whether the code drifted, not whether the comment did.
-When designing a new piece, start by asking what `Verse` / `EnrichedVerse`
-field carries the information, before reaching for new state. The data
-model is meant to be the contract between modules; adding ad-hoc fields on
-the side is how RAG systems become spaghetti.
-## Pinned design commitments (do not silently break these)
-The advisor is grounded in Advaita Vedānta as Śaṅkara taught it. We do not
-import dualistic theology, and we do not reduce Advaita to "we are all one"
-pop-spirituality. We hold the two-truths distinction (vyāvahārika and
-pāramārthika) actively, and we do not collapse the user's lived suffering
-into "it's all māyā anyway." When a teaching has a Sanskrit name with a
-precise meaning, we use the Sanskrit name with a brief gloss rather than
-substituting an approximate English word.
-Citations are exact and verifiable. "BG 2.47" in a response means the verse
-was in the retrieved set. The metric enforces this; do not weaken it.
-The advisor is not therapy and is not a chatbot friend. It is a teacher in
-the tradition of the lineage. It is allowed to push back, to challenge a
-question's premise, and to recommend silence over more words.
-The retriever is permissive; the selector is picky. Do not move filtering
-upstream into the retriever — once a verse is filtered out at retrieval,
-no later stage can recover it.

app.py DELETED Viewed

@@ -1,1101 +0,0 @@
-"""
-app.py — Enhanced Gradio web interface for the Gītā Advisor.
-Features:
-  - Real-time stage progress during inference (◌ understanding → searching → composing)
-  - Character-by-character response streaming
-  - Verse explorer: select any cited source to read Sanskrit, translation, Śaṅkara's bhāṣya
-  - Warm spiritual aesthetic
-"""
-from __future__ import annotations
-import json
-import re
-import threading
-import time
-from types import SimpleNamespace
-import gradio as gr
-import dspy
-from openai import OpenAI
-import config
-from advisor import load_optimized
-from knowledge_base import AdvaitaRetriever, format_passages_for_llm
-from corpus import EnrichedVerse, Verse, read_jsonl_enriched, read_jsonl_verses
-class _ExplainInContext(dspy.Signature):
-    """You are the Gītā Advisor continuing a conversation. The user has asked
-    you to unpack a specific verse or passage you cited. Explain what it means
-    and why it speaks precisely to their situation — go deeper than the initial
-    response did. Reference the user's words. Close with one concrete way to
-    hold or work with this text this week."""
-    verse_ref: str = dspy.InputField()
-    verse_content: str = dspy.InputField(
-        desc="Translation, original text (if available), and Śaṅkara's commentary."
-    )
-    conversation_context: str = dspy.InputField(
-        desc="The user's question and the advisor's response where this verse was cited."
-    )
-    explanation: str = dspy.OutputField(
-        desc="150-250 words. Grounded in Advaita. Do not merely restate the translation. "
-             "End with a practical suggestion for this week."
-    )
-# ── startup — runs once when the Space boots ──────────────────────────────────
-config.configure_dspy(backend="hf")
-_advisor = load_optimized()
-_retriever = AdvaitaRetriever()
-_retriever._ensure()
-# Direct OpenAI-compatible client for streaming synthesis — bypasses DSPy for
-# the final step so tokens reach the browser as they're generated.
-_synthesis_client = OpenAI(
-    base_url=config.HF_ROUTER_BASE,
-    api_key=config.HF_TOKEN,
-)
-def _load_verse_lookup() -> dict[str, Verse]:
-    lookup: dict[str, Verse] = {}
-    enriched = config.DATA_DIR / "corpus_enriched.jsonl"
-    plain = config.DATA_DIR / "corpus.jsonl"
-    if enriched.exists():
-        for v in read_jsonl_enriched(enriched):
-            lookup[v.verse_ref.lower().strip()] = v
-    elif plain.exists():
-        for v in read_jsonl_verses(plain):
-            lookup[v.verse_ref.lower().strip()] = v
-    return lookup
-_verse_lookup = _load_verse_lookup()
-# ── helpers ────────────────────────────────────────────────────────────────────
-def _to_dspy_history(gradio_history: list) -> dspy.History:
-    """Convert Gradio messages list to dspy.History, stripping source footers."""
-    msgs = []
-    i = 0
-    while i + 1 < len(gradio_history):
-        u, a = gradio_history[i], gradio_history[i + 1]
-        if u.get("role") == "user" and a.get("role") == "assistant":
-            content = a["content"]
-            if "\n\n---\n" in content:
-                content = content.split("\n\n---\n")[0]
-            msgs.append({
-                "user_question": u["content"],
-                "response": content,
-                "sources_cited": [],
-            })
-        i += 2
-    return dspy.History(messages=msgs)
-def _render_verse_html(verse: Verse) -> str:
-    ev = verse if isinstance(verse, EnrichedVerse) else None
-    parts: list[str] = []
-    ref = verse.verse_ref
-    work = getattr(verse, "work_display", None) or getattr(verse, "work", "")
-    section = getattr(verse, "section_display", None) or getattr(verse, "section", "") or ""
-    subtitle = f"{work} — {section}" if section else work
-    parts.append(
-        f'<div class="vp-header">'
-        f'  <span class="vp-ref">{ref}</span>'
-        f'  <span class="vp-subtitle">{subtitle}</span>'
-        f'</div>'
-    )
-    if getattr(verse, "sanskrit", None):
-        parts.append(f'<div class="vp-sanskrit">{verse.sanskrit}</div>')
-    if getattr(verse, "transliteration", None):
-        parts.append(f'<div class="vp-iast">{verse.transliteration}</div>')
-    if getattr(verse, "translation", None):
-        tr = getattr(verse, "translator", None)
-        label = f"Translation ({tr})" if tr else "Translation"
-        parts.append(f'<div class="vp-label">{label}</div>')
-        parts.append(f'<div class="vp-body">{verse.translation}</div>')
-    if getattr(verse, "bhashya", None):
-        btr = getattr(verse, "bhashya_translator", None)
-        note = f" ({btr})" if btr else ""
-        preview = verse.bhashya[:900] + ("…" if len(verse.bhashya) > 900 else "")
-        parts.append(f'<div class="vp-label">Śaṅkara\'s Bhāṣya{note}</div>')
-        parts.append(f'<div class="vp-body vp-dim">{preview}</div>')
-    if ev:
-        if getattr(ev, "paraphrase", None):
-            parts.append('<div class="vp-label">Teaching</div>')
-            parts.append(f'<div class="vp-body">{ev.paraphrase}</div>')
-        if getattr(ev, "themes", None):
-            tags = "".join(f'<span class="vp-tag">{t}</span>' for t in ev.themes)
-            parts.append(f'<div class="vp-tags">{tags}</div>')
-        if getattr(ev, "practical_teaching", None):
-            parts.append('<div class="vp-label">Practical Shift</div>')
-            parts.append(f'<div class="vp-body vp-gold">{ev.practical_teaching}</div>')
-    return '<div class="verse-panel">' + "\n".join(parts) + "</div>"
-# ── CSS ────────────────────────────────────────────────────────────────────────
-CSS = """
-@import url('https://fonts.googleapis.com/css2?family=Playfair+Display:ital,wght@0,400;0,600;1,400&family=EB+Garamond:ital,wght@0,400;0,500;1,400;1,500&family=Lato:ital,wght@0,300;0,400;0,700;1,300;1,400&display=swap');
-/* ── palette ──────────────────────────────────────────────────────────────── */
-:root {
-  --gold:        #C9A84C;
-  --gold-dim:    #9A7830;
-  --gold-glow:   rgba(201,168,76,0.18);
-  --bg:          #100C07;
-  --bg-mid:      #1C1208;
-  --bg-card:     #251808;
-  --bg-user:     #3D240C;
-  --bg-bot:      #180F05;
-  --border:      #5A3C18;
-  --border-dim:  #3A2408;
-  --text:        #ECD8B4;
-  --text-dim:    #A08860;
-  --text-muted:  #6A5030;
-  --radius:      10px;
-  --font-serif:  'EB Garamond', Georgia, 'Times New Roman', serif;
-  --font-sans:   'Lato', system-ui, sans-serif;
-  --font-display: 'Playfair Display', Georgia, serif;
-}
-/* ── base ─────────────────────────────────────────────────────────────────── */
-body,
-.gradio-container,
-.main,
-footer {
-  background: var(--bg) !important;
-  color: var(--text) !important;
-  font-family: var(--font-sans) !important;
-}
-.gradio-container { max-width: 880px !important; margin: 0 auto !important; }
-footer { display: none !important; }
-/* ── header ───────────────────────────────────────────────────────────────── */
-.app-header {
-  text-align: center;
-  padding: 2.4rem 1rem 1.6rem;
-  border-bottom: 1px solid var(--border);
-  margin-bottom: 0.5rem;
-}
-.app-title {
-  font-family: var(--font-display);
-  font-size: 2.6rem;
-  color: var(--gold);
-  letter-spacing: 0.05em;
-  line-height: 1.15;
-  margin: 0 0 0.5rem;
-  font-weight: 400;
-}
-.app-subtitle {
-  color: var(--text-muted);
-  font-size: 0.95rem;
-  font-weight: 300;
-  font-style: italic;
-  font-family: var(--font-serif);
-  letter-spacing: 0.03em;
-}
-.app-ornament {
-  margin-top: 1rem;
-  color: var(--gold-dim);
-  font-size: 0.85rem;
-  letter-spacing: 0.7em;
-}
-/* ── chatbot container ────────────────────────────────────────────────────── */
-#chatbot {
-  border: 1px solid var(--border) !important;
-  border-radius: var(--radius) !important;
-  background: var(--bg) !important;
-}
-#chatbot .wrap { background: var(--bg) !important; }
-/* user bubble */
-#chatbot .user.message {
-  background: var(--bg-user) !important;
-  border: 1px solid var(--border) !important;
-  border-radius: var(--radius) var(--radius) 3px var(--radius) !important;
-  padding: 0.8rem 1.1rem !important;
-  box-shadow: inset 0 1px 0 rgba(255,220,140,0.07) !important;
-}
-/* assistant bubble */
-#chatbot .bot.message {
-  background: var(--bg-bot) !important;
-  border: 1px solid var(--gold-dim) !important;
-  border-left: 3px solid var(--gold-dim) !important;
-  border-radius: 3px var(--radius) var(--radius) var(--radius) !important;
-  padding: 1.1rem 1.4rem 1.1rem 1.6rem !important;
-  line-height: 1.9 !important;
-  font-family: var(--font-serif) !important;
-  font-size: 1.08rem !important;
-  color: var(--text) !important;
-}
-/* user bubble */
-#chatbot .user.message {
-  background: var(--bg-user) !important;
-  border: 1px solid var(--border) !important;
-  border-radius: var(--radius) var(--radius) 3px var(--radius) !important;
-  padding: 0.8rem 1.1rem !important;
-  box-shadow: inset 0 1px 0 rgba(255,220,140,0.07) !important;
-  font-family: var(--font-sans) !important;
-  font-size: 0.96rem !important;
-}
-/* inner content divs — transparent so bubble bg shows through */
-#chatbot .message.panel-full-width,
-#chatbot [data-testid="user"],
-#chatbot [data-testid="bot"] {
-  background: transparent !important;
-  color: var(--text) !important;
-  padding: 0 !important;
-}
-/* markdown inside bubbles */
-#chatbot .bot.message p   { margin: 0.55em 0 !important; }
-#chatbot .user.message p  { margin: 0.3em 0 !important; }
-#chatbot .message hr  { border-color: var(--border) !important; margin: 0.6em 0 !important; }
-#chatbot .message code {
-  background: var(--bg-card) !important;
-  color: var(--gold) !important;
-  padding: 0.1em 0.4em !important;
-  border-radius: 4px !important;
-  font-size: 0.88em !important;
-  font-family: var(--font-sans) !important;
-}
-#chatbot .message em { color: var(--text-dim) !important; }
-#chatbot .message strong { color: var(--text) !important; font-weight: 500 !important; }
-/* placeholder */
-#chatbot .placeholder {
-  color: var(--text-muted) !important;
-  font-style: italic !important;
-}
-/* ── stage status ─────────────────────────────────────────────────────────── */
-#stage-status {
-  min-height: 1.8rem;
-  text-align: center;
-  padding: 0.4rem 0.5rem;
-  font-family: 'Lato', sans-serif;
-  line-height: 1.55;
-}
-#stage-status .stage-spinner {
-  color: var(--gold);
-  font-style: italic;
-  font-size: 0.88rem;
-  opacity: 0.9;
-}
-#stage-status .stage-card {
-  display: inline-block;
-  text-align: left;
-  max-width: 90%;
-  font-size: 0.84rem;
-}
-#stage-status .stage-row {
-  display: flex;
-  align-items: baseline;
-  flex-wrap: wrap;
-  gap: 0.3rem 0.6rem;
-  margin-bottom: 0.2rem;
-}
-#stage-status .stage-icon { color: var(--gold-dim); }
-#stage-status .stage-label {
-  color: var(--text-muted);
-  font-size: 0.73rem;
-  text-transform: uppercase;
-  letter-spacing: 0.09em;
-  min-width: 4.5rem;
-}
-#stage-status .stage-val { color: var(--text); font-style: italic; }
-#stage-status .stage-chip {
-  display: inline-block;
-  border: 1px solid var(--border);
-  border-radius: 3px;
-  padding: 0 0.35rem;
-  font-size: 0.78rem;
-  color: var(--text-dim);
-  font-style: normal;
-  margin: 0.1rem 0.1rem 0 0;
-}
-#stage-status .stage-source {
-  display: inline-block;
-  background: var(--bg-card);
-  border: 1px solid var(--gold-dim);
-  border-radius: 3px;
-  padding: 0 0.35rem;
-  font-size: 0.78rem;
-  color: var(--gold);
-  margin: 0.1rem 0.1rem 0 0;
-}
-/* ── input textbox ────────────────────────────────────────────────────────── */
-#msg-input {
-  background: var(--bg-card) !important;
-  border-radius: var(--radius) !important;
-}
-#msg-input label.show_textbox_border {
-  background: var(--bg-card) !important;
-  border: 1px solid var(--border) !important;
-  border-radius: var(--radius) !important;
-  transition: border-color 0.15s, box-shadow 0.15s !important;
-}
-#msg-input label.show_textbox_border:focus-within {
-  border-color: var(--gold-dim) !important;
-  box-shadow: 0 0 0 3px var(--gold-glow) !important;
-}
-#msg-input span.svelte-1hguek3 { display: none !important; } /* hide label text */
-#msg-input textarea {
-  background: var(--bg-card) !important;
-  color: var(--text) !important;
-  font-family: var(--font-serif) !important;
-  font-size: 1.02rem !important;
-  line-height: 1.55 !important;
-  caret-color: var(--gold) !important;
-  resize: none !important;
-  border: none !important;
-  outline: none !important;
-}
-#msg-input textarea::placeholder { color: var(--text-muted) !important; }
-/* ── buttons ──────────────────────────────────────────────────────────────── */
-#submit-btn {
-  background: var(--gold-dim) !important;
-  color: #0D0A07 !important;
-  border: none !important;
-  border-radius: var(--radius) !important;
-  font-family: var(--font-sans) !important;
-  font-weight: 700 !important;
-  letter-spacing: 0.05em !important;
-  transition: background 0.18s !important;
-  height: 100% !important;
-}
-#submit-btn:hover { background: var(--gold) !important; cursor: pointer !important; }
-#clear-btn {
-  background: transparent !important;
-  color: var(--text-muted) !important;
-  border: 1px solid var(--border) !important;
-  border-radius: var(--radius) !important;
-  font-family: var(--font-sans) !important;
-  transition: color 0.15s, border-color 0.15s !important;
-  width: 46px !important;
-  min-width: 46px !important;
-  max-width: 46px !important;
-  flex-shrink: 0 !important;
-  padding: 0 !important;
-  font-size: 1.1rem !important;
-}
-#clear-btn:hover { color: var(--text-dim) !important; border-color: var(--text-muted) !important; cursor: pointer !important; }
-/* ── examples ─────────────────────────────────────────────────────────────── */
-.examples-holder .examples-inner-text { color: var(--text-muted) !important; font-size: 0.8rem !important; }
-.examples-holder table { border: none !important; }
-.examples-holder table td {
-  background: var(--bg-card) !important;
-  border: 1px solid var(--border) !important;
-  color: var(--text-dim) !important;
-  border-radius: 6px !important;
-  font-size: 0.86rem !important;
-  transition: background 0.15s, color 0.15s !important;
-  cursor: pointer !important;
-}
-.examples-holder table td:hover {
-  background: var(--bg-mid) !important;
-  color: var(--text) !important;
-  border-color: var(--gold-dim) !important;
-}
-/* ── explorer section ─────────────────────────────────────────────────────── */
-.explorer-wrap {
-  border-top: 1px solid var(--border);
-  margin-top: 1.5rem;
-  padding-top: 1.2rem;
-}
-.explorer-label {
-  color: var(--text-muted);
-  font-size: 0.75rem;
-  text-transform: uppercase;
-  letter-spacing: 0.12em;
-  margin-bottom: 0.6rem;
-  font-family: 'Lato', sans-serif;
-}
-#source-dd label { color: var(--text-dim) !important; font-size: 0.82rem !important; }
-#source-dd select {
-  background: var(--bg-card) !important;
-  border: 1px solid var(--border) !important;
-  color: var(--text) !important;
-  border-radius: var(--radius) !important;
-}
-/* ── verse panel ──────────────────────────────────────────────────────────── */
-.verse-panel {
-  background: var(--bg-mid);
-  border: 1px solid var(--gold-dim);
-  border-radius: var(--radius);
-  padding: 1.6rem 2rem 1.8rem;
-  margin-top: 0.8rem;
-  line-height: 1.85;
-  font-family: var(--font-serif);
-}
-.vp-header {
-  display: flex;
-  justify-content: space-between;
-  align-items: baseline;
-  flex-wrap: wrap;
-  gap: 0.4rem;
-  border-bottom: 1px solid var(--border);
-  padding-bottom: 0.8rem;
-  margin-bottom: 1.1rem;
-}
-.vp-ref {
-  font-family: var(--font-display);
-  font-size: 1.2rem;
-  color: var(--gold);
-  font-weight: 400;
-  letter-spacing: 0.04em;
-}
-.vp-subtitle {
-  color: var(--text-muted);
-  font-size: 0.82rem;
-  font-style: italic;
-  font-family: var(--font-sans);
-}
-.vp-sanskrit {
-  font-size: 1.05rem;
-  color: var(--text);
-  font-style: italic;
-  margin-bottom: 0.2rem;
-  font-family: var(--font-serif);
-}
-.vp-iast {
-  color: var(--text-dim);
-  font-size: 0.9rem;
-  font-style: italic;
-  margin-bottom: 1rem;
-  font-family: var(--font-serif);
-}
-.vp-label {
-  color: var(--gold-dim);
-  font-size: 0.70rem;
-  text-transform: uppercase;
-  letter-spacing: 0.13em;
-  margin-top: 1.1rem;
-  margin-bottom: 0.35rem;
-  font-family: var(--font-sans);
-  font-weight: 700;
-}
-.vp-body { color: var(--text); font-size: 1rem; font-family: var(--font-serif); line-height: 1.85; }
-.vp-dim  { color: var(--text-dim) !important; font-style: italic; font-size: 0.93rem !important; }
-.vp-gold { color: var(--gold) !important; font-style: italic; }
-.vp-tags { display: flex; flex-wrap: wrap; gap: 0.35rem; margin-top: 0.8rem; }
-.vp-tag {
-  background: var(--bg-card);
-  border: 1px solid var(--border);
-  color: var(--text-muted);
-  font-size: 0.73rem;
-  padding: 0.12rem 0.6rem;
-  border-radius: 20px;
-  font-family: var(--font-sans);
-}
-/* ── explain button & output ──────────────────────────────────────────────── */
-#explain-btn {
-  background: transparent !important;
-  color: var(--text-muted) !important;
-  border: 1px solid var(--border-dim) !important;
-  border-radius: 6px !important;
-  font-family: 'Lato', sans-serif !important;
-  font-size: 0.82rem !important;
-  letter-spacing: 0.05em !important;
-  margin-top: 0.6rem !important;
-  transition: color 0.15s, border-color 0.15s, opacity 0.15s !important;
-  opacity: 0.4 !important;
-}
-#explain-btn:not([disabled]):not(.disabled) {
-  color: var(--gold-dim) !important;
-  opacity: 1 !important;
-}
-#explain-btn:not([disabled]):not(.disabled):hover {
-  color: var(--gold) !important;
-  border-color: var(--gold-dim) !important;
-}
-.explain-panel {
-  background: var(--bg-mid);
-  border-left: 3px solid var(--gold-dim);
-  border-radius: 0 var(--radius) var(--radius) 0;
-  padding: 1.3rem 1.7rem;
-  margin-top: 0.8rem;
-  color: var(--text);
-  font-size: 1rem;
-  line-height: 1.9;
-  font-style: italic;
-  font-family: var(--font-serif);
-}
-/* ── reasoning panel ─────────────────────────────────────────── */
-.reasoning-panel {
-  font-family: var(--font-sans);
-  font-size: 0.82rem;
-  line-height: 1.65;
-  color: var(--text-dim);
-}
-.reasoning-panel .r-section {
-  margin-bottom: 1rem;
-}
-.reasoning-panel .r-label {
-  color: var(--gold-dim);
-  font-size: 0.70rem;
-  text-transform: uppercase;
-  letter-spacing: 0.12em;
-  margin-bottom: 0.3rem;
-}
-.reasoning-panel .r-value {
-  color: var(--text);
-  font-size: 0.85rem;
-}
-.reasoning-panel .r-trace {
-  white-space: pre-wrap;
-  word-break: break-word;
-  color: var(--text-dim);
-  font-size: 0.80rem;
-  border-left: 2px solid var(--border);
-  padding-left: 0.8rem;
-  margin-top: 0.3rem;
-}
-/* accordion styling */
-#thinking-accordion > .label-wrap { color: var(--text-dim) !important; font-size: 0.82rem; }
-#thinking-accordion { background: transparent !important; border: 1px solid var(--border) !important; border-radius: var(--radius) !important; margin-top: 0.5rem; }
-/* ── scrollbar ────────────────────────────────────────────────────────────── */
-::-webkit-scrollbar { width: 4px; height: 4px; }
-::-webkit-scrollbar-track { background: var(--bg); }
-::-webkit-scrollbar-thumb { background: var(--border); border-radius: 3px; }
-::-webkit-scrollbar-thumb:hover { background: var(--gold-dim); }
-"""
-# ── streaming synthesis helpers ───────────────────────────────────────────────
-_RESPONSE_MARKER  = "[[ ## response ## ]]"
-_SOURCES_MARKER   = "[[ ## sources_cited ## ]]"
-_REASONING_MARKER = "[[ ## reasoning ## ]]"
-def _build_synthesis_messages(
-    dspy_hist: dspy.History,
-    message: str,
-    felt_emotion: str,
-    deeper_concern: str,
-    selected_text: str,
-) -> list[dict]:
-    """Build the exact prompt messages DSPy would send for synthesis.
-    Uses the configured ChatAdapter + the GEPA-optimized signature/demos so the
-    optimized instructions are preserved while we gain streaming control.
-    """
-    adapter = dspy.settings.adapter
-    # In DSPy 3.x ChainOfThought wraps a Predict; the extended sig (with
-    # reasoning field) and GEPA-loaded demos live on .predict
-    sig   = _advisor.synthesize.predict.signature
-    demos = getattr(_advisor.synthesize.predict, "demos", [])
-    inputs = dict(
-        history=dspy_hist,
-        user_question=message,
-        felt_emotion=felt_emotion,
-        deeper_concern=deeper_concern,
-        selected_passages=selected_text,
-    )
-    return adapter.format(sig, demos, inputs)
-def _parse_sources_cited(full_text: str) -> list[str]:
-    """Extract sources_cited JSON list from the full streamed completion text."""
-    if _SOURCES_MARKER not in full_text:
-        return []
-    raw = full_text.split(_SOURCES_MARKER, 1)[1].strip()
-    raw = re.split(r"\[\[", raw)[0].strip()  # stop at next field marker
-    try:
-        result = json.loads(raw)
-        return result if isinstance(result, list) else []
-    except Exception:
-        m = re.search(r"\[.*?\]", raw, re.DOTALL)
-        if m:
-            try:
-                return json.loads(m.group())
-            except Exception:
-                pass
-    return []
-def _parse_reasoning(full_text: str) -> str:
-    """Extract the reasoning trace from the full streamed completion text."""
-    if _REASONING_MARKER not in full_text or _RESPONSE_MARKER not in full_text:
-        return ""
-    return full_text.split(_REASONING_MARKER, 1)[1].split(_RESPONSE_MARKER, 1)[0].strip()
-def _spin(text: str) -> str:
-    return f'<div class="stage-spinner">◌  {text}</div>'
-def _stage_understand(u) -> str:
-    emotion = getattr(u, "felt_emotion", "") or ""
-    concern = getattr(u, "deeper_concern", "") or ""
-    themes = getattr(u, "vedantic_themes", []) or []
-    themes_html = "".join(f'<span class="stage-chip">{t.split("(")[0].strip()}</span>' for t in themes[:4])
-    rows = []
-    if emotion:
-        rows.append(
-            f'<div class="stage-row">'
-            f'<span class="stage-label">felt</span>'
-            f'<span class="stage-val">{emotion}</span>'
-            f'</div>'
-        )
-    if concern:
-        rows.append(
-            f'<div class="stage-row">'
-            f'<span class="stage-label">concern</span>'
-            f'<span class="stage-val">{concern}</span>'
-            f'</div>'
-        )
-    if themes_html:
-        rows.append(
-            f'<div class="stage-row">'
-            f'<span class="stage-label">themes</span>'
-            f'<span>{themes_html}</span>'
-            f'</div>'
-        )
-    return f'<div class="stage-card">{"".join(rows)}</div>'
-def _stage_plan(queries: list[str]) -> str:
-    chips = "".join(f'<span class="stage-chip">"{q}"</span>' for q in queries)
-    return (
-        f'<div class="stage-card">'
-        f'<div class="stage-row">'
-        f'<span class="stage-label">searching</span>'
-        f'<span>{chips}</span>'
-        f'</div></div>'
-    )
-def _stage_retrieve(n: int) -> str:
-    return (
-        f'<div class="stage-card">'
-        f'<div class="stage-row">'
-        f'<span class="stage-label">passages</span>'
-        f'<span class="stage-val">{n} found &nbsp;—&nbsp; selecting…</span>'
-        f'</div></div>'
-    )
-def _stage_select(sources: list[str]) -> str:
-    chips = "".join(f'<span class="stage-source">{s}</span>' for s in sources)
-    return (
-        f'<div class="stage-card">'
-        f'<div class="stage-row">'
-        f'<span class="stage-label">selected</span>'
-        f'<span>{chips}</span>'
-        f'</div>'
-        f'<div class="stage-row" style="margin-top:0.15rem">'
-        f'<span class="stage-label"></span>'
-        f'<span class="stage-spinner" style="font-size:0.82rem">◌  composing response…</span>'
-        f'</div></div>'
-    )
-def _build_reasoning_html(pred) -> str:
-    """Render the pipeline's reasoning trace as an HTML block for the accordion."""
-    emotion = getattr(pred, "felt_emotion", "") or ""
-    concern = getattr(pred, "deeper_concern", "") or ""
-    themes = getattr(pred, "vedantic_themes", []) or []
-    queries = getattr(pred, "queries", []) or []
-    reasoning = getattr(pred, "synthesis_reasoning", "") or ""
-    rationale = getattr(pred, "selection_rationale", "") or ""
-    def section(label: str, content: str) -> str:
-        return (
-            f'<div class="r-section">'
-            f'<div class="r-label">{label}</div>'
-            f'<div class="r-value">{content}</div>'
-            f'</div>'
-        )
-    parts = ['<div class="reasoning-panel">']
-    if emotion:
-        parts.append(section("Felt emotion", emotion))
-    if concern:
-        parts.append(section("Deeper concern", concern))
-    if themes:
-        parts.append(section("Vedāntic themes", " &nbsp;·&nbsp; ".join(themes)))
-    if queries:
-        qs = "".join(f"<li>{q}</li>" for q in queries)
-        parts.append(section("Search queries", f"<ol style='margin:0;padding-left:1.2em'>{qs}</ol>"))
-    if rationale:
-        parts.append(section("Passage selection", rationale))
-    if reasoning:
-        escaped = reasoning.replace("<", "&lt;").replace(">", "&gt;")
-        parts.append(
-            '<div class="r-section">'
-            '<div class="r-label">Model reasoning trace</div>'
-            f'<div class="r-trace">{escaped}</div>'
-            '</div>'
-        )
-    parts.append("</div>")
-    return "\n".join(parts)
-# ── respond (streaming generator) ─────────────────────────────────────────────
-def respond(message: str, history: list):
-    """Drive the 4-step pipeline manually so each step's output is shown live."""
-    _no_src = gr.update(choices=[], value=None, visible=False)
-    _noop   = gr.update()
-    def _emit(hist, stage_content, thinking_content=_noop):
-        return hist, stage_content, None, _no_src, "", thinking_content
-    if not message.strip():
-        yield *_emit(history, ""), _noop
-        return
-    history = history + [{"role": "user", "content": message}]
-    dspy_hist = _to_dspy_history(history[:-1])
-    # ── Step 1: understand ────────────────────────────────────────────────────
-    yield *_emit(history, _spin("understanding your question…")),
-    try:
-        u = _advisor.understand(history=dspy_hist, user_question=message)
-    except Exception as exc:
-        history = history + [{"role": "assistant", "content": f"*Error — {exc}*"}]
-        yield *_emit(history, ""),
-        return
-    # Show what was understood; plan is next
-    yield *_emit(history, _stage_understand(u)),
-    # ── Step 2: plan retrieval queries ────────────────────────────────────────
-    try:
-        p = _advisor.plan(
-            surface_concern=u.surface_concern,
-            deeper_concern=u.deeper_concern,
-            vedantic_themes=u.vedantic_themes,
-        )
-    except Exception as exc:
-        history = history + [{"role": "assistant", "content": f"*Error — {exc}*"}]
-        yield *_emit(history, ""),
-        return
-    queries = p.queries[: config.N_RETRIEVAL_QUERIES] if p.queries else [u.deeper_concern]
-    yield *_emit(history, _stage_plan(queries)),
-    # ── Step 3: retrieve (fast, local Chroma) ────────────────────────────────
-    hits = _advisor._retriever.search_many(queries, k_per=config.TOP_K_RETRIEVE)
-    candidates = hits[: max(8, config.TOP_K_RETRIEVE)]
-    candidates_text = format_passages_for_llm(candidates)
-    candidates_as_dicts = [h.to_dict() for h in candidates]
-    previously_cited = [
-        src for msg in dspy_hist.messages for src in msg.get("sources_cited", [])
-    ]
-    yield *_emit(history, _stage_retrieve(len(candidates))),
-    # ── Step 4: select passages ───────────────────────────────────────────────
-    try:
-        s = _advisor.select(
-            deeper_concern=u.deeper_concern,
-            candidate_passages=candidates_text,
-            previously_cited=previously_cited,
-        )
-    except Exception as exc:
-        history = history + [{"role": "assistant", "content": f"*Error — {exc}*"}]
-        yield *_emit(history, ""),
-        return
-    valid_idx = [
-        i for i in (s.selected_indices or [])
-        if isinstance(i, int) and 1 <= i <= len(candidates)
-    ]
-    if not valid_idx:
-        valid_idx = list(range(1, min(4, len(candidates) + 1)))
-    selected = [candidates[i - 1] for i in valid_idx]
-    selected_text = format_passages_for_llm(selected)
-    # Show selected sources; stream synthesis next
-    selected_refs = [
-        candidates_as_dicts[i - 1].get("verse_ref", f"#{i}").upper().replace("_", " ")
-        for i in valid_idx
-        if i - 1 < len(candidates_as_dicts)
-    ]
-    yield *_emit(history, _stage_select(selected_refs)),
-    # ── Step 5: synthesize with real token streaming ──────────────────────────
-    # Build partial thinking from steps 1-4 (reasoning filled in after stream)
-    partial = SimpleNamespace(
-        felt_emotion=u.felt_emotion, deeper_concern=u.deeper_concern,
-        vedantic_themes=u.vedantic_themes, queries=queries,
-        selection_rationale=s.selection_rationale, synthesis_reasoning="",
-    )
-    partial_thinking = _build_reasoning_html(partial)
-    history = history + [{"role": "assistant", "content": ""}]
-    full_text = ""
-    streamed_response = ""
-    in_response = False
-    try:
-        messages = _build_synthesis_messages(
-            dspy_hist, message, u.felt_emotion, u.deeper_concern, selected_text
-        )
-        stream = _synthesis_client.chat.completions.create(
-            model=config.HF_MODEL,
-            messages=messages,
-            stream=True,
-            temperature=0.6,
-            max_tokens=4096,
-        )
-        for chunk in stream:
-            if not chunk.choices:
-                continue
-            delta = chunk.choices[0].delta.content or ""
-            full_text += delta
-            if not in_response:
-                if _RESPONSE_MARKER in full_text:
-                    in_response = True
-                    after = full_text.split(_RESPONSE_MARKER, 1)[1].lstrip("\n")
-                    if _SOURCES_MARKER in after:
-                        streamed_response = after.split(_SOURCES_MARKER, 1)[0].rstrip()
-                        history[-1]["content"] = streamed_response
-                        yield history, "", None, _no_src, "", partial_thinking
-                        break
-                    streamed_response = after
-                    history[-1]["content"] = streamed_response
-                    yield history, "", None, _no_src, "", partial_thinking
-            else:
-                if _SOURCES_MARKER in full_text:
-                    after_response = full_text.split(_RESPONSE_MARKER, 1)[1]
-                    streamed_response = after_response.split(_SOURCES_MARKER, 1)[0].strip()
-                    history[-1]["content"] = streamed_response
-                    yield history, "", None, _no_src, "", partial_thinking
-                    break
-                streamed_response += delta
-                history[-1]["content"] = streamed_response
-                yield history, "", None, _no_src, "", partial_thinking
-    except Exception as exc:
-        history[-1]["content"] = f"*Error during synthesis — {exc}*"
-        yield history, "", None, _no_src, "", partial_thinking
-        return
-    # Parse sources and reasoning from the complete accumulated text
-    sources_cited = _parse_sources_cited(full_text)
-    reasoning     = _parse_reasoning(full_text)
-    final_thinking = _build_reasoning_html(SimpleNamespace(
-        felt_emotion=u.felt_emotion, deeper_concern=u.deeper_concern,
-        vedantic_themes=u.vedantic_themes, queries=queries,
-        selection_rationale=s.selection_rationale, synthesis_reasoning=reasoning,
-    ))
-    if sources_cited:
-        footer = "\n\n---\n**Sources:** " + "  ·  ".join(f"`{s}`" for s in sources_cited)
-        history[-1]["content"] = streamed_response + footer
-    yield history, "", None, gr.update(choices=sources_cited, value=None, visible=bool(sources_cited)), "", final_thinking
-def show_verse(ref: str) -> tuple[str, str]:
-    """Return (verse_html, explain_html) — clears any prior explanation."""
-    if not ref:
-        return "", ""
-    verse = _verse_lookup.get(ref.lower().strip())
-    if verse is None:
-        return '<div class="verse-panel"><p style="color:var(--text-muted)">Verse not found in corpus.</p></div>', ""
-    return _render_verse_html(verse), ""
-def explain_verse(source_ref: str, history: list):
-    """Generator: stream a contextual explanation of the selected verse."""
-    if not source_ref:
-        yield '<div class="explain-panel" style="color:var(--text-muted)">Select a verse first.</div>'
-        return
-    verse = _verse_lookup.get(source_ref.lower().strip())
-    if verse is None:
-        yield '<div class="explain-panel" style="color:var(--text-muted)">Verse not found in corpus.</div>'
-        return
-    # Build verse content string
-    bits = []
-    if getattr(verse, "translation", None):
-        bits.append(f"Translation: {verse.translation}")
-    if getattr(verse, "sanskrit", None):
-        bits.append(f"Sanskrit: {verse.sanskrit}")
-    if getattr(verse, "bhashya", None):
-        bits.append(f"Śaṅkara's commentary: {verse.bhashya[:600]}")
-    ev = verse if isinstance(verse, EnrichedVerse) else None
-    if ev and getattr(ev, "paraphrase", None):
-        bits.append(f"Teaching: {ev.paraphrase}")
-    verse_content = "\n\n".join(bits)
-    # Build conversation context from the last turn
-    context = "No prior conversation."
-    i = len(history) - 1
-    while i >= 0:
-        if history[i].get("role") == "assistant" and i > 0:
-            user_msg = history[i - 1].get("content", "")
-            bot_msg = history[i].get("content", "")
-            if "\n\n---\n" in bot_msg:
-                bot_msg = bot_msg.split("\n\n---\n")[0]
-            context = f"User: {user_msg}\n\nAdvisor: {bot_msg}"
-            break
-        i -= 1
-    # Run in thread so we can stream
-    result_box: list = [None]
-    err_box: list = [None]
-    done = threading.Event()
-    def _run():
-        try:
-            explainer = dspy.ChainOfThought(_ExplainInContext)
-            result_box[0] = explainer(
-                verse_ref=source_ref,
-                verse_content=verse_content,
-                conversation_context=context,
-            )
-        except Exception as exc:
-            err_box[0] = exc
-        finally:
-            done.set()
-    threading.Thread(target=_run, daemon=True).start()
-    yield '<div class="explain-panel" style="color:var(--text-muted);font-style:italic;">◌  drawing the thread…</div>'
-    while not done.wait(timeout=0.2):
-        yield '<div class="explain-panel" style="color:var(--text-muted);font-style:italic;">◌  drawing the thread…</div>'
-    if err_box[0]:
-        yield f'<div class="explain-panel" style="color:var(--text-muted)">Could not generate explanation: {err_box[0]}</div>'
-        return
-    explanation = result_box[0].explanation
-    # Stream character by character
-    streamed = ""
-    for char in explanation:
-        streamed += char
-        yield f'<div class="explain-panel">{streamed}</div>'
-# ── layout ─────────────────────────────────────────────────────────────────────
-EXAMPLES = [
-    "I just got laid off and feel like nothing makes sense.",
-    "I'm terrified of dying. Is that irrational?",
-    "I keep hurting the people I love without meaning to.",
-    "I've been meditating for years but still feel empty.",
-    "My ambition feels hollow but I can't stop chasing it.",
-]
-with gr.Blocks(title="Gītā Advisor") as demo:
-    pred_state = gr.State(None)
-    gr.HTML("""
-    <div class="app-header">
-      <div class="app-title">Gītā Advisor</div>
-      <div class="app-subtitle">Grounded in Advaita Vedānta as taught by Śaṅkarācārya</div>
-      <div class="app-ornament">✦ &nbsp; ✦ &nbsp; ✦</div>
-    </div>
-    """)
-    chatbot = gr.Chatbot(
-        height=480,
-        show_label=False,
-        elem_id="chatbot",
-        render_markdown=True,
-        placeholder=(
-            '<div style="text-align:center;padding:3rem 1rem;">'
-            '<span style="color:#5A3F1E;font-style:italic;font-size:0.95rem;">'
-            "Speak from where you actually are.<br>"
-            '<span style="font-size:0.82rem">The teacher will meet you there.</span>'
-            "</span></div>"
-        ),
-    )
-    stage_html = gr.HTML("", elem_id="stage-status")
-    with gr.Row(equal_height=True):
-        msg_box = gr.Textbox(
-            placeholder="Speak from where you actually are…",
-            show_label=False,
-            lines=2,
-            max_lines=6,
-            elem_id="msg-input",
-            scale=7,
-            container=False,
-        )
-        submit_btn = gr.Button("Ask →", variant="primary",   elem_id="submit-btn", size="lg", scale=1, min_width=110)
-        clear_btn  = gr.Button("✕",    variant="secondary", elem_id="clear-btn",  size="lg", scale=0, min_width=46)
-    gr.Examples(examples=EXAMPLES, inputs=msg_box, label="Opening moves")
-    with gr.Column(elem_classes=["explorer-wrap"]):
-        gr.HTML('<div class="explorer-label">Explore a cited verse</div>')
-        source_dd = gr.Dropdown(
-            choices=[],
-            value=None,
-            label="Select a cited source…",
-            show_label=False,
-            elem_id="source-dd",
-            visible=False,
-            interactive=True,
-        )
-        verse_html  = gr.HTML("")
-        explain_btn = gr.Button("Explain in context →", elem_id="explain-btn", visible=True, interactive=False, size="sm")
-        explain_out = gr.HTML("")
-    with gr.Accordion("🧠  Model reasoning", open=False, elem_id="thinking-accordion"):
-        thinking_html = gr.HTML("")
-    # ── event wiring ──────────────────────────────────────────────────────────
-    outputs = [chatbot, stage_html, pred_state, source_dd, msg_box, thinking_html]
-    msg_box.submit(respond, [msg_box, chatbot], outputs)
-    submit_btn.click(respond, [msg_box, chatbot], outputs)
-    clear_btn.click(
-        fn=lambda: ([], "", None, gr.update(choices=[], value=None, visible=False), "", "", "", ""),
-        outputs=[chatbot, stage_html, pred_state, source_dd, msg_box, verse_html, explain_out, thinking_html],
-    )
-    source_dd.change(
-        fn=lambda ref: (*show_verse(ref), gr.update(interactive=bool(ref))),
-        inputs=source_dd,
-        outputs=[verse_html, explain_out, explain_btn],
-    )
-    explain_btn.click(
-        fn=explain_verse,
-        inputs=[source_dd, chatbot],
-        outputs=explain_out,
-    )
-demo.queue()
-if __name__ == "__main__":
-    demo.launch(server_port=7860, css=CSS)

artifacts/chroma/52cdeb15-0631-44ed-8618-782f1d4d27bb/data_level0.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:201b998f2a013f78cea5960b05174ceffedbd046c4dfc10a8d2492ff8a1398a7
 size 167600

 version https://git-lfs.github.com/spec/v1
+oid sha256:bbac4dba21b040d3b944b10cf70751e5b6ce7a5bf3e98b4bbafd56657c5c7ffb
 size 167600

artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/data_level0.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1471fb4961d4e00a440e224c0f9cdfe75b4dd83de2a20772e448b632da02404a
+size 167600

artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/header.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a0e81c3b22454233bc12d0762f06dcca48261a75231cf87c79b75e69a6c00150
+size 100

artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/length.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7a12e561363385e9dfeeab326368731c030ed4b374e7f5897ac819159d2884c5
+size 400

parsers/__init__.py → artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/link_lists.bin RENAMED Viewed

File without changes

artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/data_level0.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5dfaebf500084c49276ef2d99d82780c1268cc3bd9b4df63416632bf613a2907
+size 167600

artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/header.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a0e81c3b22454233bc12d0762f06dcca48261a75231cf87c79b75e69a6c00150
+size 100

artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/length.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7a12e561363385e9dfeeab326368731c030ed4b374e7f5897ac819159d2884c5
+size 400

artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/link_lists.bin ADDED Viewed

File without changes

artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/data_level0.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f5b6c432fff6b70a57924e98c7ed59ff1bea23574753718ba13c417e32baffd4
+size 167600

artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/header.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a0e81c3b22454233bc12d0762f06dcca48261a75231cf87c79b75e69a6c00150
+size 100

artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/length.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7a12e561363385e9dfeeab326368731c030ed4b374e7f5897ac819159d2884c5
+size 400

artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/link_lists.bin ADDED Viewed

File without changes

artifacts/chroma/chroma.sqlite3 CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:acd63ba75ad75f6234302822709e44e8e560d80e120f31519c20f73d41821d22
-size 20459520

 version https://git-lfs.github.com/spec/v1
+oid sha256:2faa37ea993ae5c07d47bc4b0f9fb17520e4839b8c3299318b36b0f5add3d181
+size 21323776

artifacts/optimized_advisor.prompts.txt DELETED Viewed

@@ -1,87 +0,0 @@
-# Optimized prompts after GEPA
-## understand.predict
-### instructions
-Read the user's life situation carefully, taking into account the full
-conversation so far. If there is prior exchange, use it to understand
-follow-up messages, references like 'what you said earlier', or shifts in
-the user's emotional state across turns. Identify the felt emotion, the
-underlying spiritual concern (not just the surface complaint), and the
-Vedāntic themes that are most relevant — drawing only from concepts native
-to Advaita Vedānta.
-### fields
-- history: Prior turns as a list of message dicts with 'user_question' and 'response' keys. Empty history means this is the first message.
-- user_question: The user's current message; may be a question, a vent, a follow-up, or a description of a situation.
-- reasoning: ${reasoning}
-- felt_emotion: The dominant emotion the user is experiencing, named precisely (e.g. 'anticipatory grief', not just 'sad').
-- surface_concern: What the user is literally asking about, in one sentence.
-- deeper_concern: The underlying existential/spiritual concern — usually about identity, attachment, fear, dharma, or meaning — that the surface concern is a symptom of. One sentence.
-- vedantic_themes: 2-4 Advaita-Vedānta concepts most relevant to this situation. Use Sanskrit terms with brief gloss, e.g. 'adhyāsa (superimposition of self onto roles)', 'vairāgya (dispassion)', 'sākṣī (witness consciousness)'.
----
-## plan.predict
-### instructions
-Given the user's situation and identified themes, generate diverse search
-queries to find relevant passages from the Advaita corpus (Bhagavad Gītā with
-Śaṅkara bhāṣya, Upaniṣads, Brahma Sūtras, prakaraṇa texts). Each query should
-target a different angle — one query about the philosophical principle,
-one about a parallel situation in the texts, one about the practical
-teaching offered by the lineage.
-### fields
-- surface_concern: ${surface_concern}
-- deeper_concern: ${deeper_concern}
-- vedantic_themes: ${vedantic_themes}
-- reasoning: ${reasoning}
-- queries: 3 distinct search queries (each 5-15 words). Vary in angle: principle, parallel, practice.
----
-## select.predict
-### instructions
-From the retrieved candidate passages, select the ones that genuinely
-speak to this user's situation. Prefer primary sources (Gītā verses,
-Upaniṣadic mantras, Śaṅkara's bhāṣya) over secondary or modern commentary
-when both are available. Reject passages that are merely topically adjacent
-but don't address the actual spiritual concern. Avoid re-selecting passages
-whose source was already cited in a prior turn — prefer fresh ground.
-### fields
-- deeper_concern: ${deeper_concern}
-- candidate_passages: Numbered candidate passages with source attribution.
-- previously_cited: Source references already cited in earlier turns of this conversation (e.g. ['BG 2.47', 'BG 18.66']). Prefer passages not on this list so the conversation covers new ground. Empty list on the first turn.
-- reasoning: ${reasoning}
-- selected_indices: Indices (1-based) of the 2-4 most relevant passages.
-- selection_rationale: One sentence per selection explaining why that passage speaks to this concern.
----
-## synthesize.predict
-### instructions
-Compose a response that is grounded in Advaita Vedānta as taught by
-Śaṅkarācārya, empathetic to the user's felt experience, and practically
-useful for their situation. Honor the two-truths distinction: meet the user
-in vyāvahārika (transactional reality) without ever denying the
-pāramārthika (absolute) view. Cite specific verses/passages by reference,
-integrate them into prose rather than dumping quotes, and keep wit gentle —
-light around the cosmic predicament, never light about the user's pain.
-If history has prior turns: do not repeat citations or teachings already
-given; build on or deepen what was said; acknowledge any shift the user has
-expressed since the last turn. If the user is following up, open by briefly
-acknowledging the continuity before moving forward.
-### fields
-- history: Prior turns as a list of message dicts with 'user_question' and 'response' keys. Use this to avoid repetition and to build across turns.
-- user_question: ${user_question}
-- felt_emotion: ${felt_emotion}
-- deeper_concern: ${deeper_concern}
-- selected_passages: The selected passages with full source attribution.
-- reasoning: ${reasoning}
-- response: The advisor's reply to the user. 250-450 words. Open by acknowledging the felt experience. Move into the Vedāntic perspective. Cite at least one primary source (Gītā chapter:verse, Upaniṣad name + section, etc.). Close with a concrete practice or shift in perspective they can try this week. Address the user as 'you' throughout. Avoid Western therapy clichés.
-- sources_cited: Source references actually cited in the response, e.g. 'BG 2.47', 'Bṛhadāraṇyaka Up. 4.4.5', 'Vivekacūḍāmaṇi 11'.
----

chat.py DELETED Viewed

@@ -1,392 +0,0 @@
-"""
-chat.py — interactive conversation with the advisor.
-By default it loads the GEPA-optimized program from artifacts/. If that file
-doesn't exist yet, it falls back to the un-optimized base prompts so you can
-sanity-check the pipeline before running optimization.
-Flags:
-  --debug       Show intermediate pipeline state (felt emotion, queries, etc.)
-  --thinking    Show the full synthesis reasoning trace (default: first 6 lines)
-  --no-thinking Hide the reasoning trace entirely
-After each response, source references are printed with numbers.
-  show <N|ref>    Display the verse text, translation, and Śaṅkara's bhāṣya.
-  explain <N|ref> Show the verse then stream a contextual explanation of how
-                  it applies to the current conversation.
-"""
-from __future__ import annotations
-import argparse
-import time
-import threading
-from typing import Optional
-import dspy
-from rich.console import Console
-from rich.live import Live
-from rich.markdown import Markdown
-from rich.panel import Panel
-from rich.rule import Rule
-from rich.text import Text
-import config
-from advisor import load_optimized
-from corpus import EnrichedVerse, Verse, read_jsonl_enriched, read_jsonl_verses
-# ── speed constants ────────────────────────────────────────────────────────────
-_THINKING_CPS = 800   # chars/sec for reasoning stream (secondary content, fast)
-_RESPONSE_CPS = 300   # chars/sec for advisor response (primary content)
-_THINKING_PREVIEW = 6 # lines shown in collapsed thinking mode
-# ── verse corpus lookup ────────────────────────────────────────────────────────
-def _load_verse_lookup() -> dict[str, Verse]:
-    """Build a case-insensitive verse_ref → Verse dict from the corpus."""
-    lookup: dict[str, Verse] = {}
-    enriched = config.DATA_DIR / "corpus_enriched.jsonl"
-    plain = config.DATA_DIR / "corpus.jsonl"
-    if enriched.exists():
-        loader, path = read_jsonl_enriched, enriched
-    elif plain.exists():
-        loader, path = read_jsonl_verses, plain
-    else:
-        return lookup
-    for verse in loader(path):
-        lookup[verse.verse_ref.lower().strip()] = verse
-    return lookup
-def _find_verse(lookup: dict, ref: str) -> Optional[Verse]:
-    return lookup.get(ref.lower().strip())
-def _resolve_ref(arg: str, sources_cited: list[str]) -> str:
-    """Turn '1' → sources_cited[0], or return arg unchanged for direct ref lookup."""
-    try:
-        n = int(arg.strip())
-        if 1 <= n <= len(sources_cited):
-            return sources_cited[n - 1]
-    except ValueError:
-        pass
-    return arg.strip()
-# ── DSPy signature for contextual explanation ─────────────────────────────────
-class _ExplainInContext(dspy.Signature):
-    """You are the Gītā Advisor continuing a conversation. The user has asked
-    you to unpack a specific verse or passage you cited. Explain what it means
-    and why it speaks precisely to their situation — go deeper than the initial
-    response did. Reference the user's words. Close with one concrete way to
-    hold or work with this text this week."""
-    verse_ref: str = dspy.InputField()
-    verse_content: str = dspy.InputField(
-        desc="Translation, original text (if available), and Śaṅkara's commentary."
-    )
-    conversation_context: str = dspy.InputField(
-        desc="The user's question and the advisor's response where this verse was cited."
-    )
-    explanation: str = dspy.OutputField(
-        desc="150-250 words. Grounded in Advaita. Do not merely restate the translation. "
-             "End with a practical suggestion for this week."
-    )
-# ── streaming helpers ─────────────────────────────────────────────────────────
-def _stream_chars(console: Console, text: str, cps: int):
-    """Write text to the terminal character by character."""
-    if not text:
-        return
-    delay = 1.0 / cps
-    for ch in text:
-        console.file.write(ch)
-        console.file.flush()
-        time.sleep(delay)
-    console.file.write("\n")
-    console.file.flush()
-def _stream_response(console: Console, text: str, cps: int = _RESPONSE_CPS):
-    """Stream the advisor response into a growing Markdown Panel via Rich Live."""
-    if not text:
-        return
-    displayed = ""
-    delay = 1.0 / cps
-    with Live(console=console, refresh_per_second=min(cps, 30)) as live:
-        for ch in text:
-            displayed += ch
-            live.update(Panel(
-                Markdown(displayed),
-                title="[bold]advisor[/bold]",
-                border_style="yellow",
-                padding=(1, 2),
-            ))
-            time.sleep(delay)
-def _show_thinking(console: Console, reasoning: str, full: bool):
-    """Stream the synthesis reasoning below a dim rule, collapsed to _THINKING_PREVIEW lines."""
-    if not reasoning:
-        return
-    lines = reasoning.strip().splitlines()
-    if not full and len(lines) > _THINKING_PREVIEW:
-        display = "\n".join(lines[:_THINKING_PREVIEW])
-        n_hidden = len(lines) - _THINKING_PREVIEW
-    else:
-        display = "\n".join(lines)
-        n_hidden = 0
-    console.print(Rule("[dim]thinking[/dim]", style="dim blue"))
-    # Write dim italic via ANSI since we're streaming to file directly
-    # (Rich markup can't be applied char-by-char; dim is cosmetic here)
-    _stream_chars(console, display, cps=_THINKING_CPS)
-    if n_hidden:
-        console.print(f"[dim]  ↳ {n_hidden} more lines — use --thinking to expand[/dim]")
-    console.print()
-# ── verse display helpers ─────────────────────────────────────────────────────
-def _show_verse(console: Console, verse: Verse):
-    """Render a verse with its translation, original text, and commentary."""
-    body = Text()
-    if verse.sanskrit:
-        body.append(verse.sanskrit + "\n", style="bold")
-    if verse.transliteration:
-        body.append(verse.transliteration + "\n", style="italic dim")
-    if verse.translation:
-        label = f"Translation ({verse.translator})" if verse.translator else "Translation"
-        body.append(f"\n{label}:\n", style="dim")
-        body.append(verse.translation + "\n")
-    if verse.bhashya:
-        translator_note = f" ({verse.bhashya_translator})" if verse.bhashya_translator else ""
-        body.append(f"\nŚaṅkara's Bhāṣya{translator_note}:\n", style="dim")
-        preview = verse.bhashya[:800] + ("…" if len(verse.bhashya) > 800 else "")
-        body.append(preview + "\n", style="dim")
-    ev = verse if isinstance(verse, EnrichedVerse) else None
-    if ev and ev.paraphrase:
-        body.append("\nTeaching: ", style="bold dim")
-        body.append(ev.paraphrase + "\n", style="dim")
-    if ev and ev.themes:
-        body.append("Themes: ", style="bold dim")
-        body.append(", ".join(ev.themes) + "\n", style="dim")
-    if ev and ev.practical_teaching:
-        body.append("Practical shift: ", style="bold dim")
-        body.append(ev.practical_teaching + "\n", style="dim")
-    section = verse.section_display or verse.section
-    subtitle = verse.work_display + (f" — {section}" if section else "")
-    console.print(Panel(
-        body,
-        title=f"[bold]{verse.verse_ref}[/bold]",
-        subtitle=f"[dim]{subtitle}[/dim]",
-        border_style="cyan",
-        padding=(1, 2),
-    ))
-def _explain_in_context(
-    console: Console,
-    verse: Verse,
-    history_messages: list[dict],
-    cps: int = _RESPONSE_CPS,
-):
-    """Call the LM to explain the verse in context of the last conversation turn."""
-    if history_messages:
-        last = history_messages[-1]
-        context = (
-            f"User: {last.get('user_question', '')}\n\n"
-            f"Advisor: {last.get('response', '')}"
-        )
-    else:
-        context = "No prior conversation."
-    bits = []
-    if verse.translation:
-        bits.append(f"Translation: {verse.translation}")
-    if verse.sanskrit:
-        bits.append(f"Sanskrit: {verse.sanskrit}")
-    if verse.bhashya:
-        bits.append(f"Śaṅkara's commentary: {verse.bhashya[:600]}")
-    ev = verse if isinstance(verse, EnrichedVerse) else None
-    if ev and ev.paraphrase:
-        bits.append(f"Teaching: {ev.paraphrase}")
-    verse_content = "\n\n".join(bits)
-    explainer = dspy.ChainOfThought(_ExplainInContext)
-    with console.status("[dim]expanding...[/dim]", spinner="dots"):
-        try:
-            result = explainer(
-                verse_ref=verse.verse_ref,
-                verse_content=verse_content,
-                conversation_context=context,
-            )
-            explanation = result.explanation
-        except Exception as exc:
-            console.print(f"[red]Could not generate explanation: {exc}[/red]")
-            return
-    console.print()
-    _stream_response(console, explanation, cps=cps)
-# ── main loop ─────────────────────────────────────────────────────────────────
-def main():
-    ap = argparse.ArgumentParser()
-    ap.add_argument("--program", default=str(config.OPTIMIZED_PROGRAM_PATH))
-    ap.add_argument("--debug", action="store_true",
-                    help="Show intermediate pipeline state for each turn.")
-    ap.add_argument("--thinking", action="store_true",
-                    help="Show full synthesis reasoning trace (default: first 6 lines).")
-    ap.add_argument("--no-thinking", action="store_true", dest="no_thinking",
-                    help="Hide the reasoning trace entirely.")
-    args = ap.parse_args()
-    config.configure_dspy()
-    advisor = load_optimized(args.program)
-    console = Console()
-    with console.status("[dim]loading corpus...[/dim]", spinner="dots"):
-        verse_lookup = _load_verse_lookup()
-    console.print(Panel.fit(
-        "[bold]Gītā Advisor[/bold]\n\n"
-        "Speak from where you actually are.\n"
-        "After a response: [italic]show <N>[/italic] to read a cited verse · "
-        "[italic]explain <N>[/italic] for contextual breakdown.\n"
-        "Type [italic]exit[/italic] or Ctrl-D to leave.",
-        border_style="cyan",
-    ))
-    history = dspy.History(messages=[])
-    last_pred = None
-    while True:
-        try:
-            console.print()
-            console.print("[bold cyan]you:[/bold cyan] ", end="")
-            line = input().strip()
-        except (EOFError, KeyboardInterrupt):
-            console.print("\n[dim]नमस्ते।[/dim]")
-            return
-        if not line:
-            continue
-        if line.lower() in {"exit", "quit", ":q"}:
-            console.print("[dim]नमस्ते।[/dim]")
-            return
-        # ── source exploration commands ───────────────────────────────────────
-        cmd_lower = line.lower()
-        if cmd_lower.startswith(("show ", "explain ")):
-            if last_pred is None:
-                console.print("[dim]No sources yet — ask a question first.[/dim]")
-                continue
-            cmd, _, arg = line.partition(" ")
-            ref = _resolve_ref(arg, last_pred.sources_cited)
-            verse = _find_verse(verse_lookup, ref)
-            if verse is None:
-                console.print(f"[dim]'{ref}' not found in corpus.[/dim]")
-                if last_pred.sources_cited:
-                    hint = "  ".join(
-                        f"[{i+1}] {r}" for i, r in enumerate(last_pred.sources_cited)
-                    )
-                    console.print(f"[dim]Available: {hint}[/dim]")
-                continue
-            _show_verse(console, verse)
-            if cmd.lower() == "explain":
-                _explain_in_context(console, verse, history.messages)
-            continue
-        # ── normal question — run pipeline in background with live stage progress ──
-        pred = None
-        error = None
-        stage = ["initializing..."]
-        done = threading.Event()
-        def run_advisor():
-            nonlocal pred, error
-            try:
-                pred = advisor(
-                    user_question=line,
-                    history=history,
-                    _stage_cb=lambda msg: stage.__setitem__(0, msg),
-                )
-            except Exception as exc:
-                error = exc
-            finally:
-                done.set()
-        threading.Thread(target=run_advisor, daemon=True).start()
-        with Live(console=console, refresh_per_second=8) as live:
-            while not done.wait(timeout=0.12):
-                live.update(Text(f"  ◌  {stage[0]}", style="dim"))
-            live.update(Text(""))
-        if error:
-            console.print(f"[red]Error: {error}[/red]")
-            continue
-        last_pred = pred
-        history.messages.append({
-            "user_question": line,
-            "response": pred.response,
-            "sources_cited": pred.sources_cited,
-        })
-        # debug trace
-        if args.debug:
-            console.print(Rule("[dim]debug[/dim]", style="dim"))
-            console.print(f"[dim]felt:[/dim]    {pred.felt_emotion}")
-            console.print(f"[dim]surface:[/dim] {pred.surface_concern}")
-            console.print(f"[dim]deeper:[/dim]  {pred.deeper_concern}")
-            console.print(f"[dim]themes:[/dim]  {', '.join(pred.vedantic_themes)}")
-            console.print(f"[dim]queries:[/dim] {pred.queries}")
-            console.print(f"[dim]selected:[/dim] {pred.selected_indices}")
-            for i in pred.selected_indices:
-                if 1 <= i <= len(pred.retrieved_passages):
-                    h = pred.retrieved_passages[i - 1]
-                    m = h["meta"]
-                    console.print(
-                        f"  [dim]→ [{m['tier']}] {m['work']}"
-                        f"{' — ' + m['section'] if m.get('section') else ''}"
-                        f"  (score {h['score']:.3f})[/dim]"
-                    )
-            console.print(Rule(style="dim"))
-        # thinking section
-        if not args.no_thinking:
-            _show_thinking(
-                console,
-                getattr(pred, "synthesis_reasoning", ""),
-                full=args.thinking,
-            )
-        # stream the response
-        console.print()
-        _stream_response(console, pred.response)
-        # source footer with hints
-        if pred.sources_cited:
-            numbered = "  ".join(
-                f"[{i+1}] {r}" for i, r in enumerate(pred.sources_cited)
-            )
-            console.print(f"\n[dim]sources: {numbered}[/dim]")
-            console.print(
-                "[dim]  → show <N> to read the verse  ·  explain <N> for contextual breakdown[/dim]"
-            )
-if __name__ == "__main__":
-    main()

config.py CHANGED Viewed

@@ -90,8 +90,9 @@ TASK_LM_KWARGS = dict(
 )
 # ──────────────────────────── Task LM — HuggingFace Router ──────────────────────────────
-# router.huggingface.co/v1 is OpenAI-compatible. The HF_TOKEN Space secret is used
-# for both DSPy LM calls (via openai/ prefix) and the direct streaming client in app.py.
 HF_TOKEN = os.getenv("HF_TOKEN", "")
 HF_ROUTER_BASE = os.getenv("HF_ROUTER_BASE", "https://router.huggingface.co/v1")
 HF_MODEL = os.getenv("HF_MODEL", "google/gemma-4-26B-A4B-it")
@@ -105,14 +106,43 @@ HF_LM_KWARGS = dict(
     cache=True,
 )
-# Which backend to use. Priority: explicit env var > HF (default on Spaces) > gemini > lm_studio.
 def _default_task_lm_backend() -> str:
     if "TASK_LM_BACKEND" in os.environ:
         return os.environ["TASK_LM_BACKEND"]
-    if HF_TOKEN:
-        return "hf"
     if GEMINI_API_KEY:
         return "gemini"
     return "lm_studio"
 TASK_LM_BACKEND: str = _default_task_lm_backend()
@@ -184,22 +214,28 @@ REFLECTION_LM_KWARGS = dict(
 def configure_dspy(backend: str | None = None) -> tuple[dspy.LM, dspy.LM]:
     """Configure DSPy for inference and return (task_lm, reflection_lm).
-    backend overrides TASK_LM_BACKEND when provided explicitly.
-    Accepted values: "hf", "gemini", "lm_studio".
-    ChatAdapter fallback to JSONAdapter is disabled because Gemma outputs
-    `[[ ## field ]]` (no closing ##); the field_header_pattern patch at module
-    load time makes ChatAdapter parse these correctly.
     """
     effective_backend = backend or TASK_LM_BACKEND
-    if effective_backend == "hf":
         if not HF_TOKEN:
-            raise SystemExit("HF_TOKEN is not set. Add it as a Space secret.")
         task_lm = dspy.LM(model=HF_MODEL_STRING, **HF_LM_KWARGS)
         print(f"Task LM backend: HuggingFace Router ({HF_MODEL} @ {HF_ROUTER_BASE})")
-    elif effective_backend == "gemini":
-        task_lm = dspy.LM(model=GEMINI_TASK_MODEL, **GEMINI_TASK_LM_KWARGS)
-        print(f"Task LM backend: Gemini API ({GEMINI_TASK_MODEL})")
     else:
         task_lm = dspy.LM(model=TASK_MODEL_STRING, **TASK_LM_KWARGS)
         print(f"Task LM backend: LM Studio ({TASK_MODEL_STRING} @ {LM_STUDIO_BASE})")

 )
 # ──────────────────────────── Task LM — HuggingFace Router ──────────────────────────────
+# router.huggingface.co/v1 is OpenAI-compatible; use the "openai/" LiteLLM prefix
+# with api_base pointing at HF's router endpoint.
+# Set HF_MODEL env var to use a different model slug (must be deployed on HF).
 HF_TOKEN = os.getenv("HF_TOKEN", "")
 HF_ROUTER_BASE = os.getenv("HF_ROUTER_BASE", "https://router.huggingface.co/v1")
 HF_MODEL = os.getenv("HF_MODEL", "google/gemma-4-26B-A4B-it")
     cache=True,
 )
+# ──────────────────────────── Task LM — OpenRouter ───────────────────────────────────────
+# LiteLLM recognises the "openrouter/" prefix natively and routes through
+# https://openrouter.ai/api/v1.  Pick any model slug from openrouter.ai/models.
+#
+# Speed vs quality guide (set OPENROUTER_MODEL to override):
+#   openrouter/google/gemini-2.0-flash-001       — fastest (~3-5s); good quality
+#   openrouter/google/gemini-2.5-flash-preview   — balanced (~8-12s)
+#   openrouter/anthropic/claude-3-5-haiku        — reliable structured output
+#   openrouter/google/gemma-3-27b-it             — closest to the local Gemma 4 weights
+OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY", "")
+_openrouter_model_raw = os.getenv("OPENROUTER_MODEL", "google/gemini-2.0-flash-001")
+# LiteLLM requires the "openrouter/" prefix; add it if the env var omits it.
+OPENROUTER_MODEL = (
+    _openrouter_model_raw
+    if _openrouter_model_raw.startswith("openrouter/")
+    else f"openrouter/{_openrouter_model_raw}"
+)
+OPENROUTER_LM_KWARGS = dict(
+    api_key=OPENROUTER_API_KEY,
+    temperature=0.6,
+    max_tokens=4096,
+    cache=True,
+)
+# Which backend to use: "openrouter" if that key is set (and Gemini is not),
+# "gemini" if GEMINI_API_KEY is set, else "lm_studio".
+# Force a specific one with TASK_LM_BACKEND=openrouter|gemini|lm_studio.
 def _default_task_lm_backend() -> str:
     if "TASK_LM_BACKEND" in os.environ:
         return os.environ["TASK_LM_BACKEND"]
     if GEMINI_API_KEY:
         return "gemini"
+    if OPENROUTER_API_KEY:
+        return "openrouter"
+    if HF_TOKEN:
+        return "hf"
     return "lm_studio"
 TASK_LM_BACKEND: str = _default_task_lm_backend()
 def configure_dspy(backend: str | None = None) -> tuple[dspy.LM, dspy.LM]:
     """Configure DSPy for inference and return (task_lm, reflection_lm).
+    backend overrides TASK_LM_BACKEND when provided explicitly (used by chat.py
+    --backend flag). Accepted values: "gemini", "openrouter", "lm_studio".
+    ChatAdapter fallback to JSONAdapter is disabled in all paths because:
+    - LM Studio rejects json_object.
+    - Gemma outputs `[[ ## field ]]` (no closing ##); the field_header_pattern
+      patch at module load time makes ChatAdapter parse these correctly.
     """
     effective_backend = backend or TASK_LM_BACKEND
+    if effective_backend == "gemini":
+        task_lm = dspy.LM(model=GEMINI_TASK_MODEL, **GEMINI_TASK_LM_KWARGS)
+        print(f"Task LM backend: Gemini API ({GEMINI_TASK_MODEL})")
+    elif effective_backend == "openrouter":
+        if not OPENROUTER_API_KEY:
+            raise SystemExit("OPENROUTER_API_KEY is not set. Add it to your .env file.")
+        task_lm = dspy.LM(model=OPENROUTER_MODEL, **OPENROUTER_LM_KWARGS)
+        print(f"Task LM backend: OpenRouter ({OPENROUTER_MODEL})")
+    elif effective_backend == "hf":
         if not HF_TOKEN:
+            raise SystemExit("HF_TOKEN is not set. Add it to your .env file.")
         task_lm = dspy.LM(model=HF_MODEL_STRING, **HF_LM_KWARGS)
         print(f"Task LM backend: HuggingFace Router ({HF_MODEL} @ {HF_ROUTER_BASE})")
     else:
         task_lm = dspy.LM(model=TASK_MODEL_STRING, **TASK_LM_KWARGS)
         print(f"Task LM backend: LM Studio ({TASK_MODEL_STRING} @ {LM_STUDIO_BASE})")

data/.gitkeep ADDED Viewed

File without changes

data/synthetic_questions.jsonl DELETED Viewed

The diff for this file is too large to render. See raw diff

dataset_generator.py DELETED Viewed

@@ -1,332 +0,0 @@
-"""
-dataset_generator.py — produce ~500 unique, life-grounded questions.
-The dataset is the GEPA training/validation pool. We want:
-  - Coverage across life domains (career, grief, identity, dharma, practice, ...)
-  - Variety in voice (anguished / intellectual / sarcastic / exhausted / hopeful)
-  - Variety in form (direct question / vent / philosophical doubt / dilemma)
-  - Variety in age & life-stage cues
-  - Some cleanly Advaita-relevant, some that *force* the advisor to find the
-    Advaita angle in something mundane (this is where over-fitting to "spiritual"
-    questions usually shows up)
-Strategy: structured combinatorics × LM rewriting × similarity dedupe.
-We construct (domain, scenario, voice, form) tuples, send them to the local LM
-to write each as a real human message, then dedupe by embedding similarity.
-"""
-from __future__ import annotations
-import argparse
-import json
-import random
-import re
-from dataclasses import dataclass, asdict
-from pathlib import Path
-import numpy as np
-from sentence_transformers import SentenceTransformer
-from tqdm import tqdm
-import dspy
-import config
-# ──────────────────────────── Taxonomy ────────────────────────────
-DOMAINS: dict[str, list[str]] = {
-    "career_and_purpose": [
-        "got laid off after years of dedication",
-        "achieved the big career goal and feels empty",
-        "stuck in a job that pays well but feels meaningless",
-        "wants to leave stable career to pursue art / spiritual path",
-        "watching peers succeed while their own work plateaus",
-        "facing retirement and loss of identity tied to work",
-        "imposter syndrome after a major promotion",
-        "publicly failed in front of colleagues",
-    ],
-    "romantic_relationships": [
-        "going through a painful breakup after long relationship",
-        "marriage has gone cold and considering divorce",
-        "in love with someone who doesn't love them back",
-        "obsessive jealousy about a partner's past",
-        "tempted to have an affair",
-        "partner died and grief is overwhelming",
-        "afraid of commitment despite loving partner",
-        "single in their 40s and despairing about it",
-    ],
-    "family": [
-        "parent is dying and they have unresolved conflict",
-        "estranged from a sibling for years",
-        "parents pressuring them about marriage / career",
-        "child making destructive life choices",
-        "caring for an aging parent and exhausted",
-        "had a falling out with adult child",
-        "mother-in-law conflict ruining marriage",
-        "feels they failed as a parent",
-    ],
-    "friendship_and_social": [
-        "best friend betrayed their trust",
-        "feels invisible and lonely in their 30s",
-        "friend group has drifted apart with age",
-        "social anxiety preventing them from connecting",
-        "outgrown their old friends spiritually",
-        "discovered close friend was talking behind their back",
-    ],
-    "mortality_and_loss": [
-        "received a serious medical diagnosis",
-        "watching a loved one die slowly",
-        "afraid of death after a near-miss",
-        "grieving a sudden, unexpected loss",
-        "watching parents age and decline",
-        "lost a child",
-        "lost a pet who was their closest companion",
-        "approaching old age with regret about unlived life",
-    ],
-    "identity_and_ego": [
-        "tying self-worth entirely to external validation",
-        "endlessly comparing themselves to others on social media",
-        "going through midlife crisis questioning everything",
-        "famous and feels everyone wants something from them",
-        "lost sense of who they are after big life change",
-        "racial / cultural identity feels splintered between worlds",
-        "transitioning gender and family rejecting them",
-    ],
-    "material_life": [
-        "drowning in debt and shame about it",
-        "wealthy and feels guilty / disconnected because of it",
-        "consumed by FOMO scrolling through richer friends' lives",
-        "lost their home / financial security",
-        "struggling to give up consumerist habits despite knowing better",
-        "tempted by a get-rich-quick scheme",
-    ],
-    "existential": [
-        "feels life has no meaning at all",
-        "deeply depressed and going through the motions",
-        "constant existential dread about the world's state",
-        "doubting whether God / Brahman exists",
-        "sees through everything and now nothing feels real",
-        "feels they were 'born wrong' for this world",
-    ],
-    "spiritual_practice": [
-        "meditation has gone dry after years of practice",
-        "got addicted to spiritual highs and now they've stopped",
-        "spiritual ego — feels superior to non-practitioners",
-        "had a powerful experience and can't get back to it",
-        "doubts whether their guru / lineage is right for them",
-        "intellectually understands non-duality but doesn't feel it",
-        "afraid that liberation means losing love for family",
-        "can't reconcile traditional teachings with modern life",
-    ],
-    "ethics_and_dharma": [
-        "told a serious lie and considering whether to confess",
-        "harmed someone in the past and can't forgive themselves",
-        "facing a moral dilemma at work involving dishonesty",
-        "tempted to retaliate against someone who wronged them",
-        "torn between duty to family and personal calling",
-        "did something they're deeply ashamed of",
-    ],
-    "health_and_body": [
-        "chronic illness reshaping their entire life",
-        "struggling with addiction and relapse",
-        "eating disorder they can't seem to escape",
-        "chronic pain making spiritual practice feel impossible",
-        "hates their aging body",
-        "cancer diagnosis reframing everything",
-    ],
-    "modernity_specific": [
-        "doomscrolling and feeling worse every day",
-        "AI / automation making them feel obsolete",
-        "climate dread paralyzing their life decisions",
-        "political division has destroyed family relationships",
-        "addicted to phone / can't focus / can't read books anymore",
-        "online persona feels disconnected from real self",
-    ],
-}
-VOICES = [
-    "anguished",
-    "exhausted",
-    "intellectual and analytical",
-    "darkly sarcastic",
-    "quietly hopeful",
-    "numb and dissociated",
-    "frustrated and angry",
-    "softly resigned",
-]
-FORMS = [
-    "direct question",
-    "venting paragraph",
-    "philosophical doubt",
-    "practical dilemma asking what to do",
-    "stream-of-consciousness",
-]
-AGE_CUES = [
-    "early 20s",
-    "late 20s",
-    "early 30s",
-    "late 30s",
-    "40s",
-    "50s",
-    "60s",
-    "70s",
-    "(no age cue)",
-]
-@dataclass
-class QuestionRecord:
-    id: str
-    question: str
-    domain: str
-    scenario: str
-    voice: str
-    form: str
-    age_cue: str
-# ──────────────────────────── LM-driven phrasing ────────────────────────────
-class WriteUserMessage(dspy.Signature):
-    """Write a single, realistic message that a person might send to a spiritual
-    advisor. The message must reflect the given scenario, voice, form, and age
-    cue. Do NOT include scripture references, do NOT name Vedānta concepts —
-    write as a real person speaking from their actual life. Avoid generic phrases
-    like 'help me find peace' or 'I want to grow spiritually'. Be specific, lived,
-    grounded in detail. 2-6 sentences."""
-    scenario: str = dspy.InputField()
-    voice: str = dspy.InputField()
-    form: str = dspy.InputField()
-    age_cue: str = dspy.InputField()
-    message: str = dspy.OutputField(desc="The user's message, in first person.")
-def _slug(s: str) -> str:
-    return re.sub(r"[^a-z0-9]+", "_", s.lower()).strip("_")[:60]
-def generate_questions(target_n: int = 500, seed: int = 7, use_local: bool = False) -> list[QuestionRecord]:
-    """Generate ~target_n unique questions via combinatorics + LM rewriting."""
-    rng = random.Random(seed)
-    if use_local:
-        config.configure_dspy()
-    else:
-        config.configure_enrich_lm()  # gpt-4o-mini: faster and more stylistically diverse
-    writer = dspy.Predict(WriteUserMessage)
-    # Build the (domain, scenario, voice, form, age) plan first
-    combos: list[tuple[str, str, str, str, str]] = []
-    for domain, scenarios in DOMAINS.items():
-        for scenario in scenarios:
-            # 5 variants per scenario varying voice/form/age
-            voices = rng.sample(VOICES, k=5)
-            forms = [rng.choice(FORMS) for _ in range(5)]
-            ages = rng.sample(AGE_CUES, k=5)
-            for v, f, a in zip(voices, forms, ages):
-                combos.append((domain, scenario, v, f, a))
-    rng.shuffle(combos)
-    # Cap to a generous over-target; we'll dedupe down to target_n
-    over_target = int(target_n * 1.25)
-    combos = combos[:over_target]
-    records: list[QuestionRecord] = []
-    for i, (domain, scenario, voice, form, age) in enumerate(tqdm(combos, desc="Generating")):
-        try:
-            out = writer(scenario=scenario, voice=voice, form=form, age_cue=age)
-            msg = (out.message or "").strip()
-            if len(msg) < 30:
-                continue
-            records.append(QuestionRecord(
-                id=f"q_{i:04d}_{_slug(domain)}",
-                question=msg,
-                domain=domain,
-                scenario=scenario,
-                voice=voice,
-                form=form,
-                age_cue=age,
-            ))
-        except Exception as e:
-            # Local LMs occasionally hiccup. Log and continue.
-            print(f"[warn] generation failure on combo {i}: {e}")
-            continue
-    return _dedupe_by_similarity(records, target_n=target_n)
-def _dedupe_by_similarity(records: list[QuestionRecord], target_n: int, threshold: float = 0.92) -> list[QuestionRecord]:
-    """Embed and remove near-duplicates greedily."""
-    if not records:
-        return records
-    print(f"Deduping {len(records)} candidates ...")
-    embedder = SentenceTransformer(config.EMBED_MODEL, device=config.EMBED_DEVICE)
-    embs = embedder.encode(
-        [r.question for r in records],
-        normalize_embeddings=True,
-        show_progress_bar=True,
-        batch_size=32,
-    )
-    keep_idx: list[int] = []
-    kept_embs = []
-    for i, e in enumerate(embs):
-        if not kept_embs:
-            keep_idx.append(i)
-            kept_embs.append(e)
-            continue
-        sims = np.dot(np.stack(kept_embs), e)
-        if float(sims.max()) < threshold:
-            keep_idx.append(i)
-            kept_embs.append(e)
-        if len(keep_idx) >= target_n:
-            break
-    print(f"Kept {len(keep_idx)} after dedupe (target {target_n}).")
-    return [records[i] for i in keep_idx]
-def save_jsonl(records: list[QuestionRecord], path: Path):
-    with path.open("w", encoding="utf-8") as f:
-        for r in records:
-            f.write(json.dumps(asdict(r), ensure_ascii=False) + "\n")
-    print(f"Wrote {len(records)} questions to {path}")
-def load_jsonl(path: Path = config.DATASET_PATH) -> list[dict]:
-    with path.open(encoding="utf-8") as f:
-        return [json.loads(line) for line in f if line.strip()]
-def to_dspy_examples(records: list[dict]) -> list[dspy.Example]:
-    """The dataset has no gold labels — that's fine. GEPA's metric uses LLM
-    judgment + retrieval grounding rather than reference answers.
-    We carry the metadata as inputs-of-record so the metric can use them."""
-    out = []
-    for r in records:
-        ex = dspy.Example(
-            user_question=r["question"],
-            history=dspy.History(messages=[]),
-            domain=r["domain"],
-            scenario=r["scenario"],
-        ).with_inputs("user_question", "history")
-        out.append(ex)
-    return out
-# ──────────────────────────── CLI ────────────────────────────
-def main():
-    ap = argparse.ArgumentParser()
-    ap.add_argument("--n", type=int, default=500)
-    ap.add_argument("--seed", type=int, default=7)
-    ap.add_argument("--out", type=str, default=str(config.DATASET_PATH))
-    ap.add_argument("--lm", choices=["openai", "local"], default="openai",
-                    help="openai = gpt-4o-mini (default, faster); local = LM Studio task LM")
-    args = ap.parse_args()
-    records = generate_questions(target_n=args.n, seed=args.seed, use_local=(args.lm == "local"))
-    save_jsonl(records, Path(args.out))
-if __name__ == "__main__":
-    main()

download_sources.py DELETED Viewed

@@ -1,195 +0,0 @@
-"""
-download_sources.py — fetch every enabled source from the registry.
-What this does
---------------
-Reads sources_registry.SOURCES, walks each enabled entry, and downloads its
-files into data/raw/<source_key>/. The downloader is deliberately dumb: it
-just gets the bytes onto disk. Parsing happens in a separate step (parsers/)
-so a download failure on one source doesn't block ingest of the others, and
-so re-parsing during prompt iteration doesn't re-hit the network.
-Why HTTPS over `requests` rather than git for everything
---------------------------------------------------------
-Most of our sources are individual JSON or HTML files. Cloning a whole repo
-to get two files wastes bandwidth and brittle-ifies the script. For sources
-that *are* whole repos (rare in our registry), prefix the URL with `git+`.
-Idempotency
------------
-If a file is already present and not corrupt, we skip it. Pass --force to
-re-download. This makes it safe to run repeatedly while debugging parsers.
-Politeness
-----------
-We send a real User-Agent and rate-limit to one request per second per host.
-Internet Archive and similar mirrors are gracious to projects that play nice;
-they can also throttle aggressively when they aren't.
-"""
-from __future__ import annotations
-import argparse
-import shutil
-import subprocess
-import sys
-import time
-from collections import defaultdict
-from pathlib import Path
-from urllib.parse import urlparse
-import requests
-from tqdm import tqdm
-import config
-from sources_registry import SOURCES, Source
-RAW_DIR = config.DATA_DIR / "raw"
-USER_AGENT = (
-    "GitaAdvisor/0.2 (Advaita-Vedanta research project; "
-    "contact: <add your email here>)"
-)
-# Per-host minimum interval in seconds
-MIN_INTERVAL = 1.0
-def _filename_for_url(url: str) -> str:
-    """Derive a sensible local filename from a URL."""
-    parsed = urlparse(url)
-    name = Path(parsed.path).name or "index.html"
-    # archive.org sometimes serves djvu.txt with no extension on the URL;
-    # keep what's there.
-    return name
-def _is_git_url(url: str) -> bool:
-    return url.startswith("git+")
-_last_request_time: dict = defaultdict(float)
-def _polite_get(url: str) -> requests.Response:
-    """GET with rate limiting per host."""
-    host = urlparse(url).netloc
-    elapsed = time.time() - _last_request_time[host]
-    if elapsed < MIN_INTERVAL:
-        time.sleep(MIN_INTERVAL - elapsed)
-    _last_request_time[host] = time.time()
-    return requests.get(url, headers={"User-Agent": USER_AGENT}, timeout=60, stream=True)
-def _download_file(url: str, dest: Path, force: bool = False) -> bool:
-    """Download a single URL to dest. Returns True if a download happened
-    (vs being skipped because already present)."""
-    if dest.exists() and dest.stat().st_size > 0 and not force:
-        return False
-    dest.parent.mkdir(parents=True, exist_ok=True)
-    tmp = dest.with_suffix(dest.suffix + ".tmp")
-    with _polite_get(url) as r:
-        r.raise_for_status()
-        total = int(r.headers.get("content-length", 0)) or None
-        with tmp.open("wb") as out, tqdm(
-            total=total, unit="B", unit_scale=True, leave=False, desc=dest.name
-        ) as bar:
-            for chunk in r.iter_content(chunk_size=8192):
-                if not chunk:
-                    continue
-                out.write(chunk)
-                bar.update(len(chunk))
-    tmp.replace(dest)
-    return True
-def _clone_git(url: str, dest_dir: Path, force: bool = False) -> bool:
-    """Clone a git repo (URL prefixed with 'git+') into dest_dir. Returns
-    True if a clone happened."""
-    real_url = url[len("git+"):]
-    if dest_dir.exists() and any(dest_dir.iterdir()) and not force:
-        return False
-    if dest_dir.exists():
-        shutil.rmtree(dest_dir)
-    dest_dir.parent.mkdir(parents=True, exist_ok=True)
-    subprocess.run(
-        ["git", "clone", "--depth=1", real_url, str(dest_dir)],
-        check=True,
-    )
-    return True
-def download_source(src: Source, force: bool = False) -> dict:
-    """Download all URLs for one source. Returns a small report dict."""
-    target = RAW_DIR / src.key
-    report = {"key": src.key, "ok": 0, "skipped": 0, "failed": []}
-    if not src.urls:
-        report["failed"].append("no URLs in registry entry")
-        return report
-    for url in src.urls:
-        if not url:
-            continue
-        try:
-            if _is_git_url(url):
-                changed = _clone_git(url, target, force=force)
-            else:
-                fname = _filename_for_url(url)
-                changed = _download_file(url, target / fname, force=force)
-            if changed:
-                report["ok"] += 1
-            else:
-                report["skipped"] += 1
-        except Exception as e:
-            report["failed"].append(f"{url}: {e}")
-    return report
-def main():
-    ap = argparse.ArgumentParser(description="Download all enabled sources from the registry.")
-    ap.add_argument("--force", action="store_true",
-                    help="Re-download even if files exist.")
-    ap.add_argument("--only", nargs="*", default=None,
-                    help="Only download these source keys.")
-    args = ap.parse_args()
-    enabled = [s for s in SOURCES if s.enabled]
-    if args.only:
-        enabled = [s for s in enabled if s.key in set(args.only)]
-    if not enabled:
-        print("No enabled sources match. Edit sources_registry.py to enable some.")
-        sys.exit(1)
-    print(f"Downloading {len(enabled)} sources to {RAW_DIR}")
-    print(f"User-Agent: {USER_AGENT}")
-    print()
-    any_failed = False
-    for src in enabled:
-        print(f"━━━ {src.key} — {src.name}")
-        print(f"    license={src.license}  tier={src.tier}  parser={src.parser}")
-        if src.translator:
-            year = f", {src.year}" if src.year else ""
-            print(f"    translator: {src.translator}{year}")
-        report = download_source(src, force=args.force)
-        if report["failed"]:
-            any_failed = True
-            for f in report["failed"]:
-                print(f"    [FAIL] {f}")
-        print(f"    downloaded={report['ok']}  cached={report['skipped']}")
-        print()
-    if any_failed:
-        print("Some sources failed. Re-run with the network available, or "
-              "edit the URL in sources_registry.py if a mirror has moved.")
-        sys.exit(2)
-    print("All enabled sources are now on disk under data/raw/.")
-    print("Next: python ingest_corpus.py")
-if __name__ == "__main__":
-    main()

enrich_corpus.py DELETED Viewed

@@ -1,174 +0,0 @@
-"""
-enrich_corpus.py — run the local LLM over every verse, once, with caching.
-The cost calculus
------------------
-For ~3,000 verses at ~30s per call on a 26B-class local model, a full pass
-takes a long evening — call it 25 hours. That's tolerable as a one-time cost,
-intolerable as a recurring one. So caching is non-negotiable. We cache by
-verse_id and the enrichment_version stamp; if you change the prompt
-substantively, bump the version in enrichment.py and the next run re-enriches.
-What we write
--------------
-data/corpus_enriched.jsonl — one EnrichedVerse per line, in the same order
-as data/corpus.jsonl. Failed enrichments are still written (with empty
-enrichment fields and an error stamp in enrichment_model) so the index can
-still cover them on their literal text.
-Concurrency
------------
-LM Studio's OpenAI-compatible server processes requests serially by default.
-We don't try to parallelize at the client; if you've configured your server
-for parallel decode, set --concurrency > 1 and DSPy will hold multiple
-in-flight calls. For modest hardware, 1 is correct.
-Resumability
-------------
-If the run dies halfway, just re-run. The cache at data/enrichment_cache.jsonl
-remembers per-verse what we already did, so we pick up exactly where we left
-off. No flag is needed for resume; it's the default behavior.
-"""
-from __future__ import annotations
-import argparse
-import json
-import os
-from dataclasses import asdict
-from pathlib import Path
-from typing import Iterable
-from tqdm import tqdm
-import dspy
-import config
-from corpus import Verse, EnrichedVerse, read_jsonl_verses, write_jsonl
-from enrichment import Enricher
-CACHE_PATH = config.DATA_DIR / "enrichment_cache.jsonl"
-ENRICHED_PATH = config.DATA_DIR / "corpus_enriched.jsonl"
-# ──────────────────────────── Cache I/O ────────────────────────────
-def _load_cache(path: Path) -> dict[str, EnrichedVerse]:
-    """Load cache as {verse_id: EnrichedVerse}. Tolerates partial writes."""
-    if not path.exists():
-        return {}
-    out: dict[str, EnrichedVerse] = {}
-    with path.open(encoding="utf-8") as f:
-        for line in f:
-            line = line.strip()
-            if not line:
-                continue
-            try:
-                d = json.loads(line)
-                ev = EnrichedVerse(**{k: v for k, v in d.items() if k in EnrichedVerse.__dataclass_fields__})
-                out[ev.verse_id] = ev
-            except Exception:
-                continue
-    return out
-def _append_cache(path: Path, ev: EnrichedVerse) -> None:
-    """Append a single record. We use append-mode rather than rewriting so
-    a kill -9 mid-run loses at most one line."""
-    path.parent.mkdir(parents=True, exist_ok=True)
-    with path.open("a", encoding="utf-8") as f:
-        f.write(json.dumps(asdict(ev), ensure_ascii=False) + "\n")
-# ──────────────────────────── Main loop ────────────────────────────
-def enrich_all(
-    in_path: Path,
-    out_path: Path,
-    cache_path: Path,
-    limit: int | None = None,
-    re_enrich: bool = False,
-    only_failed: bool = False,
-    use_claude: bool = True,
-) -> None:
-    if use_claude:
-        lm = config.configure_enrich_lm()
-        print(f"[enrich] LM: {lm.model} (Claude API)")
-    else:
-        config.configure_dspy()
-        print(f"[enrich] LM: {config.LOCAL_MODEL} (local LM Studio)")
-    enricher = Enricher()
-    cache = _load_cache(cache_path) if not re_enrich else {}
-    print(f"[enrich] cache contains {len(cache)} previously-enriched verses")
-    verses = list(read_jsonl_verses(in_path))
-    if limit:
-        verses = verses[:limit]
-    print(f"[enrich] enriching {len(verses)} verses from {in_path}")
-    enriched: list[EnrichedVerse] = []
-    pending = []
-    for v in verses:
-        cached = cache.get(v.verse_id)
-        if cached and not re_enrich:
-            if only_failed and cached.enrichment_model.startswith("FAILED"):
-                pending.append(v)
-            else:
-                enriched.append(cached)
-                continue
-        else:
-            pending.append(v)
-    print(f"[enrich] {len(enriched)} from cache, {len(pending)} to call LM for")
-    n_failed = 0
-    for v in tqdm(pending, desc="enriching"):
-        ev = enricher(verse=v)
-        _append_cache(cache_path, ev)
-        enriched.append(ev)
-        if not ev.is_enriched():
-            n_failed += 1
-    # Restore original verse order from in_path
-    by_id = {ev.verse_id: ev for ev in enriched}
-    ordered = [by_id[v.verse_id] for v in verses if v.verse_id in by_id]
-    n_written = write_jsonl(ordered, out_path)
-    print(f"[enrich] wrote {n_written} enriched verses to {out_path}")
-    if n_failed:
-        print(f"[enrich] WARNING: {n_failed} verses failed enrichment "
-              f"(empty fields, indexed only on literal text). "
-              f"Re-run with --only-failed to retry just those.")
-# ──────────────────────────── CLI ────────────────────────────
-def main():
-    ap = argparse.ArgumentParser()
-    ap.add_argument("--in", dest="in_path",
-                    default=str(config.DATA_DIR / "corpus.jsonl"))
-    ap.add_argument("--out", default=str(ENRICHED_PATH))
-    ap.add_argument("--cache", default=str(CACHE_PATH))
-    ap.add_argument("--limit", type=int, default=None,
-                    help="Enrich only the first N verses (smoke-test).")
-    ap.add_argument("--re-enrich", action="store_true",
-                    help="Ignore cache and re-enrich everything. Use this "
-                         "when you change the enrichment prompt.")
-    ap.add_argument("--only-failed", action="store_true",
-                    help="Re-run only the verses whose previous enrichment "
-                         "failed (FAILED stamp in enrichment_model).")
-    ap.add_argument("--lm", choices=["claude", "local"], default="claude",
-                    help="Which LM to use: 'claude' (default, Sonnet 4.6 via API) "
-                         "or 'local' (LM Studio). Claude requires ANTHROPIC_API_KEY.")
-    args = ap.parse_args()
-    enrich_all(
-        in_path=Path(args.in_path),
-        out_path=Path(args.out),
-        cache_path=Path(args.cache),
-        limit=args.limit,
-        re_enrich=args.re_enrich,
-        only_failed=args.only_failed,
-        use_claude=(args.lm == "claude"),
-    )
-if __name__ == "__main__":
-    main()

enrichment.py DELETED Viewed

@@ -1,266 +0,0 @@
-"""
-enrichment.py — turn a Verse into an EnrichedVerse using the local LLM.
-This module is the heart of the redesign. Instead of hoping that vector
-similarity between a user's English question and a Sanskrit verse will find
-the right teaching, we run a one-time offline pass that asks the local LLM
-to translate each verse into the language a real person would use to seek
-help. The output gets stored alongside the verse and embedded for retrieval.
-What the prompt asks for, and why each field
---------------------------------------------
-We extract six fields. Each one earns its place by closing a different gap
-between scripture and a user's question:
-  paraphrase             — what the verse teaches, in plain modern English.
-                           This is what the synthesizer reads when writing
-                           the advisor's reply, so paraphrase quality matters
-                           more than embedding quality.
-  themes                 — Vedānta concepts engaged. Tradition-native names
-                           (karma_yoga, vairagya, sakshi, two_truths). Used
-                           for filtering and for ensuring the metric can
-                           verify Advaita-coherence.
-  life_situations        — the predicaments where this verse helps. User-
-                           language. This is the field that does the actual
-                           bridging: a query about "facing failure" finds
-                           BG 2.47 even though those words aren't in the verse.
-  emotions_addressed     — drawn from a fixed vocabulary so we get faceted
-                           filtering rather than free-text drift. The metric
-                           uses this to verify that retrieved verses actually
-                           address the user's felt emotion.
-  practical_teaching     — what the verse asks the seeker to do or shift.
-                           The synthesizer uses this as the seed for its
-                           "concrete practice you can try this week" close.
-  hypothetical_questions — five questions a real person might bring to the
-                           verse. Highest-leverage field for retrieval recall.
-A closed vocabulary for emotions
---------------------------------
-We constrain `emotions_addressed` to the EMOTION_VOCAB list below. If we let
-the LLM generate freely, we get drift: "sadness" / "sorrow" / "melancholy" /
-"grief-tinged blue" all become separate buckets, and faceted filtering
-becomes useless. Closed vocab keeps the index sharp.
-We don't constrain themes the same way because the Sanskrit conceptual
-vocabulary is open-ended and forcing the LLM into a small list would lose
-information. We just normalize for casing/spacing in post-processing.
-Working with a flaky local LLM
-------------------------------
-Local 26B-class models occasionally produce malformed structured output.
-This module assumes that. The enrich() function:
-  - validates output against minimum-quality checks
-  - retries up to 2 times with temperature=0
-  - on persistent failure, returns an EnrichedVerse with empty enrichment
-    fields rather than raising — so the corpus can still index on the
-    literal text + bhāṣya and the verse isn't lost
-"""
-from __future__ import annotations
-import re
-from dataclasses import asdict
-import dspy
-from corpus import Verse, EnrichedVerse
-# ──────────────────────────── Closed emotion vocabulary ────────────────────────────
-# Twenty buckets, ordered roughly from acute to diffuse. Adding entries is
-# easy; removing them risks orphaning previously-enriched records.
-EMOTION_VOCAB: tuple[str, ...] = (
-    "grief",                # acute loss
-    "anticipatory_grief",   # loss in advance
-    "fear",                 # discrete fear
-    "anxiety",              # chronic, diffuse
-    "despair",              # loss of hope
-    "shame",                # self-as-bad
-    "guilt",                # action-as-bad
-    "anger",
-    "resentment",
-    "envy",
-    "jealousy",
-    "longing",
-    "loneliness",
-    "doubt",                # epistemic; not knowing
-    "disillusionment",      # the hollowness of attained goals
-    "boredom",              # the inertness of repetition
-    "restlessness",         # the inability to settle
-    "frustration",
-    "confusion",
-    "numbness",             # affect-blunted
-)
-# ──────────────────────────── DSPy signature ────────────────────────────
-class EnrichVerse(dspy.Signature):
-    """You are an Advaita-Vedānta-trained reader producing structured metadata
-    for a verse from the Bhagavad Gītā or a related scripture, so that a
-    spiritual advisor can later find this verse when a real person describes
-    a life situation in everyday language. Stay strictly within the framework
-    of Śaṅkarācārya's non-dual interpretation. Do not import dualistic notions
-    (separate creator/creature, soul-merging-into-God-as-other, etc.) and do
-    not bypass the verse's plain meaning by always retreating to the absolute.
-    The verse may include the Sanskrit, the English translation, and (when
-    available) Śaṅkara's commentary. Read all three. Your output is structured
-    fields, not prose. Be specific, lived, concrete. Avoid generic spiritual
-    language ('find peace', 'be in the moment'). Avoid tradition-foreign
-    therapy language ('honor your feelings'). When in doubt about a field,
-    leave it shorter rather than padded."""
-    # Inputs — the verse in its richest available form
-    verse_ref: str = dspy.InputField(desc="Citation form, e.g. 'BG 2.47'.")
-    sanskrit: str = dspy.InputField(desc="Devanāgarī text, may be empty.")
-    translation: str = dspy.InputField(desc="English translation of the verse.")
-    bhashya: str = dspy.InputField(desc="Śaṅkara's commentary on this verse, may be empty.")
-    # Outputs
-    paraphrase: str = dspy.OutputField(
-        desc="One or two sentences in plain modern English stating what the "
-             "verse teaches. Not a translation; a teaching summary. No jargon."
-    )
-    themes: list[str] = dspy.OutputField(
-        desc="2–5 Vedānta concepts the verse engages, in tradition-native "
-             "vocabulary with snake_case_keys, e.g. ['karma_yoga', 'non_attachment', "
-             "'two_truths']. Use Sanskrit terms where they're the right name."
-    )
-    life_situations: list[str] = dspy.OutputField(
-        desc="3–6 specific human predicaments this verse would help with, "
-             "in everyday English. e.g. 'facing public failure after years of "
-             "effort'. NOT 'finding peace' or 'spiritual growth'."
-    )
-    emotions_addressed: list[str] = dspy.OutputField(
-        desc="The emotions this verse meets, drawn ONLY from this fixed list: "
-             + ", ".join(EMOTION_VOCAB) + ". 1–4 entries."
-    )
-    practical_teaching: str = dspy.OutputField(
-        desc="One sentence: what the verse asks the seeker to actually do or "
-             "shift. If the verse is purely ontological, write 'pure ontology — "
-             "no direct prescription' and the field will be ignored downstream."
-    )
-    hypothetical_questions: list[str] = dspy.OutputField(
-        desc="EXACTLY 5 first-person questions a real person might write to a "
-             "spiritual advisor that this verse would speak to. Specific, "
-             "ungeneric, in the user's voice. NOT in scripture's voice. e.g. "
-             "'I worked on this for three years and it just failed publicly — "
-             "how do I keep going?'"
-    )
-# ──────────────────────────── Validators ────────────────────────────
-THEME_KEY_RX = re.compile(r"^[a-z][a-z0-9_]{2,40}$")
-def _normalize_theme(t: str) -> str:
-    t = t.strip().lower()
-    t = re.sub(r"[\s\-]+", "_", t)
-    t = re.sub(r"[^a-z0-9_]", "", t)
-    return t
-def _validate(pred) -> tuple[bool, str]:
-    """Light schema check. Returns (ok, reason_if_not_ok). Used to decide
-    whether to retry the LM call with a stricter prompt."""
-    paraphrase = (pred.paraphrase or "").strip()
-    if len(paraphrase) < 20:
-        return False, "paraphrase too short"
-    qs = pred.hypothetical_questions or []
-    if not isinstance(qs, list) or len(qs) < 3:
-        return False, f"need ≥3 hypothetical_questions, got {len(qs)}"
-    sits = pred.life_situations or []
-    if not isinstance(sits, list) or len(sits) < 2:
-        return False, f"need ≥2 life_situations, got {len(sits)}"
-    emos = pred.emotions_addressed or []
-    if not isinstance(emos, list) or not emos:
-        return False, "emotions_addressed empty"
-    bad = [e for e in emos if _normalize_theme(e) not in EMOTION_VOCAB]
-    if bad:
-        return False, f"emotions outside vocabulary: {bad}"
-    themes = pred.themes or []
-    if not isinstance(themes, list) or not themes:
-        return False, "themes empty"
-    return True, ""
-# ──────────────────────────── Module ────────────────────────────
-class Enricher(dspy.Module):
-    """Wraps the EnrichVerse signature with retries and post-processing.
-    Why ChainOfThought over Predict
-    -------------------------------
-    GEPA may eventually optimize this prompt too, and ChainOfThought gives it
-    a `reasoning` trace to inspect during reflection. The cost is one extra
-    paragraph of LM output per call, which is negligible at our scale.
-    """
-    def __init__(self, max_retries: int = 2):
-        super().__init__()
-        self.predict = dspy.ChainOfThought(EnrichVerse)
-        self.max_retries = max_retries
-    def forward(self, verse: Verse) -> EnrichedVerse:
-        attempt = 0
-        last_err = ""
-        pred = None
-        while attempt <= self.max_retries:
-            try:
-                pred = self.predict(
-                    verse_ref=verse.verse_ref,
-                    sanskrit=verse.sanskrit or "",
-                    translation=verse.translation or "",
-                    bhashya=verse.bhashya or "",
-                )
-                ok, reason = _validate(pred)
-                if ok:
-                    break
-                last_err = reason
-            except Exception as e:
-                last_err = f"LM error: {e}"
-            attempt += 1
-        # Build the EnrichedVerse from the Verse + whatever we got
-        base = asdict(verse)
-        ev = EnrichedVerse(**base)
-        if pred and not last_err:
-            ev.paraphrase = (pred.paraphrase or "").strip()
-            ev.practical_teaching = (pred.practical_teaching or "").strip()
-            ev.themes = [
-                _normalize_theme(t) for t in (pred.themes or [])
-                if THEME_KEY_RX.match(_normalize_theme(t))
-            ]
-            ev.life_situations = [
-                s.strip() for s in (pred.life_situations or [])
-                if s and len(s.strip()) >= 5
-            ]
-            ev.emotions_addressed = [
-                _normalize_theme(e) for e in (pred.emotions_addressed or [])
-                if _normalize_theme(e) in EMOTION_VOCAB
-            ]
-            ev.hypothetical_questions = [
-                q.strip() for q in (pred.hypothetical_questions or [])
-                if q and len(q.strip()) >= 10
-            ][:5]  # cap at 5
-            # Stamp the model so re-runs after a model swap can be detected
-            try:
-                lm = dspy.settings.lm
-                ev.enrichment_model = getattr(lm, "model", "") or ""
-            except Exception:
-                pass
-        else:
-            # Enrichment failed; keep the verse but mark it
-            ev.enrichment_model = f"FAILED: {last_err}"
-        return ev

ingest_corpus.py DELETED Viewed

@@ -1,203 +0,0 @@
-"""
-ingest_corpus.py — run the parsers and produce data/corpus.jsonl.
-This script lives between download_sources.py (which gets bytes onto disk)
-and enrich_corpus.py (which adds LLM-derived fields). Its specific job:
-    1. Walk each enabled source in the registry.
-    2. Dispatch to its parser, which yields Verse records.
-    3. Merge records across sources by verse_ref.
-       - The Gītā parser yields verses with translation but no bhāṣya.
-       - The Sastry parser yields verses with bhāṣya but spotty translation.
-       - We want one record per verse, with both populated when possible.
-    4. Write the merged stream as JSONL to data/corpus.jsonl.
-Why merge by verse_ref rather than verse_id
--------------------------------------------
-The Gītā parser uses work='bhagavad_gita' and the Sastry parser uses
-work='bhagavad_gita_bhashya'. Their verse_ids therefore differ (different
-work prefix), but their verse_refs match — both render as 'BG 2.47'. We
-key the merge on verse_ref since that's the reader-facing canonical citation.
-Conflict policy when merging
-----------------------------
-- Translation: keep whichever record has it; if both, prefer the one whose
-  source_key is in the GITA_TEXT_PRIORITY list. (We want the modern, clean
-  Sivananda over Sastry's archaic English-of-Śaṅkara-paraphrasing-the-verse.)
-- Bhāṣya: only one source produces this; conflicts shouldn't happen.
-- Sanskrit / transliteration / word_meanings: prefer gita_json; richer.
-"""
-from __future__ import annotations
-import argparse
-from collections import defaultdict
-from pathlib import Path
-from typing import Iterable
-from tqdm import tqdm
-import config
-from corpus import Verse, write_jsonl
-from sources_registry import enabled_sources, by_key, Source
-# Parsers
-from parsers import gita_json as parser_gita_json
-from parsers import sastry_archive as parser_sastry
-# When two sources both have a translation, this list decides which wins
-GITA_TEXT_PRIORITY = ("gita_json_core", "sastry_gita_bhashya")
-def _parse_source(src: Source, raw_dir: Path) -> Iterable[Verse]:
-    """Dispatch to the right parser for a registry entry.
-    Each parser is documented to take a directory and return an iterable of
-    Verses; this function is just a switch table.
-    """
-    if src.parser == "gita_json":
-        # The gita_json parser can take both the core dir and (optionally) a
-        # translations dir. We pass the same dir for both since the downloader
-        # puts all gita_json* files into per-source folders.
-        if src.key == "gita_json_core":
-            translations_dir = raw_dir.parent / "gita_json_translations"
-            return parser_gita_json.parse(
-                raw_dir,
-                translations_dir if translations_dir.exists() else None,
-            )
-        # The translations source is "consumed" alongside core, not parsed alone
-        return iter(())
-    if src.parser == "sastry_archive":
-        return parser_sastry.parse(raw_dir)
-    if src.parser == "wisdomlib_html":
-        # Stub for now — see parsers/wisdomlib_html.py to implement.
-        # We don't fail the whole ingest just because one parser is unimplemented.
-        print(f"[ingest] wisdomlib_html parser not implemented yet — skipping {src.key}")
-        return iter(())
-    if src.parser == "thibaut_sbe":
-        print(f"[ingest] thibaut_sbe parser not implemented yet — skipping {src.key}")
-        return iter(())
-    if src.parser == "plain_text":
-        # Reserved for user-dropped texts; future work
-        return iter(())
-    raise ValueError(f"Unknown parser type: {src.parser}")
-def _merge(records: list[Verse]) -> list[Verse]:
-    """Merge multiple parser outputs into one record per verse_ref.
-    The output preserves the order of first appearance, so the corpus.jsonl
-    file is naturally chapter-then-verse ordered.
-    """
-    by_ref: dict[str, Verse] = {}
-    order: list[str] = []
-    for r in records:
-        if r.verse_ref not in by_ref:
-            by_ref[r.verse_ref] = r
-            order.append(r.verse_ref)
-            continue
-        existing = by_ref[r.verse_ref]
-        # Translation: pick higher-priority source if both have one
-        new_translation = existing.translation
-        new_translator = existing.translator
-        if r.translation and (
-            not existing.translation
-            or _priority(r.source_key) < _priority(existing.source_key)
-        ):
-            new_translation = r.translation
-            new_translator = r.translator
-        # Bhashya: only one source typically has it, take whichever isn't blank
-        new_bhashya = existing.bhashya or r.bhashya
-        new_bhashya_tr = existing.bhashya_translator or r.bhashya_translator
-        # Sanskrit family of fields: prefer the existing record if it has them,
-        # else take from the new record
-        merged = Verse(
-            verse_id=existing.verse_id,
-            work=existing.work,           # keep the work_display of whichever came first
-            work_display=existing.work_display,
-            verse_ref=existing.verse_ref,
-            tier=_choose_tier(existing.tier, r.tier),
-            section=existing.section or r.section,
-            section_display=existing.section_display or r.section_display,
-            translation=new_translation,
-            translator=new_translator,
-            sanskrit=existing.sanskrit or r.sanskrit,
-            transliteration=existing.transliteration or r.transliteration,
-            word_meanings=existing.word_meanings or r.word_meanings,
-            bhashya=new_bhashya,
-            bhashya_translator=new_bhashya_tr,
-            source_key=existing.source_key + "+" + r.source_key,
-            license=existing.license or r.license,
-        )
-        by_ref[r.verse_ref] = merged
-    return [by_ref[k] for k in order]
-def _priority(source_key: str) -> int:
-    """Lower is higher-priority. Sources not in the priority list rank last."""
-    for i, key in enumerate(GITA_TEXT_PRIORITY):
-        if source_key == key or source_key.startswith(key + "+") or source_key.endswith("+" + key):
-            return i
-    return 99
-def _choose_tier(a: str, b: str) -> str:
-    """When two records merge, the tier of the merged verse is the most
-    'authoritative' of the two: primary > shankara > supporting.
-    Why primary > shankara: when we have both the verse text (primary) and
-    Śaṅkara's bhāṣya on it (shankara) folded into one record, the verse
-    itself is what the citation refers to — so primary wins."""
-    rank = {"primary": 0, "shankara": 1, "supporting": 2}
-    return a if rank.get(a, 9) <= rank.get(b, 9) else b
-# ──────────────────────────── CLI ────────────────────────────
-def main():
-    ap = argparse.ArgumentParser()
-    ap.add_argument("--out", default=str(config.DATA_DIR / "corpus.jsonl"))
-    args = ap.parse_args()
-    raw_root = config.DATA_DIR / "raw"
-    if not raw_root.exists():
-        raise SystemExit("data/raw/ doesn't exist. Run download_sources.py first.")
-    all_records: list[Verse] = []
-    for src in enabled_sources():
-        raw_dir = raw_root / src.key
-        if not raw_dir.exists():
-            print(f"[ingest] {src.key}: no files at {raw_dir}; skipping")
-            continue
-        print(f"[ingest] parsing {src.key} via {src.parser}")
-        try:
-            n_before = len(all_records)
-            for v in _parse_source(src, raw_dir):
-                if v.has_content():
-                    all_records.append(v)
-            print(f"[ingest]   yielded {len(all_records) - n_before} records")
-        except Exception as e:
-            print(f"[ingest] {src.key} failed: {e}")
-    print(f"[ingest] merging {len(all_records)} records by verse_ref ...")
-    merged = _merge(all_records)
-    print(f"[ingest] {len(merged)} unique verses after merge")
-    out_path = Path(args.out)
-    n = write_jsonl(merged, out_path)
-    print(f"[ingest] wrote {n} verses to {out_path}")
-    print(f"[ingest] next: python enrich_corpus.py")
-if __name__ == "__main__":
-    main()

knowledge_base.py CHANGED Viewed

@@ -55,7 +55,6 @@ from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Iterable
-import numpy as np
 import chromadb
 from chromadb.config import Settings
 from sentence_transformers import SentenceTransformer
@@ -136,47 +135,7 @@ def _client() -> chromadb.api.ClientAPI:
     )
-class _HFEmbedder:
-    """Drop-in for SentenceTransformer that calls the HF Inference API.
-    Used on HF Spaces to avoid the ~400 MB cold-start cost of loading the
-    local sentence-transformer. The HF Inference API handles the model
-    server-side; we only pay a network round-trip per batch.
-    """
-    def __init__(self, model: str, token: str) -> None:
-        from huggingface_hub import InferenceClient
-        self._client = InferenceClient(token=token)
-        self._model = model
-    def encode(
-        self,
-        sentences: list[str],
-        normalize_embeddings: bool = True,
-        show_progress_bar: bool = False,
-        batch_size: int = 64,
-    ) -> np.ndarray:
-        all_embs: list[list[float]] = []
-        for i in range(0, len(sentences), batch_size):
-            batch = sentences[i : i + batch_size]
-            result = self._client.feature_extraction(batch, model=self._model)
-            # API returns list[list[float]] for batch; list[float] for single
-            if isinstance(result, list) and result and isinstance(result[0], float):
-                result = [result]
-            all_embs.extend(result)
-        embs = np.array(all_embs, dtype=np.float32)
-        if normalize_embeddings:
-            norms = np.linalg.norm(embs, axis=1, keepdims=True)
-            embs = embs / np.where(norms > 0, norms, 1.0)
-        return embs
-def _embedder(force_local: bool = False) -> "SentenceTransformer | _HFEmbedder":
-    """Return an embedder. Prefers the HF Inference API when HF_TOKEN is set
-    (avoids loading a 400 MB model on Spaces). Pass force_local=True when
-    building the index locally to use the local model for batch efficiency."""
-    if config.HF_TOKEN and not force_local:
-        return _HFEmbedder(config.EMBED_MODEL, config.HF_TOKEN)
     return SentenceTransformer(config.EMBED_MODEL, device=config.EMBED_DEVICE)
@@ -216,7 +175,7 @@ def build_index(corpus_path: Path | None = None) -> dict[str, int]:
         )
     print(f"Loading embedding model: {config.EMBED_MODEL} on {config.EMBED_DEVICE}")
-    embedder = _embedder(force_local=True)  # local model for batch-efficient index building
     client = _client()
     # Drop existing collections; build_index is "rebuild from scratch"

 from pathlib import Path
 from typing import Iterable
 import chromadb
 from chromadb.config import Settings
 from sentence_transformers import SentenceTransformer
     )
+def _embedder() -> SentenceTransformer:
     return SentenceTransformer(config.EMBED_MODEL, device=config.EMBED_DEVICE)
         )
     print(f"Loading embedding model: {config.EMBED_MODEL} on {config.EMBED_DEVICE}")
+    embedder = _embedder()
     client = _client()
     # Drop existing collections; build_index is "rebuild from scratch"

metrics.py DELETED Viewed

@@ -1,435 +0,0 @@
-"""
-metrics.py — the metric is the specification.
-GEPA optimizes whatever the metric rewards. So the metric here is not a single
-number; it's a *contract* on what an Advaita-grounded, empathetic, practically
-useful response looks like — combined with rich textual feedback the reflection
-LM uses to rewrite prompts.
-We combine three signals:
-    1. Rule-based checks (fast, deterministic)
-        - citation grounding (cites real retrieved sources, not hallucinated)
-        - tier preference (primary + Śaṅkara > supporting)
-        - structural hygiene (length, has actionable element, no therapy clichés)
-    2. LLM-as-judge rubric scoring
-        - Advaita coherence (non-dual, not crypto-dualist)
-        - two-truths discipline (vyāvahārika ↔ pāramārthika)
-        - empathy without dissolving into the user's frame
-        - wit calibration (light around the predicament, never the pain)
-    3. Composite score + structured feedback string
-The function signature matches GEPA's metric contract:
-    metric(gold, pred, trace=None, pred_name=None, pred_trace=None) -> dspy.Prediction
-Returning dspy.Prediction(score=float, feedback=str) is the GEPA happy path.
-"""
-from __future__ import annotations
-import re
-import json
-from typing import Any
-import dspy
-# ──────────────────────────── Rule-based checks ────────────────────────────
-THERAPY_CLICHES = [
-    "you got this",
-    "be kind to yourself",
-    "self-care",
-    "just remember",
-    "trust the process",
-    "everything happens for a reason",
-    "you are enough",
-    "love and light",
-    "manifesting",
-    "send positive vibes",
-    "good vibes",
-]
-# Loose pattern catching citations like "BG 2.47", "Gītā 18.66", "Bṛhadāraṇyaka 4.4.5",
-# "Vivekacūḍāmaṇi 11", "Kaṭha Up. 1.3.14", etc.
-CITATION_PATTERN = re.compile(
-    r"\b("
-    r"BG\s*\d+[\.:]\d+"                                  # BG 2.47
-    r"|G[īi]t[āa]\s*\d+[\.:]\d+"                         # Gita 2.47
-    r"|[A-ZĀĪŪṚḌṬṆṢŚḤṂa-zāīūṛḍṭṇṣśḥṃ]{3,}\s*Up\.?\s*\d+(?:[\.:]\d+){0,2}"  # Kaṭha Up. 1.2.3
-    r"|Vivekac[ūu]ḍāmaṇi\s*\d+"
-    r"|Ātmabodha\s*\d+"
-    r"|Tattvabodha\s*\d+"
-    r"|Brahma\s*S[ūu]tra\s*\d+[\.:]\d+(?:[\.:]\d+)?"
-    r"|Aṣṭāvakra\s*G[īi]t[āa]\s*\d+[\.:]\d+"
-    r")\b"
-)
-EMPATHY_OPENERS = [
-    "what you", "you're carrying", "you are carrying", "i hear",
-    "this hurts", "this is painful", "the weight", "sitting with",
-    "what you describe", "the ache",
-]
-ACTIONABLE_MARKERS = [
-    "this week", "today", "try this", "begin by", "for the next",
-    "each morning", "each evening", "when you notice", "the next time",
-    "as a practice", "sit for", "spend ", "over the next",
-]
-NON_DUAL_MARKERS = [
-    "witness", "sākṣī", "sakshi", "non-dual", "advaita",
-    "pāramārthika", "paramarthika", "vyāvahārika", "vyavaharika",
-    "ātman", "atman", "brahman", "adhyāsa", "adhyasa", "māyā", "maya",
-    "neti neti", "tat tvam asi", "ahaṁ brahmāsmi", "aham brahmasmi",
-    "self with a capital", "the seer", "awareness itself",
-]
-def _word_count(s: str) -> int:
-    return len(s.split())
-def _has_any(text: str, needles: list[str]) -> list[str]:
-    low = text.lower()
-    return [n for n in needles if n in low]
-def _normalize_for_match(s: str) -> str:
-    return re.sub(r"\s+", " ", s.lower()).strip()
-def _citation_grounding(
-    sources_cited: list[str],
-    retrieved_passages: list[dict],
-) -> tuple[float, list[str], list[str]]:
-    """Return (grounding_score, grounded_citations, ungrounded_citations).
-    With the verse-indexed corpus, each retrieved passage carries an exact
-    verse_ref string ('BG 2.47', 'Muṇḍaka Up. 2.1.3', etc.). Grounding becomes
-    an exact set-membership test rather than fuzzy substring matching, which
-    is dramatically sharper feedback for GEPA's reflection step: 'BG 2.47'
-    is grounded if and only if 'BG 2.47' was in the retrieved set.
-    We still tolerate light formatting noise: the synthesizer might write
-    'BG 2.47', 'Bhagavad Gītā 2.47', 'Gita 2:47', etc. We canonicalize to
-    'BG <chap>.<verse>' for Gītā citations before comparing. Other works
-    are matched directly by verse_ref string with whitespace normalized.
-    """
-    if not sources_cited:
-        return 0.0, [], []
-    retrieved_refs = {
-        _canonicalize_ref(h.get("verse_ref") or h.get("meta", {}).get("verse_ref", ""))
-        for h in retrieved_passages
-    }
-    retrieved_refs.discard("")
-    grounded, ungrounded = [], []
-    for c in sources_cited:
-        canon = _canonicalize_ref(c)
-        # Try direct match first, then a "substring of any retrieved" fallback
-        # for cases where the synthesizer paraphrases the citation
-        # ('chapter 2 verse 47' vs 'BG 2.47').
-        hit = canon in retrieved_refs or any(
-            canon and (canon in r or r in canon) for r in retrieved_refs
-        )
-        (grounded if hit else ungrounded).append(c)
-    score = len(grounded) / max(len(sources_cited), 1)
-    return score, grounded, ungrounded
-def _canonicalize_ref(s: str) -> str:
-    """Normalize a citation string so 'BG 2.47', 'Bhagavad Gītā 2.47',
-    'Gītā 2:47' all reduce to the same canonical form 'BG 2.47'."""
-    s = re.sub(r"\s+", " ", s.strip())
-    # Gītā variants
-    m = re.match(r"^(?:BG|Bhagavad\s*G[īi]t[āa]|G[īi]t[āa])\s*(\d+)[\.:](\d+)", s, re.I)
-    if m:
-        return f"BG {int(m.group(1))}.{int(m.group(2))}"
-    # Default: lowercased, colons → dots
-    return s.lower().replace(":", ".")
-def _tier_preference(
-    sources_cited: list[str],
-    retrieved_passages: list[dict],
-    selected_indices: list[int],
-) -> tuple[float, dict]:
-    """Reward responses whose *cited* passages came from primary/Śaṅkara tiers."""
-    if not selected_indices:
-        return 0.0, {"primary": 0, "shankara": 0, "supporting": 0}
-    counts = {"primary": 0, "shankara": 0, "supporting": 0}
-    for idx in selected_indices:
-        if 1 <= idx <= len(retrieved_passages):
-            tier = retrieved_passages[idx - 1].get("meta", {}).get("tier", "supporting")
-            counts[tier] = counts.get(tier, 0) + 1
-    total = sum(counts.values()) or 1
-    preferred = counts["primary"] + counts["shankara"]
-    return preferred / total, counts
-def rule_based_score(pred: dspy.Prediction) -> tuple[float, dict]:
-    """Returns (score in [0,1], breakdown dict)."""
-    response = getattr(pred, "response", "") or ""
-    sources_cited = getattr(pred, "sources_cited", []) or []
-    retrieved = getattr(pred, "retrieved_passages", []) or []
-    selected_idx = getattr(pred, "selected_indices", []) or []
-    felt = getattr(pred, "felt_emotion", "") or ""
-    wc = _word_count(response)
-    length_ok = 200 <= wc <= 600
-    length_score = 1.0 if length_ok else max(0.0, 1.0 - abs(wc - 350) / 350)
-    citations_in_text = CITATION_PATTERN.findall(response)
-    has_citation = bool(citations_in_text) or bool(sources_cited)
-    citation_score = 1.0 if has_citation else 0.0
-    grounding_score, grounded, ungrounded = _citation_grounding(sources_cited, retrieved)
-    tier_score, tier_counts = _tier_preference(sources_cited, retrieved, selected_idx)
-    cliches = _has_any(response, THERAPY_CLICHES)
-    cliche_penalty = min(1.0, 0.25 * len(cliches))
-    cliche_score = 1.0 - cliche_penalty
-    # Empathy: opening should signal acknowledgement of feeling
-    head = response[:300].lower()
-    empathy_hits = [m for m in EMPATHY_OPENERS if m in head]
-    # Bonus if the felt_emotion content is referenced (loosely)
-    if felt:
-        for tok in felt.lower().split():
-            if len(tok) > 4 and tok in head:
-                empathy_hits.append(f"echoes:{tok}")
-                break
-    empathy_score = min(1.0, 0.4 + 0.3 * len(empathy_hits))
-    actionable_hits = _has_any(response, ACTIONABLE_MARKERS)
-    actionable_score = 1.0 if actionable_hits else 0.4
-    nondual_hits = _has_any(response, NON_DUAL_MARKERS)
-    nondual_score = min(1.0, 0.4 + 0.2 * len(nondual_hits))
-    # Weighted aggregate
-    components = {
-        "length": (length_score, 0.05),
-        "citation_present": (citation_score, 0.08),
-        "citation_grounding": (grounding_score, 0.18),
-        "tier_preference": (tier_score, 0.12),
-        "no_cliches": (cliche_score, 0.10),
-        "empathy_opening": (empathy_score, 0.15),
-        "actionable": (actionable_score, 0.10),
-        "nondual_register": (nondual_score, 0.22),
-    }
-    score = sum(s * w for s, w in components.values())
-    breakdown = {
-        "score": score,
-        "word_count": wc,
-        "components": {k: round(v[0], 3) for k, v in components.items()},
-        "citations_in_text": citations_in_text,
-        "sources_cited": sources_cited,
-        "grounded_citations": grounded,
-        "ungrounded_citations": ungrounded,
-        "tier_counts": tier_counts,
-        "therapy_cliches_found": cliches,
-        "empathy_hits": empathy_hits,
-        "actionable_hits": actionable_hits,
-        "nondual_markers_found": nondual_hits,
-    }
-    return score, breakdown
-# ──────────────────────────── LLM-judge rubric ────────────────────────────
-class JudgeAdvice(dspy.Signature):
-    """You are an examiner of Advaita-Vedānta spiritual counsel in the lineage
-    of Ādi Śaṅkarācārya. Score the advisor's response against the user's
-    question on each rubric (0.0 to 1.0) and write a short critique that an
-    optimizer can use to *improve the prompts that produced this response*.
-    Rubrics:
-    - advaita_coherence: Does the response reflect genuine non-dualism
-      (jīva-ātman-brahman identity), or does it accidentally smuggle in dualism
-      ('the soul reaches God', 'becoming one with the universe' as if they were
-      separate, etc.)? Does it avoid collapsing into nihilism ('nothing is
-      real')?
-    - two_truths_discipline: Does it honor the distinction between
-      vyāvahārika (transactional, where the user's pain and choices are real
-      and matter) and pāramārthika (absolute, where the witness is untouched)?
-      Failure modes: spiritual bypass (denying the pain by pointing to the
-      absolute), or pure-therapy register (forgetting the absolute exists).
-    - empathy_without_dissolving: Does it meet the user in their felt
-      experience without either flattening into therapy-speak OR dismissing
-      the feeling with premature transcendence?
-    - wit_calibration: Is there a light, dry touch around the cosmic
-      predicament (Śaṅkara himself is dry; this is consistent with the
-      tradition) WITHOUT being flippant about the user's actual pain? Both
-      'too solemn throughout' and 'making jokes about their situation' lose
-      points.
-    - source_integration: Are scriptural citations woven into the prose
-      (illuminating the point) rather than dumped as block quotes or used
-      as decoration? Are the references specific (Gītā 2.47, not just
-      "the Gita says")?
-    - practical_offering: Does the response close with something the user
-      can actually try — a question to sit with, a practice, a perspective
-      shift — rather than abstract platitudes?
-    - draw_from_personal_experiences: Does the response use parables and day to day life
-      stories as examples to encourage the user to relate better to the advise
-    The critique should be specific and prescriptive: what to keep, what to
-    cut, what's missing. Phrase it as you would to a writer revising a draft."""
-    user_question: str = dspy.InputField()
-    response: str = dspy.InputField()
-    sources_cited: list[str] = dspy.InputField()
-    advaita_coherence: float = dspy.OutputField(desc="0.0 to 1.0")
-    two_truths_discipline: float = dspy.OutputField(desc="0.0 to 1.0")
-    empathy_without_dissolving: float = dspy.OutputField(desc="0.0 to 1.0")
-    wit_calibration: float = dspy.OutputField(desc="0.0 to 1.0")
-    source_integration: float = dspy.OutputField(desc="0.0 to 1.0")
-    practical_offering: float = dspy.OutputField(desc="0.0 to 1.0")
-    draw_from_personal_experiences: float = dspy.OutputField(desc="0.0 to 1.0")
-    critique: str = dspy.OutputField(
-        desc="3-6 sentences of prescriptive feedback for revising the response."
-    )
-# Lazily-instantiated judge. Call configure_judge() to use a stronger LM (e.g. gpt-4o)
-# during GEPA optimization so the reflection LM gets high-quality signal to work from.
-_judge = None
-_judge_lm = None  # None means use the globally-configured LM (task LM)
-def configure_judge(lm) -> None:
-    """Set the LM used by judge_score. Call before GEPA to use gpt-4o instead of the task LM."""
-    global _judge_lm, _judge
-    _judge_lm = lm
-    _judge = None  # reset so next call recreates with new context
-def _get_judge():
-    global _judge
-    if _judge is None:
-        _judge = dspy.ChainOfThought(JudgeAdvice)
-    return _judge
-def judge_score(user_question: str, pred: dspy.Prediction) -> tuple[float, dict, str]:
-    judge = _get_judge()
-    try:
-        call_kwargs = dict(
-            user_question=user_question,
-            response=getattr(pred, "response", "") or "",
-            sources_cited=getattr(pred, "sources_cited", []) or [],
-        )
-        if _judge_lm is not None:
-            with dspy.context(lm=_judge_lm):
-                j = judge(**call_kwargs)
-        else:
-            j = judge(**call_kwargs)
-    except Exception as e:
-        # If the judge fails (parse error, LM hiccup), fall back gracefully.
-        return 0.5, {"judge_error": str(e)}, f"Judge failed: {e}"
-    rubric = {
-        "advaita_coherence": float(j.advaita_coherence or 0.0),
-        "two_truths_discipline": float(j.two_truths_discipline or 0.0),
-        "empathy_without_dissolving": float(j.empathy_without_dissolving or 0.0),
-        "wit_calibration": float(j.wit_calibration or 0.0),
-        "source_integration": float(j.source_integration or 0.0),
-        "practical_offering": float(j.practical_offering or 0.0),
-        "draw_from_personal_experiences": float(j.draw_from_personal_experiences or 0.0),
-    }
-    weights = {
-        "advaita_coherence": 0.25,
-        "two_truths_discipline": 0.20,
-        "empathy_without_dissolving": 0.20,
-        "wit_calibration": 0.10,
-        "source_integration": 0.10,
-        "practical_offering": 0.10,
-        "draw_from_personal_experiences": 0.05,
-    }
-    score = sum(rubric[k] * weights[k] for k in rubric)
-    score = max(0.0, min(1.0, score))
-    return score, rubric, j.critique or ""
-# ──────────────────────────── Composite GEPA metric ────────────────────────────
-RULE_WEIGHT = 0.45
-JUDGE_WEIGHT = 0.55
-def _format_feedback(rule_breakdown: dict, judge_rubric: dict, critique: str) -> str:
-    """Concatenate rule-based facts and judge critique into one feedback string
-    that the GEPA reflection LM can read and use to rewrite prompts."""
-    lines = ["FEEDBACK FOR PROMPT IMPROVEMENT", ""]
-    lines.append("Rule-based observations:")
-    comps = rule_breakdown.get("components", {})
-    for k, v in comps.items():
-        lines.append(f"  - {k}: {v}")
-    if rule_breakdown.get("therapy_cliches_found"):
-        lines.append(f"  - Therapy clichés to remove: {rule_breakdown['therapy_cliches_found']}")
-    if rule_breakdown.get("ungrounded_citations"):
-        lines.append(
-            f"  - Citations that weren't in retrieved passages (likely hallucinated): "
-            f"{rule_breakdown['ungrounded_citations']}"
-        )
-    if not rule_breakdown.get("nondual_markers_found"):
-        lines.append("  - Response lacks explicit Advaita register; consider invoking "
-                     "concepts like sākṣī, adhyāsa, the two truths, etc.")
-    if not rule_breakdown.get("actionable_hits"):
-        lines.append("  - No concrete practice or this-week shift was offered.")
-    tier_counts = rule_breakdown.get("tier_counts", {})
-    if tier_counts:
-        lines.append(f"  - Selected passage tiers: {tier_counts} "
-                     f"(prefer primary + śaṅkara when both options exist).")
-    lines.append("")
-    lines.append("Rubric scores from Advaita-tradition examiner:")
-    for k, v in judge_rubric.items():
-        if isinstance(v, float):
-            lines.append(f"  - {k}: {v:.2f}")
-    lines.append("")
-    lines.append("Examiner critique:")
-    lines.append(critique.strip() or "(no critique returned)")
-    return "\n".join(lines)
-def gita_metric(
-    gold: dspy.Example,
-    pred: dspy.Prediction,
-    trace: Any = None,
-    pred_name: str | None = None,
-    pred_trace: Any = None,
-) -> dspy.Prediction:
-    """The GEPA-compatible metric.
-    Returns dspy.Prediction(score=..., feedback=...). The feedback string is
-    what GEPA's reflection LM ingests when rewriting prompts."""
-    user_q = getattr(gold, "user_question", "") if gold else ""
-    rule_score, rule_breakdown = rule_based_score(pred)
-    j_score, j_rubric, critique = judge_score(user_q, pred)
-    composite = RULE_WEIGHT * rule_score + JUDGE_WEIGHT * j_score
-    feedback = _format_feedback(rule_breakdown, j_rubric, critique)
-    return dspy.Prediction(score=composite, feedback=feedback)
-def quick_eval_score(
-    gold: dspy.Example,
-    pred: dspy.Prediction,
-    trace: Any = None,
-) -> float:
-    """A pure-float metric for `dspy.Evaluate` — same composite, no feedback."""
-    out = gita_metric(gold, pred, trace=trace)
-    return float(out.score)

optimize_gepa.py DELETED Viewed

@@ -1,200 +0,0 @@
-"""
-optimize_gepa.py — run GEPA reflective prompt evolution.
-GEPA (Genetic-Pareto) treats the program's prompts as an evolving population.
-At each step it:
-  1. Runs the current candidate(s) on a minibatch of training examples
-  2. Collects the (score, feedback) pairs from our metric
-  3. Asks a *reflection LM* to read the failures + feedback and propose a
-     mutated prompt
-  4. Evaluates the mutant; keeps it if it Pareto-dominates the parent on the
-     validation set
-  5. Repeats
-Because we wrote `gita_metric` to return rich textual feedback, the reflection
-LM has something substantive to chew on instead of just gradient signal.
-The dataset has no gold labels — that's deliberate. Our metric judges the
-prediction directly. This is the regime GEPA is designed for.
-Usage:
-    python optimize_gepa.py --auto medium
-    python optimize_gepa.py --max-metric-calls 300 --proxy-task-lm
-    python optimize_gepa.py --auto light --proxy-task-lm   # ~2-3 hrs vs 260 hrs
-Proxy task LM (--proxy-task-lm):
-    Runs GEPA with gpt-4o-mini as the task LM instead of Gemma 4. GEPA only
-    needs to evaluate prompt quality — it doesn't need the final inference model.
-    Optimized prompts are model-agnostic text and transfer back to Gemma 4 when
-    the saved program is loaded at inference time. ~20x speedup over Gemma thinking.
-"""
-from __future__ import annotations
-import argparse
-import json
-import random
-from pathlib import Path
-import dspy
-from dspy import GEPA
-import config
-from advisor import GitaAdvisor
-from dataset_generator import load_jsonl, to_dspy_examples
-import metrics as metrics_module
-from metrics import gita_metric, quick_eval_score
-def split(examples, val_frac: float, seed: int = 42):
-    rng = random.Random(seed)
-    shuffled = examples[:]
-    rng.shuffle(shuffled)
-    n_val = max(20, int(len(shuffled) * val_frac))
-    return shuffled[n_val:], shuffled[:n_val]
-def main():
-    ap = argparse.ArgumentParser()
-    ap.add_argument("--dataset", default=str(config.DATASET_PATH))
-    ap.add_argument("--out", default=str(config.OPTIMIZED_PROGRAM_PATH))
-    ap.add_argument("--val-frac", type=float, default=0.2)
-    ap.add_argument(
-        "--auto",
-        choices=["light", "medium", "heavy"],
-        default="medium",
-        help="GEPA's auto-budget mode. 'light' for smoke-tests, 'medium' for "
-             "a real run, 'heavy' for an overnight run on a meaty box.",
-    )
-    ap.add_argument(
-        "--max-metric-calls",
-        type=int,
-        default=None,
-        help="Override --auto with an explicit metric-call budget.",
-    )
-    ap.add_argument("--track-stats", action="store_true", default=True)
-    ap.add_argument("--seed", type=int, default=42)
-    ap.add_argument(
-        "--proxy-task-lm",
-        action="store_true",
-        default=False,
-        help="Use gpt-4o-mini as the task LM during GEPA instead of Gemma 4. "
-             "~20x faster; optimized prompts transfer back to Gemma 4 at inference. "
-             "Requires OPENAI_API_KEY.",
-    )
-    args = ap.parse_args()
-    # Configure DSPy globally and grab the reflection LM
-    task_lm, reflection_lm = config.configure_dspy()
-    if args.proxy_task_lm:
-        # Override the task LM with gpt-4o-mini for the duration of this process.
-        # DSPy saves only prompt text (instructions + field descriptions), not the
-        # LM choice — so the optimized JSON loads cleanly onto Gemma 4 at inference.
-        task_lm = dspy.LM(model=config.PROXY_TASK_MODEL, **config.PROXY_TASK_LM_KWARGS)
-        dspy.configure(lm=task_lm, adapter=dspy.ChatAdapter(use_json_adapter_fallback=False))
-        print(f"Task LM (proxy): {task_lm.model}  [GEPA optimization only]")
-    else:
-        print(f"Task LM:         {task_lm.model}")
-    print(f"Reflection LM:   {reflection_lm.model}")
-    # Use the reflection LM (gpt-4o) for judging instead of the task LM (Gemma).
-    # Gemma judging its own responses produces noisy, self-congratulatory scores;
-    # gpt-4o gives the reflection step the crisp, tradition-aware feedback it needs.
-    metrics_module.configure_judge(reflection_lm)
-    print(f"Judge LM:      {reflection_lm.model} (overriding task LM for judging)")
-    # Dataset
-    raw = load_jsonl(Path(args.dataset))
-    examples = to_dspy_examples(raw)
-    if len(examples) < 40:
-        print(f"[warn] Only {len(examples)} examples — generate more with "
-              f"`python dataset_generator.py --n 500`.")
-    train, val = split(examples, args.val_frac, seed=args.seed)
-    print(f"Train: {len(train)}   Val: {len(val)}")
-    # Student program
-    student = GitaAdvisor()
-    # More threads when hitting an API (no local GPU bottleneck).
-    num_threads = 16 if args.proxy_task_lm or config.TASK_LM_BACKEND == "gemini" else 4
-    # Optional: get a baseline number for context
-    print("\nEvaluating baseline (un-optimized) on validation set ...")
-    evaluator = dspy.Evaluate(
-        devset=val,
-        metric=quick_eval_score,
-        num_threads=num_threads,
-        display_progress=True,
-        display_table=0,
-    )
-    try:
-        baseline_result = evaluator(student)
-        baseline_score = float(baseline_result) if hasattr(baseline_result, "__float__") else baseline_result
-        print(f"Baseline score: {baseline_score}")
-    except Exception as e:
-        print(f"Baseline eval failed (continuing to optimization): {e}")
-    # GEPA
-    log_dir = str(config.ARTIFACTS_DIR / "gepa_logs")
-    gepa_kwargs = dict(
-        metric=gita_metric,
-        reflection_lm=reflection_lm,
-        track_stats=args.track_stats,
-        seed=args.seed,
-        # Show 6 training examples to the reflection LM per proposal step instead of
-        # the default 3 — our 12 domains need diversity to avoid domain-specific over-fit.
-        reflection_minibatch_size=6,
-        # API-backed runs (proxy or Gemini) can saturate many threads; local GPU is
-        # limited to 4 to avoid OOM / serialization on a single device.
-        num_threads=num_threads,
-        # When the task LM mangles a list field the reflection LM should know the format
-        # broke, not just see a low score with no explanation.
-        add_format_failure_as_feedback=True,
-        # Persist per-step scores and prompts for post-run inspection.
-        log_dir=log_dir,
-    )
-    if args.max_metric_calls is not None:
-        gepa_kwargs["max_metric_calls"] = args.max_metric_calls
-    else:
-        gepa_kwargs["auto"] = args.auto
-    print(f"\nStarting GEPA with {gepa_kwargs} ...")
-    optimizer = GEPA(**gepa_kwargs)
-    optimized = optimizer.compile(
-        student=student,
-        trainset=train,
-        valset=val,
-    )
-    # Save
-    out_path = Path(args.out)
-    out_path.parent.mkdir(parents=True, exist_ok=True)
-    optimized.save(str(out_path))
-    print(f"\nSaved optimized program to {out_path}")
-    # Side-by-side eval
-    print("\nFinal eval on validation set ...")
-    final_result = evaluator(optimized)
-    final_score = float(final_result) if hasattr(final_result, "__float__") else final_result
-    print(f"Optimized score: {final_score}")
-    # Dump the optimized prompts for human inspection
-    inspect_path = out_path.with_suffix(".prompts.txt")
-    with inspect_path.open("w", encoding="utf-8") as f:
-        f.write("# Optimized prompts after GEPA\n\n")
-        for name, predictor in optimized.named_predictors():
-            sig = predictor.signature
-            f.write(f"## {name}\n")
-            f.write(f"### instructions\n{sig.instructions}\n\n")
-            f.write("### fields\n")
-            for fname, field in sig.fields.items():
-                desc = getattr(field.json_schema_extra, "get", lambda *_: "")("desc", "") \
-                    if hasattr(field, "json_schema_extra") else ""
-                f.write(f"- {fname}: {desc}\n")
-            f.write("\n---\n\n")
-    print(f"Wrote prompt inspection file to {inspect_path}")
-if __name__ == "__main__":
-    main()

parsers/gita_json.py DELETED Viewed

@@ -1,236 +0,0 @@
-"""
-parsers/gita_json.py — turn the gita/gita verse-indexed JSON into Verse records.
-The gita/gita repo (Unlicense, public-domain dedication) gives us four files
-on the static mirror:
-    chapters.json     — chapter metadata (number, name, summary)
-    verse.json        — per-verse Sanskrit + transliteration + word_meanings
-    translation.json  — per-verse English translations keyed by author_id
-    authors.json      — author metadata for the translations
-Why split parsing across multiple sources_registry entries
-----------------------------------------------------------
-We register `gita_json_core` (the verse text) and `gita_json_translations`
-(the English translations) as separate sources. Both happen to feed this one
-parser. The reason for the split is that translations come and go from the
-upstream repo whereas the core verse data is essentially fixed; isolating
-them lets us pin only what we need.
-Translator allowlist
---------------------
-Not every translator in the gita/gita translations.json is public-domain.
-We hard-allowlist the ones we know are safe to redistribute. Anyone not on
-the list is silently skipped — adding more is a one-line change.
-"""
-from __future__ import annotations
-import json
-from pathlib import Path
-from typing import Iterable
-from corpus import Verse
-# ──────────────────────────── Translator allowlist ────────────────────────────
-# The keys are the author_id values used inside translation.json. The values
-# are display strings + the year we want to use for attribution.
-#
-# Why this list and not just "all translations":
-# - Some translators in the upstream repo (e.g. ISKCON Prabhupada) have
-#   active publisher rights that we shouldn't rely on regardless of how the
-#   upstream chose to license its compilation.
-# - Reducing translation count keeps the index lean. Three voices are plenty.
-#
-# If you want to add a translator, verify their public-domain status (death
-# year + 70 in most jurisdictions, or pre-1929 publication for US PD), then
-# add a row.
-ALLOWED_TRANSLATORS: dict[str, tuple[str, int | None]] = {
-    # Swami Sivananda — d. 1963 — works are widely shared by The Divine Life
-    # Society in keeping with their founder's non-commercial stance.
-    "sivananda":   ("Swami Sivananda", 1969),
-    # Swami Tejomayananda — modern; included only because some mirrors
-    # release these under permissive terms; double-check before relying on it.
-    # Disabled by default to be conservative.
-    # "tejomayananda": ("Swami Tejomayananda", 1995),
-    # Dr. S. Sankaranarayan — translation of Śaṅkara's Gītā Bhāṣya included
-    # in some forks of gita/gita; verify the specific edition. Off by default.
-    # "shankara":    ("Śaṅkara (tr. Sankaranarayan)", 1990),
-    # The verse text itself is not a "translation" per se but a copy of the
-    # critical text plus transliteration. We include it under the synthetic
-    # author key 'sanskrit'.
-    "sanskrit":    ("Sanskrit text + IAST", None),
-}
-# ──────────────────────────── Helpers ────────────────────────────
-def _verse_id(chapter: int, verse_no: int) -> str:
-    """Stable global key. Format: bhagavad_gita_<chap>_<verse>, zero-padded
-    to two digits so 1.10 sorts after 1.9 and lexical ordering matches numeric."""
-    return f"bhagavad_gita_{chapter:02d}_{verse_no:02d}"
-def _verse_ref(chapter: int, verse_no: int) -> str:
-    """Citation form used by the advisor in its replies."""
-    return f"BG {chapter}.{verse_no}"
-def _section_display(chapter_meta: dict) -> str:
-    name = chapter_meta.get("name_translation") or chapter_meta.get("name", "")
-    return f"Chapter {chapter_meta.get('chapter_number', '?')}: {name}"
-# ──────────────────────────── Parser entry point ────────────────────────────
-def parse(raw_dir_for_core: Path, raw_dir_for_translations: Path | None = None) -> Iterable[Verse]:
-    """Walk the gita/gita JSON files and yield Verse records.
-    Layout expected (after download_sources.py has run):
-        raw_dir_for_core/chapters.json
-        raw_dir_for_core/verse.json
-        [optionally]
-        raw_dir_for_translations/translation.json
-        raw_dir_for_translations/authors.json
-    If translations are not present, we still emit Verses with sanskrit +
-    transliteration + word_meanings; the `translation` field falls back to
-    the transliteration so the verse isn't content-empty. (Better: enable
-    the gita_json_translations source.)
-    """
-    chapters = _load(raw_dir_for_core / "chapters.json")
-    verses_raw = _load(raw_dir_for_core / "verse.json")
-    chapters_by_id = {c["chapter_number"]: c for c in chapters}
-    translations_by_verse: dict[int, dict[str, str]] = {}
-    authors_by_id: dict[str, str] = {}
-    if raw_dir_for_translations is not None:
-        translations_by_verse = _load_translations(raw_dir_for_translations / "translation.json")
-        authors_by_id = _load_authors(raw_dir_for_translations / "authors.json")
-    # Pick the best available translator from the allowlist, in priority order.
-    # First match wins. This keeps the index from carrying redundant English
-    # translations of the same verse.
-    translator_priority = ["sivananda", "sanskrit"]
-    for v in verses_raw:
-        chap_no = v["chapter_number"]
-        verse_no = v["verse_number"]
-        chap_meta = chapters_by_id.get(chap_no, {})
-        verse_id = _verse_id(chap_no, verse_no)
-        # Sanskrit text comes from the core file. The 'text' field has it
-        # in Devanāgarī, often with a trailing newline and verse number.
-        sanskrit = (v.get("text") or "").strip()
-        translit = (v.get("transliteration") or "").strip()
-        word_mean = (v.get("word_meanings") or "").strip()
-        # Try to attach an English translation
-        english = ""
-        translator_label = ""
-        v_translations = translations_by_verse.get(v.get("id") or v.get("externalId") or -1, {})
-        for key in translator_priority:
-            text = v_translations.get(key) or _translation_for(v_translations, key)
-            if text:
-                english = text.strip()
-                meta = ALLOWED_TRANSLATORS.get(key)
-                if meta:
-                    translator_label = meta[0]
-                break
-        # Fallback: if no English translation, use word-meanings as a substitute
-        # so the verse isn't content-empty. Better than nothing for retrieval,
-        # though enrichment will be poorer.
-        if not english:
-            english = word_mean or translit
-        yield Verse(
-            verse_id=verse_id,
-            work="bhagavad_gita",
-            work_display="Bhagavad Gītā",
-            verse_ref=_verse_ref(chap_no, verse_no),
-            tier="primary",
-            section=f"chapter_{chap_no:02d}",
-            section_display=_section_display(chap_meta),
-            translation=english,
-            translator=translator_label,
-            sanskrit=sanskrit,
-            transliteration=translit,
-            word_meanings=word_mean,
-            bhashya="",                  # Gītā Bhāṣya is brought in by the Sastry parser
-            bhashya_translator="",
-            source_key="gita_json_core",
-            license="unlicense",
-        )
-# ──────────────────────────── Internals ────────────────────────────
-def _load(path: Path):
-    with path.open(encoding="utf-8") as f:
-        return json.load(f)
-def _load_translations(path: Path) -> dict[int, dict[str, str]]:
-    """The translations file has one entry per (verse, author). Group them
-    by verse_id into a {verse_id: {author_id: text}} map.
-    Schema seen in the wild varies slightly between forks of gita/gita; we
-    cope by trying a few key names. If parsing fails entirely we return {}
-    and proceed without translations rather than blowing up the whole ingest.
-    """
-    if not path.exists():
-        return {}
-    try:
-        raw = _load(path)
-    except Exception as e:
-        print(f"[gita_json] failed to load translations: {e}")
-        return {}
-    out: dict[int, dict[str, str]] = {}
-    for row in raw:
-        vid = row.get("verse_id") or row.get("verseNumber") or row.get("verse_number_id") or row.get("id")
-        text = row.get("description") or row.get("text") or row.get("translation")
-        if vid is None or not text:
-            continue
-        # Skip non-English rows (Ramsukhdas Hindi etc.)
-        lang = (row.get("lang") or "").lower()
-        if lang and lang not in ("english", "en"):
-            continue
-        # Map the authorName (e.g. "Swami Sivananda") to an allowlist key
-        # ("sivananda") via case-insensitive substring matching. The numeric
-        # author_id field alone can't match the allowlist, which is why we
-        # prefer authorName here.
-        name_str = str(row.get("authorName") or row.get("author_id") or row.get("author") or "").strip()
-        matched_key = next(
-            (k for k in ALLOWED_TRANSLATORS if k.lower() in name_str.lower()),
-            None,
-        )
-        if matched_key is None:
-            continue
-        out.setdefault(int(vid), {})[matched_key] = text
-    return out
-def _load_authors(path: Path) -> dict[str, str]:
-    if not path.exists():
-        return {}
-    try:
-        raw = _load(path)
-    except Exception:
-        return {}
-    return {row.get("id"): row.get("name", "") for row in raw if row.get("id")}
-def _translation_for(v_translations: dict, author_key: str) -> str | None:
-    """Tolerant lookup: some files use 'sivananda', some 'Sivananda', etc."""
-    if author_key in v_translations:
-        return v_translations[author_key]
-    lk = author_key.lower()
-    for k, val in v_translations.items():
-        if str(k).lower() == lk:
-            return val
-    return None

parsers/sastry_archive.py DELETED Viewed

@@ -1,249 +0,0 @@
-"""
-parsers/sastry_archive.py — extract verse-attached Śaṅkara bhāṣya from
-Alladi Mahadeva Sastry's 1897 archive.org OCR text.
-What makes this harder than the gita_json parser
--------------------------------------------------
-The gita/gita JSON gave us each verse already keyed by chapter and verse
-number. The Sastry archive.org file is OCR'd plain text — about 20 MB of
-running prose where the only structural cues are:
-    1. Chapter headings, formatted in caps like "SANKHYA YOGA." or
-       "CHAPTER II — SANKHYA YOGA"
-    2. Verse markers, which appear in two forms in the OCR:
-         - inline as "(II. 47.)" or "II. 47." after a translated verse
-         - as section headings like "47." or "Verse 47." preceding the bhāṣya
-    3. The rule that when a translated verse appears, Śaṅkara's commentary
-       follows immediately until the next verse marker.
-Add to that: OCR noise. "II" can become "11", "47" can become "4 7", periods
-become commas, glyphs get dropped. So the parser is forgiving — it tries
-several patterns and falls back gracefully.
-What we extract
----------------
-For each verse we find, we yield a Verse with:
-    - tier='shankara'
-    - work='bhagavad_gita_bhashya'  (kept distinct from 'bhagavad_gita' so
-      the joiner in ingest_corpus.py knows to merge bhashya into the gita
-      verses by verse_ref)
-    - translation = the verse text as Sastry rendered it (handy as a second
-      English voice alongside Sivananda)
-    - bhashya = Śaṅkara's commentary, as Sastry translated it
-    - bhashya_translator = 'Alladi Mahadeva Sastry, 1897'
-Robustness strategy
--------------------
-We don't try to be perfect. If a verse's bhāṣya is mis-attributed by ±1, the
-downstream enrichment step will produce paraphrases that don't quite fit, and
-we'll catch those during the spot-check pass on enriched output. The metric
-will also penalize ungrounded citations. The key invariant is: never silently
-emit a wrong (verse_id, bhashya) pair if we're uncertain — better to skip.
-"""
-from __future__ import annotations
-import re
-from dataclasses import replace
-from pathlib import Path
-from typing import Iterable
-from corpus import Verse
-# ──────────────────────────── Patterns ────────────────────────────
-# Roman numerals (allowing OCR substitutions: I↔1, V↔V, etc.)
-ROMAN = r"(?:[IVX1l]+|[ivx]+)"
-# A "verse marker" looks like "II. 47" or "(II. 47.)" or "47" alone in a section
-# heading. We try several shapes and let the most specific win.
-VERSE_INLINE = re.compile(
-    r"\(?\s*(?P<chap>" + ROMAN + r")\s*[\.\,]\s*(?P<verse>\d{1,3})\s*[\.\,]?\s*\)?",
-    re.IGNORECASE,
-)
-# Chapter heading: "CHAPTER II" or "II. SANKHYA YOGA" — uppercase-heavy lines
-CHAPTER_HEADING = re.compile(
-    r"^\s*(?:CHAPTER\s+)?(?P<roman>" + ROMAN + r")\.?\s+[A-Z][A-Z \-—]{4,}",
-    re.MULTILINE,
-)
-# Roman → arabic
-ROMAN_MAP = {
-    "I": 1, "II": 2, "III": 3, "IV": 4, "V": 5, "VI": 6, "VII": 7, "VIII": 8,
-    "IX": 9, "X": 10, "XI": 11, "XII": 12, "XIII": 13, "XIV": 14, "XV": 15,
-    "XVI": 16, "XVII": 17, "XVIII": 18,
-}
-def _to_arabic(token: str) -> int | None:
-    """Convert a possibly-noisy roman numeral to an int. OCR sometimes turns
-    'I' into '1' and 'II' into '11', so we accept both forms."""
-    t = token.upper().replace("L", "I").replace("0", "O")  # OCR substitutions
-    if t in ROMAN_MAP:
-        return ROMAN_MAP[t]
-    # Pure-arabic fallback (e.g. OCR rendered 'II' as '11')
-    if t.isdigit():
-        n = int(t)
-        if 1 <= n <= 18:
-            return n
-    return None
-# ──────────────────────────── Main parse ────────────────────────────
-def parse(raw_dir: Path) -> Iterable[Verse]:
-    """Walk Sastry archive.org text in raw_dir and yield Verse records.
-    Expected layout (after download_sources.py):
-        raw_dir/Bhagavad-Gita.with.the.Commentary.of.Sri.Shankaracharya_djvu.txt
-    The file is ~20 MB of OCR text. We stream it line-by-line, maintain the
-    current chapter as we encounter chapter headings, and at each verse marker
-    yield the accumulated text since the previous marker as the bhāṣya.
-    """
-    txts = list(raw_dir.glob("*_djvu.txt")) + list(raw_dir.glob("*.txt"))
-    if not txts:
-        print(f"[sastry] no .txt under {raw_dir}; did you download_sources.py?")
-        return
-    text = txts[0].read_text(encoding="utf-8", errors="replace")
-    text = _denoise(text)
-    # First pass: find every verse marker with its position and attempt to
-    # disambiguate the chapter from context. We collect (chap, verse, span)
-    # tuples in document order.
-    markers: list[tuple[int, int, int, int]] = []  # chap, verse, start, end
-    current_chapter = 1
-    last_pos = 0
-    # Walk chapter headings and verse markers together via merged iteration
-    events = []
-    for m in CHAPTER_HEADING.finditer(text):
-        c = _to_arabic(m.group("roman"))
-        if c is not None:
-            events.append(("chapter", m.start(), c))
-    for m in VERSE_INLINE.finditer(text):
-        c = _to_arabic(m.group("chap"))
-        try:
-            v = int(m.group("verse"))
-        except (ValueError, TypeError):
-            continue
-        if c is None or not (1 <= v <= 80):
-            continue
-        events.append(("verse", m.start(), c, v, m.end(), m.start("verse")))
-    events.sort(key=lambda e: e[1])
-    # Second pass: build (chapter, verse) → (start, end) spans, where each
-    # span is the bhāṣya from one marker to the next. We yield in document
-    # order with the chapter from the most recent chapter heading we saw.
-    last_marker_pos: int | None = None
-    last_chap: int | None = None
-    last_verse: int | None = None
-    for ev in events:
-        if ev[0] == "chapter":
-            current_chapter = ev[2]
-            continue
-        # ev: ("verse", start, chap, verse, end, verse_pos)
-        _, start, chap, verse, end, verse_pos = ev
-        # Only treat markers where the verse NUMBER appears near the start of
-        # its line — those are actual section headings. Inline cross-references
-        # like "(II. 47.)" mid-paragraph have the verse number well into the
-        # line and must not be treated as section boundaries.
-        verse_line_start = text.rfind("\n", 0, verse_pos) + 1
-        on_own_line = (verse_pos - verse_line_start) <= 8
-        if not on_own_line:
-            continue
-        current_chapter = chap
-        if last_marker_pos is not None and last_chap is not None and last_verse is not None:
-            bhashya_text = text[last_marker_pos:start].strip()
-            if bhashya_text:
-                yield _build_verse(
-                    chap=last_chap, verse=last_verse, body=bhashya_text,
-                )
-        last_marker_pos = end
-        last_chap = current_chapter
-        last_verse = verse
-    # Flush the trailing one
-    if last_marker_pos is not None and last_chap and last_verse:
-        tail = text[last_marker_pos:].strip()
-        if tail:
-            yield _build_verse(chap=last_chap, verse=last_verse, body=tail)
-# ──────────────────────────── Builders ────────────────────────────
-def _build_verse(chap: int, verse: int, body: str) -> Verse:
-    """The body lump contains both Sastry's English of the verse and Śaṅkara's
-    commentary, usually with the verse first (sometimes labeled) and the
-    commentary following. We make a *light* split heuristic: if the first
-    paragraph is short (≤ 400 chars) and ends near a period, treat it as the
-    verse translation; the rest is bhashya. If we can't split confidently,
-    we put everything into bhashya and leave translation empty — the gita_json
-    parser already gave us a translation by another translator."""
-    body = body.strip()
-    translation = ""
-    bhashya = body
-    # Heuristic split on the first blank-ish line within reasonable distance
-    para_break = re.search(r"\n\s*\n", body[:600])
-    if para_break and para_break.end() < 500:
-        head = body[:para_break.start()].strip()
-        tail = body[para_break.end():].strip()
-        # Accept the split only if the head looks like a verse: short-ish,
-        # not starting with a typical-bhashya opener like "This means" /
-        # "The meaning is" / "Here the Lord says".
-        if 30 < len(head) < 400 and not _looks_like_bhashya_opener(head):
-            translation, bhashya = head, tail
-    return Verse(
-        verse_id=f"bhagavad_gita_{chap:02d}_{verse:02d}",
-        work="bhagavad_gita_bhashya",
-        work_display="Bhagavad Gītā with Śaṅkara's Bhāṣya",
-        verse_ref=f"BG {chap}.{verse}",
-        tier="shankara",
-        section=f"chapter_{chap:02d}",
-        section_display=f"Chapter {chap}",
-        translation=translation,
-        translator="Alladi Mahadeva Sastry" if translation else "",
-        bhashya=bhashya,
-        bhashya_translator="Alladi Mahadeva Sastry, 1897",
-        source_key="sastry_gita_bhashya",
-        license="public_domain",
-    )
-def _looks_like_bhashya_opener(s: str) -> bool:
-    s = s.strip().lower()
-    openers = (
-        "this means", "the meaning is", "the sense is", "here the lord",
-        "here it is said", "the lord says", "the question may", "objection",
-        "the commentator",
-    )
-    return any(s.startswith(o) for o in openers)
-# ──────────────────────────── OCR de-noise ────────────────────────────
-def _denoise(text: str) -> str:
-    """Light cleanup. Aggressive normalization risks losing real signal —
-    we only fix patterns we're confident about."""
-    # Common OCR substitutions for Sanskrit diacritics losses won't matter
-    # for English-language retrieval; we leave Sanskrit fragments alone.
-    # Collapse runs of repeated punctuation that OCR hallucinated
-    text = re.sub(r"\.{3,}", ".", text)
-    text = re.sub(r" +\.", ".", text)
-    # Glue cross-line hyphens: "lib-\nerty" → "liberty"
-    text = re.sub(r"-\n([a-z])", r"\1", text)
-    # Normalize whitespace
-    text = re.sub(r"[ \t]+", " ", text)
-    text = re.sub(r"\n[ \t]+", "\n", text)
-    text = re.sub(r"\n{3,}", "\n\n", text)
-    return text

run_overnight.py DELETED Viewed

@@ -1,230 +0,0 @@
-"""
-run_overnight.py — orchestrates full GEPA optimization through light → medium,
-then saves prompts and runs a multi-question test suite.
-Usage:
-    python run_overnight.py [--skip-light] [--skip-medium]
-Writes a timestamped log to artifacts/overnight_run.log.
-"""
-from __future__ import annotations
-import argparse
-import subprocess
-import sys
-import time
-from datetime import datetime
-from pathlib import Path
-import json
-ROOT = Path(__file__).parent.resolve()
-LOG_PATH = ROOT / "artifacts" / "overnight_run.log"
-OPTIMIZED_PATH = ROOT / "artifacts" / "optimized_advisor.json"
-PROMPTS_PATH = ROOT / "artifacts" / "optimized_advisor.prompts.txt"
-RESULTS_PATH = ROOT / "artifacts" / "test_results.json"
-TEST_QUESTIONS = [
-    "I just got laid off and feel like nothing matters anymore.",
-    "I keep procrastinating on important work and feel guilty about it. How do I stop?",
-    "My relationship ended and I feel like I've lost my identity. Who am I without this person?",
-    "I'm terrified of death and can't stop thinking about it at night.",
-    "I have achieved everything I wanted — career, family, money — and still feel empty.",
-    "I feel angry at everyone around me but don't know why. How should I deal with this?",
-    "I can't stop comparing myself to others and feeling like I'm always falling short.",
-]
-def ts() -> str:
-    return datetime.now().strftime("%Y-%m-%d %H:%M:%S")
-def log(msg: str, f=None):
-    line = f"[{ts()}] {msg}"
-    print(line, flush=True)
-    if f:
-        f.write(line + "\n")
-        f.flush()
-def run_phase(cmd: list[str], phase: str, logfile) -> bool:
-    log(f"=== STARTING {phase} ===", logfile)
-    log(f"Command: {' '.join(cmd)}", logfile)
-    start = time.time()
-    try:
-        proc = subprocess.Popen(
-            cmd,
-            stdout=subprocess.PIPE,
-            stderr=subprocess.STDOUT,
-            text=True,
-            cwd=str(ROOT),
-        )
-        for line in proc.stdout:
-            logfile.write(line)
-            logfile.flush()
-            # Echo key lines to terminal
-            if any(k in line for k in ["score", "GEPA", "Step", "ERROR", "Saved", "Train:", "Val:", "Baseline"]):
-                print(line, end="", flush=True)
-        proc.wait()
-        elapsed = time.time() - start
-        if proc.returncode == 0:
-            log(f"=== {phase} COMPLETED in {elapsed/60:.1f} min ===", logfile)
-            return True
-        else:
-            log(f"=== {phase} FAILED (exit {proc.returncode}) after {elapsed/60:.1f} min ===", logfile)
-            return False
-    except Exception as e:
-        log(f"=== {phase} ERROR: {e} ===", logfile)
-        return False
-def run_test_suite(logfile) -> dict:
-    log("=== STARTING TEST SUITE ===", logfile)
-    sys.path.insert(0, str(ROOT))
-    import config
-    from advisor import load_optimized
-    from metrics import gita_metric
-    import dspy
-    from concurrent.futures import ThreadPoolExecutor, as_completed
-    config.configure_dspy()
-    advisor = load_optimized()
-    n = len(TEST_QUESTIONS)
-    def run_one(i_q):
-        i, q = i_q
-        try:
-            pred = advisor(user_question=q, history=dspy.History(messages=[]))
-            gold = dspy.Example(user_question=q).with_inputs("user_question")
-            m = gita_metric(gold, pred)
-            return i, q, {
-                "question": q,
-                "score": round(float(m.score), 3),
-                "word_count": len(pred.response.split()),
-                "sources_cited": pred.sources_cited,
-                "response_excerpt": pred.response[:200],
-                "feedback_excerpt": m.feedback[:500],
-            }
-        except Exception as e:
-            return i, q, {"question": q, "error": str(e), "score": 0.0}
-    indexed = list(enumerate(TEST_QUESTIONS, 1))
-    results_map = {}
-    with ThreadPoolExecutor(max_workers=n) as pool:
-        futures = {pool.submit(run_one, iq): iq for iq in indexed}
-        for fut in as_completed(futures):
-            i, q, result = fut.result()
-            results_map[i] = result
-            if "error" in result:
-                log(f"  [{i}/{n}] ERROR: {result['error']}", logfile)
-            else:
-                log(f"  [{i}/{n}] score={result['score']:.3f}  wc={result['word_count']}  sources={result['sources_cited']}", logfile)
-    results = [results_map[i] for i in range(1, n + 1)]
-    avg = sum(r.get("score", 0) for r in results) / n
-    log(f"=== TEST SUITE DONE — avg score: {avg:.3f} ===", logfile)
-    return {"questions": results, "avg_score": round(avg, 3), "timestamp": ts()}
-def dump_prompts(logfile):
-    """Re-extract and log optimized prompts to a human-readable file."""
-    if not OPTIMIZED_PATH.exists():
-        log("  No optimized program found — skipping prompt dump.", logfile)
-        return
-    sys.path.insert(0, str(ROOT))
-    import config
-    from advisor import GitaAdvisor
-    config.configure_dspy()
-    advisor = GitaAdvisor()
-    try:
-        advisor.load(str(OPTIMIZED_PATH))
-    except Exception as e:
-        log(f"  Could not load optimized program: {e}", logfile)
-        return
-    lines = ["# Optimized Prompts after GEPA overnight run", f"# Extracted at {ts()}", ""]
-    for name, predictor in advisor.named_predictors():
-        sig = predictor.signature
-        lines.append(f"## {name}")
-        lines.append(f"### Instructions")
-        lines.append(sig.instructions or "(none)")
-        lines.append("")
-        lines.append("### Field descriptions")
-        for fname, field in sig.fields.items():
-            extras = field.json_schema_extra or {}
-            desc = extras.get("desc", "") if isinstance(extras, dict) else ""
-            lines.append(f"  {fname}: {desc}")
-        lines.append("")
-        lines.append("---")
-        lines.append("")
-    PROMPTS_PATH.write_text("\n".join(lines), encoding="utf-8")
-    log(f"  Prompts written to {PROMPTS_PATH}", logfile)
-def main():
-    ap = argparse.ArgumentParser()
-    ap.add_argument("--skip-light", action="store_true")
-    ap.add_argument("--skip-medium", action="store_true")
-    ap.add_argument("--skip-tests", action="store_true")
-    args = ap.parse_args()
-    LOG_PATH.parent.mkdir(parents=True, exist_ok=True)
-    with LOG_PATH.open("w", encoding="utf-8") as logfile:
-        log("=== OVERNIGHT GEPA RUN STARTED ===", logfile)
-        log(f"Dataset: {ROOT / 'data' / 'synthetic_questions.jsonl'}", logfile)
-        log(f"Output:  {OPTIMIZED_PATH}", logfile)
-        python = sys.executable
-        # ── Phase 1: Light ──
-        if not args.skip_light:
-            ok = run_phase(
-                [python, "optimize_gepa.py", "--auto", "light"],
-                "GEPA LIGHT",
-                logfile,
-            )
-            if not ok:
-                log("Light phase failed — stopping overnight run.", logfile)
-                sys.exit(1)
-            # Back up light result
-            if OPTIMIZED_PATH.exists():
-                import shutil
-                shutil.copy(OPTIMIZED_PATH, OPTIMIZED_PATH.with_suffix(".light.json"))
-                log(f"  Backed up light result to {OPTIMIZED_PATH.with_suffix('.light.json')}", logfile)
-        else:
-            log("Skipping light phase (--skip-light).", logfile)
-        # ── Phase 2: Medium ──
-        if not args.skip_medium:
-            ok = run_phase(
-                [python, "optimize_gepa.py", "--auto", "medium"],
-                "GEPA MEDIUM",
-                logfile,
-            )
-            if not ok:
-                log("Medium phase failed.", logfile)
-                # Don't exit — still dump whatever we have
-        else:
-            log("Skipping medium phase (--skip-medium).", logfile)
-        # ── Dump prompts ──
-        log("Extracting optimized prompts ...", logfile)
-        dump_prompts(logfile)
-        # ── Test suite ──
-        if not args.skip_tests:
-            test_results = run_test_suite(logfile)
-            RESULTS_PATH.write_text(json.dumps(test_results, indent=2, ensure_ascii=False), encoding="utf-8")
-            log(f"Test results written to {RESULTS_PATH}", logfile)
-        else:
-            log("Skipping test suite (--skip-tests).", logfile)
-        log("=== OVERNIGHT RUN COMPLETE ===", logfile)
-if __name__ == "__main__":
-    main()

smoke_test.py DELETED Viewed

@@ -1,99 +0,0 @@
-"""
-smoke_test.py — verify the full pipeline before spending hours on GEPA.
-Runs:
-  1. LM connectivity check
-  2. Retriever connectivity check
-  3. One end-to-end advisor call
-  4. One metric call against the result
-If any step fails, the error message tells you which knob to turn.
-    python smoke_test.py "I just got laid off and feel like nothing makes sense anymore."
-"""
-from __future__ import annotations
-import sys
-import json
-import dspy
-import config
-from advisor import GitaAdvisor
-from knowledge_base import AdvaitaRetriever
-from metrics import gita_metric
-def step(label: str):
-    print(f"\n── {label} " + "─" * (60 - len(label)))
-def main():
-    user_q = sys.argv[1] if len(sys.argv) > 1 else (
-        "I just got laid off and feel like nothing makes sense anymore."
-    )
-    step("1. Configure LMs")
-    task_lm, reflection_lm = config.configure_dspy()
-    print(f"  task_lm:       {task_lm.model}")
-    print(f"  reflection_lm: {reflection_lm.model}")
-    step("2. LM round-trip")
-    try:
-        out = task_lm("Reply with the single word: ready.")
-        print(f"  reply: {out!r}")
-    except Exception as e:
-        print(f"  FAILED — is LM Studio running at {config.LM_STUDIO_BASE}?\n  {e}")
-        sys.exit(1)
-    step("3. Retriever sanity")
-    try:
-        retr = AdvaitaRetriever()
-        hits = retr.search("non-attachment to results of action", k=3)
-        if not hits:
-            print("  WARNING: no retrieval results. Did you build the index?")
-            print("  Run:   python knowledge_base.py --build")
-        else:
-            for h in hits:
-                v = h.verse
-                section = f" — {v.section}" if v.section else ""
-                print(f"  [{v.tier}] {v.work}{section}  score={h.combined_score:.3f}")
-    except Exception as e:
-        print(f"  FAILED — index probably not built. Run "
-              f"`python knowledge_base.py --build` after dropping texts in sources/.")
-        print(f"  {e}")
-        sys.exit(1)
-    step("4. End-to-end advisor call")
-    advisor = GitaAdvisor()
-    try:
-        pred = advisor(user_question=user_q, history=dspy.History(messages=[]))
-    except Exception as e:
-        print(f"  FAILED — pipeline error: {e}")
-        sys.exit(1)
-    print(f"\n  user: {user_q}")
-    print(f"\n  felt:    {pred.felt_emotion}")
-    print(f"  surface: {pred.surface_concern}")
-    print(f"  deeper:  {pred.deeper_concern}")
-    print(f"  themes:  {pred.vedantic_themes}")
-    print(f"  queries: {pred.queries}")
-    print(f"  selected indices: {pred.selected_indices}")
-    print(f"\n  --- response ---")
-    print(pred.response)
-    print(f"\n  sources cited: {pred.sources_cited}")
-    step("5. Metric round-trip")
-    gold = dspy.Example(user_question=user_q, history=dspy.History(messages=[])).with_inputs("user_question", "history")
-    m = gita_metric(gold, pred)
-    print(f"  composite score: {m.score:.3f}")
-    print(f"\n  --- feedback (this is what GEPA's reflection LM sees) ---")
-    print(m.feedback)
-    step("Done")
-    print("If you got here, you're ready to run:")
-    print("  python dataset_generator.py --n 500")
-    print("  python optimize_gepa.py --auto medium")
-if __name__ == "__main__":
-    main()

sources_local/.gitkeep ADDED Viewed

File without changes

sources_registry.py DELETED Viewed

@@ -1,331 +0,0 @@
-"""
-sources_registry.py — the one place every open source lives.
-Why a registry rather than scattered URLs?
-------------------------------------------
-Adding a new text to the corpus shouldn't mean editing five files. It should
-mean adding one entry here. Downloads, parsing, re-indexing, and enrichment
-all read from this registry, so the registry *is* the corpus definition.
-How sources are categorized
----------------------------
-Every source belongs to a "tier", which the retriever uses to break ties when
-two passages score equally on cosine similarity:
-    primary    — the śruti and the Gītā itself (the thing being commented on)
-    shankara   — Śaṅkarācārya's bhāṣyas and prakaraṇa-granthas (his own pen)
-    supporting — texts in his lineage but not by him (Aṣṭāvakra, Yoga Vāsiṣṭha,
-                 Vidyāraṇya's Pañcadaśī, modern Ramaṇa & Nisargadatta where
-                 explicitly placed in the Advaita stream)
-The tier weights live in knowledge_base.py; this file just labels.
-License classes
----------------
-We track licensing because the project is meant to be shareable. We refuse to
-register any source that is not unambiguously open. The classes are:
-    public_domain — pre-1929 works in US PD; covers most 19th-c. translations
-    unlicense     — Unlicense / CC0 / equivalent dedications
-    cc_by         — Creative Commons Attribution (must preserve credit)
-    cc_by_sa      — Creative Commons ShareAlike
-    open_database — ODbL (the dataset license used by some github corpora)
-Anything we'd label "publisher_copyright" simply doesn't get an entry. If you
-want the modern Advaita Ashrama translations, you must obtain a license and
-add the texts yourself in the user-supplied directory.
-"""
-from __future__ import annotations
-from dataclasses import dataclass, field
-from typing import Literal
-# ──────────────────────────── Type aliases ────────────────────────────
-Tier = Literal["primary", "shankara", "supporting"]
-License = Literal[
-    "public_domain", "unlicense", "cc_by", "cc_by_sa", "open_database",
-]
-Parser = Literal[
-    "gita_json",         # the gita/gita repo JSON layout (verse-indexed)
-    "wisdomlib_html",    # one chapter per HTML page on wisdomlib
-    "sastry_archive",    # Alladi Mahadeva Sastry OCR text from archive.org
-    "thibaut_sbe",       # Thibaut's SBE Brahma Sutra translation HTML
-    "plain_text",        # already-cleaned plain text the user dropped in
-]
-# ──────────────────────────── Source entry ────────────────────────────
-@dataclass(frozen=True)
-class Source:
-    """One downloadable source. The registry is a list of these.
-    The download_sources script understands two kinds of `urls`:
-      - HTTPS URLs to direct files (json, html, txt) — fetched with `requests`
-      - "git+https://..." URLs — cloned with `git clone --depth=1`
-    The parser receives the local path(s) and is responsible for emitting
-    Verse records into the corpus.
-    """
-    # Identity
-    key: str                       # short slug used as folder name; must be unique
-    name: str                      # human-readable name
-    work: str                      # the work; matches Verse.work for grouping
-    tier: Tier
-    # Provenance
-    license: License
-    license_url: str = ""          # canonical license URL or attribution page
-    translator: str = ""           # who did the English translation
-    year: int | None = None        # year of the edition we're using
-    # Download
-    urls: tuple[str, ...] = ()     # one or more files / git repos
-    parser: Parser = "plain_text"
-    # Operational
-    enabled: bool = True           # set False to skip without deleting the entry
-    notes: str = ""                # anything a future reader should know
-# ──────────────────────────── The registry ────────────────────────────
-#
-# This list is the source of truth. Everything else reads it.
-#
-# Conventions:
-#   - One entry per *publication*, not per chapter file. The parser knows how
-#     to walk its own files.
-#   - URLs that work as of the writing of this comment are noted. If a URL
-#     drifts, fix it here and re-run `download_sources.py`.
-#   - When in doubt about license, leave the source disabled and add a note.
-#
-SOURCES: list[Source] = [
-    # ─── Bhagavad Gītā: Sanskrit + transliteration + word meanings ───
-    # The gita/gita repo gives us the cleanest verse-indexed data on the web.
-    # Released under the Unlicense, which is a public-domain dedication. We
-    # use the static GitHub Pages mirror because it's directly fetchable as
-    # JSON files; cloning the repo is also fine.
-    Source(
-        key="gita_json_core",
-        name="Bhagavad Gītā — verse-indexed JSON (core)",
-        work="bhagavad_gita",
-        tier="primary",
-        license="unlicense",
-        license_url="https://github.com/gita/gita/blob/main/LICENSE",
-        translator="Sanskrit + IAST transliteration + word-by-word gloss",
-        year=None,
-        urls=(
-            "https://ravisiyer.github.io/gita-data/v1/chapters.json",
-            "https://ravisiyer.github.io/gita-data/v1/verse.json",
-        ),
-        parser="gita_json",
-        notes=(
-            "This is the spine of the Gītā corpus. Sanskrit + transliteration + "
-            "word-meanings. Translations come from translator-specific files."
-        ),
-    ),
-    # ─── Bhagavad Gītā: English translations (one or more) ───
-    # The translations.json file is large (~2 MB) and contains multiple
-    # translators keyed by author_id. Our parser will pick public-domain ones.
-    Source(
-        key="gita_json_translations",
-        name="Bhagavad Gītā — English translations (multiple authors)",
-        work="bhagavad_gita",
-        tier="primary",
-        license="unlicense",
-        license_url="https://github.com/gita/gita/blob/main/LICENSE",
-        translator="multiple — see per-verse author_id",
-        year=None,
-        urls=(
-            "https://ravisiyer.github.io/gita-data/v1/translation.json",
-            "https://ravisiyer.github.io/gita-data/v1/authors.json",
-        ),
-        parser="gita_json",
-        notes=(
-            "Parser keeps only translators whose works are public-domain or "
-            "explicitly free; e.g. Swami Sivananda is OK, ISKCON Prabhupada "
-            "is excluded. See parsers/gita_json.py for the allowlist."
-        ),
-    ),
-    # ─── Śaṅkara's Gītā Bhāṣya, Sastry 1897 translation ───
-    # The only full English translation of Śaṅkara's Gītā commentary that's
-    # unambiguously in the public domain (Sastry died ~1926; first published
-    # 1897). Lives on archive.org as OCR text. Parser handles OCR noise.
-    Source(
-        key="sastry_gita_bhashya",
-        name="Śaṅkara's Bhagavad Gītā Bhāṣya — Sastry translation (1897)",
-        work="bhagavad_gita_bhashya",
-        tier="shankara",
-        license="public_domain",
-        license_url="https://archive.org/details/Bhagavad-Gita.with.the.Commentary.of.Sri.Shankaracharya",
-        translator="Alladi Mahadeva Sastry",
-        year=1897,
-        urls=(
-            # Direct OCR text. The /download/ path is reliably the raw file;
-            # /stream/ is the HTML viewer and not what we want.
-            "https://archive.org/download/Bhagavad-Gita.with.the.Commentary.of.Sri.Shankaracharya/Bhagavad-Gita.with.the.Commentary.of.Sri.Shankaracharya_djvu.txt",
-        ),
-        parser="sastry_archive",
-        notes=(
-            "OCR will have noise — broken hyphens, occasional 'rn' → 'm'. "
-            "Parser uses verse-marker regex to chunk by verse and tries to "
-            "associate Śaṅkara's commentary with the verse it follows."
-        ),
-    ),
-    # ─── Telang's Gītā translation, SBE Vol. 8 (1882) ───
-    # An alternative to Sastry for the Gītā translation itself. Useful when
-    # we want a second voice for the verse text, since Sastry was sometimes
-    # paraphrasing Śaṅkara's gloss into the translation.
-    Source(
-        key="telang_gita",
-        name="Bhagavad Gītā — Telang translation, SBE Vol. 8 (1882)",
-        work="bhagavad_gita",
-        tier="primary",
-        license="public_domain",
-        license_url="https://en.wikipedia.org/wiki/Sacred_Books_of_the_East",
-        translator="Kāshināth Trimbak Telang",
-        year=1882,
-        urls=tuple(
-            f"https://www.wisdomlib.org/hinduism/book/the-bhagavadgita/d/doc{n}.html"
-            for n in range(81668, 81686)  # chapters 1–18
-        ),
-        parser="wisdomlib_html",
-        enabled=False,  # off by default — gita_json_translations gives us enough
-        notes=(
-            "Wisdomlib mirrors Telang's SBE 8 translation as one chapter per "
-            "page. Enable if you want a second translation alongside gita_json."
-        ),
-    ),
-    # ─── Mundaka Upaniṣad with Śaṅkara's Bhāṣya ───
-    # Wisdomlib hosts a complete English edition of Mundaka with Śaṅkara's
-    # commentary. Likely older Sitarama Sastri translation, public domain.
-    Source(
-        key="mundaka_shankara",
-        name="Muṇḍaka Upaniṣad with Śaṅkara's Bhāṣya",
-        work="mundaka_upanishad",
-        tier="shankara",
-        license="public_domain",
-        license_url="https://www.wisdomlib.org/hinduism/book/mundaka-upanishad-shankara-bhashya",
-        translator="Sitarama Sastri (1898)",
-        year=1898,
-        urls=(
-            "https://www.wisdomlib.org/hinduism/book/mundaka-upanishad-shankara-bhashya",
-        ),
-        parser="wisdomlib_html",
-        enabled=False,  # wisdomlib_html parser not yet implemented
-        notes=(
-            "The wisdomlib parser will follow the table-of-contents links from "
-            "this index page to fetch each section."
-        ),
-    ),
-    # ─── Brahma Sūtras with Śaṅkara's Bhāṣya, Thibaut translation ───
-    # SBE volumes 34 (1890) and 38 (1896). The most-cited English translation
-    # of the Brahma Sūtra Bhāṣya, used by every academic working in Vedānta.
-    # Squarely public domain.
-    Source(
-        key="thibaut_brahma_sutra",
-        name="Brahma Sūtras with Śaṅkara Bhāṣya — Thibaut translation",
-        work="brahma_sutra_bhashya",
-        tier="shankara",
-        license="public_domain",
-        license_url="https://archive.org/details/SacredBooksOfTheEastVol34",
-        translator="George Thibaut (SBE 34 & 38)",
-        year=1890,
-        urls=(
-            # archive.org full-text URLs for SBE 34 and 38
-            "https://archive.org/download/SacredBooksOfTheEastVol34/sbe34_djvu.txt",
-            "https://archive.org/download/SacredBooksOfTheEastVol38/sbe38_djvu.txt",
-        ),
-        parser="thibaut_sbe",
-        enabled=False,  # parser not implemented in v1 — see parsers/README.md
-        notes=(
-            "Disabled by default until thibaut_sbe parser is written. The text "
-            "is structured by adhikaraṇa (topic groups of sūtras), not by "
-            "single sūtras, so the parser needs more care than the others."
-        ),
-    ),
-    # ─── Vivekacūḍāmaṇi (Mohini Chatterji translation) ───
-    # The most famous prakaraṇa attributed to Śaṅkara. 581 verses.
-    # Mohini Chatterji's translation is early-20th-c., public domain.
-    Source(
-        key="vivekachudamani_chatterji",
-        name="Vivekacūḍāmaṇi — Mohini Chatterji translation",
-        work="vivekachudamani",
-        tier="shankara",
-        license="public_domain",
-        translator="Mohini M. Chatterji",
-        year=1932,
-        urls=(
-            # The user should fill this in; placeholder for the registry shape
-            # so the downloader logs a clear "URL missing" message rather than
-            # silently skipping.
-            "",
-        ),
-        parser="plain_text",
-        enabled=False,
-        notes=(
-            "Drop a clean copy at sources_local/vivekachudamani.txt and the "
-            "plain_text parser will pick it up. Several archive.org editions "
-            "exist; verse markers vary by edition."
-        ),
-    ),
-    # ─── User-provided plain-text drop-in slot ───
-    # If you already have a clean text file (a translation you typed up, a
-    # lecture transcript you cleaned, anything), drop it in sources_local/
-    # named tier__work__section.txt and the plain_text parser will fold it in.
-    Source(
-        key="user_local",
-        name="User-provided plain-text sources",
-        work="user_local",
-        tier="supporting",
-        license="public_domain",  # under your responsibility
-        urls=(),
-        parser="plain_text",
-        enabled=False,  # no URLs; user drops files manually into sources_local/
-        notes=(
-            "Anything in sources_local/. Convention: tier__work__section.txt "
-            "(see parsers/plain_text.py)."
-        ),
-    ),
-]
-# ──────────────────────────── Helpers ────────────────────────────
-def by_key(key: str) -> Source:
-    """Look up a source by its registry key. Raises KeyError on miss."""
-    for s in SOURCES:
-        if s.key == key:
-            return s
-    raise KeyError(f"No source with key={key!r}")
-def enabled_sources() -> list[Source]:
-    return [s for s in SOURCES if s.enabled]
-def by_parser(parser: Parser) -> list[Source]:
-    """Group enabled sources by their parser, useful for the ingest loop."""
-    return [s for s in SOURCES if s.enabled and s.parser == parser]
-def attribution_for(work: str) -> list[str]:
-    """Returns the attribution lines for any work, for citation footers.
-    Even though all our sources are PD, citing translators is right. The
-    advisor's response footer can call this and append the translators to
-    the bibliography lines.
-    """
-    out = []
-    for s in SOURCES:
-        if s.work == work and s.translator:
-            year = f", {s.year}" if s.year else ""
-            out.append(f"{s.translator}{year} ({s.license})")
-    return out

streamlit_app.py CHANGED Viewed

@@ -178,7 +178,6 @@ _EXAMPLES = [
     "I keep hurting the people I love without meaning to.",
     "I've been meditating for years but still feel empty.",
     "My ambition feels hollow but I can't stop chasing it.",
-    "My boss no longer wants me on his team and I feel humiliated.",
 ]

     "I keep hurting the people I love without meaning to.",
     "I've been meditating for years but still feel empty.",
     "My ambition feels hollow but I can't stop chasing it.",
 ]