Spaces:
Running
Running
minimise HF Space: remove training pipeline, update corpus index
Browse filesDeleted all training-only files (parsers, ingest/enrich/download scripts,
metrics, optimizer, dataset generator, Gradio app.py, CLAUDE.md, dev logs).
Updated Chroma index with Swarupananda verses; synced config.py and
knowledge_base.py from main repo (adds OpenRouter backend). Removed
'my boss' example question from streamlit_app.py. Updated .gitignore to
protect secrets and training artifacts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- .gitignore +17 -3
- CLAUDE.md +0 -232
- app.py +0 -1101
- artifacts/chroma/52cdeb15-0631-44ed-8618-782f1d4d27bb/data_level0.bin +1 -1
- artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/data_level0.bin +3 -0
- artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/header.bin +3 -0
- artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/length.bin +3 -0
- parsers/__init__.py → artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/link_lists.bin +0 -0
- artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/data_level0.bin +3 -0
- artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/header.bin +3 -0
- artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/length.bin +3 -0
- artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/link_lists.bin +0 -0
- artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/data_level0.bin +3 -0
- artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/header.bin +3 -0
- artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/length.bin +3 -0
- artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/link_lists.bin +0 -0
- artifacts/chroma/chroma.sqlite3 +2 -2
- artifacts/optimized_advisor.prompts.txt +0 -87
- chat.py +0 -392
- config.py +51 -15
- data/.gitkeep +0 -0
- data/synthetic_questions.jsonl +0 -0
- dataset_generator.py +0 -332
- download_sources.py +0 -195
- enrich_corpus.py +0 -174
- enrichment.py +0 -266
- ingest_corpus.py +0 -203
- knowledge_base.py +2 -43
- metrics.py +0 -435
- optimize_gepa.py +0 -200
- parsers/gita_json.py +0 -236
- parsers/sastry_archive.py +0 -249
- run_overnight.py +0 -230
- smoke_test.py +0 -99
- sources_local/.gitkeep +0 -0
- sources_registry.py +0 -331
- streamlit_app.py +0 -1
.gitignore
CHANGED
|
@@ -1,14 +1,28 @@
|
|
| 1 |
.env
|
|
|
|
| 2 |
__pycache__/
|
| 3 |
*.pyc
|
| 4 |
*.pyo
|
|
|
|
|
|
|
|
|
|
| 5 |
data/raw/
|
| 6 |
data/enrichment_cache.jsonl
|
| 7 |
data/corpus.jsonl
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
artifacts/gepa_logs/
|
| 9 |
artifacts/gepa_state.bin
|
| 10 |
artifacts/*.log
|
| 11 |
-
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
sources/
|
| 14 |
-
|
|
|
|
|
|
|
|
|
| 1 |
.env
|
| 2 |
+
*.env
|
| 3 |
__pycache__/
|
| 4 |
*.pyc
|
| 5 |
*.pyo
|
| 6 |
+
.DS_Store
|
| 7 |
+
|
| 8 |
+
# Training pipeline — lives in the main repo, not needed in this HF Space
|
| 9 |
data/raw/
|
| 10 |
data/enrichment_cache.jsonl
|
| 11 |
data/corpus.jsonl
|
| 12 |
+
data/corpus_enriched.jsonl
|
| 13 |
+
data/synthetic_questions.jsonl
|
| 14 |
+
|
| 15 |
+
# Training artifacts — logs, test runs, GEPA state
|
| 16 |
artifacts/gepa_logs/
|
| 17 |
artifacts/gepa_state.bin
|
| 18 |
artifacts/*.log
|
| 19 |
+
artifacts/test_results.json
|
| 20 |
+
artifacts/optimized_advisor.prompts.txt
|
| 21 |
+
|
| 22 |
+
# User-supplied local sources (not for public repo)
|
| 23 |
+
sources_local/*
|
| 24 |
+
!sources_local/.gitkeep
|
| 25 |
sources/
|
| 26 |
+
|
| 27 |
+
# Guard against accidentally nesting the repo inside itself
|
| 28 |
+
Gita-advisor/
|
CLAUDE.md
DELETED
|
@@ -1,232 +0,0 @@
|
|
| 1 |
-
# CLAUDE.md — Project Primer for the Gītā Advisor
|
| 2 |
-
|
| 3 |
-
This file is read by Claude Code when you open this project. It is also a
|
| 4 |
-
human-readable design memo. Read it once before asking Claude to do anything
|
| 5 |
-
substantial, and keep it updated as the design evolves — when the file lies,
|
| 6 |
-
Claude's behavior degrades.
|
| 7 |
-
|
| 8 |
-
## What this project is
|
| 9 |
-
|
| 10 |
-
A spiritual advisor grounded in Advaita Vedānta as taught by Śaṅkarācārya,
|
| 11 |
-
optimized via DSPy + GEPA against a local LM Studio model. The advisor takes
|
| 12 |
-
a real-life question or vent ("I just got laid off and feel like nothing
|
| 13 |
-
makes sense") and produces a response that is empathetic to the felt
|
| 14 |
-
experience, faithful to the non-dual lineage, and grounded in actual cited
|
| 15 |
-
verses from the Gītā, the principal Upaniṣads, the Brahma Sūtras, and
|
| 16 |
-
prakaraṇa-granthas. Wit is welcome, but only around the cosmic predicament,
|
| 17 |
-
never around the user's pain.
|
| 18 |
-
|
| 19 |
-
## The pipeline, in one breath
|
| 20 |
-
|
| 21 |
-
User text →
|
| 22 |
-
`UnderstandQuery` (felt emotion + surface concern + deeper concern + themes) →
|
| 23 |
-
`PlanRetrieval` (3 diverse search queries) →
|
| 24 |
-
`AdvaitaRetriever.search_many` (multi-view RAG over verse-indexed corpus) →
|
| 25 |
-
`SelectPassages` (pick the 2–4 verses that actually fit) →
|
| 26 |
-
`SynthesizeAdvice` (compose the reply with citations) →
|
| 27 |
-
`dspy.Prediction` carrying the response and its full trace for the metric.
|
| 28 |
-
|
| 29 |
-
Each predictor is a `dspy.ChainOfThought`, so GEPA has a `reasoning` trace to
|
| 30 |
-
inspect during reflection. The retriever is not optimized — vector search
|
| 31 |
-
isn't text — but the *queries given to it* are, which is what `PlanRetrieval`
|
| 32 |
-
exists to evolve.
|
| 33 |
-
|
| 34 |
-
## The two architectural choices that matter most
|
| 35 |
-
|
| 36 |
-
**Verse as the unit of retrieval.** Scripture is not arbitrary prose; the
|
| 37 |
-
natural unit is the verse (śloka, mantra, sūtra). The corpus is therefore
|
| 38 |
-
indexed by `verse_id` (e.g. `bhagavad_gita_02_47`), which has a stable
|
| 39 |
-
human-readable form (`BG 2.47`). Citations from the advisor are exact-match
|
| 40 |
-
verifiable against the retrieved set, which gives the metric a sharp signal
|
| 41 |
-
to feed back into GEPA's reflection step.
|
| 42 |
-
|
| 43 |
-
**Multi-view embeddings to bridge the language gap.** Users do not write in
|
| 44 |
-
the vocabulary of scripture — they say "I'm anxious about my career," not
|
| 45 |
-
"I'm experiencing rāga toward kāmya-karma." So we use the local LLM, in a
|
| 46 |
-
one-time offline pass, to enrich each verse with structured fields that
|
| 47 |
-
speak the user's language: a paraphrase, themes, life situations, emotions
|
| 48 |
-
addressed, practical teaching, and five hypothetical first-person questions.
|
| 49 |
-
Three separate embeddings per verse — `literal_view`, `bhashya_view`, and
|
| 50 |
-
`advisor_view` — let queries phrased in any register find the right verse.
|
| 51 |
-
|
| 52 |
-
The advisor view dominates retrieval (weight 0.55) because that is where the
|
| 53 |
-
language gap closes; the literal and bhāṣya views (0.25, 0.20) act as
|
| 54 |
-
insurance against the enrichment pipeline missing a topic.
|
| 55 |
-
|
| 56 |
-
## File map
|
| 57 |
-
|
| 58 |
-
```
|
| 59 |
-
gita_advisor/
|
| 60 |
-
├── config.py # paths, LM Studio URL, model strings, embed config
|
| 61 |
-
├── sources_registry.py # central catalog of every open source we use
|
| 62 |
-
├── download_sources.py # downloads everything to data/raw/<source_key>/
|
| 63 |
-
├── corpus.py # Verse / EnrichedVerse dataclasses + JSONL I/O
|
| 64 |
-
├── parsers/ # one module per source format
|
| 65 |
-
│ ├── gita_json.py # ↳ gita/gita verse-indexed JSON (Unlicense)
|
| 66 |
-
│ └── sastry_archive.py # ↳ Sastry 1897 OCR text from archive.org (PD)
|
| 67 |
-
├── ingest_corpus.py # runs parsers, merges by verse_ref → corpus.jsonl
|
| 68 |
-
├── enrichment.py # DSPy module: Verse → EnrichedVerse via local LLM
|
| 69 |
-
├── enrich_corpus.py # batch enrichment with caching; long-running
|
| 70 |
-
├── knowledge_base.py # 3-view Chroma index; AdvaitaRetriever
|
| 71 |
-
├── signatures.py # the four DSPy signatures GEPA optimizes
|
| 72 |
-
├── advisor.py # composed dspy.Module — what GEPA optimizes
|
| 73 |
-
├── metrics.py # rule-based + LLM-judge composite, with feedback
|
| 74 |
-
├── dataset_generator.py # synthesizes ~500 life-situation questions
|
| 75 |
-
├── optimize_gepa.py # runs GEPA over the advisor with the metric
|
| 76 |
-
├── chat.py # interactive CLI — load optimized advisor, chat
|
| 77 |
-
├── smoke_test.py # 5-step pipeline check before committing time
|
| 78 |
-
├── data/
|
| 79 |
-
│ ├── raw/ # pristine downloads, one folder per source key
|
| 80 |
-
│ ├── corpus.jsonl # parsed Verses, merged across sources
|
| 81 |
-
│ ├── corpus_enriched.jsonl # Verses + LLM-extracted fields
|
| 82 |
-
│ └── enrichment_cache.jsonl # append-only cache for resumable enrichment
|
| 83 |
-
└── artifacts/
|
| 84 |
-
├── chroma/ # the three view-collections
|
| 85 |
-
└── optimized_advisor.json # GEPA's compiled program
|
| 86 |
-
```
|
| 87 |
-
|
| 88 |
-
## Source provenance, in one place
|
| 89 |
-
|
| 90 |
-
Every source must be unambiguously open. The four pillars currently enabled
|
| 91 |
-
or staged are described below in prose so the rationale doesn't get buried.
|
| 92 |
-
|
| 93 |
-
The `gita/gita` repository on GitHub provides the spine of the Gītā corpus.
|
| 94 |
-
It is a verse-indexed JSON dataset with Sanskrit, IAST transliteration, and
|
| 95 |
-
word-by-word glosses, released under the Unlicense (a public-domain
|
| 96 |
-
dedication). We pull it via a static-file mirror at
|
| 97 |
-
`ravisiyer.github.io/gita-data/v1/` so a single `requests.get` is enough;
|
| 98 |
-
cloning the whole repo also works.
|
| 99 |
-
|
| 100 |
-
Alladi Mahadeva Sastry's 1897 translation of Śaṅkara's Gītā Bhāṣya lives on
|
| 101 |
-
archive.org as full OCR text. It is the only complete English translation
|
| 102 |
-
of Śaṅkara's Gītā commentary that is unambiguously in the public domain
|
| 103 |
-
(Sastry died in 1926, the work itself dates to 1897). The OCR has
|
| 104 |
-
predictable noise — broken hyphens, occasional "rn" → "m" — and
|
| 105 |
-
`parsers/sastry_archive.py` is patient about it.
|
| 106 |
-
|
| 107 |
-
The wisdomlib site mirrors the *Sacred Books of the East* series and other
|
| 108 |
-
public-domain Indology — Telang's 1882 Gītā, Mundaka with Śaṅkara, etc.
|
| 109 |
-
The `wisdomlib_html` parser is registered but not yet implemented; this is
|
| 110 |
-
on the to-do list. `sacred-texts.com` carries the same content but blocks
|
| 111 |
-
some HTTP fetchers, so on the Mac you can use either.
|
| 112 |
-
|
| 113 |
-
What we deliberately do not include: Swami Gambhirananda's translations
|
| 114 |
-
(Advaita Ashrama copyright, mid-20th c.), modern Ramaṇa or Nisargadatta
|
| 115 |
-
editions, ISKCON's Prabhupada commentary. If you want any of these, place
|
| 116 |
-
your own copies in `sources_local/` under your own license judgment.
|
| 117 |
-
|
| 118 |
-
## The pipeline of commands
|
| 119 |
-
|
| 120 |
-
The first time, in order:
|
| 121 |
-
|
| 122 |
-
```bash
|
| 123 |
-
pip install -r requirements.txt
|
| 124 |
-
|
| 125 |
-
# 1. Download the registered open sources to data/raw/. Polite (1 req/s/host),
|
| 126 |
-
# idempotent (skips files already present). Re-run with --force to refresh.
|
| 127 |
-
python download_sources.py
|
| 128 |
-
|
| 129 |
-
# 2. Parse the raw downloads into a unified verse corpus. Merges Gītā verse
|
| 130 |
-
# text with Śaṅkara's bhāṣya by verse_ref. Outputs data/corpus.jsonl.
|
| 131 |
-
python ingest_corpus.py
|
| 132 |
-
|
| 133 |
-
# 3. Run the local LLM over every verse to extract paraphrase + life
|
| 134 |
-
# situations + emotions + hypothetical questions. SLOW — several hours,
|
| 135 |
-
# overnight is normal. Resumable via append-mode cache, so kill -9 is safe.
|
| 136 |
-
# Outputs data/corpus_enriched.jsonl.
|
| 137 |
-
python enrich_corpus.py
|
| 138 |
-
# Smoke-test on 50 verses first if you want to verify the prompt is producing
|
| 139 |
-
# good output before committing the overnight run:
|
| 140 |
-
python enrich_corpus.py --limit 50
|
| 141 |
-
|
| 142 |
-
# 4. Build the three Chroma view-indices from the enriched corpus.
|
| 143 |
-
python knowledge_base.py --build
|
| 144 |
-
|
| 145 |
-
# 5. Sanity-check the pipeline on one user question.
|
| 146 |
-
python smoke_test.py "I just got laid off and feel like nothing matters anymore"
|
| 147 |
-
|
| 148 |
-
# 6. Generate the synthetic dataset of ~500 user questions for GEPA training.
|
| 149 |
-
python dataset_generator.py --n 500
|
| 150 |
-
|
| 151 |
-
# 7. Run GEPA optimization. Also long — start with --auto light to verify, then
|
| 152 |
-
# re-run at --auto medium for the real pass.
|
| 153 |
-
python optimize_gepa.py --auto medium
|
| 154 |
-
|
| 155 |
-
# 8. Open the chat CLI with the optimized program loaded.
|
| 156 |
-
python chat.py
|
| 157 |
-
```
|
| 158 |
-
|
| 159 |
-
After the first run, only steps 4–8 normally re-run. Steps 1–3 are one-time
|
| 160 |
-
unless you change sources or the enrichment prompt.
|
| 161 |
-
|
| 162 |
-
## Two things to watch out for
|
| 163 |
-
|
| 164 |
-
**LM Studio model name.** The exact string `google/gemma-4-26b-a4b` (or
|
| 165 |
-
whatever you settle on) goes in `config.py` as `LOCAL_MODEL`, and DSPy
|
| 166 |
-
prefixes it with `openai/` to route through the OpenAI client. If LM Studio
|
| 167 |
-
reports a different model identifier in its API, copy-paste verbatim.
|
| 168 |
-
|
| 169 |
-
**Failed enrichments.** The local model occasionally produces malformed
|
| 170 |
-
structured output. The enricher retries twice and, on persistent failure,
|
| 171 |
-
stamps `enrichment_model = "FAILED: <reason>"` on the verse. The verse is
|
| 172 |
-
still indexed on its literal text and bhāṣya, just without the advisor view.
|
| 173 |
-
After the full pass, run `python enrich_corpus.py --only-failed` to retry
|
| 174 |
-
just those, perhaps after tuning the prompt in `enrichment.py`.
|
| 175 |
-
|
| 176 |
-
## What is not yet done
|
| 177 |
-
|
| 178 |
-
The Sastry parser produces verse-attached bhāṣya but the verse-text /
|
| 179 |
-
bhāṣya split is heuristic; spot-check a few verses (BG 2.47, BG 18.66 are
|
| 180 |
-
good canaries) and tighten `_build_verse` if needed.
|
| 181 |
-
|
| 182 |
-
The `wisdomlib_html` parser and the `thibaut_sbe` (Brahma Sūtra) parser are
|
| 183 |
-
registered but stubbed — adding either is a single-file change. They are
|
| 184 |
-
disabled in `sources_registry.py` until written.
|
| 185 |
-
|
| 186 |
-
The metric still has the rule-based hooks for therapy clichés, length, and
|
| 187 |
-
non-dual register but does not yet look at the new EnrichedVerse fields
|
| 188 |
-
(`emotions_addressed`, `themes`) for empathy verification. There is a clear
|
| 189 |
-
win there: when the user's `felt_emotion` appears in a selected verse's
|
| 190 |
-
`emotions_addressed` list, that is strong evidence of empathic-fit retrieval.
|
| 191 |
-
|
| 192 |
-
The dataset generator was written before the schema shift; spot-check that
|
| 193 |
-
its output still flows through the pipeline cleanly.
|
| 194 |
-
|
| 195 |
-
## How to talk to me when working in this project
|
| 196 |
-
|
| 197 |
-
The most useful prompts are concrete and bounded. "Tighten the verse-bhāṣya
|
| 198 |
-
split heuristic in `parsers/sastry_archive.py` and run it on the first three
|
| 199 |
-
chapters; show me the BG 2.47 record" is a good prompt. "Improve the
|
| 200 |
-
parser" is not.
|
| 201 |
-
|
| 202 |
-
When something is broken, read the relevant file end-to-end before patching.
|
| 203 |
-
The comments in this project are unusually heavy because the design has many
|
| 204 |
-
small choices that stop being obvious six months from now. If a comment
|
| 205 |
-
disagrees with the code, the comment is more likely to be right and you
|
| 206 |
-
should ask whether the code drifted, not whether the comment did.
|
| 207 |
-
|
| 208 |
-
When designing a new piece, start by asking what `Verse` / `EnrichedVerse`
|
| 209 |
-
field carries the information, before reaching for new state. The data
|
| 210 |
-
model is meant to be the contract between modules; adding ad-hoc fields on
|
| 211 |
-
the side is how RAG systems become spaghetti.
|
| 212 |
-
|
| 213 |
-
## Pinned design commitments (do not silently break these)
|
| 214 |
-
|
| 215 |
-
The advisor is grounded in Advaita Vedānta as Śaṅkara taught it. We do not
|
| 216 |
-
import dualistic theology, and we do not reduce Advaita to "we are all one"
|
| 217 |
-
pop-spirituality. We hold the two-truths distinction (vyāvahārika and
|
| 218 |
-
pāramārthika) actively, and we do not collapse the user's lived suffering
|
| 219 |
-
into "it's all māyā anyway." When a teaching has a Sanskrit name with a
|
| 220 |
-
precise meaning, we use the Sanskrit name with a brief gloss rather than
|
| 221 |
-
substituting an approximate English word.
|
| 222 |
-
|
| 223 |
-
Citations are exact and verifiable. "BG 2.47" in a response means the verse
|
| 224 |
-
was in the retrieved set. The metric enforces this; do not weaken it.
|
| 225 |
-
|
| 226 |
-
The advisor is not therapy and is not a chatbot friend. It is a teacher in
|
| 227 |
-
the tradition of the lineage. It is allowed to push back, to challenge a
|
| 228 |
-
question's premise, and to recommend silence over more words.
|
| 229 |
-
|
| 230 |
-
The retriever is permissive; the selector is picky. Do not move filtering
|
| 231 |
-
upstream into the retriever — once a verse is filtered out at retrieval,
|
| 232 |
-
no later stage can recover it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
DELETED
|
@@ -1,1101 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
app.py — Enhanced Gradio web interface for the Gītā Advisor.
|
| 3 |
-
|
| 4 |
-
Features:
|
| 5 |
-
- Real-time stage progress during inference (◌ understanding → searching → composing)
|
| 6 |
-
- Character-by-character response streaming
|
| 7 |
-
- Verse explorer: select any cited source to read Sanskrit, translation, Śaṅkara's bhāṣya
|
| 8 |
-
- Warm spiritual aesthetic
|
| 9 |
-
"""
|
| 10 |
-
|
| 11 |
-
from __future__ import annotations
|
| 12 |
-
import json
|
| 13 |
-
import re
|
| 14 |
-
import threading
|
| 15 |
-
import time
|
| 16 |
-
from types import SimpleNamespace
|
| 17 |
-
|
| 18 |
-
import gradio as gr
|
| 19 |
-
import dspy
|
| 20 |
-
from openai import OpenAI
|
| 21 |
-
|
| 22 |
-
import config
|
| 23 |
-
from advisor import load_optimized
|
| 24 |
-
from knowledge_base import AdvaitaRetriever, format_passages_for_llm
|
| 25 |
-
from corpus import EnrichedVerse, Verse, read_jsonl_enriched, read_jsonl_verses
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
class _ExplainInContext(dspy.Signature):
|
| 29 |
-
"""You are the Gītā Advisor continuing a conversation. The user has asked
|
| 30 |
-
you to unpack a specific verse or passage you cited. Explain what it means
|
| 31 |
-
and why it speaks precisely to their situation — go deeper than the initial
|
| 32 |
-
response did. Reference the user's words. Close with one concrete way to
|
| 33 |
-
hold or work with this text this week."""
|
| 34 |
-
|
| 35 |
-
verse_ref: str = dspy.InputField()
|
| 36 |
-
verse_content: str = dspy.InputField(
|
| 37 |
-
desc="Translation, original text (if available), and Śaṅkara's commentary."
|
| 38 |
-
)
|
| 39 |
-
conversation_context: str = dspy.InputField(
|
| 40 |
-
desc="The user's question and the advisor's response where this verse was cited."
|
| 41 |
-
)
|
| 42 |
-
explanation: str = dspy.OutputField(
|
| 43 |
-
desc="150-250 words. Grounded in Advaita. Do not merely restate the translation. "
|
| 44 |
-
"End with a practical suggestion for this week."
|
| 45 |
-
)
|
| 46 |
-
|
| 47 |
-
# ── startup — runs once when the Space boots ──────────────────────────────────
|
| 48 |
-
config.configure_dspy(backend="hf")
|
| 49 |
-
_advisor = load_optimized()
|
| 50 |
-
_retriever = AdvaitaRetriever()
|
| 51 |
-
_retriever._ensure()
|
| 52 |
-
|
| 53 |
-
# Direct OpenAI-compatible client for streaming synthesis — bypasses DSPy for
|
| 54 |
-
# the final step so tokens reach the browser as they're generated.
|
| 55 |
-
_synthesis_client = OpenAI(
|
| 56 |
-
base_url=config.HF_ROUTER_BASE,
|
| 57 |
-
api_key=config.HF_TOKEN,
|
| 58 |
-
)
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
def _load_verse_lookup() -> dict[str, Verse]:
|
| 62 |
-
lookup: dict[str, Verse] = {}
|
| 63 |
-
enriched = config.DATA_DIR / "corpus_enriched.jsonl"
|
| 64 |
-
plain = config.DATA_DIR / "corpus.jsonl"
|
| 65 |
-
if enriched.exists():
|
| 66 |
-
for v in read_jsonl_enriched(enriched):
|
| 67 |
-
lookup[v.verse_ref.lower().strip()] = v
|
| 68 |
-
elif plain.exists():
|
| 69 |
-
for v in read_jsonl_verses(plain):
|
| 70 |
-
lookup[v.verse_ref.lower().strip()] = v
|
| 71 |
-
return lookup
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
_verse_lookup = _load_verse_lookup()
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
# ── helpers ────────────────────────────────────────────────────────────────────
|
| 78 |
-
|
| 79 |
-
def _to_dspy_history(gradio_history: list) -> dspy.History:
|
| 80 |
-
"""Convert Gradio messages list to dspy.History, stripping source footers."""
|
| 81 |
-
msgs = []
|
| 82 |
-
i = 0
|
| 83 |
-
while i + 1 < len(gradio_history):
|
| 84 |
-
u, a = gradio_history[i], gradio_history[i + 1]
|
| 85 |
-
if u.get("role") == "user" and a.get("role") == "assistant":
|
| 86 |
-
content = a["content"]
|
| 87 |
-
if "\n\n---\n" in content:
|
| 88 |
-
content = content.split("\n\n---\n")[0]
|
| 89 |
-
msgs.append({
|
| 90 |
-
"user_question": u["content"],
|
| 91 |
-
"response": content,
|
| 92 |
-
"sources_cited": [],
|
| 93 |
-
})
|
| 94 |
-
i += 2
|
| 95 |
-
return dspy.History(messages=msgs)
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
def _render_verse_html(verse: Verse) -> str:
|
| 99 |
-
ev = verse if isinstance(verse, EnrichedVerse) else None
|
| 100 |
-
parts: list[str] = []
|
| 101 |
-
|
| 102 |
-
ref = verse.verse_ref
|
| 103 |
-
work = getattr(verse, "work_display", None) or getattr(verse, "work", "")
|
| 104 |
-
section = getattr(verse, "section_display", None) or getattr(verse, "section", "") or ""
|
| 105 |
-
subtitle = f"{work} — {section}" if section else work
|
| 106 |
-
|
| 107 |
-
parts.append(
|
| 108 |
-
f'<div class="vp-header">'
|
| 109 |
-
f' <span class="vp-ref">{ref}</span>'
|
| 110 |
-
f' <span class="vp-subtitle">{subtitle}</span>'
|
| 111 |
-
f'</div>'
|
| 112 |
-
)
|
| 113 |
-
|
| 114 |
-
if getattr(verse, "sanskrit", None):
|
| 115 |
-
parts.append(f'<div class="vp-sanskrit">{verse.sanskrit}</div>')
|
| 116 |
-
if getattr(verse, "transliteration", None):
|
| 117 |
-
parts.append(f'<div class="vp-iast">{verse.transliteration}</div>')
|
| 118 |
-
|
| 119 |
-
if getattr(verse, "translation", None):
|
| 120 |
-
tr = getattr(verse, "translator", None)
|
| 121 |
-
label = f"Translation ({tr})" if tr else "Translation"
|
| 122 |
-
parts.append(f'<div class="vp-label">{label}</div>')
|
| 123 |
-
parts.append(f'<div class="vp-body">{verse.translation}</div>')
|
| 124 |
-
|
| 125 |
-
if getattr(verse, "bhashya", None):
|
| 126 |
-
btr = getattr(verse, "bhashya_translator", None)
|
| 127 |
-
note = f" ({btr})" if btr else ""
|
| 128 |
-
preview = verse.bhashya[:900] + ("…" if len(verse.bhashya) > 900 else "")
|
| 129 |
-
parts.append(f'<div class="vp-label">Śaṅkara\'s Bhāṣya{note}</div>')
|
| 130 |
-
parts.append(f'<div class="vp-body vp-dim">{preview}</div>')
|
| 131 |
-
|
| 132 |
-
if ev:
|
| 133 |
-
if getattr(ev, "paraphrase", None):
|
| 134 |
-
parts.append('<div class="vp-label">Teaching</div>')
|
| 135 |
-
parts.append(f'<div class="vp-body">{ev.paraphrase}</div>')
|
| 136 |
-
if getattr(ev, "themes", None):
|
| 137 |
-
tags = "".join(f'<span class="vp-tag">{t}</span>' for t in ev.themes)
|
| 138 |
-
parts.append(f'<div class="vp-tags">{tags}</div>')
|
| 139 |
-
if getattr(ev, "practical_teaching", None):
|
| 140 |
-
parts.append('<div class="vp-label">Practical Shift</div>')
|
| 141 |
-
parts.append(f'<div class="vp-body vp-gold">{ev.practical_teaching}</div>')
|
| 142 |
-
|
| 143 |
-
return '<div class="verse-panel">' + "\n".join(parts) + "</div>"
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
# ── CSS ────────────────────────────────────────────────────────────────────────
|
| 147 |
-
|
| 148 |
-
CSS = """
|
| 149 |
-
@import url('https://fonts.googleapis.com/css2?family=Playfair+Display:ital,wght@0,400;0,600;1,400&family=EB+Garamond:ital,wght@0,400;0,500;1,400;1,500&family=Lato:ital,wght@0,300;0,400;0,700;1,300;1,400&display=swap');
|
| 150 |
-
|
| 151 |
-
/* ── palette ──────────────────────────────────────────────────────────────── */
|
| 152 |
-
:root {
|
| 153 |
-
--gold: #C9A84C;
|
| 154 |
-
--gold-dim: #9A7830;
|
| 155 |
-
--gold-glow: rgba(201,168,76,0.18);
|
| 156 |
-
--bg: #100C07;
|
| 157 |
-
--bg-mid: #1C1208;
|
| 158 |
-
--bg-card: #251808;
|
| 159 |
-
--bg-user: #3D240C;
|
| 160 |
-
--bg-bot: #180F05;
|
| 161 |
-
--border: #5A3C18;
|
| 162 |
-
--border-dim: #3A2408;
|
| 163 |
-
--text: #ECD8B4;
|
| 164 |
-
--text-dim: #A08860;
|
| 165 |
-
--text-muted: #6A5030;
|
| 166 |
-
--radius: 10px;
|
| 167 |
-
--font-serif: 'EB Garamond', Georgia, 'Times New Roman', serif;
|
| 168 |
-
--font-sans: 'Lato', system-ui, sans-serif;
|
| 169 |
-
--font-display: 'Playfair Display', Georgia, serif;
|
| 170 |
-
}
|
| 171 |
-
|
| 172 |
-
/* ── base ─────────────────────────────────────────────────────────────────── */
|
| 173 |
-
body,
|
| 174 |
-
.gradio-container,
|
| 175 |
-
.main,
|
| 176 |
-
footer {
|
| 177 |
-
background: var(--bg) !important;
|
| 178 |
-
color: var(--text) !important;
|
| 179 |
-
font-family: var(--font-sans) !important;
|
| 180 |
-
}
|
| 181 |
-
|
| 182 |
-
.gradio-container { max-width: 880px !important; margin: 0 auto !important; }
|
| 183 |
-
|
| 184 |
-
footer { display: none !important; }
|
| 185 |
-
|
| 186 |
-
/* ── header ───────────────────────────────────────────────────────────────── */
|
| 187 |
-
.app-header {
|
| 188 |
-
text-align: center;
|
| 189 |
-
padding: 2.4rem 1rem 1.6rem;
|
| 190 |
-
border-bottom: 1px solid var(--border);
|
| 191 |
-
margin-bottom: 0.5rem;
|
| 192 |
-
}
|
| 193 |
-
.app-title {
|
| 194 |
-
font-family: var(--font-display);
|
| 195 |
-
font-size: 2.6rem;
|
| 196 |
-
color: var(--gold);
|
| 197 |
-
letter-spacing: 0.05em;
|
| 198 |
-
line-height: 1.15;
|
| 199 |
-
margin: 0 0 0.5rem;
|
| 200 |
-
font-weight: 400;
|
| 201 |
-
}
|
| 202 |
-
.app-subtitle {
|
| 203 |
-
color: var(--text-muted);
|
| 204 |
-
font-size: 0.95rem;
|
| 205 |
-
font-weight: 300;
|
| 206 |
-
font-style: italic;
|
| 207 |
-
font-family: var(--font-serif);
|
| 208 |
-
letter-spacing: 0.03em;
|
| 209 |
-
}
|
| 210 |
-
.app-ornament {
|
| 211 |
-
margin-top: 1rem;
|
| 212 |
-
color: var(--gold-dim);
|
| 213 |
-
font-size: 0.85rem;
|
| 214 |
-
letter-spacing: 0.7em;
|
| 215 |
-
}
|
| 216 |
-
|
| 217 |
-
/* ── chatbot container ────────────────────────────────────────────────────── */
|
| 218 |
-
#chatbot {
|
| 219 |
-
border: 1px solid var(--border) !important;
|
| 220 |
-
border-radius: var(--radius) !important;
|
| 221 |
-
background: var(--bg) !important;
|
| 222 |
-
}
|
| 223 |
-
#chatbot .wrap { background: var(--bg) !important; }
|
| 224 |
-
|
| 225 |
-
/* user bubble */
|
| 226 |
-
#chatbot .user.message {
|
| 227 |
-
background: var(--bg-user) !important;
|
| 228 |
-
border: 1px solid var(--border) !important;
|
| 229 |
-
border-radius: var(--radius) var(--radius) 3px var(--radius) !important;
|
| 230 |
-
padding: 0.8rem 1.1rem !important;
|
| 231 |
-
box-shadow: inset 0 1px 0 rgba(255,220,140,0.07) !important;
|
| 232 |
-
}
|
| 233 |
-
/* assistant bubble */
|
| 234 |
-
#chatbot .bot.message {
|
| 235 |
-
background: var(--bg-bot) !important;
|
| 236 |
-
border: 1px solid var(--gold-dim) !important;
|
| 237 |
-
border-left: 3px solid var(--gold-dim) !important;
|
| 238 |
-
border-radius: 3px var(--radius) var(--radius) var(--radius) !important;
|
| 239 |
-
padding: 1.1rem 1.4rem 1.1rem 1.6rem !important;
|
| 240 |
-
line-height: 1.9 !important;
|
| 241 |
-
font-family: var(--font-serif) !important;
|
| 242 |
-
font-size: 1.08rem !important;
|
| 243 |
-
color: var(--text) !important;
|
| 244 |
-
}
|
| 245 |
-
/* user bubble */
|
| 246 |
-
#chatbot .user.message {
|
| 247 |
-
background: var(--bg-user) !important;
|
| 248 |
-
border: 1px solid var(--border) !important;
|
| 249 |
-
border-radius: var(--radius) var(--radius) 3px var(--radius) !important;
|
| 250 |
-
padding: 0.8rem 1.1rem !important;
|
| 251 |
-
box-shadow: inset 0 1px 0 rgba(255,220,140,0.07) !important;
|
| 252 |
-
font-family: var(--font-sans) !important;
|
| 253 |
-
font-size: 0.96rem !important;
|
| 254 |
-
}
|
| 255 |
-
/* inner content divs — transparent so bubble bg shows through */
|
| 256 |
-
#chatbot .message.panel-full-width,
|
| 257 |
-
#chatbot [data-testid="user"],
|
| 258 |
-
#chatbot [data-testid="bot"] {
|
| 259 |
-
background: transparent !important;
|
| 260 |
-
color: var(--text) !important;
|
| 261 |
-
padding: 0 !important;
|
| 262 |
-
}
|
| 263 |
-
|
| 264 |
-
/* markdown inside bubbles */
|
| 265 |
-
#chatbot .bot.message p { margin: 0.55em 0 !important; }
|
| 266 |
-
#chatbot .user.message p { margin: 0.3em 0 !important; }
|
| 267 |
-
#chatbot .message hr { border-color: var(--border) !important; margin: 0.6em 0 !important; }
|
| 268 |
-
#chatbot .message code {
|
| 269 |
-
background: var(--bg-card) !important;
|
| 270 |
-
color: var(--gold) !important;
|
| 271 |
-
padding: 0.1em 0.4em !important;
|
| 272 |
-
border-radius: 4px !important;
|
| 273 |
-
font-size: 0.88em !important;
|
| 274 |
-
font-family: var(--font-sans) !important;
|
| 275 |
-
}
|
| 276 |
-
#chatbot .message em { color: var(--text-dim) !important; }
|
| 277 |
-
#chatbot .message strong { color: var(--text) !important; font-weight: 500 !important; }
|
| 278 |
-
|
| 279 |
-
/* placeholder */
|
| 280 |
-
#chatbot .placeholder {
|
| 281 |
-
color: var(--text-muted) !important;
|
| 282 |
-
font-style: italic !important;
|
| 283 |
-
}
|
| 284 |
-
|
| 285 |
-
/* ── stage status ─────────────────────────────────────────────────────────── */
|
| 286 |
-
#stage-status {
|
| 287 |
-
min-height: 1.8rem;
|
| 288 |
-
text-align: center;
|
| 289 |
-
padding: 0.4rem 0.5rem;
|
| 290 |
-
font-family: 'Lato', sans-serif;
|
| 291 |
-
line-height: 1.55;
|
| 292 |
-
}
|
| 293 |
-
#stage-status .stage-spinner {
|
| 294 |
-
color: var(--gold);
|
| 295 |
-
font-style: italic;
|
| 296 |
-
font-size: 0.88rem;
|
| 297 |
-
opacity: 0.9;
|
| 298 |
-
}
|
| 299 |
-
#stage-status .stage-card {
|
| 300 |
-
display: inline-block;
|
| 301 |
-
text-align: left;
|
| 302 |
-
max-width: 90%;
|
| 303 |
-
font-size: 0.84rem;
|
| 304 |
-
}
|
| 305 |
-
#stage-status .stage-row {
|
| 306 |
-
display: flex;
|
| 307 |
-
align-items: baseline;
|
| 308 |
-
flex-wrap: wrap;
|
| 309 |
-
gap: 0.3rem 0.6rem;
|
| 310 |
-
margin-bottom: 0.2rem;
|
| 311 |
-
}
|
| 312 |
-
#stage-status .stage-icon { color: var(--gold-dim); }
|
| 313 |
-
#stage-status .stage-label {
|
| 314 |
-
color: var(--text-muted);
|
| 315 |
-
font-size: 0.73rem;
|
| 316 |
-
text-transform: uppercase;
|
| 317 |
-
letter-spacing: 0.09em;
|
| 318 |
-
min-width: 4.5rem;
|
| 319 |
-
}
|
| 320 |
-
#stage-status .stage-val { color: var(--text); font-style: italic; }
|
| 321 |
-
#stage-status .stage-chip {
|
| 322 |
-
display: inline-block;
|
| 323 |
-
border: 1px solid var(--border);
|
| 324 |
-
border-radius: 3px;
|
| 325 |
-
padding: 0 0.35rem;
|
| 326 |
-
font-size: 0.78rem;
|
| 327 |
-
color: var(--text-dim);
|
| 328 |
-
font-style: normal;
|
| 329 |
-
margin: 0.1rem 0.1rem 0 0;
|
| 330 |
-
}
|
| 331 |
-
#stage-status .stage-source {
|
| 332 |
-
display: inline-block;
|
| 333 |
-
background: var(--bg-card);
|
| 334 |
-
border: 1px solid var(--gold-dim);
|
| 335 |
-
border-radius: 3px;
|
| 336 |
-
padding: 0 0.35rem;
|
| 337 |
-
font-size: 0.78rem;
|
| 338 |
-
color: var(--gold);
|
| 339 |
-
margin: 0.1rem 0.1rem 0 0;
|
| 340 |
-
}
|
| 341 |
-
|
| 342 |
-
/* ── input textbox ────────────────────────────────────────────────────────── */
|
| 343 |
-
#msg-input {
|
| 344 |
-
background: var(--bg-card) !important;
|
| 345 |
-
border-radius: var(--radius) !important;
|
| 346 |
-
}
|
| 347 |
-
#msg-input label.show_textbox_border {
|
| 348 |
-
background: var(--bg-card) !important;
|
| 349 |
-
border: 1px solid var(--border) !important;
|
| 350 |
-
border-radius: var(--radius) !important;
|
| 351 |
-
transition: border-color 0.15s, box-shadow 0.15s !important;
|
| 352 |
-
}
|
| 353 |
-
#msg-input label.show_textbox_border:focus-within {
|
| 354 |
-
border-color: var(--gold-dim) !important;
|
| 355 |
-
box-shadow: 0 0 0 3px var(--gold-glow) !important;
|
| 356 |
-
}
|
| 357 |
-
#msg-input span.svelte-1hguek3 { display: none !important; } /* hide label text */
|
| 358 |
-
#msg-input textarea {
|
| 359 |
-
background: var(--bg-card) !important;
|
| 360 |
-
color: var(--text) !important;
|
| 361 |
-
font-family: var(--font-serif) !important;
|
| 362 |
-
font-size: 1.02rem !important;
|
| 363 |
-
line-height: 1.55 !important;
|
| 364 |
-
caret-color: var(--gold) !important;
|
| 365 |
-
resize: none !important;
|
| 366 |
-
border: none !important;
|
| 367 |
-
outline: none !important;
|
| 368 |
-
}
|
| 369 |
-
#msg-input textarea::placeholder { color: var(--text-muted) !important; }
|
| 370 |
-
|
| 371 |
-
/* ── buttons ──────────────────────────────────────────────────────────────── */
|
| 372 |
-
#submit-btn {
|
| 373 |
-
background: var(--gold-dim) !important;
|
| 374 |
-
color: #0D0A07 !important;
|
| 375 |
-
border: none !important;
|
| 376 |
-
border-radius: var(--radius) !important;
|
| 377 |
-
font-family: var(--font-sans) !important;
|
| 378 |
-
font-weight: 700 !important;
|
| 379 |
-
letter-spacing: 0.05em !important;
|
| 380 |
-
transition: background 0.18s !important;
|
| 381 |
-
height: 100% !important;
|
| 382 |
-
}
|
| 383 |
-
#submit-btn:hover { background: var(--gold) !important; cursor: pointer !important; }
|
| 384 |
-
|
| 385 |
-
#clear-btn {
|
| 386 |
-
background: transparent !important;
|
| 387 |
-
color: var(--text-muted) !important;
|
| 388 |
-
border: 1px solid var(--border) !important;
|
| 389 |
-
border-radius: var(--radius) !important;
|
| 390 |
-
font-family: var(--font-sans) !important;
|
| 391 |
-
transition: color 0.15s, border-color 0.15s !important;
|
| 392 |
-
width: 46px !important;
|
| 393 |
-
min-width: 46px !important;
|
| 394 |
-
max-width: 46px !important;
|
| 395 |
-
flex-shrink: 0 !important;
|
| 396 |
-
padding: 0 !important;
|
| 397 |
-
font-size: 1.1rem !important;
|
| 398 |
-
}
|
| 399 |
-
#clear-btn:hover { color: var(--text-dim) !important; border-color: var(--text-muted) !important; cursor: pointer !important; }
|
| 400 |
-
|
| 401 |
-
/* ── examples ─────────────────────────────────────────────────────────────── */
|
| 402 |
-
.examples-holder .examples-inner-text { color: var(--text-muted) !important; font-size: 0.8rem !important; }
|
| 403 |
-
.examples-holder table { border: none !important; }
|
| 404 |
-
.examples-holder table td {
|
| 405 |
-
background: var(--bg-card) !important;
|
| 406 |
-
border: 1px solid var(--border) !important;
|
| 407 |
-
color: var(--text-dim) !important;
|
| 408 |
-
border-radius: 6px !important;
|
| 409 |
-
font-size: 0.86rem !important;
|
| 410 |
-
transition: background 0.15s, color 0.15s !important;
|
| 411 |
-
cursor: pointer !important;
|
| 412 |
-
}
|
| 413 |
-
.examples-holder table td:hover {
|
| 414 |
-
background: var(--bg-mid) !important;
|
| 415 |
-
color: var(--text) !important;
|
| 416 |
-
border-color: var(--gold-dim) !important;
|
| 417 |
-
}
|
| 418 |
-
|
| 419 |
-
/* ── explorer section ─────────────────────────────────────────────────────── */
|
| 420 |
-
.explorer-wrap {
|
| 421 |
-
border-top: 1px solid var(--border);
|
| 422 |
-
margin-top: 1.5rem;
|
| 423 |
-
padding-top: 1.2rem;
|
| 424 |
-
}
|
| 425 |
-
.explorer-label {
|
| 426 |
-
color: var(--text-muted);
|
| 427 |
-
font-size: 0.75rem;
|
| 428 |
-
text-transform: uppercase;
|
| 429 |
-
letter-spacing: 0.12em;
|
| 430 |
-
margin-bottom: 0.6rem;
|
| 431 |
-
font-family: 'Lato', sans-serif;
|
| 432 |
-
}
|
| 433 |
-
|
| 434 |
-
#source-dd label { color: var(--text-dim) !important; font-size: 0.82rem !important; }
|
| 435 |
-
#source-dd select {
|
| 436 |
-
background: var(--bg-card) !important;
|
| 437 |
-
border: 1px solid var(--border) !important;
|
| 438 |
-
color: var(--text) !important;
|
| 439 |
-
border-radius: var(--radius) !important;
|
| 440 |
-
}
|
| 441 |
-
|
| 442 |
-
/* ── verse panel ──────────────────────────────────────────────────────────── */
|
| 443 |
-
.verse-panel {
|
| 444 |
-
background: var(--bg-mid);
|
| 445 |
-
border: 1px solid var(--gold-dim);
|
| 446 |
-
border-radius: var(--radius);
|
| 447 |
-
padding: 1.6rem 2rem 1.8rem;
|
| 448 |
-
margin-top: 0.8rem;
|
| 449 |
-
line-height: 1.85;
|
| 450 |
-
font-family: var(--font-serif);
|
| 451 |
-
}
|
| 452 |
-
.vp-header {
|
| 453 |
-
display: flex;
|
| 454 |
-
justify-content: space-between;
|
| 455 |
-
align-items: baseline;
|
| 456 |
-
flex-wrap: wrap;
|
| 457 |
-
gap: 0.4rem;
|
| 458 |
-
border-bottom: 1px solid var(--border);
|
| 459 |
-
padding-bottom: 0.8rem;
|
| 460 |
-
margin-bottom: 1.1rem;
|
| 461 |
-
}
|
| 462 |
-
.vp-ref {
|
| 463 |
-
font-family: var(--font-display);
|
| 464 |
-
font-size: 1.2rem;
|
| 465 |
-
color: var(--gold);
|
| 466 |
-
font-weight: 400;
|
| 467 |
-
letter-spacing: 0.04em;
|
| 468 |
-
}
|
| 469 |
-
.vp-subtitle {
|
| 470 |
-
color: var(--text-muted);
|
| 471 |
-
font-size: 0.82rem;
|
| 472 |
-
font-style: italic;
|
| 473 |
-
font-family: var(--font-sans);
|
| 474 |
-
}
|
| 475 |
-
.vp-sanskrit {
|
| 476 |
-
font-size: 1.05rem;
|
| 477 |
-
color: var(--text);
|
| 478 |
-
font-style: italic;
|
| 479 |
-
margin-bottom: 0.2rem;
|
| 480 |
-
font-family: var(--font-serif);
|
| 481 |
-
}
|
| 482 |
-
.vp-iast {
|
| 483 |
-
color: var(--text-dim);
|
| 484 |
-
font-size: 0.9rem;
|
| 485 |
-
font-style: italic;
|
| 486 |
-
margin-bottom: 1rem;
|
| 487 |
-
font-family: var(--font-serif);
|
| 488 |
-
}
|
| 489 |
-
.vp-label {
|
| 490 |
-
color: var(--gold-dim);
|
| 491 |
-
font-size: 0.70rem;
|
| 492 |
-
text-transform: uppercase;
|
| 493 |
-
letter-spacing: 0.13em;
|
| 494 |
-
margin-top: 1.1rem;
|
| 495 |
-
margin-bottom: 0.35rem;
|
| 496 |
-
font-family: var(--font-sans);
|
| 497 |
-
font-weight: 700;
|
| 498 |
-
}
|
| 499 |
-
.vp-body { color: var(--text); font-size: 1rem; font-family: var(--font-serif); line-height: 1.85; }
|
| 500 |
-
.vp-dim { color: var(--text-dim) !important; font-style: italic; font-size: 0.93rem !important; }
|
| 501 |
-
.vp-gold { color: var(--gold) !important; font-style: italic; }
|
| 502 |
-
|
| 503 |
-
.vp-tags { display: flex; flex-wrap: wrap; gap: 0.35rem; margin-top: 0.8rem; }
|
| 504 |
-
.vp-tag {
|
| 505 |
-
background: var(--bg-card);
|
| 506 |
-
border: 1px solid var(--border);
|
| 507 |
-
color: var(--text-muted);
|
| 508 |
-
font-size: 0.73rem;
|
| 509 |
-
padding: 0.12rem 0.6rem;
|
| 510 |
-
border-radius: 20px;
|
| 511 |
-
font-family: var(--font-sans);
|
| 512 |
-
}
|
| 513 |
-
|
| 514 |
-
/* ── explain button & output ──────────────────────────────────────────────── */
|
| 515 |
-
#explain-btn {
|
| 516 |
-
background: transparent !important;
|
| 517 |
-
color: var(--text-muted) !important;
|
| 518 |
-
border: 1px solid var(--border-dim) !important;
|
| 519 |
-
border-radius: 6px !important;
|
| 520 |
-
font-family: 'Lato', sans-serif !important;
|
| 521 |
-
font-size: 0.82rem !important;
|
| 522 |
-
letter-spacing: 0.05em !important;
|
| 523 |
-
margin-top: 0.6rem !important;
|
| 524 |
-
transition: color 0.15s, border-color 0.15s, opacity 0.15s !important;
|
| 525 |
-
opacity: 0.4 !important;
|
| 526 |
-
}
|
| 527 |
-
#explain-btn:not([disabled]):not(.disabled) {
|
| 528 |
-
color: var(--gold-dim) !important;
|
| 529 |
-
opacity: 1 !important;
|
| 530 |
-
}
|
| 531 |
-
#explain-btn:not([disabled]):not(.disabled):hover {
|
| 532 |
-
color: var(--gold) !important;
|
| 533 |
-
border-color: var(--gold-dim) !important;
|
| 534 |
-
}
|
| 535 |
-
|
| 536 |
-
.explain-panel {
|
| 537 |
-
background: var(--bg-mid);
|
| 538 |
-
border-left: 3px solid var(--gold-dim);
|
| 539 |
-
border-radius: 0 var(--radius) var(--radius) 0;
|
| 540 |
-
padding: 1.3rem 1.7rem;
|
| 541 |
-
margin-top: 0.8rem;
|
| 542 |
-
color: var(--text);
|
| 543 |
-
font-size: 1rem;
|
| 544 |
-
line-height: 1.9;
|
| 545 |
-
font-style: italic;
|
| 546 |
-
font-family: var(--font-serif);
|
| 547 |
-
}
|
| 548 |
-
|
| 549 |
-
/* ── reasoning panel ─────────────────────────────────────────── */
|
| 550 |
-
.reasoning-panel {
|
| 551 |
-
font-family: var(--font-sans);
|
| 552 |
-
font-size: 0.82rem;
|
| 553 |
-
line-height: 1.65;
|
| 554 |
-
color: var(--text-dim);
|
| 555 |
-
}
|
| 556 |
-
.reasoning-panel .r-section {
|
| 557 |
-
margin-bottom: 1rem;
|
| 558 |
-
}
|
| 559 |
-
.reasoning-panel .r-label {
|
| 560 |
-
color: var(--gold-dim);
|
| 561 |
-
font-size: 0.70rem;
|
| 562 |
-
text-transform: uppercase;
|
| 563 |
-
letter-spacing: 0.12em;
|
| 564 |
-
margin-bottom: 0.3rem;
|
| 565 |
-
}
|
| 566 |
-
.reasoning-panel .r-value {
|
| 567 |
-
color: var(--text);
|
| 568 |
-
font-size: 0.85rem;
|
| 569 |
-
}
|
| 570 |
-
.reasoning-panel .r-trace {
|
| 571 |
-
white-space: pre-wrap;
|
| 572 |
-
word-break: break-word;
|
| 573 |
-
color: var(--text-dim);
|
| 574 |
-
font-size: 0.80rem;
|
| 575 |
-
border-left: 2px solid var(--border);
|
| 576 |
-
padding-left: 0.8rem;
|
| 577 |
-
margin-top: 0.3rem;
|
| 578 |
-
}
|
| 579 |
-
/* accordion styling */
|
| 580 |
-
#thinking-accordion > .label-wrap { color: var(--text-dim) !important; font-size: 0.82rem; }
|
| 581 |
-
#thinking-accordion { background: transparent !important; border: 1px solid var(--border) !important; border-radius: var(--radius) !important; margin-top: 0.5rem; }
|
| 582 |
-
|
| 583 |
-
/* ── scrollbar ────────────────────────────────────────────────────────────── */
|
| 584 |
-
::-webkit-scrollbar { width: 4px; height: 4px; }
|
| 585 |
-
::-webkit-scrollbar-track { background: var(--bg); }
|
| 586 |
-
::-webkit-scrollbar-thumb { background: var(--border); border-radius: 3px; }
|
| 587 |
-
::-webkit-scrollbar-thumb:hover { background: var(--gold-dim); }
|
| 588 |
-
"""
|
| 589 |
-
|
| 590 |
-
|
| 591 |
-
# ── streaming synthesis helpers ───────────────────────────────────────────────
|
| 592 |
-
|
| 593 |
-
_RESPONSE_MARKER = "[[ ## response ## ]]"
|
| 594 |
-
_SOURCES_MARKER = "[[ ## sources_cited ## ]]"
|
| 595 |
-
_REASONING_MARKER = "[[ ## reasoning ## ]]"
|
| 596 |
-
|
| 597 |
-
|
| 598 |
-
def _build_synthesis_messages(
|
| 599 |
-
dspy_hist: dspy.History,
|
| 600 |
-
message: str,
|
| 601 |
-
felt_emotion: str,
|
| 602 |
-
deeper_concern: str,
|
| 603 |
-
selected_text: str,
|
| 604 |
-
) -> list[dict]:
|
| 605 |
-
"""Build the exact prompt messages DSPy would send for synthesis.
|
| 606 |
-
|
| 607 |
-
Uses the configured ChatAdapter + the GEPA-optimized signature/demos so the
|
| 608 |
-
optimized instructions are preserved while we gain streaming control.
|
| 609 |
-
"""
|
| 610 |
-
adapter = dspy.settings.adapter
|
| 611 |
-
# In DSPy 3.x ChainOfThought wraps a Predict; the extended sig (with
|
| 612 |
-
# reasoning field) and GEPA-loaded demos live on .predict
|
| 613 |
-
sig = _advisor.synthesize.predict.signature
|
| 614 |
-
demos = getattr(_advisor.synthesize.predict, "demos", [])
|
| 615 |
-
inputs = dict(
|
| 616 |
-
history=dspy_hist,
|
| 617 |
-
user_question=message,
|
| 618 |
-
felt_emotion=felt_emotion,
|
| 619 |
-
deeper_concern=deeper_concern,
|
| 620 |
-
selected_passages=selected_text,
|
| 621 |
-
)
|
| 622 |
-
return adapter.format(sig, demos, inputs)
|
| 623 |
-
|
| 624 |
-
|
| 625 |
-
def _parse_sources_cited(full_text: str) -> list[str]:
|
| 626 |
-
"""Extract sources_cited JSON list from the full streamed completion text."""
|
| 627 |
-
if _SOURCES_MARKER not in full_text:
|
| 628 |
-
return []
|
| 629 |
-
raw = full_text.split(_SOURCES_MARKER, 1)[1].strip()
|
| 630 |
-
raw = re.split(r"\[\[", raw)[0].strip() # stop at next field marker
|
| 631 |
-
try:
|
| 632 |
-
result = json.loads(raw)
|
| 633 |
-
return result if isinstance(result, list) else []
|
| 634 |
-
except Exception:
|
| 635 |
-
m = re.search(r"\[.*?\]", raw, re.DOTALL)
|
| 636 |
-
if m:
|
| 637 |
-
try:
|
| 638 |
-
return json.loads(m.group())
|
| 639 |
-
except Exception:
|
| 640 |
-
pass
|
| 641 |
-
return []
|
| 642 |
-
|
| 643 |
-
|
| 644 |
-
def _parse_reasoning(full_text: str) -> str:
|
| 645 |
-
"""Extract the reasoning trace from the full streamed completion text."""
|
| 646 |
-
if _REASONING_MARKER not in full_text or _RESPONSE_MARKER not in full_text:
|
| 647 |
-
return ""
|
| 648 |
-
return full_text.split(_REASONING_MARKER, 1)[1].split(_RESPONSE_MARKER, 1)[0].strip()
|
| 649 |
-
|
| 650 |
-
|
| 651 |
-
def _spin(text: str) -> str:
|
| 652 |
-
return f'<div class="stage-spinner">◌ {text}</div>'
|
| 653 |
-
|
| 654 |
-
|
| 655 |
-
def _stage_understand(u) -> str:
|
| 656 |
-
emotion = getattr(u, "felt_emotion", "") or ""
|
| 657 |
-
concern = getattr(u, "deeper_concern", "") or ""
|
| 658 |
-
themes = getattr(u, "vedantic_themes", []) or []
|
| 659 |
-
themes_html = "".join(f'<span class="stage-chip">{t.split("(")[0].strip()}</span>' for t in themes[:4])
|
| 660 |
-
rows = []
|
| 661 |
-
if emotion:
|
| 662 |
-
rows.append(
|
| 663 |
-
f'<div class="stage-row">'
|
| 664 |
-
f'<span class="stage-label">felt</span>'
|
| 665 |
-
f'<span class="stage-val">{emotion}</span>'
|
| 666 |
-
f'</div>'
|
| 667 |
-
)
|
| 668 |
-
if concern:
|
| 669 |
-
rows.append(
|
| 670 |
-
f'<div class="stage-row">'
|
| 671 |
-
f'<span class="stage-label">concern</span>'
|
| 672 |
-
f'<span class="stage-val">{concern}</span>'
|
| 673 |
-
f'</div>'
|
| 674 |
-
)
|
| 675 |
-
if themes_html:
|
| 676 |
-
rows.append(
|
| 677 |
-
f'<div class="stage-row">'
|
| 678 |
-
f'<span class="stage-label">themes</span>'
|
| 679 |
-
f'<span>{themes_html}</span>'
|
| 680 |
-
f'</div>'
|
| 681 |
-
)
|
| 682 |
-
return f'<div class="stage-card">{"".join(rows)}</div>'
|
| 683 |
-
|
| 684 |
-
|
| 685 |
-
def _stage_plan(queries: list[str]) -> str:
|
| 686 |
-
chips = "".join(f'<span class="stage-chip">"{q}"</span>' for q in queries)
|
| 687 |
-
return (
|
| 688 |
-
f'<div class="stage-card">'
|
| 689 |
-
f'<div class="stage-row">'
|
| 690 |
-
f'<span class="stage-label">searching</span>'
|
| 691 |
-
f'<span>{chips}</span>'
|
| 692 |
-
f'</div></div>'
|
| 693 |
-
)
|
| 694 |
-
|
| 695 |
-
|
| 696 |
-
def _stage_retrieve(n: int) -> str:
|
| 697 |
-
return (
|
| 698 |
-
f'<div class="stage-card">'
|
| 699 |
-
f'<div class="stage-row">'
|
| 700 |
-
f'<span class="stage-label">passages</span>'
|
| 701 |
-
f'<span class="stage-val">{n} found — selecting…</span>'
|
| 702 |
-
f'</div></div>'
|
| 703 |
-
)
|
| 704 |
-
|
| 705 |
-
|
| 706 |
-
def _stage_select(sources: list[str]) -> str:
|
| 707 |
-
chips = "".join(f'<span class="stage-source">{s}</span>' for s in sources)
|
| 708 |
-
return (
|
| 709 |
-
f'<div class="stage-card">'
|
| 710 |
-
f'<div class="stage-row">'
|
| 711 |
-
f'<span class="stage-label">selected</span>'
|
| 712 |
-
f'<span>{chips}</span>'
|
| 713 |
-
f'</div>'
|
| 714 |
-
f'<div class="stage-row" style="margin-top:0.15rem">'
|
| 715 |
-
f'<span class="stage-label"></span>'
|
| 716 |
-
f'<span class="stage-spinner" style="font-size:0.82rem">◌ composing response…</span>'
|
| 717 |
-
f'</div></div>'
|
| 718 |
-
)
|
| 719 |
-
|
| 720 |
-
|
| 721 |
-
def _build_reasoning_html(pred) -> str:
|
| 722 |
-
"""Render the pipeline's reasoning trace as an HTML block for the accordion."""
|
| 723 |
-
emotion = getattr(pred, "felt_emotion", "") or ""
|
| 724 |
-
concern = getattr(pred, "deeper_concern", "") or ""
|
| 725 |
-
themes = getattr(pred, "vedantic_themes", []) or []
|
| 726 |
-
queries = getattr(pred, "queries", []) or []
|
| 727 |
-
reasoning = getattr(pred, "synthesis_reasoning", "") or ""
|
| 728 |
-
rationale = getattr(pred, "selection_rationale", "") or ""
|
| 729 |
-
|
| 730 |
-
def section(label: str, content: str) -> str:
|
| 731 |
-
return (
|
| 732 |
-
f'<div class="r-section">'
|
| 733 |
-
f'<div class="r-label">{label}</div>'
|
| 734 |
-
f'<div class="r-value">{content}</div>'
|
| 735 |
-
f'</div>'
|
| 736 |
-
)
|
| 737 |
-
|
| 738 |
-
parts = ['<div class="reasoning-panel">']
|
| 739 |
-
if emotion:
|
| 740 |
-
parts.append(section("Felt emotion", emotion))
|
| 741 |
-
if concern:
|
| 742 |
-
parts.append(section("Deeper concern", concern))
|
| 743 |
-
if themes:
|
| 744 |
-
parts.append(section("Vedāntic themes", " · ".join(themes)))
|
| 745 |
-
if queries:
|
| 746 |
-
qs = "".join(f"<li>{q}</li>" for q in queries)
|
| 747 |
-
parts.append(section("Search queries", f"<ol style='margin:0;padding-left:1.2em'>{qs}</ol>"))
|
| 748 |
-
if rationale:
|
| 749 |
-
parts.append(section("Passage selection", rationale))
|
| 750 |
-
if reasoning:
|
| 751 |
-
escaped = reasoning.replace("<", "<").replace(">", ">")
|
| 752 |
-
parts.append(
|
| 753 |
-
'<div class="r-section">'
|
| 754 |
-
'<div class="r-label">Model reasoning trace</div>'
|
| 755 |
-
f'<div class="r-trace">{escaped}</div>'
|
| 756 |
-
'</div>'
|
| 757 |
-
)
|
| 758 |
-
parts.append("</div>")
|
| 759 |
-
return "\n".join(parts)
|
| 760 |
-
|
| 761 |
-
|
| 762 |
-
# ── respond (streaming generator) ─────────────────────────────────────────────
|
| 763 |
-
|
| 764 |
-
def respond(message: str, history: list):
|
| 765 |
-
"""Drive the 4-step pipeline manually so each step's output is shown live."""
|
| 766 |
-
_no_src = gr.update(choices=[], value=None, visible=False)
|
| 767 |
-
_noop = gr.update()
|
| 768 |
-
|
| 769 |
-
def _emit(hist, stage_content, thinking_content=_noop):
|
| 770 |
-
return hist, stage_content, None, _no_src, "", thinking_content
|
| 771 |
-
|
| 772 |
-
if not message.strip():
|
| 773 |
-
yield *_emit(history, ""), _noop
|
| 774 |
-
return
|
| 775 |
-
|
| 776 |
-
history = history + [{"role": "user", "content": message}]
|
| 777 |
-
dspy_hist = _to_dspy_history(history[:-1])
|
| 778 |
-
|
| 779 |
-
# ── Step 1: understand ────────────────────────────────────────────────────
|
| 780 |
-
yield *_emit(history, _spin("understanding your question…")),
|
| 781 |
-
try:
|
| 782 |
-
u = _advisor.understand(history=dspy_hist, user_question=message)
|
| 783 |
-
except Exception as exc:
|
| 784 |
-
history = history + [{"role": "assistant", "content": f"*Error — {exc}*"}]
|
| 785 |
-
yield *_emit(history, ""),
|
| 786 |
-
return
|
| 787 |
-
|
| 788 |
-
# Show what was understood; plan is next
|
| 789 |
-
yield *_emit(history, _stage_understand(u)),
|
| 790 |
-
|
| 791 |
-
# ── Step 2: plan retrieval queries ────────────────────────────────────────
|
| 792 |
-
try:
|
| 793 |
-
p = _advisor.plan(
|
| 794 |
-
surface_concern=u.surface_concern,
|
| 795 |
-
deeper_concern=u.deeper_concern,
|
| 796 |
-
vedantic_themes=u.vedantic_themes,
|
| 797 |
-
)
|
| 798 |
-
except Exception as exc:
|
| 799 |
-
history = history + [{"role": "assistant", "content": f"*Error — {exc}*"}]
|
| 800 |
-
yield *_emit(history, ""),
|
| 801 |
-
return
|
| 802 |
-
|
| 803 |
-
queries = p.queries[: config.N_RETRIEVAL_QUERIES] if p.queries else [u.deeper_concern]
|
| 804 |
-
yield *_emit(history, _stage_plan(queries)),
|
| 805 |
-
|
| 806 |
-
# ── Step 3: retrieve (fast, local Chroma) ────────────────────────────────
|
| 807 |
-
hits = _advisor._retriever.search_many(queries, k_per=config.TOP_K_RETRIEVE)
|
| 808 |
-
candidates = hits[: max(8, config.TOP_K_RETRIEVE)]
|
| 809 |
-
candidates_text = format_passages_for_llm(candidates)
|
| 810 |
-
candidates_as_dicts = [h.to_dict() for h in candidates]
|
| 811 |
-
previously_cited = [
|
| 812 |
-
src for msg in dspy_hist.messages for src in msg.get("sources_cited", [])
|
| 813 |
-
]
|
| 814 |
-
yield *_emit(history, _stage_retrieve(len(candidates))),
|
| 815 |
-
|
| 816 |
-
# ── Step 4: select passages ───────────────────────────────────────────────
|
| 817 |
-
try:
|
| 818 |
-
s = _advisor.select(
|
| 819 |
-
deeper_concern=u.deeper_concern,
|
| 820 |
-
candidate_passages=candidates_text,
|
| 821 |
-
previously_cited=previously_cited,
|
| 822 |
-
)
|
| 823 |
-
except Exception as exc:
|
| 824 |
-
history = history + [{"role": "assistant", "content": f"*Error — {exc}*"}]
|
| 825 |
-
yield *_emit(history, ""),
|
| 826 |
-
return
|
| 827 |
-
|
| 828 |
-
valid_idx = [
|
| 829 |
-
i for i in (s.selected_indices or [])
|
| 830 |
-
if isinstance(i, int) and 1 <= i <= len(candidates)
|
| 831 |
-
]
|
| 832 |
-
if not valid_idx:
|
| 833 |
-
valid_idx = list(range(1, min(4, len(candidates) + 1)))
|
| 834 |
-
selected = [candidates[i - 1] for i in valid_idx]
|
| 835 |
-
selected_text = format_passages_for_llm(selected)
|
| 836 |
-
|
| 837 |
-
# Show selected sources; stream synthesis next
|
| 838 |
-
selected_refs = [
|
| 839 |
-
candidates_as_dicts[i - 1].get("verse_ref", f"#{i}").upper().replace("_", " ")
|
| 840 |
-
for i in valid_idx
|
| 841 |
-
if i - 1 < len(candidates_as_dicts)
|
| 842 |
-
]
|
| 843 |
-
yield *_emit(history, _stage_select(selected_refs)),
|
| 844 |
-
|
| 845 |
-
# ── Step 5: synthesize with real token streaming ──────────────────────────
|
| 846 |
-
# Build partial thinking from steps 1-4 (reasoning filled in after stream)
|
| 847 |
-
partial = SimpleNamespace(
|
| 848 |
-
felt_emotion=u.felt_emotion, deeper_concern=u.deeper_concern,
|
| 849 |
-
vedantic_themes=u.vedantic_themes, queries=queries,
|
| 850 |
-
selection_rationale=s.selection_rationale, synthesis_reasoning="",
|
| 851 |
-
)
|
| 852 |
-
partial_thinking = _build_reasoning_html(partial)
|
| 853 |
-
|
| 854 |
-
history = history + [{"role": "assistant", "content": ""}]
|
| 855 |
-
full_text = ""
|
| 856 |
-
streamed_response = ""
|
| 857 |
-
in_response = False
|
| 858 |
-
|
| 859 |
-
try:
|
| 860 |
-
messages = _build_synthesis_messages(
|
| 861 |
-
dspy_hist, message, u.felt_emotion, u.deeper_concern, selected_text
|
| 862 |
-
)
|
| 863 |
-
stream = _synthesis_client.chat.completions.create(
|
| 864 |
-
model=config.HF_MODEL,
|
| 865 |
-
messages=messages,
|
| 866 |
-
stream=True,
|
| 867 |
-
temperature=0.6,
|
| 868 |
-
max_tokens=4096,
|
| 869 |
-
)
|
| 870 |
-
for chunk in stream:
|
| 871 |
-
if not chunk.choices:
|
| 872 |
-
continue
|
| 873 |
-
delta = chunk.choices[0].delta.content or ""
|
| 874 |
-
full_text += delta
|
| 875 |
-
|
| 876 |
-
if not in_response:
|
| 877 |
-
if _RESPONSE_MARKER in full_text:
|
| 878 |
-
in_response = True
|
| 879 |
-
after = full_text.split(_RESPONSE_MARKER, 1)[1].lstrip("\n")
|
| 880 |
-
if _SOURCES_MARKER in after:
|
| 881 |
-
streamed_response = after.split(_SOURCES_MARKER, 1)[0].rstrip()
|
| 882 |
-
history[-1]["content"] = streamed_response
|
| 883 |
-
yield history, "", None, _no_src, "", partial_thinking
|
| 884 |
-
break
|
| 885 |
-
streamed_response = after
|
| 886 |
-
history[-1]["content"] = streamed_response
|
| 887 |
-
yield history, "", None, _no_src, "", partial_thinking
|
| 888 |
-
else:
|
| 889 |
-
if _SOURCES_MARKER in full_text:
|
| 890 |
-
after_response = full_text.split(_RESPONSE_MARKER, 1)[1]
|
| 891 |
-
streamed_response = after_response.split(_SOURCES_MARKER, 1)[0].strip()
|
| 892 |
-
history[-1]["content"] = streamed_response
|
| 893 |
-
yield history, "", None, _no_src, "", partial_thinking
|
| 894 |
-
break
|
| 895 |
-
streamed_response += delta
|
| 896 |
-
history[-1]["content"] = streamed_response
|
| 897 |
-
yield history, "", None, _no_src, "", partial_thinking
|
| 898 |
-
|
| 899 |
-
except Exception as exc:
|
| 900 |
-
history[-1]["content"] = f"*Error during synthesis — {exc}*"
|
| 901 |
-
yield history, "", None, _no_src, "", partial_thinking
|
| 902 |
-
return
|
| 903 |
-
|
| 904 |
-
# Parse sources and reasoning from the complete accumulated text
|
| 905 |
-
sources_cited = _parse_sources_cited(full_text)
|
| 906 |
-
reasoning = _parse_reasoning(full_text)
|
| 907 |
-
|
| 908 |
-
final_thinking = _build_reasoning_html(SimpleNamespace(
|
| 909 |
-
felt_emotion=u.felt_emotion, deeper_concern=u.deeper_concern,
|
| 910 |
-
vedantic_themes=u.vedantic_themes, queries=queries,
|
| 911 |
-
selection_rationale=s.selection_rationale, synthesis_reasoning=reasoning,
|
| 912 |
-
))
|
| 913 |
-
|
| 914 |
-
if sources_cited:
|
| 915 |
-
footer = "\n\n---\n**Sources:** " + " · ".join(f"`{s}`" for s in sources_cited)
|
| 916 |
-
history[-1]["content"] = streamed_response + footer
|
| 917 |
-
|
| 918 |
-
yield history, "", None, gr.update(choices=sources_cited, value=None, visible=bool(sources_cited)), "", final_thinking
|
| 919 |
-
|
| 920 |
-
|
| 921 |
-
def show_verse(ref: str) -> tuple[str, str]:
|
| 922 |
-
"""Return (verse_html, explain_html) — clears any prior explanation."""
|
| 923 |
-
if not ref:
|
| 924 |
-
return "", ""
|
| 925 |
-
verse = _verse_lookup.get(ref.lower().strip())
|
| 926 |
-
if verse is None:
|
| 927 |
-
return '<div class="verse-panel"><p style="color:var(--text-muted)">Verse not found in corpus.</p></div>', ""
|
| 928 |
-
return _render_verse_html(verse), ""
|
| 929 |
-
|
| 930 |
-
|
| 931 |
-
def explain_verse(source_ref: str, history: list):
|
| 932 |
-
"""Generator: stream a contextual explanation of the selected verse."""
|
| 933 |
-
if not source_ref:
|
| 934 |
-
yield '<div class="explain-panel" style="color:var(--text-muted)">Select a verse first.</div>'
|
| 935 |
-
return
|
| 936 |
-
|
| 937 |
-
verse = _verse_lookup.get(source_ref.lower().strip())
|
| 938 |
-
if verse is None:
|
| 939 |
-
yield '<div class="explain-panel" style="color:var(--text-muted)">Verse not found in corpus.</div>'
|
| 940 |
-
return
|
| 941 |
-
|
| 942 |
-
# Build verse content string
|
| 943 |
-
bits = []
|
| 944 |
-
if getattr(verse, "translation", None):
|
| 945 |
-
bits.append(f"Translation: {verse.translation}")
|
| 946 |
-
if getattr(verse, "sanskrit", None):
|
| 947 |
-
bits.append(f"Sanskrit: {verse.sanskrit}")
|
| 948 |
-
if getattr(verse, "bhashya", None):
|
| 949 |
-
bits.append(f"Śaṅkara's commentary: {verse.bhashya[:600]}")
|
| 950 |
-
ev = verse if isinstance(verse, EnrichedVerse) else None
|
| 951 |
-
if ev and getattr(ev, "paraphrase", None):
|
| 952 |
-
bits.append(f"Teaching: {ev.paraphrase}")
|
| 953 |
-
verse_content = "\n\n".join(bits)
|
| 954 |
-
|
| 955 |
-
# Build conversation context from the last turn
|
| 956 |
-
context = "No prior conversation."
|
| 957 |
-
i = len(history) - 1
|
| 958 |
-
while i >= 0:
|
| 959 |
-
if history[i].get("role") == "assistant" and i > 0:
|
| 960 |
-
user_msg = history[i - 1].get("content", "")
|
| 961 |
-
bot_msg = history[i].get("content", "")
|
| 962 |
-
if "\n\n---\n" in bot_msg:
|
| 963 |
-
bot_msg = bot_msg.split("\n\n---\n")[0]
|
| 964 |
-
context = f"User: {user_msg}\n\nAdvisor: {bot_msg}"
|
| 965 |
-
break
|
| 966 |
-
i -= 1
|
| 967 |
-
|
| 968 |
-
# Run in thread so we can stream
|
| 969 |
-
result_box: list = [None]
|
| 970 |
-
err_box: list = [None]
|
| 971 |
-
done = threading.Event()
|
| 972 |
-
|
| 973 |
-
def _run():
|
| 974 |
-
try:
|
| 975 |
-
explainer = dspy.ChainOfThought(_ExplainInContext)
|
| 976 |
-
result_box[0] = explainer(
|
| 977 |
-
verse_ref=source_ref,
|
| 978 |
-
verse_content=verse_content,
|
| 979 |
-
conversation_context=context,
|
| 980 |
-
)
|
| 981 |
-
except Exception as exc:
|
| 982 |
-
err_box[0] = exc
|
| 983 |
-
finally:
|
| 984 |
-
done.set()
|
| 985 |
-
|
| 986 |
-
threading.Thread(target=_run, daemon=True).start()
|
| 987 |
-
|
| 988 |
-
yield '<div class="explain-panel" style="color:var(--text-muted);font-style:italic;">◌ drawing the thread…</div>'
|
| 989 |
-
while not done.wait(timeout=0.2):
|
| 990 |
-
yield '<div class="explain-panel" style="color:var(--text-muted);font-style:italic;">◌ drawing the thread…</div>'
|
| 991 |
-
|
| 992 |
-
if err_box[0]:
|
| 993 |
-
yield f'<div class="explain-panel" style="color:var(--text-muted)">Could not generate explanation: {err_box[0]}</div>'
|
| 994 |
-
return
|
| 995 |
-
|
| 996 |
-
explanation = result_box[0].explanation
|
| 997 |
-
# Stream character by character
|
| 998 |
-
streamed = ""
|
| 999 |
-
for char in explanation:
|
| 1000 |
-
streamed += char
|
| 1001 |
-
yield f'<div class="explain-panel">{streamed}</div>'
|
| 1002 |
-
|
| 1003 |
-
|
| 1004 |
-
# ── layout ─────────────────────────────────────────────────────────────────────
|
| 1005 |
-
|
| 1006 |
-
EXAMPLES = [
|
| 1007 |
-
"I just got laid off and feel like nothing makes sense.",
|
| 1008 |
-
"I'm terrified of dying. Is that irrational?",
|
| 1009 |
-
"I keep hurting the people I love without meaning to.",
|
| 1010 |
-
"I've been meditating for years but still feel empty.",
|
| 1011 |
-
"My ambition feels hollow but I can't stop chasing it.",
|
| 1012 |
-
]
|
| 1013 |
-
|
| 1014 |
-
with gr.Blocks(title="Gītā Advisor") as demo:
|
| 1015 |
-
|
| 1016 |
-
pred_state = gr.State(None)
|
| 1017 |
-
|
| 1018 |
-
gr.HTML("""
|
| 1019 |
-
<div class="app-header">
|
| 1020 |
-
<div class="app-title">Gītā Advisor</div>
|
| 1021 |
-
<div class="app-subtitle">Grounded in Advaita Vedānta as taught by Śaṅkarācārya</div>
|
| 1022 |
-
<div class="app-ornament">✦ ✦ ✦</div>
|
| 1023 |
-
</div>
|
| 1024 |
-
""")
|
| 1025 |
-
|
| 1026 |
-
chatbot = gr.Chatbot(
|
| 1027 |
-
height=480,
|
| 1028 |
-
show_label=False,
|
| 1029 |
-
elem_id="chatbot",
|
| 1030 |
-
render_markdown=True,
|
| 1031 |
-
placeholder=(
|
| 1032 |
-
'<div style="text-align:center;padding:3rem 1rem;">'
|
| 1033 |
-
'<span style="color:#5A3F1E;font-style:italic;font-size:0.95rem;">'
|
| 1034 |
-
"Speak from where you actually are.<br>"
|
| 1035 |
-
'<span style="font-size:0.82rem">The teacher will meet you there.</span>'
|
| 1036 |
-
"</span></div>"
|
| 1037 |
-
),
|
| 1038 |
-
)
|
| 1039 |
-
|
| 1040 |
-
stage_html = gr.HTML("", elem_id="stage-status")
|
| 1041 |
-
|
| 1042 |
-
with gr.Row(equal_height=True):
|
| 1043 |
-
msg_box = gr.Textbox(
|
| 1044 |
-
placeholder="Speak from where you actually are…",
|
| 1045 |
-
show_label=False,
|
| 1046 |
-
lines=2,
|
| 1047 |
-
max_lines=6,
|
| 1048 |
-
elem_id="msg-input",
|
| 1049 |
-
scale=7,
|
| 1050 |
-
container=False,
|
| 1051 |
-
)
|
| 1052 |
-
submit_btn = gr.Button("Ask →", variant="primary", elem_id="submit-btn", size="lg", scale=1, min_width=110)
|
| 1053 |
-
clear_btn = gr.Button("✕", variant="secondary", elem_id="clear-btn", size="lg", scale=0, min_width=46)
|
| 1054 |
-
|
| 1055 |
-
gr.Examples(examples=EXAMPLES, inputs=msg_box, label="Opening moves")
|
| 1056 |
-
|
| 1057 |
-
with gr.Column(elem_classes=["explorer-wrap"]):
|
| 1058 |
-
gr.HTML('<div class="explorer-label">Explore a cited verse</div>')
|
| 1059 |
-
source_dd = gr.Dropdown(
|
| 1060 |
-
choices=[],
|
| 1061 |
-
value=None,
|
| 1062 |
-
label="Select a cited source…",
|
| 1063 |
-
show_label=False,
|
| 1064 |
-
elem_id="source-dd",
|
| 1065 |
-
visible=False,
|
| 1066 |
-
interactive=True,
|
| 1067 |
-
)
|
| 1068 |
-
verse_html = gr.HTML("")
|
| 1069 |
-
explain_btn = gr.Button("Explain in context →", elem_id="explain-btn", visible=True, interactive=False, size="sm")
|
| 1070 |
-
explain_out = gr.HTML("")
|
| 1071 |
-
|
| 1072 |
-
with gr.Accordion("🧠 Model reasoning", open=False, elem_id="thinking-accordion"):
|
| 1073 |
-
thinking_html = gr.HTML("")
|
| 1074 |
-
|
| 1075 |
-
# ── event wiring ──────────────────────────────────────────────────────────
|
| 1076 |
-
outputs = [chatbot, stage_html, pred_state, source_dd, msg_box, thinking_html]
|
| 1077 |
-
|
| 1078 |
-
msg_box.submit(respond, [msg_box, chatbot], outputs)
|
| 1079 |
-
submit_btn.click(respond, [msg_box, chatbot], outputs)
|
| 1080 |
-
|
| 1081 |
-
clear_btn.click(
|
| 1082 |
-
fn=lambda: ([], "", None, gr.update(choices=[], value=None, visible=False), "", "", "", ""),
|
| 1083 |
-
outputs=[chatbot, stage_html, pred_state, source_dd, msg_box, verse_html, explain_out, thinking_html],
|
| 1084 |
-
)
|
| 1085 |
-
|
| 1086 |
-
source_dd.change(
|
| 1087 |
-
fn=lambda ref: (*show_verse(ref), gr.update(interactive=bool(ref))),
|
| 1088 |
-
inputs=source_dd,
|
| 1089 |
-
outputs=[verse_html, explain_out, explain_btn],
|
| 1090 |
-
)
|
| 1091 |
-
|
| 1092 |
-
explain_btn.click(
|
| 1093 |
-
fn=explain_verse,
|
| 1094 |
-
inputs=[source_dd, chatbot],
|
| 1095 |
-
outputs=explain_out,
|
| 1096 |
-
)
|
| 1097 |
-
|
| 1098 |
-
demo.queue()
|
| 1099 |
-
|
| 1100 |
-
if __name__ == "__main__":
|
| 1101 |
-
demo.launch(server_port=7860, css=CSS)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
artifacts/chroma/52cdeb15-0631-44ed-8618-782f1d4d27bb/data_level0.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 167600
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bbac4dba21b040d3b944b10cf70751e5b6ce7a5bf3e98b4bbafd56657c5c7ffb
|
| 3 |
size 167600
|
artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/data_level0.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1471fb4961d4e00a440e224c0f9cdfe75b4dd83de2a20772e448b632da02404a
|
| 3 |
+
size 167600
|
artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/header.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a0e81c3b22454233bc12d0762f06dcca48261a75231cf87c79b75e69a6c00150
|
| 3 |
+
size 100
|
artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/length.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7a12e561363385e9dfeeab326368731c030ed4b374e7f5897ac819159d2884c5
|
| 3 |
+
size 400
|
parsers/__init__.py → artifacts/chroma/767895ca-73d7-4509-b1d2-45c5e9059c2e/link_lists.bin
RENAMED
|
File without changes
|
artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/data_level0.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5dfaebf500084c49276ef2d99d82780c1268cc3bd9b4df63416632bf613a2907
|
| 3 |
+
size 167600
|
artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/header.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a0e81c3b22454233bc12d0762f06dcca48261a75231cf87c79b75e69a6c00150
|
| 3 |
+
size 100
|
artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/length.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7a12e561363385e9dfeeab326368731c030ed4b374e7f5897ac819159d2884c5
|
| 3 |
+
size 400
|
artifacts/chroma/83348f65-1650-440a-b8df-05db540a1746/link_lists.bin
ADDED
|
File without changes
|
artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/data_level0.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f5b6c432fff6b70a57924e98c7ed59ff1bea23574753718ba13c417e32baffd4
|
| 3 |
+
size 167600
|
artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/header.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a0e81c3b22454233bc12d0762f06dcca48261a75231cf87c79b75e69a6c00150
|
| 3 |
+
size 100
|
artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/length.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7a12e561363385e9dfeeab326368731c030ed4b374e7f5897ac819159d2884c5
|
| 3 |
+
size 400
|
artifacts/chroma/9a2f23e3-afec-4b53-98af-af37c20188c1/link_lists.bin
ADDED
|
File without changes
|
artifacts/chroma/chroma.sqlite3
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2faa37ea993ae5c07d47bc4b0f9fb17520e4839b8c3299318b36b0f5add3d181
|
| 3 |
+
size 21323776
|
artifacts/optimized_advisor.prompts.txt
DELETED
|
@@ -1,87 +0,0 @@
|
|
| 1 |
-
# Optimized prompts after GEPA
|
| 2 |
-
|
| 3 |
-
## understand.predict
|
| 4 |
-
### instructions
|
| 5 |
-
Read the user's life situation carefully, taking into account the full
|
| 6 |
-
conversation so far. If there is prior exchange, use it to understand
|
| 7 |
-
follow-up messages, references like 'what you said earlier', or shifts in
|
| 8 |
-
the user's emotional state across turns. Identify the felt emotion, the
|
| 9 |
-
underlying spiritual concern (not just the surface complaint), and the
|
| 10 |
-
Vedāntic themes that are most relevant — drawing only from concepts native
|
| 11 |
-
to Advaita Vedānta.
|
| 12 |
-
|
| 13 |
-
### fields
|
| 14 |
-
- history: Prior turns as a list of message dicts with 'user_question' and 'response' keys. Empty history means this is the first message.
|
| 15 |
-
- user_question: The user's current message; may be a question, a vent, a follow-up, or a description of a situation.
|
| 16 |
-
- reasoning: ${reasoning}
|
| 17 |
-
- felt_emotion: The dominant emotion the user is experiencing, named precisely (e.g. 'anticipatory grief', not just 'sad').
|
| 18 |
-
- surface_concern: What the user is literally asking about, in one sentence.
|
| 19 |
-
- deeper_concern: The underlying existential/spiritual concern — usually about identity, attachment, fear, dharma, or meaning — that the surface concern is a symptom of. One sentence.
|
| 20 |
-
- vedantic_themes: 2-4 Advaita-Vedānta concepts most relevant to this situation. Use Sanskrit terms with brief gloss, e.g. 'adhyāsa (superimposition of self onto roles)', 'vairāgya (dispassion)', 'sākṣī (witness consciousness)'.
|
| 21 |
-
|
| 22 |
-
---
|
| 23 |
-
|
| 24 |
-
## plan.predict
|
| 25 |
-
### instructions
|
| 26 |
-
Given the user's situation and identified themes, generate diverse search
|
| 27 |
-
queries to find relevant passages from the Advaita corpus (Bhagavad Gītā with
|
| 28 |
-
Śaṅkara bhāṣya, Upaniṣads, Brahma Sūtras, prakaraṇa texts). Each query should
|
| 29 |
-
target a different angle — one query about the philosophical principle,
|
| 30 |
-
one about a parallel situation in the texts, one about the practical
|
| 31 |
-
teaching offered by the lineage.
|
| 32 |
-
|
| 33 |
-
### fields
|
| 34 |
-
- surface_concern: ${surface_concern}
|
| 35 |
-
- deeper_concern: ${deeper_concern}
|
| 36 |
-
- vedantic_themes: ${vedantic_themes}
|
| 37 |
-
- reasoning: ${reasoning}
|
| 38 |
-
- queries: 3 distinct search queries (each 5-15 words). Vary in angle: principle, parallel, practice.
|
| 39 |
-
|
| 40 |
-
---
|
| 41 |
-
|
| 42 |
-
## select.predict
|
| 43 |
-
### instructions
|
| 44 |
-
From the retrieved candidate passages, select the ones that genuinely
|
| 45 |
-
speak to this user's situation. Prefer primary sources (Gītā verses,
|
| 46 |
-
Upaniṣadic mantras, Śaṅkara's bhāṣya) over secondary or modern commentary
|
| 47 |
-
when both are available. Reject passages that are merely topically adjacent
|
| 48 |
-
but don't address the actual spiritual concern. Avoid re-selecting passages
|
| 49 |
-
whose source was already cited in a prior turn — prefer fresh ground.
|
| 50 |
-
|
| 51 |
-
### fields
|
| 52 |
-
- deeper_concern: ${deeper_concern}
|
| 53 |
-
- candidate_passages: Numbered candidate passages with source attribution.
|
| 54 |
-
- previously_cited: Source references already cited in earlier turns of this conversation (e.g. ['BG 2.47', 'BG 18.66']). Prefer passages not on this list so the conversation covers new ground. Empty list on the first turn.
|
| 55 |
-
- reasoning: ${reasoning}
|
| 56 |
-
- selected_indices: Indices (1-based) of the 2-4 most relevant passages.
|
| 57 |
-
- selection_rationale: One sentence per selection explaining why that passage speaks to this concern.
|
| 58 |
-
|
| 59 |
-
---
|
| 60 |
-
|
| 61 |
-
## synthesize.predict
|
| 62 |
-
### instructions
|
| 63 |
-
Compose a response that is grounded in Advaita Vedānta as taught by
|
| 64 |
-
Śaṅkarācārya, empathetic to the user's felt experience, and practically
|
| 65 |
-
useful for their situation. Honor the two-truths distinction: meet the user
|
| 66 |
-
in vyāvahārika (transactional reality) without ever denying the
|
| 67 |
-
pāramārthika (absolute) view. Cite specific verses/passages by reference,
|
| 68 |
-
integrate them into prose rather than dumping quotes, and keep wit gentle —
|
| 69 |
-
light around the cosmic predicament, never light about the user's pain.
|
| 70 |
-
|
| 71 |
-
If history has prior turns: do not repeat citations or teachings already
|
| 72 |
-
given; build on or deepen what was said; acknowledge any shift the user has
|
| 73 |
-
expressed since the last turn. If the user is following up, open by briefly
|
| 74 |
-
acknowledging the continuity before moving forward.
|
| 75 |
-
|
| 76 |
-
### fields
|
| 77 |
-
- history: Prior turns as a list of message dicts with 'user_question' and 'response' keys. Use this to avoid repetition and to build across turns.
|
| 78 |
-
- user_question: ${user_question}
|
| 79 |
-
- felt_emotion: ${felt_emotion}
|
| 80 |
-
- deeper_concern: ${deeper_concern}
|
| 81 |
-
- selected_passages: The selected passages with full source attribution.
|
| 82 |
-
- reasoning: ${reasoning}
|
| 83 |
-
- response: The advisor's reply to the user. 250-450 words. Open by acknowledging the felt experience. Move into the Vedāntic perspective. Cite at least one primary source (Gītā chapter:verse, Upaniṣad name + section, etc.). Close with a concrete practice or shift in perspective they can try this week. Address the user as 'you' throughout. Avoid Western therapy clichés.
|
| 84 |
-
- sources_cited: Source references actually cited in the response, e.g. 'BG 2.47', 'Bṛhadāraṇyaka Up. 4.4.5', 'Vivekacūḍāmaṇi 11'.
|
| 85 |
-
|
| 86 |
-
---
|
| 87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
chat.py
DELETED
|
@@ -1,392 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
chat.py — interactive conversation with the advisor.
|
| 3 |
-
|
| 4 |
-
By default it loads the GEPA-optimized program from artifacts/. If that file
|
| 5 |
-
doesn't exist yet, it falls back to the un-optimized base prompts so you can
|
| 6 |
-
sanity-check the pipeline before running optimization.
|
| 7 |
-
|
| 8 |
-
Flags:
|
| 9 |
-
--debug Show intermediate pipeline state (felt emotion, queries, etc.)
|
| 10 |
-
--thinking Show the full synthesis reasoning trace (default: first 6 lines)
|
| 11 |
-
--no-thinking Hide the reasoning trace entirely
|
| 12 |
-
|
| 13 |
-
After each response, source references are printed with numbers.
|
| 14 |
-
show <N|ref> Display the verse text, translation, and Śaṅkara's bhāṣya.
|
| 15 |
-
explain <N|ref> Show the verse then stream a contextual explanation of how
|
| 16 |
-
it applies to the current conversation.
|
| 17 |
-
"""
|
| 18 |
-
|
| 19 |
-
from __future__ import annotations
|
| 20 |
-
import argparse
|
| 21 |
-
import time
|
| 22 |
-
import threading
|
| 23 |
-
from typing import Optional
|
| 24 |
-
|
| 25 |
-
import dspy
|
| 26 |
-
from rich.console import Console
|
| 27 |
-
from rich.live import Live
|
| 28 |
-
from rich.markdown import Markdown
|
| 29 |
-
from rich.panel import Panel
|
| 30 |
-
from rich.rule import Rule
|
| 31 |
-
from rich.text import Text
|
| 32 |
-
|
| 33 |
-
import config
|
| 34 |
-
from advisor import load_optimized
|
| 35 |
-
from corpus import EnrichedVerse, Verse, read_jsonl_enriched, read_jsonl_verses
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
# ── speed constants ────────────────────────────────────────────────────────────
|
| 39 |
-
_THINKING_CPS = 800 # chars/sec for reasoning stream (secondary content, fast)
|
| 40 |
-
_RESPONSE_CPS = 300 # chars/sec for advisor response (primary content)
|
| 41 |
-
_THINKING_PREVIEW = 6 # lines shown in collapsed thinking mode
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
# ── verse corpus lookup ────────────────────────────────────────────────────────
|
| 45 |
-
def _load_verse_lookup() -> dict[str, Verse]:
|
| 46 |
-
"""Build a case-insensitive verse_ref → Verse dict from the corpus."""
|
| 47 |
-
lookup: dict[str, Verse] = {}
|
| 48 |
-
enriched = config.DATA_DIR / "corpus_enriched.jsonl"
|
| 49 |
-
plain = config.DATA_DIR / "corpus.jsonl"
|
| 50 |
-
|
| 51 |
-
if enriched.exists():
|
| 52 |
-
loader, path = read_jsonl_enriched, enriched
|
| 53 |
-
elif plain.exists():
|
| 54 |
-
loader, path = read_jsonl_verses, plain
|
| 55 |
-
else:
|
| 56 |
-
return lookup
|
| 57 |
-
|
| 58 |
-
for verse in loader(path):
|
| 59 |
-
lookup[verse.verse_ref.lower().strip()] = verse
|
| 60 |
-
return lookup
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
def _find_verse(lookup: dict, ref: str) -> Optional[Verse]:
|
| 64 |
-
return lookup.get(ref.lower().strip())
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
def _resolve_ref(arg: str, sources_cited: list[str]) -> str:
|
| 68 |
-
"""Turn '1' → sources_cited[0], or return arg unchanged for direct ref lookup."""
|
| 69 |
-
try:
|
| 70 |
-
n = int(arg.strip())
|
| 71 |
-
if 1 <= n <= len(sources_cited):
|
| 72 |
-
return sources_cited[n - 1]
|
| 73 |
-
except ValueError:
|
| 74 |
-
pass
|
| 75 |
-
return arg.strip()
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
# ── DSPy signature for contextual explanation ─────────────────────────────────
|
| 79 |
-
class _ExplainInContext(dspy.Signature):
|
| 80 |
-
"""You are the Gītā Advisor continuing a conversation. The user has asked
|
| 81 |
-
you to unpack a specific verse or passage you cited. Explain what it means
|
| 82 |
-
and why it speaks precisely to their situation — go deeper than the initial
|
| 83 |
-
response did. Reference the user's words. Close with one concrete way to
|
| 84 |
-
hold or work with this text this week."""
|
| 85 |
-
|
| 86 |
-
verse_ref: str = dspy.InputField()
|
| 87 |
-
verse_content: str = dspy.InputField(
|
| 88 |
-
desc="Translation, original text (if available), and Śaṅkara's commentary."
|
| 89 |
-
)
|
| 90 |
-
conversation_context: str = dspy.InputField(
|
| 91 |
-
desc="The user's question and the advisor's response where this verse was cited."
|
| 92 |
-
)
|
| 93 |
-
|
| 94 |
-
explanation: str = dspy.OutputField(
|
| 95 |
-
desc="150-250 words. Grounded in Advaita. Do not merely restate the translation. "
|
| 96 |
-
"End with a practical suggestion for this week."
|
| 97 |
-
)
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
# ── streaming helpers ─────────────────────────────────────────────────────────
|
| 101 |
-
def _stream_chars(console: Console, text: str, cps: int):
|
| 102 |
-
"""Write text to the terminal character by character."""
|
| 103 |
-
if not text:
|
| 104 |
-
return
|
| 105 |
-
delay = 1.0 / cps
|
| 106 |
-
for ch in text:
|
| 107 |
-
console.file.write(ch)
|
| 108 |
-
console.file.flush()
|
| 109 |
-
time.sleep(delay)
|
| 110 |
-
console.file.write("\n")
|
| 111 |
-
console.file.flush()
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
def _stream_response(console: Console, text: str, cps: int = _RESPONSE_CPS):
|
| 115 |
-
"""Stream the advisor response into a growing Markdown Panel via Rich Live."""
|
| 116 |
-
if not text:
|
| 117 |
-
return
|
| 118 |
-
displayed = ""
|
| 119 |
-
delay = 1.0 / cps
|
| 120 |
-
with Live(console=console, refresh_per_second=min(cps, 30)) as live:
|
| 121 |
-
for ch in text:
|
| 122 |
-
displayed += ch
|
| 123 |
-
live.update(Panel(
|
| 124 |
-
Markdown(displayed),
|
| 125 |
-
title="[bold]advisor[/bold]",
|
| 126 |
-
border_style="yellow",
|
| 127 |
-
padding=(1, 2),
|
| 128 |
-
))
|
| 129 |
-
time.sleep(delay)
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
def _show_thinking(console: Console, reasoning: str, full: bool):
|
| 133 |
-
"""Stream the synthesis reasoning below a dim rule, collapsed to _THINKING_PREVIEW lines."""
|
| 134 |
-
if not reasoning:
|
| 135 |
-
return
|
| 136 |
-
|
| 137 |
-
lines = reasoning.strip().splitlines()
|
| 138 |
-
if not full and len(lines) > _THINKING_PREVIEW:
|
| 139 |
-
display = "\n".join(lines[:_THINKING_PREVIEW])
|
| 140 |
-
n_hidden = len(lines) - _THINKING_PREVIEW
|
| 141 |
-
else:
|
| 142 |
-
display = "\n".join(lines)
|
| 143 |
-
n_hidden = 0
|
| 144 |
-
|
| 145 |
-
console.print(Rule("[dim]thinking[/dim]", style="dim blue"))
|
| 146 |
-
# Write dim italic via ANSI since we're streaming to file directly
|
| 147 |
-
# (Rich markup can't be applied char-by-char; dim is cosmetic here)
|
| 148 |
-
_stream_chars(console, display, cps=_THINKING_CPS)
|
| 149 |
-
|
| 150 |
-
if n_hidden:
|
| 151 |
-
console.print(f"[dim] ↳ {n_hidden} more lines — use --thinking to expand[/dim]")
|
| 152 |
-
console.print()
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
# ── verse display helpers ─────────────────────────────────────────────────────
|
| 156 |
-
def _show_verse(console: Console, verse: Verse):
|
| 157 |
-
"""Render a verse with its translation, original text, and commentary."""
|
| 158 |
-
body = Text()
|
| 159 |
-
|
| 160 |
-
if verse.sanskrit:
|
| 161 |
-
body.append(verse.sanskrit + "\n", style="bold")
|
| 162 |
-
if verse.transliteration:
|
| 163 |
-
body.append(verse.transliteration + "\n", style="italic dim")
|
| 164 |
-
|
| 165 |
-
if verse.translation:
|
| 166 |
-
label = f"Translation ({verse.translator})" if verse.translator else "Translation"
|
| 167 |
-
body.append(f"\n{label}:\n", style="dim")
|
| 168 |
-
body.append(verse.translation + "\n")
|
| 169 |
-
|
| 170 |
-
if verse.bhashya:
|
| 171 |
-
translator_note = f" ({verse.bhashya_translator})" if verse.bhashya_translator else ""
|
| 172 |
-
body.append(f"\nŚaṅkara's Bhāṣya{translator_note}:\n", style="dim")
|
| 173 |
-
preview = verse.bhashya[:800] + ("…" if len(verse.bhashya) > 800 else "")
|
| 174 |
-
body.append(preview + "\n", style="dim")
|
| 175 |
-
|
| 176 |
-
ev = verse if isinstance(verse, EnrichedVerse) else None
|
| 177 |
-
if ev and ev.paraphrase:
|
| 178 |
-
body.append("\nTeaching: ", style="bold dim")
|
| 179 |
-
body.append(ev.paraphrase + "\n", style="dim")
|
| 180 |
-
if ev and ev.themes:
|
| 181 |
-
body.append("Themes: ", style="bold dim")
|
| 182 |
-
body.append(", ".join(ev.themes) + "\n", style="dim")
|
| 183 |
-
if ev and ev.practical_teaching:
|
| 184 |
-
body.append("Practical shift: ", style="bold dim")
|
| 185 |
-
body.append(ev.practical_teaching + "\n", style="dim")
|
| 186 |
-
|
| 187 |
-
section = verse.section_display or verse.section
|
| 188 |
-
subtitle = verse.work_display + (f" — {section}" if section else "")
|
| 189 |
-
console.print(Panel(
|
| 190 |
-
body,
|
| 191 |
-
title=f"[bold]{verse.verse_ref}[/bold]",
|
| 192 |
-
subtitle=f"[dim]{subtitle}[/dim]",
|
| 193 |
-
border_style="cyan",
|
| 194 |
-
padding=(1, 2),
|
| 195 |
-
))
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
def _explain_in_context(
|
| 199 |
-
console: Console,
|
| 200 |
-
verse: Verse,
|
| 201 |
-
history_messages: list[dict],
|
| 202 |
-
cps: int = _RESPONSE_CPS,
|
| 203 |
-
):
|
| 204 |
-
"""Call the LM to explain the verse in context of the last conversation turn."""
|
| 205 |
-
if history_messages:
|
| 206 |
-
last = history_messages[-1]
|
| 207 |
-
context = (
|
| 208 |
-
f"User: {last.get('user_question', '')}\n\n"
|
| 209 |
-
f"Advisor: {last.get('response', '')}"
|
| 210 |
-
)
|
| 211 |
-
else:
|
| 212 |
-
context = "No prior conversation."
|
| 213 |
-
|
| 214 |
-
bits = []
|
| 215 |
-
if verse.translation:
|
| 216 |
-
bits.append(f"Translation: {verse.translation}")
|
| 217 |
-
if verse.sanskrit:
|
| 218 |
-
bits.append(f"Sanskrit: {verse.sanskrit}")
|
| 219 |
-
if verse.bhashya:
|
| 220 |
-
bits.append(f"Śaṅkara's commentary: {verse.bhashya[:600]}")
|
| 221 |
-
ev = verse if isinstance(verse, EnrichedVerse) else None
|
| 222 |
-
if ev and ev.paraphrase:
|
| 223 |
-
bits.append(f"Teaching: {ev.paraphrase}")
|
| 224 |
-
verse_content = "\n\n".join(bits)
|
| 225 |
-
|
| 226 |
-
explainer = dspy.ChainOfThought(_ExplainInContext)
|
| 227 |
-
with console.status("[dim]expanding...[/dim]", spinner="dots"):
|
| 228 |
-
try:
|
| 229 |
-
result = explainer(
|
| 230 |
-
verse_ref=verse.verse_ref,
|
| 231 |
-
verse_content=verse_content,
|
| 232 |
-
conversation_context=context,
|
| 233 |
-
)
|
| 234 |
-
explanation = result.explanation
|
| 235 |
-
except Exception as exc:
|
| 236 |
-
console.print(f"[red]Could not generate explanation: {exc}[/red]")
|
| 237 |
-
return
|
| 238 |
-
|
| 239 |
-
console.print()
|
| 240 |
-
_stream_response(console, explanation, cps=cps)
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
# ── main loop ─────────────────────────────────────────────────────────────────
|
| 244 |
-
def main():
|
| 245 |
-
ap = argparse.ArgumentParser()
|
| 246 |
-
ap.add_argument("--program", default=str(config.OPTIMIZED_PROGRAM_PATH))
|
| 247 |
-
ap.add_argument("--debug", action="store_true",
|
| 248 |
-
help="Show intermediate pipeline state for each turn.")
|
| 249 |
-
ap.add_argument("--thinking", action="store_true",
|
| 250 |
-
help="Show full synthesis reasoning trace (default: first 6 lines).")
|
| 251 |
-
ap.add_argument("--no-thinking", action="store_true", dest="no_thinking",
|
| 252 |
-
help="Hide the reasoning trace entirely.")
|
| 253 |
-
args = ap.parse_args()
|
| 254 |
-
|
| 255 |
-
config.configure_dspy()
|
| 256 |
-
advisor = load_optimized(args.program)
|
| 257 |
-
console = Console()
|
| 258 |
-
|
| 259 |
-
with console.status("[dim]loading corpus...[/dim]", spinner="dots"):
|
| 260 |
-
verse_lookup = _load_verse_lookup()
|
| 261 |
-
|
| 262 |
-
console.print(Panel.fit(
|
| 263 |
-
"[bold]Gītā Advisor[/bold]\n\n"
|
| 264 |
-
"Speak from where you actually are.\n"
|
| 265 |
-
"After a response: [italic]show <N>[/italic] to read a cited verse · "
|
| 266 |
-
"[italic]explain <N>[/italic] for contextual breakdown.\n"
|
| 267 |
-
"Type [italic]exit[/italic] or Ctrl-D to leave.",
|
| 268 |
-
border_style="cyan",
|
| 269 |
-
))
|
| 270 |
-
|
| 271 |
-
history = dspy.History(messages=[])
|
| 272 |
-
last_pred = None
|
| 273 |
-
|
| 274 |
-
while True:
|
| 275 |
-
try:
|
| 276 |
-
console.print()
|
| 277 |
-
console.print("[bold cyan]you:[/bold cyan] ", end="")
|
| 278 |
-
line = input().strip()
|
| 279 |
-
except (EOFError, KeyboardInterrupt):
|
| 280 |
-
console.print("\n[dim]नमस्ते।[/dim]")
|
| 281 |
-
return
|
| 282 |
-
|
| 283 |
-
if not line:
|
| 284 |
-
continue
|
| 285 |
-
if line.lower() in {"exit", "quit", ":q"}:
|
| 286 |
-
console.print("[dim]नमस्ते।[/dim]")
|
| 287 |
-
return
|
| 288 |
-
|
| 289 |
-
# ── source exploration commands ───────────────────────────────────────
|
| 290 |
-
cmd_lower = line.lower()
|
| 291 |
-
if cmd_lower.startswith(("show ", "explain ")):
|
| 292 |
-
if last_pred is None:
|
| 293 |
-
console.print("[dim]No sources yet — ask a question first.[/dim]")
|
| 294 |
-
continue
|
| 295 |
-
cmd, _, arg = line.partition(" ")
|
| 296 |
-
ref = _resolve_ref(arg, last_pred.sources_cited)
|
| 297 |
-
verse = _find_verse(verse_lookup, ref)
|
| 298 |
-
if verse is None:
|
| 299 |
-
console.print(f"[dim]'{ref}' not found in corpus.[/dim]")
|
| 300 |
-
if last_pred.sources_cited:
|
| 301 |
-
hint = " ".join(
|
| 302 |
-
f"[{i+1}] {r}" for i, r in enumerate(last_pred.sources_cited)
|
| 303 |
-
)
|
| 304 |
-
console.print(f"[dim]Available: {hint}[/dim]")
|
| 305 |
-
continue
|
| 306 |
-
_show_verse(console, verse)
|
| 307 |
-
if cmd.lower() == "explain":
|
| 308 |
-
_explain_in_context(console, verse, history.messages)
|
| 309 |
-
continue
|
| 310 |
-
|
| 311 |
-
# ── normal question — run pipeline in background with live stage progress ──
|
| 312 |
-
pred = None
|
| 313 |
-
error = None
|
| 314 |
-
stage = ["initializing..."]
|
| 315 |
-
done = threading.Event()
|
| 316 |
-
|
| 317 |
-
def run_advisor():
|
| 318 |
-
nonlocal pred, error
|
| 319 |
-
try:
|
| 320 |
-
pred = advisor(
|
| 321 |
-
user_question=line,
|
| 322 |
-
history=history,
|
| 323 |
-
_stage_cb=lambda msg: stage.__setitem__(0, msg),
|
| 324 |
-
)
|
| 325 |
-
except Exception as exc:
|
| 326 |
-
error = exc
|
| 327 |
-
finally:
|
| 328 |
-
done.set()
|
| 329 |
-
|
| 330 |
-
threading.Thread(target=run_advisor, daemon=True).start()
|
| 331 |
-
|
| 332 |
-
with Live(console=console, refresh_per_second=8) as live:
|
| 333 |
-
while not done.wait(timeout=0.12):
|
| 334 |
-
live.update(Text(f" ◌ {stage[0]}", style="dim"))
|
| 335 |
-
live.update(Text(""))
|
| 336 |
-
|
| 337 |
-
if error:
|
| 338 |
-
console.print(f"[red]Error: {error}[/red]")
|
| 339 |
-
continue
|
| 340 |
-
|
| 341 |
-
last_pred = pred
|
| 342 |
-
history.messages.append({
|
| 343 |
-
"user_question": line,
|
| 344 |
-
"response": pred.response,
|
| 345 |
-
"sources_cited": pred.sources_cited,
|
| 346 |
-
})
|
| 347 |
-
|
| 348 |
-
# debug trace
|
| 349 |
-
if args.debug:
|
| 350 |
-
console.print(Rule("[dim]debug[/dim]", style="dim"))
|
| 351 |
-
console.print(f"[dim]felt:[/dim] {pred.felt_emotion}")
|
| 352 |
-
console.print(f"[dim]surface:[/dim] {pred.surface_concern}")
|
| 353 |
-
console.print(f"[dim]deeper:[/dim] {pred.deeper_concern}")
|
| 354 |
-
console.print(f"[dim]themes:[/dim] {', '.join(pred.vedantic_themes)}")
|
| 355 |
-
console.print(f"[dim]queries:[/dim] {pred.queries}")
|
| 356 |
-
console.print(f"[dim]selected:[/dim] {pred.selected_indices}")
|
| 357 |
-
for i in pred.selected_indices:
|
| 358 |
-
if 1 <= i <= len(pred.retrieved_passages):
|
| 359 |
-
h = pred.retrieved_passages[i - 1]
|
| 360 |
-
m = h["meta"]
|
| 361 |
-
console.print(
|
| 362 |
-
f" [dim]→ [{m['tier']}] {m['work']}"
|
| 363 |
-
f"{' — ' + m['section'] if m.get('section') else ''}"
|
| 364 |
-
f" (score {h['score']:.3f})[/dim]"
|
| 365 |
-
)
|
| 366 |
-
console.print(Rule(style="dim"))
|
| 367 |
-
|
| 368 |
-
# thinking section
|
| 369 |
-
if not args.no_thinking:
|
| 370 |
-
_show_thinking(
|
| 371 |
-
console,
|
| 372 |
-
getattr(pred, "synthesis_reasoning", ""),
|
| 373 |
-
full=args.thinking,
|
| 374 |
-
)
|
| 375 |
-
|
| 376 |
-
# stream the response
|
| 377 |
-
console.print()
|
| 378 |
-
_stream_response(console, pred.response)
|
| 379 |
-
|
| 380 |
-
# source footer with hints
|
| 381 |
-
if pred.sources_cited:
|
| 382 |
-
numbered = " ".join(
|
| 383 |
-
f"[{i+1}] {r}" for i, r in enumerate(pred.sources_cited)
|
| 384 |
-
)
|
| 385 |
-
console.print(f"\n[dim]sources: {numbered}[/dim]")
|
| 386 |
-
console.print(
|
| 387 |
-
"[dim] → show <N> to read the verse · explain <N> for contextual breakdown[/dim]"
|
| 388 |
-
)
|
| 389 |
-
|
| 390 |
-
|
| 391 |
-
if __name__ == "__main__":
|
| 392 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
config.py
CHANGED
|
@@ -90,8 +90,9 @@ TASK_LM_KWARGS = dict(
|
|
| 90 |
)
|
| 91 |
|
| 92 |
# ──────────────────────────── Task LM — HuggingFace Router ──────────────────────────────
|
| 93 |
-
# router.huggingface.co/v1 is OpenAI-compatible
|
| 94 |
-
#
|
|
|
|
| 95 |
HF_TOKEN = os.getenv("HF_TOKEN", "")
|
| 96 |
HF_ROUTER_BASE = os.getenv("HF_ROUTER_BASE", "https://router.huggingface.co/v1")
|
| 97 |
HF_MODEL = os.getenv("HF_MODEL", "google/gemma-4-26B-A4B-it")
|
|
@@ -105,14 +106,43 @@ HF_LM_KWARGS = dict(
|
|
| 105 |
cache=True,
|
| 106 |
)
|
| 107 |
|
| 108 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
def _default_task_lm_backend() -> str:
|
| 110 |
if "TASK_LM_BACKEND" in os.environ:
|
| 111 |
return os.environ["TASK_LM_BACKEND"]
|
| 112 |
-
if HF_TOKEN:
|
| 113 |
-
return "hf"
|
| 114 |
if GEMINI_API_KEY:
|
| 115 |
return "gemini"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
return "lm_studio"
|
| 117 |
|
| 118 |
TASK_LM_BACKEND: str = _default_task_lm_backend()
|
|
@@ -184,22 +214,28 @@ REFLECTION_LM_KWARGS = dict(
|
|
| 184 |
def configure_dspy(backend: str | None = None) -> tuple[dspy.LM, dspy.LM]:
|
| 185 |
"""Configure DSPy for inference and return (task_lm, reflection_lm).
|
| 186 |
|
| 187 |
-
backend overrides TASK_LM_BACKEND when provided explicitly.
|
| 188 |
-
Accepted values: "
|
| 189 |
|
| 190 |
-
ChatAdapter fallback to JSONAdapter is disabled
|
| 191 |
-
|
| 192 |
-
|
|
|
|
| 193 |
"""
|
| 194 |
effective_backend = backend or TASK_LM_BACKEND
|
| 195 |
-
if effective_backend == "
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
if not HF_TOKEN:
|
| 197 |
-
raise SystemExit("HF_TOKEN is not set. Add it
|
| 198 |
task_lm = dspy.LM(model=HF_MODEL_STRING, **HF_LM_KWARGS)
|
| 199 |
print(f"Task LM backend: HuggingFace Router ({HF_MODEL} @ {HF_ROUTER_BASE})")
|
| 200 |
-
elif effective_backend == "gemini":
|
| 201 |
-
task_lm = dspy.LM(model=GEMINI_TASK_MODEL, **GEMINI_TASK_LM_KWARGS)
|
| 202 |
-
print(f"Task LM backend: Gemini API ({GEMINI_TASK_MODEL})")
|
| 203 |
else:
|
| 204 |
task_lm = dspy.LM(model=TASK_MODEL_STRING, **TASK_LM_KWARGS)
|
| 205 |
print(f"Task LM backend: LM Studio ({TASK_MODEL_STRING} @ {LM_STUDIO_BASE})")
|
|
|
|
| 90 |
)
|
| 91 |
|
| 92 |
# ──────────────────────────── Task LM — HuggingFace Router ──────────────────────────────
|
| 93 |
+
# router.huggingface.co/v1 is OpenAI-compatible; use the "openai/" LiteLLM prefix
|
| 94 |
+
# with api_base pointing at HF's router endpoint.
|
| 95 |
+
# Set HF_MODEL env var to use a different model slug (must be deployed on HF).
|
| 96 |
HF_TOKEN = os.getenv("HF_TOKEN", "")
|
| 97 |
HF_ROUTER_BASE = os.getenv("HF_ROUTER_BASE", "https://router.huggingface.co/v1")
|
| 98 |
HF_MODEL = os.getenv("HF_MODEL", "google/gemma-4-26B-A4B-it")
|
|
|
|
| 106 |
cache=True,
|
| 107 |
)
|
| 108 |
|
| 109 |
+
# ──────────────────────────── Task LM — OpenRouter ───────────────────────────────────────
|
| 110 |
+
# LiteLLM recognises the "openrouter/" prefix natively and routes through
|
| 111 |
+
# https://openrouter.ai/api/v1. Pick any model slug from openrouter.ai/models.
|
| 112 |
+
#
|
| 113 |
+
# Speed vs quality guide (set OPENROUTER_MODEL to override):
|
| 114 |
+
# openrouter/google/gemini-2.0-flash-001 — fastest (~3-5s); good quality
|
| 115 |
+
# openrouter/google/gemini-2.5-flash-preview — balanced (~8-12s)
|
| 116 |
+
# openrouter/anthropic/claude-3-5-haiku — reliable structured output
|
| 117 |
+
# openrouter/google/gemma-3-27b-it — closest to the local Gemma 4 weights
|
| 118 |
+
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY", "")
|
| 119 |
+
_openrouter_model_raw = os.getenv("OPENROUTER_MODEL", "google/gemini-2.0-flash-001")
|
| 120 |
+
# LiteLLM requires the "openrouter/" prefix; add it if the env var omits it.
|
| 121 |
+
OPENROUTER_MODEL = (
|
| 122 |
+
_openrouter_model_raw
|
| 123 |
+
if _openrouter_model_raw.startswith("openrouter/")
|
| 124 |
+
else f"openrouter/{_openrouter_model_raw}"
|
| 125 |
+
)
|
| 126 |
+
|
| 127 |
+
OPENROUTER_LM_KWARGS = dict(
|
| 128 |
+
api_key=OPENROUTER_API_KEY,
|
| 129 |
+
temperature=0.6,
|
| 130 |
+
max_tokens=4096,
|
| 131 |
+
cache=True,
|
| 132 |
+
)
|
| 133 |
+
|
| 134 |
+
# Which backend to use: "openrouter" if that key is set (and Gemini is not),
|
| 135 |
+
# "gemini" if GEMINI_API_KEY is set, else "lm_studio".
|
| 136 |
+
# Force a specific one with TASK_LM_BACKEND=openrouter|gemini|lm_studio.
|
| 137 |
def _default_task_lm_backend() -> str:
|
| 138 |
if "TASK_LM_BACKEND" in os.environ:
|
| 139 |
return os.environ["TASK_LM_BACKEND"]
|
|
|
|
|
|
|
| 140 |
if GEMINI_API_KEY:
|
| 141 |
return "gemini"
|
| 142 |
+
if OPENROUTER_API_KEY:
|
| 143 |
+
return "openrouter"
|
| 144 |
+
if HF_TOKEN:
|
| 145 |
+
return "hf"
|
| 146 |
return "lm_studio"
|
| 147 |
|
| 148 |
TASK_LM_BACKEND: str = _default_task_lm_backend()
|
|
|
|
| 214 |
def configure_dspy(backend: str | None = None) -> tuple[dspy.LM, dspy.LM]:
|
| 215 |
"""Configure DSPy for inference and return (task_lm, reflection_lm).
|
| 216 |
|
| 217 |
+
backend overrides TASK_LM_BACKEND when provided explicitly (used by chat.py
|
| 218 |
+
--backend flag). Accepted values: "gemini", "openrouter", "lm_studio".
|
| 219 |
|
| 220 |
+
ChatAdapter fallback to JSONAdapter is disabled in all paths because:
|
| 221 |
+
- LM Studio rejects json_object.
|
| 222 |
+
- Gemma outputs `[[ ## field ]]` (no closing ##); the field_header_pattern
|
| 223 |
+
patch at module load time makes ChatAdapter parse these correctly.
|
| 224 |
"""
|
| 225 |
effective_backend = backend or TASK_LM_BACKEND
|
| 226 |
+
if effective_backend == "gemini":
|
| 227 |
+
task_lm = dspy.LM(model=GEMINI_TASK_MODEL, **GEMINI_TASK_LM_KWARGS)
|
| 228 |
+
print(f"Task LM backend: Gemini API ({GEMINI_TASK_MODEL})")
|
| 229 |
+
elif effective_backend == "openrouter":
|
| 230 |
+
if not OPENROUTER_API_KEY:
|
| 231 |
+
raise SystemExit("OPENROUTER_API_KEY is not set. Add it to your .env file.")
|
| 232 |
+
task_lm = dspy.LM(model=OPENROUTER_MODEL, **OPENROUTER_LM_KWARGS)
|
| 233 |
+
print(f"Task LM backend: OpenRouter ({OPENROUTER_MODEL})")
|
| 234 |
+
elif effective_backend == "hf":
|
| 235 |
if not HF_TOKEN:
|
| 236 |
+
raise SystemExit("HF_TOKEN is not set. Add it to your .env file.")
|
| 237 |
task_lm = dspy.LM(model=HF_MODEL_STRING, **HF_LM_KWARGS)
|
| 238 |
print(f"Task LM backend: HuggingFace Router ({HF_MODEL} @ {HF_ROUTER_BASE})")
|
|
|
|
|
|
|
|
|
|
| 239 |
else:
|
| 240 |
task_lm = dspy.LM(model=TASK_MODEL_STRING, **TASK_LM_KWARGS)
|
| 241 |
print(f"Task LM backend: LM Studio ({TASK_MODEL_STRING} @ {LM_STUDIO_BASE})")
|
data/.gitkeep
ADDED
|
File without changes
|
data/synthetic_questions.jsonl
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
dataset_generator.py
DELETED
|
@@ -1,332 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
dataset_generator.py — produce ~500 unique, life-grounded questions.
|
| 3 |
-
|
| 4 |
-
The dataset is the GEPA training/validation pool. We want:
|
| 5 |
-
- Coverage across life domains (career, grief, identity, dharma, practice, ...)
|
| 6 |
-
- Variety in voice (anguished / intellectual / sarcastic / exhausted / hopeful)
|
| 7 |
-
- Variety in form (direct question / vent / philosophical doubt / dilemma)
|
| 8 |
-
- Variety in age & life-stage cues
|
| 9 |
-
- Some cleanly Advaita-relevant, some that *force* the advisor to find the
|
| 10 |
-
Advaita angle in something mundane (this is where over-fitting to "spiritual"
|
| 11 |
-
questions usually shows up)
|
| 12 |
-
|
| 13 |
-
Strategy: structured combinatorics × LM rewriting × similarity dedupe.
|
| 14 |
-
|
| 15 |
-
We construct (domain, scenario, voice, form) tuples, send them to the local LM
|
| 16 |
-
to write each as a real human message, then dedupe by embedding similarity.
|
| 17 |
-
"""
|
| 18 |
-
|
| 19 |
-
from __future__ import annotations
|
| 20 |
-
import argparse
|
| 21 |
-
import json
|
| 22 |
-
import random
|
| 23 |
-
import re
|
| 24 |
-
from dataclasses import dataclass, asdict
|
| 25 |
-
from pathlib import Path
|
| 26 |
-
|
| 27 |
-
import numpy as np
|
| 28 |
-
from sentence_transformers import SentenceTransformer
|
| 29 |
-
from tqdm import tqdm
|
| 30 |
-
import dspy
|
| 31 |
-
|
| 32 |
-
import config
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
# ──────────────────────────── Taxonomy ────────────────────────────
|
| 36 |
-
DOMAINS: dict[str, list[str]] = {
|
| 37 |
-
"career_and_purpose": [
|
| 38 |
-
"got laid off after years of dedication",
|
| 39 |
-
"achieved the big career goal and feels empty",
|
| 40 |
-
"stuck in a job that pays well but feels meaningless",
|
| 41 |
-
"wants to leave stable career to pursue art / spiritual path",
|
| 42 |
-
"watching peers succeed while their own work plateaus",
|
| 43 |
-
"facing retirement and loss of identity tied to work",
|
| 44 |
-
"imposter syndrome after a major promotion",
|
| 45 |
-
"publicly failed in front of colleagues",
|
| 46 |
-
],
|
| 47 |
-
"romantic_relationships": [
|
| 48 |
-
"going through a painful breakup after long relationship",
|
| 49 |
-
"marriage has gone cold and considering divorce",
|
| 50 |
-
"in love with someone who doesn't love them back",
|
| 51 |
-
"obsessive jealousy about a partner's past",
|
| 52 |
-
"tempted to have an affair",
|
| 53 |
-
"partner died and grief is overwhelming",
|
| 54 |
-
"afraid of commitment despite loving partner",
|
| 55 |
-
"single in their 40s and despairing about it",
|
| 56 |
-
],
|
| 57 |
-
"family": [
|
| 58 |
-
"parent is dying and they have unresolved conflict",
|
| 59 |
-
"estranged from a sibling for years",
|
| 60 |
-
"parents pressuring them about marriage / career",
|
| 61 |
-
"child making destructive life choices",
|
| 62 |
-
"caring for an aging parent and exhausted",
|
| 63 |
-
"had a falling out with adult child",
|
| 64 |
-
"mother-in-law conflict ruining marriage",
|
| 65 |
-
"feels they failed as a parent",
|
| 66 |
-
],
|
| 67 |
-
"friendship_and_social": [
|
| 68 |
-
"best friend betrayed their trust",
|
| 69 |
-
"feels invisible and lonely in their 30s",
|
| 70 |
-
"friend group has drifted apart with age",
|
| 71 |
-
"social anxiety preventing them from connecting",
|
| 72 |
-
"outgrown their old friends spiritually",
|
| 73 |
-
"discovered close friend was talking behind their back",
|
| 74 |
-
],
|
| 75 |
-
"mortality_and_loss": [
|
| 76 |
-
"received a serious medical diagnosis",
|
| 77 |
-
"watching a loved one die slowly",
|
| 78 |
-
"afraid of death after a near-miss",
|
| 79 |
-
"grieving a sudden, unexpected loss",
|
| 80 |
-
"watching parents age and decline",
|
| 81 |
-
"lost a child",
|
| 82 |
-
"lost a pet who was their closest companion",
|
| 83 |
-
"approaching old age with regret about unlived life",
|
| 84 |
-
],
|
| 85 |
-
"identity_and_ego": [
|
| 86 |
-
"tying self-worth entirely to external validation",
|
| 87 |
-
"endlessly comparing themselves to others on social media",
|
| 88 |
-
"going through midlife crisis questioning everything",
|
| 89 |
-
"famous and feels everyone wants something from them",
|
| 90 |
-
"lost sense of who they are after big life change",
|
| 91 |
-
"racial / cultural identity feels splintered between worlds",
|
| 92 |
-
"transitioning gender and family rejecting them",
|
| 93 |
-
],
|
| 94 |
-
"material_life": [
|
| 95 |
-
"drowning in debt and shame about it",
|
| 96 |
-
"wealthy and feels guilty / disconnected because of it",
|
| 97 |
-
"consumed by FOMO scrolling through richer friends' lives",
|
| 98 |
-
"lost their home / financial security",
|
| 99 |
-
"struggling to give up consumerist habits despite knowing better",
|
| 100 |
-
"tempted by a get-rich-quick scheme",
|
| 101 |
-
],
|
| 102 |
-
"existential": [
|
| 103 |
-
"feels life has no meaning at all",
|
| 104 |
-
"deeply depressed and going through the motions",
|
| 105 |
-
"constant existential dread about the world's state",
|
| 106 |
-
"doubting whether God / Brahman exists",
|
| 107 |
-
"sees through everything and now nothing feels real",
|
| 108 |
-
"feels they were 'born wrong' for this world",
|
| 109 |
-
],
|
| 110 |
-
"spiritual_practice": [
|
| 111 |
-
"meditation has gone dry after years of practice",
|
| 112 |
-
"got addicted to spiritual highs and now they've stopped",
|
| 113 |
-
"spiritual ego — feels superior to non-practitioners",
|
| 114 |
-
"had a powerful experience and can't get back to it",
|
| 115 |
-
"doubts whether their guru / lineage is right for them",
|
| 116 |
-
"intellectually understands non-duality but doesn't feel it",
|
| 117 |
-
"afraid that liberation means losing love for family",
|
| 118 |
-
"can't reconcile traditional teachings with modern life",
|
| 119 |
-
],
|
| 120 |
-
"ethics_and_dharma": [
|
| 121 |
-
"told a serious lie and considering whether to confess",
|
| 122 |
-
"harmed someone in the past and can't forgive themselves",
|
| 123 |
-
"facing a moral dilemma at work involving dishonesty",
|
| 124 |
-
"tempted to retaliate against someone who wronged them",
|
| 125 |
-
"torn between duty to family and personal calling",
|
| 126 |
-
"did something they're deeply ashamed of",
|
| 127 |
-
],
|
| 128 |
-
"health_and_body": [
|
| 129 |
-
"chronic illness reshaping their entire life",
|
| 130 |
-
"struggling with addiction and relapse",
|
| 131 |
-
"eating disorder they can't seem to escape",
|
| 132 |
-
"chronic pain making spiritual practice feel impossible",
|
| 133 |
-
"hates their aging body",
|
| 134 |
-
"cancer diagnosis reframing everything",
|
| 135 |
-
],
|
| 136 |
-
"modernity_specific": [
|
| 137 |
-
"doomscrolling and feeling worse every day",
|
| 138 |
-
"AI / automation making them feel obsolete",
|
| 139 |
-
"climate dread paralyzing their life decisions",
|
| 140 |
-
"political division has destroyed family relationships",
|
| 141 |
-
"addicted to phone / can't focus / can't read books anymore",
|
| 142 |
-
"online persona feels disconnected from real self",
|
| 143 |
-
],
|
| 144 |
-
}
|
| 145 |
-
|
| 146 |
-
VOICES = [
|
| 147 |
-
"anguished",
|
| 148 |
-
"exhausted",
|
| 149 |
-
"intellectual and analytical",
|
| 150 |
-
"darkly sarcastic",
|
| 151 |
-
"quietly hopeful",
|
| 152 |
-
"numb and dissociated",
|
| 153 |
-
"frustrated and angry",
|
| 154 |
-
"softly resigned",
|
| 155 |
-
]
|
| 156 |
-
|
| 157 |
-
FORMS = [
|
| 158 |
-
"direct question",
|
| 159 |
-
"venting paragraph",
|
| 160 |
-
"philosophical doubt",
|
| 161 |
-
"practical dilemma asking what to do",
|
| 162 |
-
"stream-of-consciousness",
|
| 163 |
-
]
|
| 164 |
-
|
| 165 |
-
AGE_CUES = [
|
| 166 |
-
"early 20s",
|
| 167 |
-
"late 20s",
|
| 168 |
-
"early 30s",
|
| 169 |
-
"late 30s",
|
| 170 |
-
"40s",
|
| 171 |
-
"50s",
|
| 172 |
-
"60s",
|
| 173 |
-
"70s",
|
| 174 |
-
"(no age cue)",
|
| 175 |
-
]
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
@dataclass
|
| 179 |
-
class QuestionRecord:
|
| 180 |
-
id: str
|
| 181 |
-
question: str
|
| 182 |
-
domain: str
|
| 183 |
-
scenario: str
|
| 184 |
-
voice: str
|
| 185 |
-
form: str
|
| 186 |
-
age_cue: str
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
# ──────────────────────────── LM-driven phrasing ────────────────────────────
|
| 190 |
-
class WriteUserMessage(dspy.Signature):
|
| 191 |
-
"""Write a single, realistic message that a person might send to a spiritual
|
| 192 |
-
advisor. The message must reflect the given scenario, voice, form, and age
|
| 193 |
-
cue. Do NOT include scripture references, do NOT name Vedānta concepts —
|
| 194 |
-
write as a real person speaking from their actual life. Avoid generic phrases
|
| 195 |
-
like 'help me find peace' or 'I want to grow spiritually'. Be specific, lived,
|
| 196 |
-
grounded in detail. 2-6 sentences."""
|
| 197 |
-
|
| 198 |
-
scenario: str = dspy.InputField()
|
| 199 |
-
voice: str = dspy.InputField()
|
| 200 |
-
form: str = dspy.InputField()
|
| 201 |
-
age_cue: str = dspy.InputField()
|
| 202 |
-
|
| 203 |
-
message: str = dspy.OutputField(desc="The user's message, in first person.")
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
def _slug(s: str) -> str:
|
| 207 |
-
return re.sub(r"[^a-z0-9]+", "_", s.lower()).strip("_")[:60]
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
def generate_questions(target_n: int = 500, seed: int = 7, use_local: bool = False) -> list[QuestionRecord]:
|
| 211 |
-
"""Generate ~target_n unique questions via combinatorics + LM rewriting."""
|
| 212 |
-
rng = random.Random(seed)
|
| 213 |
-
if use_local:
|
| 214 |
-
config.configure_dspy()
|
| 215 |
-
else:
|
| 216 |
-
config.configure_enrich_lm() # gpt-4o-mini: faster and more stylistically diverse
|
| 217 |
-
writer = dspy.Predict(WriteUserMessage)
|
| 218 |
-
|
| 219 |
-
# Build the (domain, scenario, voice, form, age) plan first
|
| 220 |
-
combos: list[tuple[str, str, str, str, str]] = []
|
| 221 |
-
for domain, scenarios in DOMAINS.items():
|
| 222 |
-
for scenario in scenarios:
|
| 223 |
-
# 5 variants per scenario varying voice/form/age
|
| 224 |
-
voices = rng.sample(VOICES, k=5)
|
| 225 |
-
forms = [rng.choice(FORMS) for _ in range(5)]
|
| 226 |
-
ages = rng.sample(AGE_CUES, k=5)
|
| 227 |
-
for v, f, a in zip(voices, forms, ages):
|
| 228 |
-
combos.append((domain, scenario, v, f, a))
|
| 229 |
-
|
| 230 |
-
rng.shuffle(combos)
|
| 231 |
-
|
| 232 |
-
# Cap to a generous over-target; we'll dedupe down to target_n
|
| 233 |
-
over_target = int(target_n * 1.25)
|
| 234 |
-
combos = combos[:over_target]
|
| 235 |
-
|
| 236 |
-
records: list[QuestionRecord] = []
|
| 237 |
-
for i, (domain, scenario, voice, form, age) in enumerate(tqdm(combos, desc="Generating")):
|
| 238 |
-
try:
|
| 239 |
-
out = writer(scenario=scenario, voice=voice, form=form, age_cue=age)
|
| 240 |
-
msg = (out.message or "").strip()
|
| 241 |
-
if len(msg) < 30:
|
| 242 |
-
continue
|
| 243 |
-
records.append(QuestionRecord(
|
| 244 |
-
id=f"q_{i:04d}_{_slug(domain)}",
|
| 245 |
-
question=msg,
|
| 246 |
-
domain=domain,
|
| 247 |
-
scenario=scenario,
|
| 248 |
-
voice=voice,
|
| 249 |
-
form=form,
|
| 250 |
-
age_cue=age,
|
| 251 |
-
))
|
| 252 |
-
except Exception as e:
|
| 253 |
-
# Local LMs occasionally hiccup. Log and continue.
|
| 254 |
-
print(f"[warn] generation failure on combo {i}: {e}")
|
| 255 |
-
continue
|
| 256 |
-
|
| 257 |
-
return _dedupe_by_similarity(records, target_n=target_n)
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
def _dedupe_by_similarity(records: list[QuestionRecord], target_n: int, threshold: float = 0.92) -> list[QuestionRecord]:
|
| 261 |
-
"""Embed and remove near-duplicates greedily."""
|
| 262 |
-
if not records:
|
| 263 |
-
return records
|
| 264 |
-
print(f"Deduping {len(records)} candidates ...")
|
| 265 |
-
embedder = SentenceTransformer(config.EMBED_MODEL, device=config.EMBED_DEVICE)
|
| 266 |
-
embs = embedder.encode(
|
| 267 |
-
[r.question for r in records],
|
| 268 |
-
normalize_embeddings=True,
|
| 269 |
-
show_progress_bar=True,
|
| 270 |
-
batch_size=32,
|
| 271 |
-
)
|
| 272 |
-
keep_idx: list[int] = []
|
| 273 |
-
kept_embs = []
|
| 274 |
-
for i, e in enumerate(embs):
|
| 275 |
-
if not kept_embs:
|
| 276 |
-
keep_idx.append(i)
|
| 277 |
-
kept_embs.append(e)
|
| 278 |
-
continue
|
| 279 |
-
sims = np.dot(np.stack(kept_embs), e)
|
| 280 |
-
if float(sims.max()) < threshold:
|
| 281 |
-
keep_idx.append(i)
|
| 282 |
-
kept_embs.append(e)
|
| 283 |
-
if len(keep_idx) >= target_n:
|
| 284 |
-
break
|
| 285 |
-
print(f"Kept {len(keep_idx)} after dedupe (target {target_n}).")
|
| 286 |
-
return [records[i] for i in keep_idx]
|
| 287 |
-
|
| 288 |
-
|
| 289 |
-
def save_jsonl(records: list[QuestionRecord], path: Path):
|
| 290 |
-
with path.open("w", encoding="utf-8") as f:
|
| 291 |
-
for r in records:
|
| 292 |
-
f.write(json.dumps(asdict(r), ensure_ascii=False) + "\n")
|
| 293 |
-
print(f"Wrote {len(records)} questions to {path}")
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
def load_jsonl(path: Path = config.DATASET_PATH) -> list[dict]:
|
| 297 |
-
with path.open(encoding="utf-8") as f:
|
| 298 |
-
return [json.loads(line) for line in f if line.strip()]
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
def to_dspy_examples(records: list[dict]) -> list[dspy.Example]:
|
| 302 |
-
"""The dataset has no gold labels — that's fine. GEPA's metric uses LLM
|
| 303 |
-
judgment + retrieval grounding rather than reference answers.
|
| 304 |
-
We carry the metadata as inputs-of-record so the metric can use them."""
|
| 305 |
-
out = []
|
| 306 |
-
for r in records:
|
| 307 |
-
ex = dspy.Example(
|
| 308 |
-
user_question=r["question"],
|
| 309 |
-
history=dspy.History(messages=[]),
|
| 310 |
-
domain=r["domain"],
|
| 311 |
-
scenario=r["scenario"],
|
| 312 |
-
).with_inputs("user_question", "history")
|
| 313 |
-
out.append(ex)
|
| 314 |
-
return out
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
# ──────────────────────────── CLI ────────────────────────────
|
| 318 |
-
def main():
|
| 319 |
-
ap = argparse.ArgumentParser()
|
| 320 |
-
ap.add_argument("--n", type=int, default=500)
|
| 321 |
-
ap.add_argument("--seed", type=int, default=7)
|
| 322 |
-
ap.add_argument("--out", type=str, default=str(config.DATASET_PATH))
|
| 323 |
-
ap.add_argument("--lm", choices=["openai", "local"], default="openai",
|
| 324 |
-
help="openai = gpt-4o-mini (default, faster); local = LM Studio task LM")
|
| 325 |
-
args = ap.parse_args()
|
| 326 |
-
|
| 327 |
-
records = generate_questions(target_n=args.n, seed=args.seed, use_local=(args.lm == "local"))
|
| 328 |
-
save_jsonl(records, Path(args.out))
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
if __name__ == "__main__":
|
| 332 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
download_sources.py
DELETED
|
@@ -1,195 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
download_sources.py — fetch every enabled source from the registry.
|
| 3 |
-
|
| 4 |
-
What this does
|
| 5 |
-
--------------
|
| 6 |
-
Reads sources_registry.SOURCES, walks each enabled entry, and downloads its
|
| 7 |
-
files into data/raw/<source_key>/. The downloader is deliberately dumb: it
|
| 8 |
-
just gets the bytes onto disk. Parsing happens in a separate step (parsers/)
|
| 9 |
-
so a download failure on one source doesn't block ingest of the others, and
|
| 10 |
-
so re-parsing during prompt iteration doesn't re-hit the network.
|
| 11 |
-
|
| 12 |
-
Why HTTPS over `requests` rather than git for everything
|
| 13 |
-
--------------------------------------------------------
|
| 14 |
-
Most of our sources are individual JSON or HTML files. Cloning a whole repo
|
| 15 |
-
to get two files wastes bandwidth and brittle-ifies the script. For sources
|
| 16 |
-
that *are* whole repos (rare in our registry), prefix the URL with `git+`.
|
| 17 |
-
|
| 18 |
-
Idempotency
|
| 19 |
-
-----------
|
| 20 |
-
If a file is already present and not corrupt, we skip it. Pass --force to
|
| 21 |
-
re-download. This makes it safe to run repeatedly while debugging parsers.
|
| 22 |
-
|
| 23 |
-
Politeness
|
| 24 |
-
----------
|
| 25 |
-
We send a real User-Agent and rate-limit to one request per second per host.
|
| 26 |
-
Internet Archive and similar mirrors are gracious to projects that play nice;
|
| 27 |
-
they can also throttle aggressively when they aren't.
|
| 28 |
-
"""
|
| 29 |
-
|
| 30 |
-
from __future__ import annotations
|
| 31 |
-
import argparse
|
| 32 |
-
import shutil
|
| 33 |
-
import subprocess
|
| 34 |
-
import sys
|
| 35 |
-
import time
|
| 36 |
-
from collections import defaultdict
|
| 37 |
-
from pathlib import Path
|
| 38 |
-
from urllib.parse import urlparse
|
| 39 |
-
|
| 40 |
-
import requests
|
| 41 |
-
from tqdm import tqdm
|
| 42 |
-
|
| 43 |
-
import config
|
| 44 |
-
from sources_registry import SOURCES, Source
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
RAW_DIR = config.DATA_DIR / "raw"
|
| 48 |
-
USER_AGENT = (
|
| 49 |
-
"GitaAdvisor/0.2 (Advaita-Vedanta research project; "
|
| 50 |
-
"contact: <add your email here>)"
|
| 51 |
-
)
|
| 52 |
-
|
| 53 |
-
# Per-host minimum interval in seconds
|
| 54 |
-
MIN_INTERVAL = 1.0
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
def _filename_for_url(url: str) -> str:
|
| 58 |
-
"""Derive a sensible local filename from a URL."""
|
| 59 |
-
parsed = urlparse(url)
|
| 60 |
-
name = Path(parsed.path).name or "index.html"
|
| 61 |
-
# archive.org sometimes serves djvu.txt with no extension on the URL;
|
| 62 |
-
# keep what's there.
|
| 63 |
-
return name
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
def _is_git_url(url: str) -> bool:
|
| 67 |
-
return url.startswith("git+")
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
_last_request_time: dict = defaultdict(float)
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
def _polite_get(url: str) -> requests.Response:
|
| 74 |
-
"""GET with rate limiting per host."""
|
| 75 |
-
host = urlparse(url).netloc
|
| 76 |
-
elapsed = time.time() - _last_request_time[host]
|
| 77 |
-
if elapsed < MIN_INTERVAL:
|
| 78 |
-
time.sleep(MIN_INTERVAL - elapsed)
|
| 79 |
-
_last_request_time[host] = time.time()
|
| 80 |
-
return requests.get(url, headers={"User-Agent": USER_AGENT}, timeout=60, stream=True)
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
def _download_file(url: str, dest: Path, force: bool = False) -> bool:
|
| 84 |
-
"""Download a single URL to dest. Returns True if a download happened
|
| 85 |
-
(vs being skipped because already present)."""
|
| 86 |
-
if dest.exists() and dest.stat().st_size > 0 and not force:
|
| 87 |
-
return False
|
| 88 |
-
|
| 89 |
-
dest.parent.mkdir(parents=True, exist_ok=True)
|
| 90 |
-
tmp = dest.with_suffix(dest.suffix + ".tmp")
|
| 91 |
-
|
| 92 |
-
with _polite_get(url) as r:
|
| 93 |
-
r.raise_for_status()
|
| 94 |
-
total = int(r.headers.get("content-length", 0)) or None
|
| 95 |
-
with tmp.open("wb") as out, tqdm(
|
| 96 |
-
total=total, unit="B", unit_scale=True, leave=False, desc=dest.name
|
| 97 |
-
) as bar:
|
| 98 |
-
for chunk in r.iter_content(chunk_size=8192):
|
| 99 |
-
if not chunk:
|
| 100 |
-
continue
|
| 101 |
-
out.write(chunk)
|
| 102 |
-
bar.update(len(chunk))
|
| 103 |
-
|
| 104 |
-
tmp.replace(dest)
|
| 105 |
-
return True
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
def _clone_git(url: str, dest_dir: Path, force: bool = False) -> bool:
|
| 109 |
-
"""Clone a git repo (URL prefixed with 'git+') into dest_dir. Returns
|
| 110 |
-
True if a clone happened."""
|
| 111 |
-
real_url = url[len("git+"):]
|
| 112 |
-
if dest_dir.exists() and any(dest_dir.iterdir()) and not force:
|
| 113 |
-
return False
|
| 114 |
-
if dest_dir.exists():
|
| 115 |
-
shutil.rmtree(dest_dir)
|
| 116 |
-
dest_dir.parent.mkdir(parents=True, exist_ok=True)
|
| 117 |
-
subprocess.run(
|
| 118 |
-
["git", "clone", "--depth=1", real_url, str(dest_dir)],
|
| 119 |
-
check=True,
|
| 120 |
-
)
|
| 121 |
-
return True
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
def download_source(src: Source, force: bool = False) -> dict:
|
| 125 |
-
"""Download all URLs for one source. Returns a small report dict."""
|
| 126 |
-
target = RAW_DIR / src.key
|
| 127 |
-
report = {"key": src.key, "ok": 0, "skipped": 0, "failed": []}
|
| 128 |
-
|
| 129 |
-
if not src.urls:
|
| 130 |
-
report["failed"].append("no URLs in registry entry")
|
| 131 |
-
return report
|
| 132 |
-
|
| 133 |
-
for url in src.urls:
|
| 134 |
-
if not url:
|
| 135 |
-
continue
|
| 136 |
-
try:
|
| 137 |
-
if _is_git_url(url):
|
| 138 |
-
changed = _clone_git(url, target, force=force)
|
| 139 |
-
else:
|
| 140 |
-
fname = _filename_for_url(url)
|
| 141 |
-
changed = _download_file(url, target / fname, force=force)
|
| 142 |
-
if changed:
|
| 143 |
-
report["ok"] += 1
|
| 144 |
-
else:
|
| 145 |
-
report["skipped"] += 1
|
| 146 |
-
except Exception as e:
|
| 147 |
-
report["failed"].append(f"{url}: {e}")
|
| 148 |
-
return report
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
def main():
|
| 152 |
-
ap = argparse.ArgumentParser(description="Download all enabled sources from the registry.")
|
| 153 |
-
ap.add_argument("--force", action="store_true",
|
| 154 |
-
help="Re-download even if files exist.")
|
| 155 |
-
ap.add_argument("--only", nargs="*", default=None,
|
| 156 |
-
help="Only download these source keys.")
|
| 157 |
-
args = ap.parse_args()
|
| 158 |
-
|
| 159 |
-
enabled = [s for s in SOURCES if s.enabled]
|
| 160 |
-
if args.only:
|
| 161 |
-
enabled = [s for s in enabled if s.key in set(args.only)]
|
| 162 |
-
if not enabled:
|
| 163 |
-
print("No enabled sources match. Edit sources_registry.py to enable some.")
|
| 164 |
-
sys.exit(1)
|
| 165 |
-
|
| 166 |
-
print(f"Downloading {len(enabled)} sources to {RAW_DIR}")
|
| 167 |
-
print(f"User-Agent: {USER_AGENT}")
|
| 168 |
-
print()
|
| 169 |
-
|
| 170 |
-
any_failed = False
|
| 171 |
-
for src in enabled:
|
| 172 |
-
print(f"━━━ {src.key} — {src.name}")
|
| 173 |
-
print(f" license={src.license} tier={src.tier} parser={src.parser}")
|
| 174 |
-
if src.translator:
|
| 175 |
-
year = f", {src.year}" if src.year else ""
|
| 176 |
-
print(f" translator: {src.translator}{year}")
|
| 177 |
-
|
| 178 |
-
report = download_source(src, force=args.force)
|
| 179 |
-
if report["failed"]:
|
| 180 |
-
any_failed = True
|
| 181 |
-
for f in report["failed"]:
|
| 182 |
-
print(f" [FAIL] {f}")
|
| 183 |
-
print(f" downloaded={report['ok']} cached={report['skipped']}")
|
| 184 |
-
print()
|
| 185 |
-
|
| 186 |
-
if any_failed:
|
| 187 |
-
print("Some sources failed. Re-run with the network available, or "
|
| 188 |
-
"edit the URL in sources_registry.py if a mirror has moved.")
|
| 189 |
-
sys.exit(2)
|
| 190 |
-
print("All enabled sources are now on disk under data/raw/.")
|
| 191 |
-
print("Next: python ingest_corpus.py")
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
if __name__ == "__main__":
|
| 195 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
enrich_corpus.py
DELETED
|
@@ -1,174 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
enrich_corpus.py — run the local LLM over every verse, once, with caching.
|
| 3 |
-
|
| 4 |
-
The cost calculus
|
| 5 |
-
-----------------
|
| 6 |
-
For ~3,000 verses at ~30s per call on a 26B-class local model, a full pass
|
| 7 |
-
takes a long evening — call it 25 hours. That's tolerable as a one-time cost,
|
| 8 |
-
intolerable as a recurring one. So caching is non-negotiable. We cache by
|
| 9 |
-
verse_id and the enrichment_version stamp; if you change the prompt
|
| 10 |
-
substantively, bump the version in enrichment.py and the next run re-enriches.
|
| 11 |
-
|
| 12 |
-
What we write
|
| 13 |
-
-------------
|
| 14 |
-
data/corpus_enriched.jsonl — one EnrichedVerse per line, in the same order
|
| 15 |
-
as data/corpus.jsonl. Failed enrichments are still written (with empty
|
| 16 |
-
enrichment fields and an error stamp in enrichment_model) so the index can
|
| 17 |
-
still cover them on their literal text.
|
| 18 |
-
|
| 19 |
-
Concurrency
|
| 20 |
-
-----------
|
| 21 |
-
LM Studio's OpenAI-compatible server processes requests serially by default.
|
| 22 |
-
We don't try to parallelize at the client; if you've configured your server
|
| 23 |
-
for parallel decode, set --concurrency > 1 and DSPy will hold multiple
|
| 24 |
-
in-flight calls. For modest hardware, 1 is correct.
|
| 25 |
-
|
| 26 |
-
Resumability
|
| 27 |
-
------------
|
| 28 |
-
If the run dies halfway, just re-run. The cache at data/enrichment_cache.jsonl
|
| 29 |
-
remembers per-verse what we already did, so we pick up exactly where we left
|
| 30 |
-
off. No flag is needed for resume; it's the default behavior.
|
| 31 |
-
"""
|
| 32 |
-
|
| 33 |
-
from __future__ import annotations
|
| 34 |
-
import argparse
|
| 35 |
-
import json
|
| 36 |
-
import os
|
| 37 |
-
from dataclasses import asdict
|
| 38 |
-
from pathlib import Path
|
| 39 |
-
from typing import Iterable
|
| 40 |
-
|
| 41 |
-
from tqdm import tqdm
|
| 42 |
-
import dspy
|
| 43 |
-
|
| 44 |
-
import config
|
| 45 |
-
from corpus import Verse, EnrichedVerse, read_jsonl_verses, write_jsonl
|
| 46 |
-
from enrichment import Enricher
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
CACHE_PATH = config.DATA_DIR / "enrichment_cache.jsonl"
|
| 50 |
-
ENRICHED_PATH = config.DATA_DIR / "corpus_enriched.jsonl"
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
# ──────────────────────────── Cache I/O ────────────────────────────
|
| 54 |
-
def _load_cache(path: Path) -> dict[str, EnrichedVerse]:
|
| 55 |
-
"""Load cache as {verse_id: EnrichedVerse}. Tolerates partial writes."""
|
| 56 |
-
if not path.exists():
|
| 57 |
-
return {}
|
| 58 |
-
out: dict[str, EnrichedVerse] = {}
|
| 59 |
-
with path.open(encoding="utf-8") as f:
|
| 60 |
-
for line in f:
|
| 61 |
-
line = line.strip()
|
| 62 |
-
if not line:
|
| 63 |
-
continue
|
| 64 |
-
try:
|
| 65 |
-
d = json.loads(line)
|
| 66 |
-
ev = EnrichedVerse(**{k: v for k, v in d.items() if k in EnrichedVerse.__dataclass_fields__})
|
| 67 |
-
out[ev.verse_id] = ev
|
| 68 |
-
except Exception:
|
| 69 |
-
continue
|
| 70 |
-
return out
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
def _append_cache(path: Path, ev: EnrichedVerse) -> None:
|
| 74 |
-
"""Append a single record. We use append-mode rather than rewriting so
|
| 75 |
-
a kill -9 mid-run loses at most one line."""
|
| 76 |
-
path.parent.mkdir(parents=True, exist_ok=True)
|
| 77 |
-
with path.open("a", encoding="utf-8") as f:
|
| 78 |
-
f.write(json.dumps(asdict(ev), ensure_ascii=False) + "\n")
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
# ──────────────────────────── Main loop ────────────────────────────
|
| 82 |
-
def enrich_all(
|
| 83 |
-
in_path: Path,
|
| 84 |
-
out_path: Path,
|
| 85 |
-
cache_path: Path,
|
| 86 |
-
limit: int | None = None,
|
| 87 |
-
re_enrich: bool = False,
|
| 88 |
-
only_failed: bool = False,
|
| 89 |
-
use_claude: bool = True,
|
| 90 |
-
) -> None:
|
| 91 |
-
if use_claude:
|
| 92 |
-
lm = config.configure_enrich_lm()
|
| 93 |
-
print(f"[enrich] LM: {lm.model} (Claude API)")
|
| 94 |
-
else:
|
| 95 |
-
config.configure_dspy()
|
| 96 |
-
print(f"[enrich] LM: {config.LOCAL_MODEL} (local LM Studio)")
|
| 97 |
-
enricher = Enricher()
|
| 98 |
-
|
| 99 |
-
cache = _load_cache(cache_path) if not re_enrich else {}
|
| 100 |
-
print(f"[enrich] cache contains {len(cache)} previously-enriched verses")
|
| 101 |
-
|
| 102 |
-
verses = list(read_jsonl_verses(in_path))
|
| 103 |
-
if limit:
|
| 104 |
-
verses = verses[:limit]
|
| 105 |
-
print(f"[enrich] enriching {len(verses)} verses from {in_path}")
|
| 106 |
-
|
| 107 |
-
enriched: list[EnrichedVerse] = []
|
| 108 |
-
pending = []
|
| 109 |
-
for v in verses:
|
| 110 |
-
cached = cache.get(v.verse_id)
|
| 111 |
-
if cached and not re_enrich:
|
| 112 |
-
if only_failed and cached.enrichment_model.startswith("FAILED"):
|
| 113 |
-
pending.append(v)
|
| 114 |
-
else:
|
| 115 |
-
enriched.append(cached)
|
| 116 |
-
continue
|
| 117 |
-
else:
|
| 118 |
-
pending.append(v)
|
| 119 |
-
|
| 120 |
-
print(f"[enrich] {len(enriched)} from cache, {len(pending)} to call LM for")
|
| 121 |
-
|
| 122 |
-
n_failed = 0
|
| 123 |
-
for v in tqdm(pending, desc="enriching"):
|
| 124 |
-
ev = enricher(verse=v)
|
| 125 |
-
_append_cache(cache_path, ev)
|
| 126 |
-
enriched.append(ev)
|
| 127 |
-
if not ev.is_enriched():
|
| 128 |
-
n_failed += 1
|
| 129 |
-
|
| 130 |
-
# Restore original verse order from in_path
|
| 131 |
-
by_id = {ev.verse_id: ev for ev in enriched}
|
| 132 |
-
ordered = [by_id[v.verse_id] for v in verses if v.verse_id in by_id]
|
| 133 |
-
|
| 134 |
-
n_written = write_jsonl(ordered, out_path)
|
| 135 |
-
print(f"[enrich] wrote {n_written} enriched verses to {out_path}")
|
| 136 |
-
if n_failed:
|
| 137 |
-
print(f"[enrich] WARNING: {n_failed} verses failed enrichment "
|
| 138 |
-
f"(empty fields, indexed only on literal text). "
|
| 139 |
-
f"Re-run with --only-failed to retry just those.")
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
# ──────────────────────────── CLI ────────────────────────────
|
| 143 |
-
def main():
|
| 144 |
-
ap = argparse.ArgumentParser()
|
| 145 |
-
ap.add_argument("--in", dest="in_path",
|
| 146 |
-
default=str(config.DATA_DIR / "corpus.jsonl"))
|
| 147 |
-
ap.add_argument("--out", default=str(ENRICHED_PATH))
|
| 148 |
-
ap.add_argument("--cache", default=str(CACHE_PATH))
|
| 149 |
-
ap.add_argument("--limit", type=int, default=None,
|
| 150 |
-
help="Enrich only the first N verses (smoke-test).")
|
| 151 |
-
ap.add_argument("--re-enrich", action="store_true",
|
| 152 |
-
help="Ignore cache and re-enrich everything. Use this "
|
| 153 |
-
"when you change the enrichment prompt.")
|
| 154 |
-
ap.add_argument("--only-failed", action="store_true",
|
| 155 |
-
help="Re-run only the verses whose previous enrichment "
|
| 156 |
-
"failed (FAILED stamp in enrichment_model).")
|
| 157 |
-
ap.add_argument("--lm", choices=["claude", "local"], default="claude",
|
| 158 |
-
help="Which LM to use: 'claude' (default, Sonnet 4.6 via API) "
|
| 159 |
-
"or 'local' (LM Studio). Claude requires ANTHROPIC_API_KEY.")
|
| 160 |
-
args = ap.parse_args()
|
| 161 |
-
|
| 162 |
-
enrich_all(
|
| 163 |
-
in_path=Path(args.in_path),
|
| 164 |
-
out_path=Path(args.out),
|
| 165 |
-
cache_path=Path(args.cache),
|
| 166 |
-
limit=args.limit,
|
| 167 |
-
re_enrich=args.re_enrich,
|
| 168 |
-
only_failed=args.only_failed,
|
| 169 |
-
use_claude=(args.lm == "claude"),
|
| 170 |
-
)
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
if __name__ == "__main__":
|
| 174 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
enrichment.py
DELETED
|
@@ -1,266 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
enrichment.py — turn a Verse into an EnrichedVerse using the local LLM.
|
| 3 |
-
|
| 4 |
-
This module is the heart of the redesign. Instead of hoping that vector
|
| 5 |
-
similarity between a user's English question and a Sanskrit verse will find
|
| 6 |
-
the right teaching, we run a one-time offline pass that asks the local LLM
|
| 7 |
-
to translate each verse into the language a real person would use to seek
|
| 8 |
-
help. The output gets stored alongside the verse and embedded for retrieval.
|
| 9 |
-
|
| 10 |
-
What the prompt asks for, and why each field
|
| 11 |
-
--------------------------------------------
|
| 12 |
-
We extract six fields. Each one earns its place by closing a different gap
|
| 13 |
-
between scripture and a user's question:
|
| 14 |
-
|
| 15 |
-
paraphrase — what the verse teaches, in plain modern English.
|
| 16 |
-
This is what the synthesizer reads when writing
|
| 17 |
-
the advisor's reply, so paraphrase quality matters
|
| 18 |
-
more than embedding quality.
|
| 19 |
-
|
| 20 |
-
themes — Vedānta concepts engaged. Tradition-native names
|
| 21 |
-
(karma_yoga, vairagya, sakshi, two_truths). Used
|
| 22 |
-
for filtering and for ensuring the metric can
|
| 23 |
-
verify Advaita-coherence.
|
| 24 |
-
|
| 25 |
-
life_situations — the predicaments where this verse helps. User-
|
| 26 |
-
language. This is the field that does the actual
|
| 27 |
-
bridging: a query about "facing failure" finds
|
| 28 |
-
BG 2.47 even though those words aren't in the verse.
|
| 29 |
-
|
| 30 |
-
emotions_addressed — drawn from a fixed vocabulary so we get faceted
|
| 31 |
-
filtering rather than free-text drift. The metric
|
| 32 |
-
uses this to verify that retrieved verses actually
|
| 33 |
-
address the user's felt emotion.
|
| 34 |
-
|
| 35 |
-
practical_teaching — what the verse asks the seeker to do or shift.
|
| 36 |
-
The synthesizer uses this as the seed for its
|
| 37 |
-
"concrete practice you can try this week" close.
|
| 38 |
-
|
| 39 |
-
hypothetical_questions — five questions a real person might bring to the
|
| 40 |
-
verse. Highest-leverage field for retrieval recall.
|
| 41 |
-
|
| 42 |
-
A closed vocabulary for emotions
|
| 43 |
-
--------------------------------
|
| 44 |
-
We constrain `emotions_addressed` to the EMOTION_VOCAB list below. If we let
|
| 45 |
-
the LLM generate freely, we get drift: "sadness" / "sorrow" / "melancholy" /
|
| 46 |
-
"grief-tinged blue" all become separate buckets, and faceted filtering
|
| 47 |
-
becomes useless. Closed vocab keeps the index sharp.
|
| 48 |
-
|
| 49 |
-
We don't constrain themes the same way because the Sanskrit conceptual
|
| 50 |
-
vocabulary is open-ended and forcing the LLM into a small list would lose
|
| 51 |
-
information. We just normalize for casing/spacing in post-processing.
|
| 52 |
-
|
| 53 |
-
Working with a flaky local LLM
|
| 54 |
-
------------------------------
|
| 55 |
-
Local 26B-class models occasionally produce malformed structured output.
|
| 56 |
-
This module assumes that. The enrich() function:
|
| 57 |
-
- validates output against minimum-quality checks
|
| 58 |
-
- retries up to 2 times with temperature=0
|
| 59 |
-
- on persistent failure, returns an EnrichedVerse with empty enrichment
|
| 60 |
-
fields rather than raising — so the corpus can still index on the
|
| 61 |
-
literal text + bhāṣya and the verse isn't lost
|
| 62 |
-
"""
|
| 63 |
-
|
| 64 |
-
from __future__ import annotations
|
| 65 |
-
import re
|
| 66 |
-
from dataclasses import asdict
|
| 67 |
-
import dspy
|
| 68 |
-
|
| 69 |
-
from corpus import Verse, EnrichedVerse
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
# ──────────────────────────── Closed emotion vocabulary ────────────────────────────
|
| 73 |
-
# Twenty buckets, ordered roughly from acute to diffuse. Adding entries is
|
| 74 |
-
# easy; removing them risks orphaning previously-enriched records.
|
| 75 |
-
EMOTION_VOCAB: tuple[str, ...] = (
|
| 76 |
-
"grief", # acute loss
|
| 77 |
-
"anticipatory_grief", # loss in advance
|
| 78 |
-
"fear", # discrete fear
|
| 79 |
-
"anxiety", # chronic, diffuse
|
| 80 |
-
"despair", # loss of hope
|
| 81 |
-
"shame", # self-as-bad
|
| 82 |
-
"guilt", # action-as-bad
|
| 83 |
-
"anger",
|
| 84 |
-
"resentment",
|
| 85 |
-
"envy",
|
| 86 |
-
"jealousy",
|
| 87 |
-
"longing",
|
| 88 |
-
"loneliness",
|
| 89 |
-
"doubt", # epistemic; not knowing
|
| 90 |
-
"disillusionment", # the hollowness of attained goals
|
| 91 |
-
"boredom", # the inertness of repetition
|
| 92 |
-
"restlessness", # the inability to settle
|
| 93 |
-
"frustration",
|
| 94 |
-
"confusion",
|
| 95 |
-
"numbness", # affect-blunted
|
| 96 |
-
)
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
# ──────────────────────────── DSPy signature ────────────────────────────
|
| 100 |
-
class EnrichVerse(dspy.Signature):
|
| 101 |
-
"""You are an Advaita-Vedānta-trained reader producing structured metadata
|
| 102 |
-
for a verse from the Bhagavad Gītā or a related scripture, so that a
|
| 103 |
-
spiritual advisor can later find this verse when a real person describes
|
| 104 |
-
a life situation in everyday language. Stay strictly within the framework
|
| 105 |
-
of Śaṅkarācārya's non-dual interpretation. Do not import dualistic notions
|
| 106 |
-
(separate creator/creature, soul-merging-into-God-as-other, etc.) and do
|
| 107 |
-
not bypass the verse's plain meaning by always retreating to the absolute.
|
| 108 |
-
|
| 109 |
-
The verse may include the Sanskrit, the English translation, and (when
|
| 110 |
-
available) Śaṅkara's commentary. Read all three. Your output is structured
|
| 111 |
-
fields, not prose. Be specific, lived, concrete. Avoid generic spiritual
|
| 112 |
-
language ('find peace', 'be in the moment'). Avoid tradition-foreign
|
| 113 |
-
therapy language ('honor your feelings'). When in doubt about a field,
|
| 114 |
-
leave it shorter rather than padded."""
|
| 115 |
-
|
| 116 |
-
# Inputs — the verse in its richest available form
|
| 117 |
-
verse_ref: str = dspy.InputField(desc="Citation form, e.g. 'BG 2.47'.")
|
| 118 |
-
sanskrit: str = dspy.InputField(desc="Devanāgarī text, may be empty.")
|
| 119 |
-
translation: str = dspy.InputField(desc="English translation of the verse.")
|
| 120 |
-
bhashya: str = dspy.InputField(desc="Śaṅkara's commentary on this verse, may be empty.")
|
| 121 |
-
|
| 122 |
-
# Outputs
|
| 123 |
-
paraphrase: str = dspy.OutputField(
|
| 124 |
-
desc="One or two sentences in plain modern English stating what the "
|
| 125 |
-
"verse teaches. Not a translation; a teaching summary. No jargon."
|
| 126 |
-
)
|
| 127 |
-
themes: list[str] = dspy.OutputField(
|
| 128 |
-
desc="2–5 Vedānta concepts the verse engages, in tradition-native "
|
| 129 |
-
"vocabulary with snake_case_keys, e.g. ['karma_yoga', 'non_attachment', "
|
| 130 |
-
"'two_truths']. Use Sanskrit terms where they're the right name."
|
| 131 |
-
)
|
| 132 |
-
life_situations: list[str] = dspy.OutputField(
|
| 133 |
-
desc="3–6 specific human predicaments this verse would help with, "
|
| 134 |
-
"in everyday English. e.g. 'facing public failure after years of "
|
| 135 |
-
"effort'. NOT 'finding peace' or 'spiritual growth'."
|
| 136 |
-
)
|
| 137 |
-
emotions_addressed: list[str] = dspy.OutputField(
|
| 138 |
-
desc="The emotions this verse meets, drawn ONLY from this fixed list: "
|
| 139 |
-
+ ", ".join(EMOTION_VOCAB) + ". 1–4 entries."
|
| 140 |
-
)
|
| 141 |
-
practical_teaching: str = dspy.OutputField(
|
| 142 |
-
desc="One sentence: what the verse asks the seeker to actually do or "
|
| 143 |
-
"shift. If the verse is purely ontological, write 'pure ontology — "
|
| 144 |
-
"no direct prescription' and the field will be ignored downstream."
|
| 145 |
-
)
|
| 146 |
-
hypothetical_questions: list[str] = dspy.OutputField(
|
| 147 |
-
desc="EXACTLY 5 first-person questions a real person might write to a "
|
| 148 |
-
"spiritual advisor that this verse would speak to. Specific, "
|
| 149 |
-
"ungeneric, in the user's voice. NOT in scripture's voice. e.g. "
|
| 150 |
-
"'I worked on this for three years and it just failed publicly — "
|
| 151 |
-
"how do I keep going?'"
|
| 152 |
-
)
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
# ──────────────────────────── Validators ────────────────────────────
|
| 156 |
-
THEME_KEY_RX = re.compile(r"^[a-z][a-z0-9_]{2,40}$")
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
def _normalize_theme(t: str) -> str:
|
| 160 |
-
t = t.strip().lower()
|
| 161 |
-
t = re.sub(r"[\s\-]+", "_", t)
|
| 162 |
-
t = re.sub(r"[^a-z0-9_]", "", t)
|
| 163 |
-
return t
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
def _validate(pred) -> tuple[bool, str]:
|
| 167 |
-
"""Light schema check. Returns (ok, reason_if_not_ok). Used to decide
|
| 168 |
-
whether to retry the LM call with a stricter prompt."""
|
| 169 |
-
paraphrase = (pred.paraphrase or "").strip()
|
| 170 |
-
if len(paraphrase) < 20:
|
| 171 |
-
return False, "paraphrase too short"
|
| 172 |
-
|
| 173 |
-
qs = pred.hypothetical_questions or []
|
| 174 |
-
if not isinstance(qs, list) or len(qs) < 3:
|
| 175 |
-
return False, f"need ≥3 hypothetical_questions, got {len(qs)}"
|
| 176 |
-
|
| 177 |
-
sits = pred.life_situations or []
|
| 178 |
-
if not isinstance(sits, list) or len(sits) < 2:
|
| 179 |
-
return False, f"need ≥2 life_situations, got {len(sits)}"
|
| 180 |
-
|
| 181 |
-
emos = pred.emotions_addressed or []
|
| 182 |
-
if not isinstance(emos, list) or not emos:
|
| 183 |
-
return False, "emotions_addressed empty"
|
| 184 |
-
bad = [e for e in emos if _normalize_theme(e) not in EMOTION_VOCAB]
|
| 185 |
-
if bad:
|
| 186 |
-
return False, f"emotions outside vocabulary: {bad}"
|
| 187 |
-
|
| 188 |
-
themes = pred.themes or []
|
| 189 |
-
if not isinstance(themes, list) or not themes:
|
| 190 |
-
return False, "themes empty"
|
| 191 |
-
|
| 192 |
-
return True, ""
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
# ──────────────────────────── Module ────────────────────────────
|
| 196 |
-
class Enricher(dspy.Module):
|
| 197 |
-
"""Wraps the EnrichVerse signature with retries and post-processing.
|
| 198 |
-
|
| 199 |
-
Why ChainOfThought over Predict
|
| 200 |
-
-------------------------------
|
| 201 |
-
GEPA may eventually optimize this prompt too, and ChainOfThought gives it
|
| 202 |
-
a `reasoning` trace to inspect during reflection. The cost is one extra
|
| 203 |
-
paragraph of LM output per call, which is negligible at our scale.
|
| 204 |
-
"""
|
| 205 |
-
|
| 206 |
-
def __init__(self, max_retries: int = 2):
|
| 207 |
-
super().__init__()
|
| 208 |
-
self.predict = dspy.ChainOfThought(EnrichVerse)
|
| 209 |
-
self.max_retries = max_retries
|
| 210 |
-
|
| 211 |
-
def forward(self, verse: Verse) -> EnrichedVerse:
|
| 212 |
-
attempt = 0
|
| 213 |
-
last_err = ""
|
| 214 |
-
pred = None
|
| 215 |
-
|
| 216 |
-
while attempt <= self.max_retries:
|
| 217 |
-
try:
|
| 218 |
-
pred = self.predict(
|
| 219 |
-
verse_ref=verse.verse_ref,
|
| 220 |
-
sanskrit=verse.sanskrit or "",
|
| 221 |
-
translation=verse.translation or "",
|
| 222 |
-
bhashya=verse.bhashya or "",
|
| 223 |
-
)
|
| 224 |
-
ok, reason = _validate(pred)
|
| 225 |
-
if ok:
|
| 226 |
-
break
|
| 227 |
-
last_err = reason
|
| 228 |
-
except Exception as e:
|
| 229 |
-
last_err = f"LM error: {e}"
|
| 230 |
-
attempt += 1
|
| 231 |
-
|
| 232 |
-
# Build the EnrichedVerse from the Verse + whatever we got
|
| 233 |
-
base = asdict(verse)
|
| 234 |
-
ev = EnrichedVerse(**base)
|
| 235 |
-
|
| 236 |
-
if pred and not last_err:
|
| 237 |
-
ev.paraphrase = (pred.paraphrase or "").strip()
|
| 238 |
-
ev.practical_teaching = (pred.practical_teaching or "").strip()
|
| 239 |
-
ev.themes = [
|
| 240 |
-
_normalize_theme(t) for t in (pred.themes or [])
|
| 241 |
-
if THEME_KEY_RX.match(_normalize_theme(t))
|
| 242 |
-
]
|
| 243 |
-
ev.life_situations = [
|
| 244 |
-
s.strip() for s in (pred.life_situations or [])
|
| 245 |
-
if s and len(s.strip()) >= 5
|
| 246 |
-
]
|
| 247 |
-
ev.emotions_addressed = [
|
| 248 |
-
_normalize_theme(e) for e in (pred.emotions_addressed or [])
|
| 249 |
-
if _normalize_theme(e) in EMOTION_VOCAB
|
| 250 |
-
]
|
| 251 |
-
ev.hypothetical_questions = [
|
| 252 |
-
q.strip() for q in (pred.hypothetical_questions or [])
|
| 253 |
-
if q and len(q.strip()) >= 10
|
| 254 |
-
][:5] # cap at 5
|
| 255 |
-
|
| 256 |
-
# Stamp the model so re-runs after a model swap can be detected
|
| 257 |
-
try:
|
| 258 |
-
lm = dspy.settings.lm
|
| 259 |
-
ev.enrichment_model = getattr(lm, "model", "") or ""
|
| 260 |
-
except Exception:
|
| 261 |
-
pass
|
| 262 |
-
else:
|
| 263 |
-
# Enrichment failed; keep the verse but mark it
|
| 264 |
-
ev.enrichment_model = f"FAILED: {last_err}"
|
| 265 |
-
|
| 266 |
-
return ev
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ingest_corpus.py
DELETED
|
@@ -1,203 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
ingest_corpus.py — run the parsers and produce data/corpus.jsonl.
|
| 3 |
-
|
| 4 |
-
This script lives between download_sources.py (which gets bytes onto disk)
|
| 5 |
-
and enrich_corpus.py (which adds LLM-derived fields). Its specific job:
|
| 6 |
-
|
| 7 |
-
1. Walk each enabled source in the registry.
|
| 8 |
-
2. Dispatch to its parser, which yields Verse records.
|
| 9 |
-
3. Merge records across sources by verse_ref.
|
| 10 |
-
- The Gītā parser yields verses with translation but no bhāṣya.
|
| 11 |
-
- The Sastry parser yields verses with bhāṣya but spotty translation.
|
| 12 |
-
- We want one record per verse, with both populated when possible.
|
| 13 |
-
4. Write the merged stream as JSONL to data/corpus.jsonl.
|
| 14 |
-
|
| 15 |
-
Why merge by verse_ref rather than verse_id
|
| 16 |
-
-------------------------------------------
|
| 17 |
-
The Gītā parser uses work='bhagavad_gita' and the Sastry parser uses
|
| 18 |
-
work='bhagavad_gita_bhashya'. Their verse_ids therefore differ (different
|
| 19 |
-
work prefix), but their verse_refs match — both render as 'BG 2.47'. We
|
| 20 |
-
key the merge on verse_ref since that's the reader-facing canonical citation.
|
| 21 |
-
|
| 22 |
-
Conflict policy when merging
|
| 23 |
-
----------------------------
|
| 24 |
-
- Translation: keep whichever record has it; if both, prefer the one whose
|
| 25 |
-
source_key is in the GITA_TEXT_PRIORITY list. (We want the modern, clean
|
| 26 |
-
Sivananda over Sastry's archaic English-of-Śaṅkara-paraphrasing-the-verse.)
|
| 27 |
-
- Bhāṣya: only one source produces this; conflicts shouldn't happen.
|
| 28 |
-
- Sanskrit / transliteration / word_meanings: prefer gita_json; richer.
|
| 29 |
-
"""
|
| 30 |
-
|
| 31 |
-
from __future__ import annotations
|
| 32 |
-
import argparse
|
| 33 |
-
from collections import defaultdict
|
| 34 |
-
from pathlib import Path
|
| 35 |
-
from typing import Iterable
|
| 36 |
-
|
| 37 |
-
from tqdm import tqdm
|
| 38 |
-
|
| 39 |
-
import config
|
| 40 |
-
from corpus import Verse, write_jsonl
|
| 41 |
-
from sources_registry import enabled_sources, by_key, Source
|
| 42 |
-
|
| 43 |
-
# Parsers
|
| 44 |
-
from parsers import gita_json as parser_gita_json
|
| 45 |
-
from parsers import sastry_archive as parser_sastry
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
# When two sources both have a translation, this list decides which wins
|
| 49 |
-
GITA_TEXT_PRIORITY = ("gita_json_core", "sastry_gita_bhashya")
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
def _parse_source(src: Source, raw_dir: Path) -> Iterable[Verse]:
|
| 53 |
-
"""Dispatch to the right parser for a registry entry.
|
| 54 |
-
|
| 55 |
-
Each parser is documented to take a directory and return an iterable of
|
| 56 |
-
Verses; this function is just a switch table.
|
| 57 |
-
"""
|
| 58 |
-
if src.parser == "gita_json":
|
| 59 |
-
# The gita_json parser can take both the core dir and (optionally) a
|
| 60 |
-
# translations dir. We pass the same dir for both since the downloader
|
| 61 |
-
# puts all gita_json* files into per-source folders.
|
| 62 |
-
if src.key == "gita_json_core":
|
| 63 |
-
translations_dir = raw_dir.parent / "gita_json_translations"
|
| 64 |
-
return parser_gita_json.parse(
|
| 65 |
-
raw_dir,
|
| 66 |
-
translations_dir if translations_dir.exists() else None,
|
| 67 |
-
)
|
| 68 |
-
# The translations source is "consumed" alongside core, not parsed alone
|
| 69 |
-
return iter(())
|
| 70 |
-
|
| 71 |
-
if src.parser == "sastry_archive":
|
| 72 |
-
return parser_sastry.parse(raw_dir)
|
| 73 |
-
|
| 74 |
-
if src.parser == "wisdomlib_html":
|
| 75 |
-
# Stub for now — see parsers/wisdomlib_html.py to implement.
|
| 76 |
-
# We don't fail the whole ingest just because one parser is unimplemented.
|
| 77 |
-
print(f"[ingest] wisdomlib_html parser not implemented yet — skipping {src.key}")
|
| 78 |
-
return iter(())
|
| 79 |
-
|
| 80 |
-
if src.parser == "thibaut_sbe":
|
| 81 |
-
print(f"[ingest] thibaut_sbe parser not implemented yet — skipping {src.key}")
|
| 82 |
-
return iter(())
|
| 83 |
-
|
| 84 |
-
if src.parser == "plain_text":
|
| 85 |
-
# Reserved for user-dropped texts; future work
|
| 86 |
-
return iter(())
|
| 87 |
-
|
| 88 |
-
raise ValueError(f"Unknown parser type: {src.parser}")
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
def _merge(records: list[Verse]) -> list[Verse]:
|
| 92 |
-
"""Merge multiple parser outputs into one record per verse_ref.
|
| 93 |
-
|
| 94 |
-
The output preserves the order of first appearance, so the corpus.jsonl
|
| 95 |
-
file is naturally chapter-then-verse ordered.
|
| 96 |
-
"""
|
| 97 |
-
by_ref: dict[str, Verse] = {}
|
| 98 |
-
order: list[str] = []
|
| 99 |
-
|
| 100 |
-
for r in records:
|
| 101 |
-
if r.verse_ref not in by_ref:
|
| 102 |
-
by_ref[r.verse_ref] = r
|
| 103 |
-
order.append(r.verse_ref)
|
| 104 |
-
continue
|
| 105 |
-
|
| 106 |
-
existing = by_ref[r.verse_ref]
|
| 107 |
-
|
| 108 |
-
# Translation: pick higher-priority source if both have one
|
| 109 |
-
new_translation = existing.translation
|
| 110 |
-
new_translator = existing.translator
|
| 111 |
-
if r.translation and (
|
| 112 |
-
not existing.translation
|
| 113 |
-
or _priority(r.source_key) < _priority(existing.source_key)
|
| 114 |
-
):
|
| 115 |
-
new_translation = r.translation
|
| 116 |
-
new_translator = r.translator
|
| 117 |
-
|
| 118 |
-
# Bhashya: only one source typically has it, take whichever isn't blank
|
| 119 |
-
new_bhashya = existing.bhashya or r.bhashya
|
| 120 |
-
new_bhashya_tr = existing.bhashya_translator or r.bhashya_translator
|
| 121 |
-
|
| 122 |
-
# Sanskrit family of fields: prefer the existing record if it has them,
|
| 123 |
-
# else take from the new record
|
| 124 |
-
merged = Verse(
|
| 125 |
-
verse_id=existing.verse_id,
|
| 126 |
-
work=existing.work, # keep the work_display of whichever came first
|
| 127 |
-
work_display=existing.work_display,
|
| 128 |
-
verse_ref=existing.verse_ref,
|
| 129 |
-
tier=_choose_tier(existing.tier, r.tier),
|
| 130 |
-
section=existing.section or r.section,
|
| 131 |
-
section_display=existing.section_display or r.section_display,
|
| 132 |
-
translation=new_translation,
|
| 133 |
-
translator=new_translator,
|
| 134 |
-
sanskrit=existing.sanskrit or r.sanskrit,
|
| 135 |
-
transliteration=existing.transliteration or r.transliteration,
|
| 136 |
-
word_meanings=existing.word_meanings or r.word_meanings,
|
| 137 |
-
bhashya=new_bhashya,
|
| 138 |
-
bhashya_translator=new_bhashya_tr,
|
| 139 |
-
source_key=existing.source_key + "+" + r.source_key,
|
| 140 |
-
license=existing.license or r.license,
|
| 141 |
-
)
|
| 142 |
-
by_ref[r.verse_ref] = merged
|
| 143 |
-
|
| 144 |
-
return [by_ref[k] for k in order]
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
def _priority(source_key: str) -> int:
|
| 148 |
-
"""Lower is higher-priority. Sources not in the priority list rank last."""
|
| 149 |
-
for i, key in enumerate(GITA_TEXT_PRIORITY):
|
| 150 |
-
if source_key == key or source_key.startswith(key + "+") or source_key.endswith("+" + key):
|
| 151 |
-
return i
|
| 152 |
-
return 99
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
def _choose_tier(a: str, b: str) -> str:
|
| 156 |
-
"""When two records merge, the tier of the merged verse is the most
|
| 157 |
-
'authoritative' of the two: primary > shankara > supporting.
|
| 158 |
-
|
| 159 |
-
Why primary > shankara: when we have both the verse text (primary) and
|
| 160 |
-
Śaṅkara's bhāṣya on it (shankara) folded into one record, the verse
|
| 161 |
-
itself is what the citation refers to — so primary wins."""
|
| 162 |
-
rank = {"primary": 0, "shankara": 1, "supporting": 2}
|
| 163 |
-
return a if rank.get(a, 9) <= rank.get(b, 9) else b
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
# ──────────────────────────── CLI ────────────────────────────
|
| 167 |
-
def main():
|
| 168 |
-
ap = argparse.ArgumentParser()
|
| 169 |
-
ap.add_argument("--out", default=str(config.DATA_DIR / "corpus.jsonl"))
|
| 170 |
-
args = ap.parse_args()
|
| 171 |
-
|
| 172 |
-
raw_root = config.DATA_DIR / "raw"
|
| 173 |
-
if not raw_root.exists():
|
| 174 |
-
raise SystemExit("data/raw/ doesn't exist. Run download_sources.py first.")
|
| 175 |
-
|
| 176 |
-
all_records: list[Verse] = []
|
| 177 |
-
for src in enabled_sources():
|
| 178 |
-
raw_dir = raw_root / src.key
|
| 179 |
-
if not raw_dir.exists():
|
| 180 |
-
print(f"[ingest] {src.key}: no files at {raw_dir}; skipping")
|
| 181 |
-
continue
|
| 182 |
-
print(f"[ingest] parsing {src.key} via {src.parser}")
|
| 183 |
-
try:
|
| 184 |
-
n_before = len(all_records)
|
| 185 |
-
for v in _parse_source(src, raw_dir):
|
| 186 |
-
if v.has_content():
|
| 187 |
-
all_records.append(v)
|
| 188 |
-
print(f"[ingest] yielded {len(all_records) - n_before} records")
|
| 189 |
-
except Exception as e:
|
| 190 |
-
print(f"[ingest] {src.key} failed: {e}")
|
| 191 |
-
|
| 192 |
-
print(f"[ingest] merging {len(all_records)} records by verse_ref ...")
|
| 193 |
-
merged = _merge(all_records)
|
| 194 |
-
print(f"[ingest] {len(merged)} unique verses after merge")
|
| 195 |
-
|
| 196 |
-
out_path = Path(args.out)
|
| 197 |
-
n = write_jsonl(merged, out_path)
|
| 198 |
-
print(f"[ingest] wrote {n} verses to {out_path}")
|
| 199 |
-
print(f"[ingest] next: python enrich_corpus.py")
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
if __name__ == "__main__":
|
| 203 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
knowledge_base.py
CHANGED
|
@@ -55,7 +55,6 @@ from dataclasses import dataclass, field
|
|
| 55 |
from pathlib import Path
|
| 56 |
from typing import Iterable
|
| 57 |
|
| 58 |
-
import numpy as np
|
| 59 |
import chromadb
|
| 60 |
from chromadb.config import Settings
|
| 61 |
from sentence_transformers import SentenceTransformer
|
|
@@ -136,47 +135,7 @@ def _client() -> chromadb.api.ClientAPI:
|
|
| 136 |
)
|
| 137 |
|
| 138 |
|
| 139 |
-
|
| 140 |
-
"""Drop-in for SentenceTransformer that calls the HF Inference API.
|
| 141 |
-
|
| 142 |
-
Used on HF Spaces to avoid the ~400 MB cold-start cost of loading the
|
| 143 |
-
local sentence-transformer. The HF Inference API handles the model
|
| 144 |
-
server-side; we only pay a network round-trip per batch.
|
| 145 |
-
"""
|
| 146 |
-
|
| 147 |
-
def __init__(self, model: str, token: str) -> None:
|
| 148 |
-
from huggingface_hub import InferenceClient
|
| 149 |
-
self._client = InferenceClient(token=token)
|
| 150 |
-
self._model = model
|
| 151 |
-
|
| 152 |
-
def encode(
|
| 153 |
-
self,
|
| 154 |
-
sentences: list[str],
|
| 155 |
-
normalize_embeddings: bool = True,
|
| 156 |
-
show_progress_bar: bool = False,
|
| 157 |
-
batch_size: int = 64,
|
| 158 |
-
) -> np.ndarray:
|
| 159 |
-
all_embs: list[list[float]] = []
|
| 160 |
-
for i in range(0, len(sentences), batch_size):
|
| 161 |
-
batch = sentences[i : i + batch_size]
|
| 162 |
-
result = self._client.feature_extraction(batch, model=self._model)
|
| 163 |
-
# API returns list[list[float]] for batch; list[float] for single
|
| 164 |
-
if isinstance(result, list) and result and isinstance(result[0], float):
|
| 165 |
-
result = [result]
|
| 166 |
-
all_embs.extend(result)
|
| 167 |
-
embs = np.array(all_embs, dtype=np.float32)
|
| 168 |
-
if normalize_embeddings:
|
| 169 |
-
norms = np.linalg.norm(embs, axis=1, keepdims=True)
|
| 170 |
-
embs = embs / np.where(norms > 0, norms, 1.0)
|
| 171 |
-
return embs
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
def _embedder(force_local: bool = False) -> "SentenceTransformer | _HFEmbedder":
|
| 175 |
-
"""Return an embedder. Prefers the HF Inference API when HF_TOKEN is set
|
| 176 |
-
(avoids loading a 400 MB model on Spaces). Pass force_local=True when
|
| 177 |
-
building the index locally to use the local model for batch efficiency."""
|
| 178 |
-
if config.HF_TOKEN and not force_local:
|
| 179 |
-
return _HFEmbedder(config.EMBED_MODEL, config.HF_TOKEN)
|
| 180 |
return SentenceTransformer(config.EMBED_MODEL, device=config.EMBED_DEVICE)
|
| 181 |
|
| 182 |
|
|
@@ -216,7 +175,7 @@ def build_index(corpus_path: Path | None = None) -> dict[str, int]:
|
|
| 216 |
)
|
| 217 |
|
| 218 |
print(f"Loading embedding model: {config.EMBED_MODEL} on {config.EMBED_DEVICE}")
|
| 219 |
-
embedder = _embedder(
|
| 220 |
|
| 221 |
client = _client()
|
| 222 |
# Drop existing collections; build_index is "rebuild from scratch"
|
|
|
|
| 55 |
from pathlib import Path
|
| 56 |
from typing import Iterable
|
| 57 |
|
|
|
|
| 58 |
import chromadb
|
| 59 |
from chromadb.config import Settings
|
| 60 |
from sentence_transformers import SentenceTransformer
|
|
|
|
| 135 |
)
|
| 136 |
|
| 137 |
|
| 138 |
+
def _embedder() -> SentenceTransformer:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
return SentenceTransformer(config.EMBED_MODEL, device=config.EMBED_DEVICE)
|
| 140 |
|
| 141 |
|
|
|
|
| 175 |
)
|
| 176 |
|
| 177 |
print(f"Loading embedding model: {config.EMBED_MODEL} on {config.EMBED_DEVICE}")
|
| 178 |
+
embedder = _embedder()
|
| 179 |
|
| 180 |
client = _client()
|
| 181 |
# Drop existing collections; build_index is "rebuild from scratch"
|
metrics.py
DELETED
|
@@ -1,435 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
metrics.py — the metric is the specification.
|
| 3 |
-
|
| 4 |
-
GEPA optimizes whatever the metric rewards. So the metric here is not a single
|
| 5 |
-
number; it's a *contract* on what an Advaita-grounded, empathetic, practically
|
| 6 |
-
useful response looks like — combined with rich textual feedback the reflection
|
| 7 |
-
LM uses to rewrite prompts.
|
| 8 |
-
|
| 9 |
-
We combine three signals:
|
| 10 |
-
1. Rule-based checks (fast, deterministic)
|
| 11 |
-
- citation grounding (cites real retrieved sources, not hallucinated)
|
| 12 |
-
- tier preference (primary + Śaṅkara > supporting)
|
| 13 |
-
- structural hygiene (length, has actionable element, no therapy clichés)
|
| 14 |
-
2. LLM-as-judge rubric scoring
|
| 15 |
-
- Advaita coherence (non-dual, not crypto-dualist)
|
| 16 |
-
- two-truths discipline (vyāvahārika ↔ pāramārthika)
|
| 17 |
-
- empathy without dissolving into the user's frame
|
| 18 |
-
- wit calibration (light around the predicament, never the pain)
|
| 19 |
-
3. Composite score + structured feedback string
|
| 20 |
-
|
| 21 |
-
The function signature matches GEPA's metric contract:
|
| 22 |
-
metric(gold, pred, trace=None, pred_name=None, pred_trace=None) -> dspy.Prediction
|
| 23 |
-
|
| 24 |
-
Returning dspy.Prediction(score=float, feedback=str) is the GEPA happy path.
|
| 25 |
-
"""
|
| 26 |
-
|
| 27 |
-
from __future__ import annotations
|
| 28 |
-
import re
|
| 29 |
-
import json
|
| 30 |
-
from typing import Any
|
| 31 |
-
import dspy
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
# ──────────────────────────── Rule-based checks ────────────────────────────
|
| 35 |
-
THERAPY_CLICHES = [
|
| 36 |
-
"you got this",
|
| 37 |
-
"be kind to yourself",
|
| 38 |
-
"self-care",
|
| 39 |
-
"just remember",
|
| 40 |
-
"trust the process",
|
| 41 |
-
"everything happens for a reason",
|
| 42 |
-
"you are enough",
|
| 43 |
-
"love and light",
|
| 44 |
-
"manifesting",
|
| 45 |
-
"send positive vibes",
|
| 46 |
-
"good vibes",
|
| 47 |
-
]
|
| 48 |
-
|
| 49 |
-
# Loose pattern catching citations like "BG 2.47", "Gītā 18.66", "Bṛhadāraṇyaka 4.4.5",
|
| 50 |
-
# "Vivekacūḍāmaṇi 11", "Kaṭha Up. 1.3.14", etc.
|
| 51 |
-
CITATION_PATTERN = re.compile(
|
| 52 |
-
r"\b("
|
| 53 |
-
r"BG\s*\d+[\.:]\d+" # BG 2.47
|
| 54 |
-
r"|G[īi]t[āa]\s*\d+[\.:]\d+" # Gita 2.47
|
| 55 |
-
r"|[A-ZĀĪŪṚḌṬṆṢŚḤṂa-zāīūṛḍṭṇṣśḥṃ]{3,}\s*Up\.?\s*\d+(?:[\.:]\d+){0,2}" # Kaṭha Up. 1.2.3
|
| 56 |
-
r"|Vivekac[ūu]ḍāmaṇi\s*\d+"
|
| 57 |
-
r"|Ātmabodha\s*\d+"
|
| 58 |
-
r"|Tattvabodha\s*\d+"
|
| 59 |
-
r"|Brahma\s*S[ūu]tra\s*\d+[\.:]\d+(?:[\.:]\d+)?"
|
| 60 |
-
r"|Aṣṭāvakra\s*G[īi]t[āa]\s*\d+[\.:]\d+"
|
| 61 |
-
r")\b"
|
| 62 |
-
)
|
| 63 |
-
|
| 64 |
-
EMPATHY_OPENERS = [
|
| 65 |
-
"what you", "you're carrying", "you are carrying", "i hear",
|
| 66 |
-
"this hurts", "this is painful", "the weight", "sitting with",
|
| 67 |
-
"what you describe", "the ache",
|
| 68 |
-
]
|
| 69 |
-
|
| 70 |
-
ACTIONABLE_MARKERS = [
|
| 71 |
-
"this week", "today", "try this", "begin by", "for the next",
|
| 72 |
-
"each morning", "each evening", "when you notice", "the next time",
|
| 73 |
-
"as a practice", "sit for", "spend ", "over the next",
|
| 74 |
-
]
|
| 75 |
-
|
| 76 |
-
NON_DUAL_MARKERS = [
|
| 77 |
-
"witness", "sākṣī", "sakshi", "non-dual", "advaita",
|
| 78 |
-
"pāramārthika", "paramarthika", "vyāvahārika", "vyavaharika",
|
| 79 |
-
"ātman", "atman", "brahman", "adhyāsa", "adhyasa", "māyā", "maya",
|
| 80 |
-
"neti neti", "tat tvam asi", "ahaṁ brahmāsmi", "aham brahmasmi",
|
| 81 |
-
"self with a capital", "the seer", "awareness itself",
|
| 82 |
-
]
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
def _word_count(s: str) -> int:
|
| 86 |
-
return len(s.split())
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
def _has_any(text: str, needles: list[str]) -> list[str]:
|
| 90 |
-
low = text.lower()
|
| 91 |
-
return [n for n in needles if n in low]
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
def _normalize_for_match(s: str) -> str:
|
| 95 |
-
return re.sub(r"\s+", " ", s.lower()).strip()
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
def _citation_grounding(
|
| 99 |
-
sources_cited: list[str],
|
| 100 |
-
retrieved_passages: list[dict],
|
| 101 |
-
) -> tuple[float, list[str], list[str]]:
|
| 102 |
-
"""Return (grounding_score, grounded_citations, ungrounded_citations).
|
| 103 |
-
|
| 104 |
-
With the verse-indexed corpus, each retrieved passage carries an exact
|
| 105 |
-
verse_ref string ('BG 2.47', 'Muṇḍaka Up. 2.1.3', etc.). Grounding becomes
|
| 106 |
-
an exact set-membership test rather than fuzzy substring matching, which
|
| 107 |
-
is dramatically sharper feedback for GEPA's reflection step: 'BG 2.47'
|
| 108 |
-
is grounded if and only if 'BG 2.47' was in the retrieved set.
|
| 109 |
-
|
| 110 |
-
We still tolerate light formatting noise: the synthesizer might write
|
| 111 |
-
'BG 2.47', 'Bhagavad Gītā 2.47', 'Gita 2:47', etc. We canonicalize to
|
| 112 |
-
'BG <chap>.<verse>' for Gītā citations before comparing. Other works
|
| 113 |
-
are matched directly by verse_ref string with whitespace normalized.
|
| 114 |
-
"""
|
| 115 |
-
if not sources_cited:
|
| 116 |
-
return 0.0, [], []
|
| 117 |
-
|
| 118 |
-
retrieved_refs = {
|
| 119 |
-
_canonicalize_ref(h.get("verse_ref") or h.get("meta", {}).get("verse_ref", ""))
|
| 120 |
-
for h in retrieved_passages
|
| 121 |
-
}
|
| 122 |
-
retrieved_refs.discard("")
|
| 123 |
-
|
| 124 |
-
grounded, ungrounded = [], []
|
| 125 |
-
for c in sources_cited:
|
| 126 |
-
canon = _canonicalize_ref(c)
|
| 127 |
-
# Try direct match first, then a "substring of any retrieved" fallback
|
| 128 |
-
# for cases where the synthesizer paraphrases the citation
|
| 129 |
-
# ('chapter 2 verse 47' vs 'BG 2.47').
|
| 130 |
-
hit = canon in retrieved_refs or any(
|
| 131 |
-
canon and (canon in r or r in canon) for r in retrieved_refs
|
| 132 |
-
)
|
| 133 |
-
(grounded if hit else ungrounded).append(c)
|
| 134 |
-
|
| 135 |
-
score = len(grounded) / max(len(sources_cited), 1)
|
| 136 |
-
return score, grounded, ungrounded
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
def _canonicalize_ref(s: str) -> str:
|
| 140 |
-
"""Normalize a citation string so 'BG 2.47', 'Bhagavad Gītā 2.47',
|
| 141 |
-
'Gītā 2:47' all reduce to the same canonical form 'BG 2.47'."""
|
| 142 |
-
s = re.sub(r"\s+", " ", s.strip())
|
| 143 |
-
# Gītā variants
|
| 144 |
-
m = re.match(r"^(?:BG|Bhagavad\s*G[īi]t[āa]|G[īi]t[āa])\s*(\d+)[\.:](\d+)", s, re.I)
|
| 145 |
-
if m:
|
| 146 |
-
return f"BG {int(m.group(1))}.{int(m.group(2))}"
|
| 147 |
-
# Default: lowercased, colons → dots
|
| 148 |
-
return s.lower().replace(":", ".")
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
def _tier_preference(
|
| 152 |
-
sources_cited: list[str],
|
| 153 |
-
retrieved_passages: list[dict],
|
| 154 |
-
selected_indices: list[int],
|
| 155 |
-
) -> tuple[float, dict]:
|
| 156 |
-
"""Reward responses whose *cited* passages came from primary/Śaṅkara tiers."""
|
| 157 |
-
if not selected_indices:
|
| 158 |
-
return 0.0, {"primary": 0, "shankara": 0, "supporting": 0}
|
| 159 |
-
|
| 160 |
-
counts = {"primary": 0, "shankara": 0, "supporting": 0}
|
| 161 |
-
for idx in selected_indices:
|
| 162 |
-
if 1 <= idx <= len(retrieved_passages):
|
| 163 |
-
tier = retrieved_passages[idx - 1].get("meta", {}).get("tier", "supporting")
|
| 164 |
-
counts[tier] = counts.get(tier, 0) + 1
|
| 165 |
-
|
| 166 |
-
total = sum(counts.values()) or 1
|
| 167 |
-
preferred = counts["primary"] + counts["shankara"]
|
| 168 |
-
return preferred / total, counts
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
def rule_based_score(pred: dspy.Prediction) -> tuple[float, dict]:
|
| 172 |
-
"""Returns (score in [0,1], breakdown dict)."""
|
| 173 |
-
response = getattr(pred, "response", "") or ""
|
| 174 |
-
sources_cited = getattr(pred, "sources_cited", []) or []
|
| 175 |
-
retrieved = getattr(pred, "retrieved_passages", []) or []
|
| 176 |
-
selected_idx = getattr(pred, "selected_indices", []) or []
|
| 177 |
-
felt = getattr(pred, "felt_emotion", "") or ""
|
| 178 |
-
|
| 179 |
-
wc = _word_count(response)
|
| 180 |
-
length_ok = 200 <= wc <= 600
|
| 181 |
-
length_score = 1.0 if length_ok else max(0.0, 1.0 - abs(wc - 350) / 350)
|
| 182 |
-
|
| 183 |
-
citations_in_text = CITATION_PATTERN.findall(response)
|
| 184 |
-
has_citation = bool(citations_in_text) or bool(sources_cited)
|
| 185 |
-
citation_score = 1.0 if has_citation else 0.0
|
| 186 |
-
|
| 187 |
-
grounding_score, grounded, ungrounded = _citation_grounding(sources_cited, retrieved)
|
| 188 |
-
|
| 189 |
-
tier_score, tier_counts = _tier_preference(sources_cited, retrieved, selected_idx)
|
| 190 |
-
|
| 191 |
-
cliches = _has_any(response, THERAPY_CLICHES)
|
| 192 |
-
cliche_penalty = min(1.0, 0.25 * len(cliches))
|
| 193 |
-
cliche_score = 1.0 - cliche_penalty
|
| 194 |
-
|
| 195 |
-
# Empathy: opening should signal acknowledgement of feeling
|
| 196 |
-
head = response[:300].lower()
|
| 197 |
-
empathy_hits = [m for m in EMPATHY_OPENERS if m in head]
|
| 198 |
-
# Bonus if the felt_emotion content is referenced (loosely)
|
| 199 |
-
if felt:
|
| 200 |
-
for tok in felt.lower().split():
|
| 201 |
-
if len(tok) > 4 and tok in head:
|
| 202 |
-
empathy_hits.append(f"echoes:{tok}")
|
| 203 |
-
break
|
| 204 |
-
empathy_score = min(1.0, 0.4 + 0.3 * len(empathy_hits))
|
| 205 |
-
|
| 206 |
-
actionable_hits = _has_any(response, ACTIONABLE_MARKERS)
|
| 207 |
-
actionable_score = 1.0 if actionable_hits else 0.4
|
| 208 |
-
|
| 209 |
-
nondual_hits = _has_any(response, NON_DUAL_MARKERS)
|
| 210 |
-
nondual_score = min(1.0, 0.4 + 0.2 * len(nondual_hits))
|
| 211 |
-
|
| 212 |
-
# Weighted aggregate
|
| 213 |
-
components = {
|
| 214 |
-
"length": (length_score, 0.05),
|
| 215 |
-
"citation_present": (citation_score, 0.08),
|
| 216 |
-
"citation_grounding": (grounding_score, 0.18),
|
| 217 |
-
"tier_preference": (tier_score, 0.12),
|
| 218 |
-
"no_cliches": (cliche_score, 0.10),
|
| 219 |
-
"empathy_opening": (empathy_score, 0.15),
|
| 220 |
-
"actionable": (actionable_score, 0.10),
|
| 221 |
-
"nondual_register": (nondual_score, 0.22),
|
| 222 |
-
}
|
| 223 |
-
score = sum(s * w for s, w in components.values())
|
| 224 |
-
|
| 225 |
-
breakdown = {
|
| 226 |
-
"score": score,
|
| 227 |
-
"word_count": wc,
|
| 228 |
-
"components": {k: round(v[0], 3) for k, v in components.items()},
|
| 229 |
-
"citations_in_text": citations_in_text,
|
| 230 |
-
"sources_cited": sources_cited,
|
| 231 |
-
"grounded_citations": grounded,
|
| 232 |
-
"ungrounded_citations": ungrounded,
|
| 233 |
-
"tier_counts": tier_counts,
|
| 234 |
-
"therapy_cliches_found": cliches,
|
| 235 |
-
"empathy_hits": empathy_hits,
|
| 236 |
-
"actionable_hits": actionable_hits,
|
| 237 |
-
"nondual_markers_found": nondual_hits,
|
| 238 |
-
}
|
| 239 |
-
return score, breakdown
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
# ──────────────────────────── LLM-judge rubric ────────────────────────────
|
| 243 |
-
class JudgeAdvice(dspy.Signature):
|
| 244 |
-
"""You are an examiner of Advaita-Vedānta spiritual counsel in the lineage
|
| 245 |
-
of Ādi Śaṅkarācārya. Score the advisor's response against the user's
|
| 246 |
-
question on each rubric (0.0 to 1.0) and write a short critique that an
|
| 247 |
-
optimizer can use to *improve the prompts that produced this response*.
|
| 248 |
-
|
| 249 |
-
Rubrics:
|
| 250 |
-
|
| 251 |
-
- advaita_coherence: Does the response reflect genuine non-dualism
|
| 252 |
-
(jīva-ātman-brahman identity), or does it accidentally smuggle in dualism
|
| 253 |
-
('the soul reaches God', 'becoming one with the universe' as if they were
|
| 254 |
-
separate, etc.)? Does it avoid collapsing into nihilism ('nothing is
|
| 255 |
-
real')?
|
| 256 |
-
|
| 257 |
-
- two_truths_discipline: Does it honor the distinction between
|
| 258 |
-
vyāvahārika (transactional, where the user's pain and choices are real
|
| 259 |
-
and matter) and pāramārthika (absolute, where the witness is untouched)?
|
| 260 |
-
Failure modes: spiritual bypass (denying the pain by pointing to the
|
| 261 |
-
absolute), or pure-therapy register (forgetting the absolute exists).
|
| 262 |
-
|
| 263 |
-
- empathy_without_dissolving: Does it meet the user in their felt
|
| 264 |
-
experience without either flattening into therapy-speak OR dismissing
|
| 265 |
-
the feeling with premature transcendence?
|
| 266 |
-
|
| 267 |
-
- wit_calibration: Is there a light, dry touch around the cosmic
|
| 268 |
-
predicament (Śaṅkara himself is dry; this is consistent with the
|
| 269 |
-
tradition) WITHOUT being flippant about the user's actual pain? Both
|
| 270 |
-
'too solemn throughout' and 'making jokes about their situation' lose
|
| 271 |
-
points.
|
| 272 |
-
|
| 273 |
-
- source_integration: Are scriptural citations woven into the prose
|
| 274 |
-
(illuminating the point) rather than dumped as block quotes or used
|
| 275 |
-
as decoration? Are the references specific (Gītā 2.47, not just
|
| 276 |
-
"the Gita says")?
|
| 277 |
-
|
| 278 |
-
- practical_offering: Does the response close with something the user
|
| 279 |
-
can actually try — a question to sit with, a practice, a perspective
|
| 280 |
-
shift — rather than abstract platitudes?
|
| 281 |
-
|
| 282 |
-
- draw_from_personal_experiences: Does the response use parables and day to day life
|
| 283 |
-
stories as examples to encourage the user to relate better to the advise
|
| 284 |
-
|
| 285 |
-
The critique should be specific and prescriptive: what to keep, what to
|
| 286 |
-
cut, what's missing. Phrase it as you would to a writer revising a draft."""
|
| 287 |
-
|
| 288 |
-
user_question: str = dspy.InputField()
|
| 289 |
-
response: str = dspy.InputField()
|
| 290 |
-
sources_cited: list[str] = dspy.InputField()
|
| 291 |
-
|
| 292 |
-
advaita_coherence: float = dspy.OutputField(desc="0.0 to 1.0")
|
| 293 |
-
two_truths_discipline: float = dspy.OutputField(desc="0.0 to 1.0")
|
| 294 |
-
empathy_without_dissolving: float = dspy.OutputField(desc="0.0 to 1.0")
|
| 295 |
-
wit_calibration: float = dspy.OutputField(desc="0.0 to 1.0")
|
| 296 |
-
source_integration: float = dspy.OutputField(desc="0.0 to 1.0")
|
| 297 |
-
practical_offering: float = dspy.OutputField(desc="0.0 to 1.0")
|
| 298 |
-
draw_from_personal_experiences: float = dspy.OutputField(desc="0.0 to 1.0")
|
| 299 |
-
critique: str = dspy.OutputField(
|
| 300 |
-
desc="3-6 sentences of prescriptive feedback for revising the response."
|
| 301 |
-
)
|
| 302 |
-
|
| 303 |
-
|
| 304 |
-
# Lazily-instantiated judge. Call configure_judge() to use a stronger LM (e.g. gpt-4o)
|
| 305 |
-
# during GEPA optimization so the reflection LM gets high-quality signal to work from.
|
| 306 |
-
_judge = None
|
| 307 |
-
_judge_lm = None # None means use the globally-configured LM (task LM)
|
| 308 |
-
|
| 309 |
-
|
| 310 |
-
def configure_judge(lm) -> None:
|
| 311 |
-
"""Set the LM used by judge_score. Call before GEPA to use gpt-4o instead of the task LM."""
|
| 312 |
-
global _judge_lm, _judge
|
| 313 |
-
_judge_lm = lm
|
| 314 |
-
_judge = None # reset so next call recreates with new context
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
def _get_judge():
|
| 318 |
-
global _judge
|
| 319 |
-
if _judge is None:
|
| 320 |
-
_judge = dspy.ChainOfThought(JudgeAdvice)
|
| 321 |
-
return _judge
|
| 322 |
-
|
| 323 |
-
|
| 324 |
-
def judge_score(user_question: str, pred: dspy.Prediction) -> tuple[float, dict, str]:
|
| 325 |
-
judge = _get_judge()
|
| 326 |
-
try:
|
| 327 |
-
call_kwargs = dict(
|
| 328 |
-
user_question=user_question,
|
| 329 |
-
response=getattr(pred, "response", "") or "",
|
| 330 |
-
sources_cited=getattr(pred, "sources_cited", []) or [],
|
| 331 |
-
)
|
| 332 |
-
if _judge_lm is not None:
|
| 333 |
-
with dspy.context(lm=_judge_lm):
|
| 334 |
-
j = judge(**call_kwargs)
|
| 335 |
-
else:
|
| 336 |
-
j = judge(**call_kwargs)
|
| 337 |
-
except Exception as e:
|
| 338 |
-
# If the judge fails (parse error, LM hiccup), fall back gracefully.
|
| 339 |
-
return 0.5, {"judge_error": str(e)}, f"Judge failed: {e}"
|
| 340 |
-
|
| 341 |
-
rubric = {
|
| 342 |
-
"advaita_coherence": float(j.advaita_coherence or 0.0),
|
| 343 |
-
"two_truths_discipline": float(j.two_truths_discipline or 0.0),
|
| 344 |
-
"empathy_without_dissolving": float(j.empathy_without_dissolving or 0.0),
|
| 345 |
-
"wit_calibration": float(j.wit_calibration or 0.0),
|
| 346 |
-
"source_integration": float(j.source_integration or 0.0),
|
| 347 |
-
"practical_offering": float(j.practical_offering or 0.0),
|
| 348 |
-
"draw_from_personal_experiences": float(j.draw_from_personal_experiences or 0.0),
|
| 349 |
-
}
|
| 350 |
-
weights = {
|
| 351 |
-
"advaita_coherence": 0.25,
|
| 352 |
-
"two_truths_discipline": 0.20,
|
| 353 |
-
"empathy_without_dissolving": 0.20,
|
| 354 |
-
"wit_calibration": 0.10,
|
| 355 |
-
"source_integration": 0.10,
|
| 356 |
-
"practical_offering": 0.10,
|
| 357 |
-
"draw_from_personal_experiences": 0.05,
|
| 358 |
-
}
|
| 359 |
-
score = sum(rubric[k] * weights[k] for k in rubric)
|
| 360 |
-
score = max(0.0, min(1.0, score))
|
| 361 |
-
return score, rubric, j.critique or ""
|
| 362 |
-
|
| 363 |
-
|
| 364 |
-
# ──────────────────────────── Composite GEPA metric ────────────────────────────
|
| 365 |
-
RULE_WEIGHT = 0.45
|
| 366 |
-
JUDGE_WEIGHT = 0.55
|
| 367 |
-
|
| 368 |
-
|
| 369 |
-
def _format_feedback(rule_breakdown: dict, judge_rubric: dict, critique: str) -> str:
|
| 370 |
-
"""Concatenate rule-based facts and judge critique into one feedback string
|
| 371 |
-
that the GEPA reflection LM can read and use to rewrite prompts."""
|
| 372 |
-
lines = ["FEEDBACK FOR PROMPT IMPROVEMENT", ""]
|
| 373 |
-
|
| 374 |
-
lines.append("Rule-based observations:")
|
| 375 |
-
comps = rule_breakdown.get("components", {})
|
| 376 |
-
for k, v in comps.items():
|
| 377 |
-
lines.append(f" - {k}: {v}")
|
| 378 |
-
if rule_breakdown.get("therapy_cliches_found"):
|
| 379 |
-
lines.append(f" - Therapy clichés to remove: {rule_breakdown['therapy_cliches_found']}")
|
| 380 |
-
if rule_breakdown.get("ungrounded_citations"):
|
| 381 |
-
lines.append(
|
| 382 |
-
f" - Citations that weren't in retrieved passages (likely hallucinated): "
|
| 383 |
-
f"{rule_breakdown['ungrounded_citations']}"
|
| 384 |
-
)
|
| 385 |
-
if not rule_breakdown.get("nondual_markers_found"):
|
| 386 |
-
lines.append(" - Response lacks explicit Advaita register; consider invoking "
|
| 387 |
-
"concepts like sākṣī, adhyāsa, the two truths, etc.")
|
| 388 |
-
if not rule_breakdown.get("actionable_hits"):
|
| 389 |
-
lines.append(" - No concrete practice or this-week shift was offered.")
|
| 390 |
-
tier_counts = rule_breakdown.get("tier_counts", {})
|
| 391 |
-
if tier_counts:
|
| 392 |
-
lines.append(f" - Selected passage tiers: {tier_counts} "
|
| 393 |
-
f"(prefer primary + śaṅkara when both options exist).")
|
| 394 |
-
|
| 395 |
-
lines.append("")
|
| 396 |
-
lines.append("Rubric scores from Advaita-tradition examiner:")
|
| 397 |
-
for k, v in judge_rubric.items():
|
| 398 |
-
if isinstance(v, float):
|
| 399 |
-
lines.append(f" - {k}: {v:.2f}")
|
| 400 |
-
lines.append("")
|
| 401 |
-
lines.append("Examiner critique:")
|
| 402 |
-
lines.append(critique.strip() or "(no critique returned)")
|
| 403 |
-
return "\n".join(lines)
|
| 404 |
-
|
| 405 |
-
|
| 406 |
-
def gita_metric(
|
| 407 |
-
gold: dspy.Example,
|
| 408 |
-
pred: dspy.Prediction,
|
| 409 |
-
trace: Any = None,
|
| 410 |
-
pred_name: str | None = None,
|
| 411 |
-
pred_trace: Any = None,
|
| 412 |
-
) -> dspy.Prediction:
|
| 413 |
-
"""The GEPA-compatible metric.
|
| 414 |
-
|
| 415 |
-
Returns dspy.Prediction(score=..., feedback=...). The feedback string is
|
| 416 |
-
what GEPA's reflection LM ingests when rewriting prompts."""
|
| 417 |
-
user_q = getattr(gold, "user_question", "") if gold else ""
|
| 418 |
-
|
| 419 |
-
rule_score, rule_breakdown = rule_based_score(pred)
|
| 420 |
-
j_score, j_rubric, critique = judge_score(user_q, pred)
|
| 421 |
-
|
| 422 |
-
composite = RULE_WEIGHT * rule_score + JUDGE_WEIGHT * j_score
|
| 423 |
-
feedback = _format_feedback(rule_breakdown, j_rubric, critique)
|
| 424 |
-
|
| 425 |
-
return dspy.Prediction(score=composite, feedback=feedback)
|
| 426 |
-
|
| 427 |
-
|
| 428 |
-
def quick_eval_score(
|
| 429 |
-
gold: dspy.Example,
|
| 430 |
-
pred: dspy.Prediction,
|
| 431 |
-
trace: Any = None,
|
| 432 |
-
) -> float:
|
| 433 |
-
"""A pure-float metric for `dspy.Evaluate` — same composite, no feedback."""
|
| 434 |
-
out = gita_metric(gold, pred, trace=trace)
|
| 435 |
-
return float(out.score)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
optimize_gepa.py
DELETED
|
@@ -1,200 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
optimize_gepa.py — run GEPA reflective prompt evolution.
|
| 3 |
-
|
| 4 |
-
GEPA (Genetic-Pareto) treats the program's prompts as an evolving population.
|
| 5 |
-
At each step it:
|
| 6 |
-
1. Runs the current candidate(s) on a minibatch of training examples
|
| 7 |
-
2. Collects the (score, feedback) pairs from our metric
|
| 8 |
-
3. Asks a *reflection LM* to read the failures + feedback and propose a
|
| 9 |
-
mutated prompt
|
| 10 |
-
4. Evaluates the mutant; keeps it if it Pareto-dominates the parent on the
|
| 11 |
-
validation set
|
| 12 |
-
5. Repeats
|
| 13 |
-
|
| 14 |
-
Because we wrote `gita_metric` to return rich textual feedback, the reflection
|
| 15 |
-
LM has something substantive to chew on instead of just gradient signal.
|
| 16 |
-
|
| 17 |
-
The dataset has no gold labels — that's deliberate. Our metric judges the
|
| 18 |
-
prediction directly. This is the regime GEPA is designed for.
|
| 19 |
-
|
| 20 |
-
Usage:
|
| 21 |
-
python optimize_gepa.py --auto medium
|
| 22 |
-
python optimize_gepa.py --max-metric-calls 300 --proxy-task-lm
|
| 23 |
-
python optimize_gepa.py --auto light --proxy-task-lm # ~2-3 hrs vs 260 hrs
|
| 24 |
-
|
| 25 |
-
Proxy task LM (--proxy-task-lm):
|
| 26 |
-
Runs GEPA with gpt-4o-mini as the task LM instead of Gemma 4. GEPA only
|
| 27 |
-
needs to evaluate prompt quality — it doesn't need the final inference model.
|
| 28 |
-
Optimized prompts are model-agnostic text and transfer back to Gemma 4 when
|
| 29 |
-
the saved program is loaded at inference time. ~20x speedup over Gemma thinking.
|
| 30 |
-
"""
|
| 31 |
-
|
| 32 |
-
from __future__ import annotations
|
| 33 |
-
import argparse
|
| 34 |
-
import json
|
| 35 |
-
import random
|
| 36 |
-
from pathlib import Path
|
| 37 |
-
|
| 38 |
-
import dspy
|
| 39 |
-
from dspy import GEPA
|
| 40 |
-
|
| 41 |
-
import config
|
| 42 |
-
from advisor import GitaAdvisor
|
| 43 |
-
from dataset_generator import load_jsonl, to_dspy_examples
|
| 44 |
-
import metrics as metrics_module
|
| 45 |
-
from metrics import gita_metric, quick_eval_score
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
def split(examples, val_frac: float, seed: int = 42):
|
| 49 |
-
rng = random.Random(seed)
|
| 50 |
-
shuffled = examples[:]
|
| 51 |
-
rng.shuffle(shuffled)
|
| 52 |
-
n_val = max(20, int(len(shuffled) * val_frac))
|
| 53 |
-
return shuffled[n_val:], shuffled[:n_val]
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
def main():
|
| 57 |
-
ap = argparse.ArgumentParser()
|
| 58 |
-
ap.add_argument("--dataset", default=str(config.DATASET_PATH))
|
| 59 |
-
ap.add_argument("--out", default=str(config.OPTIMIZED_PROGRAM_PATH))
|
| 60 |
-
ap.add_argument("--val-frac", type=float, default=0.2)
|
| 61 |
-
ap.add_argument(
|
| 62 |
-
"--auto",
|
| 63 |
-
choices=["light", "medium", "heavy"],
|
| 64 |
-
default="medium",
|
| 65 |
-
help="GEPA's auto-budget mode. 'light' for smoke-tests, 'medium' for "
|
| 66 |
-
"a real run, 'heavy' for an overnight run on a meaty box.",
|
| 67 |
-
)
|
| 68 |
-
ap.add_argument(
|
| 69 |
-
"--max-metric-calls",
|
| 70 |
-
type=int,
|
| 71 |
-
default=None,
|
| 72 |
-
help="Override --auto with an explicit metric-call budget.",
|
| 73 |
-
)
|
| 74 |
-
ap.add_argument("--track-stats", action="store_true", default=True)
|
| 75 |
-
ap.add_argument("--seed", type=int, default=42)
|
| 76 |
-
ap.add_argument(
|
| 77 |
-
"--proxy-task-lm",
|
| 78 |
-
action="store_true",
|
| 79 |
-
default=False,
|
| 80 |
-
help="Use gpt-4o-mini as the task LM during GEPA instead of Gemma 4. "
|
| 81 |
-
"~20x faster; optimized prompts transfer back to Gemma 4 at inference. "
|
| 82 |
-
"Requires OPENAI_API_KEY.",
|
| 83 |
-
)
|
| 84 |
-
args = ap.parse_args()
|
| 85 |
-
|
| 86 |
-
# Configure DSPy globally and grab the reflection LM
|
| 87 |
-
task_lm, reflection_lm = config.configure_dspy()
|
| 88 |
-
|
| 89 |
-
if args.proxy_task_lm:
|
| 90 |
-
# Override the task LM with gpt-4o-mini for the duration of this process.
|
| 91 |
-
# DSPy saves only prompt text (instructions + field descriptions), not the
|
| 92 |
-
# LM choice — so the optimized JSON loads cleanly onto Gemma 4 at inference.
|
| 93 |
-
task_lm = dspy.LM(model=config.PROXY_TASK_MODEL, **config.PROXY_TASK_LM_KWARGS)
|
| 94 |
-
dspy.configure(lm=task_lm, adapter=dspy.ChatAdapter(use_json_adapter_fallback=False))
|
| 95 |
-
print(f"Task LM (proxy): {task_lm.model} [GEPA optimization only]")
|
| 96 |
-
else:
|
| 97 |
-
print(f"Task LM: {task_lm.model}")
|
| 98 |
-
print(f"Reflection LM: {reflection_lm.model}")
|
| 99 |
-
|
| 100 |
-
# Use the reflection LM (gpt-4o) for judging instead of the task LM (Gemma).
|
| 101 |
-
# Gemma judging its own responses produces noisy, self-congratulatory scores;
|
| 102 |
-
# gpt-4o gives the reflection step the crisp, tradition-aware feedback it needs.
|
| 103 |
-
metrics_module.configure_judge(reflection_lm)
|
| 104 |
-
print(f"Judge LM: {reflection_lm.model} (overriding task LM for judging)")
|
| 105 |
-
|
| 106 |
-
# Dataset
|
| 107 |
-
raw = load_jsonl(Path(args.dataset))
|
| 108 |
-
examples = to_dspy_examples(raw)
|
| 109 |
-
if len(examples) < 40:
|
| 110 |
-
print(f"[warn] Only {len(examples)} examples — generate more with "
|
| 111 |
-
f"`python dataset_generator.py --n 500`.")
|
| 112 |
-
train, val = split(examples, args.val_frac, seed=args.seed)
|
| 113 |
-
print(f"Train: {len(train)} Val: {len(val)}")
|
| 114 |
-
|
| 115 |
-
# Student program
|
| 116 |
-
student = GitaAdvisor()
|
| 117 |
-
|
| 118 |
-
# More threads when hitting an API (no local GPU bottleneck).
|
| 119 |
-
num_threads = 16 if args.proxy_task_lm or config.TASK_LM_BACKEND == "gemini" else 4
|
| 120 |
-
|
| 121 |
-
# Optional: get a baseline number for context
|
| 122 |
-
print("\nEvaluating baseline (un-optimized) on validation set ...")
|
| 123 |
-
evaluator = dspy.Evaluate(
|
| 124 |
-
devset=val,
|
| 125 |
-
metric=quick_eval_score,
|
| 126 |
-
num_threads=num_threads,
|
| 127 |
-
display_progress=True,
|
| 128 |
-
display_table=0,
|
| 129 |
-
)
|
| 130 |
-
try:
|
| 131 |
-
baseline_result = evaluator(student)
|
| 132 |
-
baseline_score = float(baseline_result) if hasattr(baseline_result, "__float__") else baseline_result
|
| 133 |
-
print(f"Baseline score: {baseline_score}")
|
| 134 |
-
except Exception as e:
|
| 135 |
-
print(f"Baseline eval failed (continuing to optimization): {e}")
|
| 136 |
-
|
| 137 |
-
# GEPA
|
| 138 |
-
log_dir = str(config.ARTIFACTS_DIR / "gepa_logs")
|
| 139 |
-
gepa_kwargs = dict(
|
| 140 |
-
metric=gita_metric,
|
| 141 |
-
reflection_lm=reflection_lm,
|
| 142 |
-
track_stats=args.track_stats,
|
| 143 |
-
seed=args.seed,
|
| 144 |
-
# Show 6 training examples to the reflection LM per proposal step instead of
|
| 145 |
-
# the default 3 — our 12 domains need diversity to avoid domain-specific over-fit.
|
| 146 |
-
reflection_minibatch_size=6,
|
| 147 |
-
# API-backed runs (proxy or Gemini) can saturate many threads; local GPU is
|
| 148 |
-
# limited to 4 to avoid OOM / serialization on a single device.
|
| 149 |
-
num_threads=num_threads,
|
| 150 |
-
# When the task LM mangles a list field the reflection LM should know the format
|
| 151 |
-
# broke, not just see a low score with no explanation.
|
| 152 |
-
add_format_failure_as_feedback=True,
|
| 153 |
-
# Persist per-step scores and prompts for post-run inspection.
|
| 154 |
-
log_dir=log_dir,
|
| 155 |
-
)
|
| 156 |
-
if args.max_metric_calls is not None:
|
| 157 |
-
gepa_kwargs["max_metric_calls"] = args.max_metric_calls
|
| 158 |
-
else:
|
| 159 |
-
gepa_kwargs["auto"] = args.auto
|
| 160 |
-
|
| 161 |
-
print(f"\nStarting GEPA with {gepa_kwargs} ...")
|
| 162 |
-
optimizer = GEPA(**gepa_kwargs)
|
| 163 |
-
|
| 164 |
-
optimized = optimizer.compile(
|
| 165 |
-
student=student,
|
| 166 |
-
trainset=train,
|
| 167 |
-
valset=val,
|
| 168 |
-
)
|
| 169 |
-
|
| 170 |
-
# Save
|
| 171 |
-
out_path = Path(args.out)
|
| 172 |
-
out_path.parent.mkdir(parents=True, exist_ok=True)
|
| 173 |
-
optimized.save(str(out_path))
|
| 174 |
-
print(f"\nSaved optimized program to {out_path}")
|
| 175 |
-
|
| 176 |
-
# Side-by-side eval
|
| 177 |
-
print("\nFinal eval on validation set ...")
|
| 178 |
-
final_result = evaluator(optimized)
|
| 179 |
-
final_score = float(final_result) if hasattr(final_result, "__float__") else final_result
|
| 180 |
-
print(f"Optimized score: {final_score}")
|
| 181 |
-
|
| 182 |
-
# Dump the optimized prompts for human inspection
|
| 183 |
-
inspect_path = out_path.with_suffix(".prompts.txt")
|
| 184 |
-
with inspect_path.open("w", encoding="utf-8") as f:
|
| 185 |
-
f.write("# Optimized prompts after GEPA\n\n")
|
| 186 |
-
for name, predictor in optimized.named_predictors():
|
| 187 |
-
sig = predictor.signature
|
| 188 |
-
f.write(f"## {name}\n")
|
| 189 |
-
f.write(f"### instructions\n{sig.instructions}\n\n")
|
| 190 |
-
f.write("### fields\n")
|
| 191 |
-
for fname, field in sig.fields.items():
|
| 192 |
-
desc = getattr(field.json_schema_extra, "get", lambda *_: "")("desc", "") \
|
| 193 |
-
if hasattr(field, "json_schema_extra") else ""
|
| 194 |
-
f.write(f"- {fname}: {desc}\n")
|
| 195 |
-
f.write("\n---\n\n")
|
| 196 |
-
print(f"Wrote prompt inspection file to {inspect_path}")
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
if __name__ == "__main__":
|
| 200 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
parsers/gita_json.py
DELETED
|
@@ -1,236 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
parsers/gita_json.py — turn the gita/gita verse-indexed JSON into Verse records.
|
| 3 |
-
|
| 4 |
-
The gita/gita repo (Unlicense, public-domain dedication) gives us four files
|
| 5 |
-
on the static mirror:
|
| 6 |
-
|
| 7 |
-
chapters.json — chapter metadata (number, name, summary)
|
| 8 |
-
verse.json — per-verse Sanskrit + transliteration + word_meanings
|
| 9 |
-
translation.json — per-verse English translations keyed by author_id
|
| 10 |
-
authors.json — author metadata for the translations
|
| 11 |
-
|
| 12 |
-
Why split parsing across multiple sources_registry entries
|
| 13 |
-
----------------------------------------------------------
|
| 14 |
-
We register `gita_json_core` (the verse text) and `gita_json_translations`
|
| 15 |
-
(the English translations) as separate sources. Both happen to feed this one
|
| 16 |
-
parser. The reason for the split is that translations come and go from the
|
| 17 |
-
upstream repo whereas the core verse data is essentially fixed; isolating
|
| 18 |
-
them lets us pin only what we need.
|
| 19 |
-
|
| 20 |
-
Translator allowlist
|
| 21 |
-
--------------------
|
| 22 |
-
Not every translator in the gita/gita translations.json is public-domain.
|
| 23 |
-
We hard-allowlist the ones we know are safe to redistribute. Anyone not on
|
| 24 |
-
the list is silently skipped — adding more is a one-line change.
|
| 25 |
-
"""
|
| 26 |
-
|
| 27 |
-
from __future__ import annotations
|
| 28 |
-
import json
|
| 29 |
-
from pathlib import Path
|
| 30 |
-
from typing import Iterable
|
| 31 |
-
|
| 32 |
-
from corpus import Verse
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
# ──────────────────────────── Translator allowlist ────────────────────────────
|
| 36 |
-
# The keys are the author_id values used inside translation.json. The values
|
| 37 |
-
# are display strings + the year we want to use for attribution.
|
| 38 |
-
#
|
| 39 |
-
# Why this list and not just "all translations":
|
| 40 |
-
# - Some translators in the upstream repo (e.g. ISKCON Prabhupada) have
|
| 41 |
-
# active publisher rights that we shouldn't rely on regardless of how the
|
| 42 |
-
# upstream chose to license its compilation.
|
| 43 |
-
# - Reducing translation count keeps the index lean. Three voices are plenty.
|
| 44 |
-
#
|
| 45 |
-
# If you want to add a translator, verify their public-domain status (death
|
| 46 |
-
# year + 70 in most jurisdictions, or pre-1929 publication for US PD), then
|
| 47 |
-
# add a row.
|
| 48 |
-
ALLOWED_TRANSLATORS: dict[str, tuple[str, int | None]] = {
|
| 49 |
-
# Swami Sivananda — d. 1963 — works are widely shared by The Divine Life
|
| 50 |
-
# Society in keeping with their founder's non-commercial stance.
|
| 51 |
-
"sivananda": ("Swami Sivananda", 1969),
|
| 52 |
-
|
| 53 |
-
# Swami Tejomayananda — modern; included only because some mirrors
|
| 54 |
-
# release these under permissive terms; double-check before relying on it.
|
| 55 |
-
# Disabled by default to be conservative.
|
| 56 |
-
# "tejomayananda": ("Swami Tejomayananda", 1995),
|
| 57 |
-
|
| 58 |
-
# Dr. S. Sankaranarayan — translation of Śaṅkara's Gītā Bhāṣya included
|
| 59 |
-
# in some forks of gita/gita; verify the specific edition. Off by default.
|
| 60 |
-
# "shankara": ("Śaṅkara (tr. Sankaranarayan)", 1990),
|
| 61 |
-
|
| 62 |
-
# The verse text itself is not a "translation" per se but a copy of the
|
| 63 |
-
# critical text plus transliteration. We include it under the synthetic
|
| 64 |
-
# author key 'sanskrit'.
|
| 65 |
-
"sanskrit": ("Sanskrit text + IAST", None),
|
| 66 |
-
}
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
# ──────────────────────────── Helpers ────────────────────────────
|
| 70 |
-
def _verse_id(chapter: int, verse_no: int) -> str:
|
| 71 |
-
"""Stable global key. Format: bhagavad_gita_<chap>_<verse>, zero-padded
|
| 72 |
-
to two digits so 1.10 sorts after 1.9 and lexical ordering matches numeric."""
|
| 73 |
-
return f"bhagavad_gita_{chapter:02d}_{verse_no:02d}"
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
def _verse_ref(chapter: int, verse_no: int) -> str:
|
| 77 |
-
"""Citation form used by the advisor in its replies."""
|
| 78 |
-
return f"BG {chapter}.{verse_no}"
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
def _section_display(chapter_meta: dict) -> str:
|
| 82 |
-
name = chapter_meta.get("name_translation") or chapter_meta.get("name", "")
|
| 83 |
-
return f"Chapter {chapter_meta.get('chapter_number', '?')}: {name}"
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
# ──────────────────────────── Parser entry point ────────────────────────────
|
| 87 |
-
def parse(raw_dir_for_core: Path, raw_dir_for_translations: Path | None = None) -> Iterable[Verse]:
|
| 88 |
-
"""Walk the gita/gita JSON files and yield Verse records.
|
| 89 |
-
|
| 90 |
-
Layout expected (after download_sources.py has run):
|
| 91 |
-
raw_dir_for_core/chapters.json
|
| 92 |
-
raw_dir_for_core/verse.json
|
| 93 |
-
[optionally]
|
| 94 |
-
raw_dir_for_translations/translation.json
|
| 95 |
-
raw_dir_for_translations/authors.json
|
| 96 |
-
|
| 97 |
-
If translations are not present, we still emit Verses with sanskrit +
|
| 98 |
-
transliteration + word_meanings; the `translation` field falls back to
|
| 99 |
-
the transliteration so the verse isn't content-empty. (Better: enable
|
| 100 |
-
the gita_json_translations source.)
|
| 101 |
-
"""
|
| 102 |
-
chapters = _load(raw_dir_for_core / "chapters.json")
|
| 103 |
-
verses_raw = _load(raw_dir_for_core / "verse.json")
|
| 104 |
-
|
| 105 |
-
chapters_by_id = {c["chapter_number"]: c for c in chapters}
|
| 106 |
-
|
| 107 |
-
translations_by_verse: dict[int, dict[str, str]] = {}
|
| 108 |
-
authors_by_id: dict[str, str] = {}
|
| 109 |
-
if raw_dir_for_translations is not None:
|
| 110 |
-
translations_by_verse = _load_translations(raw_dir_for_translations / "translation.json")
|
| 111 |
-
authors_by_id = _load_authors(raw_dir_for_translations / "authors.json")
|
| 112 |
-
|
| 113 |
-
# Pick the best available translator from the allowlist, in priority order.
|
| 114 |
-
# First match wins. This keeps the index from carrying redundant English
|
| 115 |
-
# translations of the same verse.
|
| 116 |
-
translator_priority = ["sivananda", "sanskrit"]
|
| 117 |
-
|
| 118 |
-
for v in verses_raw:
|
| 119 |
-
chap_no = v["chapter_number"]
|
| 120 |
-
verse_no = v["verse_number"]
|
| 121 |
-
chap_meta = chapters_by_id.get(chap_no, {})
|
| 122 |
-
verse_id = _verse_id(chap_no, verse_no)
|
| 123 |
-
|
| 124 |
-
# Sanskrit text comes from the core file. The 'text' field has it
|
| 125 |
-
# in Devanāgarī, often with a trailing newline and verse number.
|
| 126 |
-
sanskrit = (v.get("text") or "").strip()
|
| 127 |
-
translit = (v.get("transliteration") or "").strip()
|
| 128 |
-
word_mean = (v.get("word_meanings") or "").strip()
|
| 129 |
-
|
| 130 |
-
# Try to attach an English translation
|
| 131 |
-
english = ""
|
| 132 |
-
translator_label = ""
|
| 133 |
-
v_translations = translations_by_verse.get(v.get("id") or v.get("externalId") or -1, {})
|
| 134 |
-
for key in translator_priority:
|
| 135 |
-
text = v_translations.get(key) or _translation_for(v_translations, key)
|
| 136 |
-
if text:
|
| 137 |
-
english = text.strip()
|
| 138 |
-
meta = ALLOWED_TRANSLATORS.get(key)
|
| 139 |
-
if meta:
|
| 140 |
-
translator_label = meta[0]
|
| 141 |
-
break
|
| 142 |
-
|
| 143 |
-
# Fallback: if no English translation, use word-meanings as a substitute
|
| 144 |
-
# so the verse isn't content-empty. Better than nothing for retrieval,
|
| 145 |
-
# though enrichment will be poorer.
|
| 146 |
-
if not english:
|
| 147 |
-
english = word_mean or translit
|
| 148 |
-
|
| 149 |
-
yield Verse(
|
| 150 |
-
verse_id=verse_id,
|
| 151 |
-
work="bhagavad_gita",
|
| 152 |
-
work_display="Bhagavad Gītā",
|
| 153 |
-
verse_ref=_verse_ref(chap_no, verse_no),
|
| 154 |
-
tier="primary",
|
| 155 |
-
section=f"chapter_{chap_no:02d}",
|
| 156 |
-
section_display=_section_display(chap_meta),
|
| 157 |
-
translation=english,
|
| 158 |
-
translator=translator_label,
|
| 159 |
-
sanskrit=sanskrit,
|
| 160 |
-
transliteration=translit,
|
| 161 |
-
word_meanings=word_mean,
|
| 162 |
-
bhashya="", # Gītā Bhāṣya is brought in by the Sastry parser
|
| 163 |
-
bhashya_translator="",
|
| 164 |
-
source_key="gita_json_core",
|
| 165 |
-
license="unlicense",
|
| 166 |
-
)
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
# ──────────────────────────── Internals ────────────────────────────
|
| 170 |
-
def _load(path: Path):
|
| 171 |
-
with path.open(encoding="utf-8") as f:
|
| 172 |
-
return json.load(f)
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
def _load_translations(path: Path) -> dict[int, dict[str, str]]:
|
| 176 |
-
"""The translations file has one entry per (verse, author). Group them
|
| 177 |
-
by verse_id into a {verse_id: {author_id: text}} map.
|
| 178 |
-
|
| 179 |
-
Schema seen in the wild varies slightly between forks of gita/gita; we
|
| 180 |
-
cope by trying a few key names. If parsing fails entirely we return {}
|
| 181 |
-
and proceed without translations rather than blowing up the whole ingest.
|
| 182 |
-
"""
|
| 183 |
-
if not path.exists():
|
| 184 |
-
return {}
|
| 185 |
-
try:
|
| 186 |
-
raw = _load(path)
|
| 187 |
-
except Exception as e:
|
| 188 |
-
print(f"[gita_json] failed to load translations: {e}")
|
| 189 |
-
return {}
|
| 190 |
-
|
| 191 |
-
out: dict[int, dict[str, str]] = {}
|
| 192 |
-
for row in raw:
|
| 193 |
-
vid = row.get("verse_id") or row.get("verseNumber") or row.get("verse_number_id") or row.get("id")
|
| 194 |
-
text = row.get("description") or row.get("text") or row.get("translation")
|
| 195 |
-
if vid is None or not text:
|
| 196 |
-
continue
|
| 197 |
-
|
| 198 |
-
# Skip non-English rows (Ramsukhdas Hindi etc.)
|
| 199 |
-
lang = (row.get("lang") or "").lower()
|
| 200 |
-
if lang and lang not in ("english", "en"):
|
| 201 |
-
continue
|
| 202 |
-
|
| 203 |
-
# Map the authorName (e.g. "Swami Sivananda") to an allowlist key
|
| 204 |
-
# ("sivananda") via case-insensitive substring matching. The numeric
|
| 205 |
-
# author_id field alone can't match the allowlist, which is why we
|
| 206 |
-
# prefer authorName here.
|
| 207 |
-
name_str = str(row.get("authorName") or row.get("author_id") or row.get("author") or "").strip()
|
| 208 |
-
matched_key = next(
|
| 209 |
-
(k for k in ALLOWED_TRANSLATORS if k.lower() in name_str.lower()),
|
| 210 |
-
None,
|
| 211 |
-
)
|
| 212 |
-
if matched_key is None:
|
| 213 |
-
continue
|
| 214 |
-
out.setdefault(int(vid), {})[matched_key] = text
|
| 215 |
-
return out
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
def _load_authors(path: Path) -> dict[str, str]:
|
| 219 |
-
if not path.exists():
|
| 220 |
-
return {}
|
| 221 |
-
try:
|
| 222 |
-
raw = _load(path)
|
| 223 |
-
except Exception:
|
| 224 |
-
return {}
|
| 225 |
-
return {row.get("id"): row.get("name", "") for row in raw if row.get("id")}
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
def _translation_for(v_translations: dict, author_key: str) -> str | None:
|
| 229 |
-
"""Tolerant lookup: some files use 'sivananda', some 'Sivananda', etc."""
|
| 230 |
-
if author_key in v_translations:
|
| 231 |
-
return v_translations[author_key]
|
| 232 |
-
lk = author_key.lower()
|
| 233 |
-
for k, val in v_translations.items():
|
| 234 |
-
if str(k).lower() == lk:
|
| 235 |
-
return val
|
| 236 |
-
return None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
parsers/sastry_archive.py
DELETED
|
@@ -1,249 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
parsers/sastry_archive.py — extract verse-attached Śaṅkara bhāṣya from
|
| 3 |
-
Alladi Mahadeva Sastry's 1897 archive.org OCR text.
|
| 4 |
-
|
| 5 |
-
What makes this harder than the gita_json parser
|
| 6 |
-
-------------------------------------------------
|
| 7 |
-
The gita/gita JSON gave us each verse already keyed by chapter and verse
|
| 8 |
-
number. The Sastry archive.org file is OCR'd plain text — about 20 MB of
|
| 9 |
-
running prose where the only structural cues are:
|
| 10 |
-
|
| 11 |
-
1. Chapter headings, formatted in caps like "SANKHYA YOGA." or
|
| 12 |
-
"CHAPTER II — SANKHYA YOGA"
|
| 13 |
-
2. Verse markers, which appear in two forms in the OCR:
|
| 14 |
-
- inline as "(II. 47.)" or "II. 47." after a translated verse
|
| 15 |
-
- as section headings like "47." or "Verse 47." preceding the bhāṣya
|
| 16 |
-
3. The rule that when a translated verse appears, Śaṅkara's commentary
|
| 17 |
-
follows immediately until the next verse marker.
|
| 18 |
-
|
| 19 |
-
Add to that: OCR noise. "II" can become "11", "47" can become "4 7", periods
|
| 20 |
-
become commas, glyphs get dropped. So the parser is forgiving — it tries
|
| 21 |
-
several patterns and falls back gracefully.
|
| 22 |
-
|
| 23 |
-
What we extract
|
| 24 |
-
---------------
|
| 25 |
-
For each verse we find, we yield a Verse with:
|
| 26 |
-
- tier='shankara'
|
| 27 |
-
- work='bhagavad_gita_bhashya' (kept distinct from 'bhagavad_gita' so
|
| 28 |
-
the joiner in ingest_corpus.py knows to merge bhashya into the gita
|
| 29 |
-
verses by verse_ref)
|
| 30 |
-
- translation = the verse text as Sastry rendered it (handy as a second
|
| 31 |
-
English voice alongside Sivananda)
|
| 32 |
-
- bhashya = Śaṅkara's commentary, as Sastry translated it
|
| 33 |
-
- bhashya_translator = 'Alladi Mahadeva Sastry, 1897'
|
| 34 |
-
|
| 35 |
-
Robustness strategy
|
| 36 |
-
-------------------
|
| 37 |
-
We don't try to be perfect. If a verse's bhāṣya is mis-attributed by ±1, the
|
| 38 |
-
downstream enrichment step will produce paraphrases that don't quite fit, and
|
| 39 |
-
we'll catch those during the spot-check pass on enriched output. The metric
|
| 40 |
-
will also penalize ungrounded citations. The key invariant is: never silently
|
| 41 |
-
emit a wrong (verse_id, bhashya) pair if we're uncertain — better to skip.
|
| 42 |
-
"""
|
| 43 |
-
|
| 44 |
-
from __future__ import annotations
|
| 45 |
-
import re
|
| 46 |
-
from dataclasses import replace
|
| 47 |
-
from pathlib import Path
|
| 48 |
-
from typing import Iterable
|
| 49 |
-
|
| 50 |
-
from corpus import Verse
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
# ──────────────────────────── Patterns ────────────────────────────
|
| 54 |
-
# Roman numerals (allowing OCR substitutions: I↔1, V↔V, etc.)
|
| 55 |
-
ROMAN = r"(?:[IVX1l]+|[ivx]+)"
|
| 56 |
-
|
| 57 |
-
# A "verse marker" looks like "II. 47" or "(II. 47.)" or "47" alone in a section
|
| 58 |
-
# heading. We try several shapes and let the most specific win.
|
| 59 |
-
VERSE_INLINE = re.compile(
|
| 60 |
-
r"\(?\s*(?P<chap>" + ROMAN + r")\s*[\.\,]\s*(?P<verse>\d{1,3})\s*[\.\,]?\s*\)?",
|
| 61 |
-
re.IGNORECASE,
|
| 62 |
-
)
|
| 63 |
-
|
| 64 |
-
# Chapter heading: "CHAPTER II" or "II. SANKHYA YOGA" — uppercase-heavy lines
|
| 65 |
-
CHAPTER_HEADING = re.compile(
|
| 66 |
-
r"^\s*(?:CHAPTER\s+)?(?P<roman>" + ROMAN + r")\.?\s+[A-Z][A-Z \-—]{4,}",
|
| 67 |
-
re.MULTILINE,
|
| 68 |
-
)
|
| 69 |
-
|
| 70 |
-
# Roman → arabic
|
| 71 |
-
ROMAN_MAP = {
|
| 72 |
-
"I": 1, "II": 2, "III": 3, "IV": 4, "V": 5, "VI": 6, "VII": 7, "VIII": 8,
|
| 73 |
-
"IX": 9, "X": 10, "XI": 11, "XII": 12, "XIII": 13, "XIV": 14, "XV": 15,
|
| 74 |
-
"XVI": 16, "XVII": 17, "XVIII": 18,
|
| 75 |
-
}
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
def _to_arabic(token: str) -> int | None:
|
| 79 |
-
"""Convert a possibly-noisy roman numeral to an int. OCR sometimes turns
|
| 80 |
-
'I' into '1' and 'II' into '11', so we accept both forms."""
|
| 81 |
-
t = token.upper().replace("L", "I").replace("0", "O") # OCR substitutions
|
| 82 |
-
if t in ROMAN_MAP:
|
| 83 |
-
return ROMAN_MAP[t]
|
| 84 |
-
# Pure-arabic fallback (e.g. OCR rendered 'II' as '11')
|
| 85 |
-
if t.isdigit():
|
| 86 |
-
n = int(t)
|
| 87 |
-
if 1 <= n <= 18:
|
| 88 |
-
return n
|
| 89 |
-
return None
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
# ──────────────────────────── Main parse ────────────────────────────
|
| 93 |
-
def parse(raw_dir: Path) -> Iterable[Verse]:
|
| 94 |
-
"""Walk Sastry archive.org text in raw_dir and yield Verse records.
|
| 95 |
-
|
| 96 |
-
Expected layout (after download_sources.py):
|
| 97 |
-
raw_dir/Bhagavad-Gita.with.the.Commentary.of.Sri.Shankaracharya_djvu.txt
|
| 98 |
-
|
| 99 |
-
The file is ~20 MB of OCR text. We stream it line-by-line, maintain the
|
| 100 |
-
current chapter as we encounter chapter headings, and at each verse marker
|
| 101 |
-
yield the accumulated text since the previous marker as the bhāṣya.
|
| 102 |
-
"""
|
| 103 |
-
txts = list(raw_dir.glob("*_djvu.txt")) + list(raw_dir.glob("*.txt"))
|
| 104 |
-
if not txts:
|
| 105 |
-
print(f"[sastry] no .txt under {raw_dir}; did you download_sources.py?")
|
| 106 |
-
return
|
| 107 |
-
|
| 108 |
-
text = txts[0].read_text(encoding="utf-8", errors="replace")
|
| 109 |
-
text = _denoise(text)
|
| 110 |
-
|
| 111 |
-
# First pass: find every verse marker with its position and attempt to
|
| 112 |
-
# disambiguate the chapter from context. We collect (chap, verse, span)
|
| 113 |
-
# tuples in document order.
|
| 114 |
-
markers: list[tuple[int, int, int, int]] = [] # chap, verse, start, end
|
| 115 |
-
current_chapter = 1
|
| 116 |
-
last_pos = 0
|
| 117 |
-
|
| 118 |
-
# Walk chapter headings and verse markers together via merged iteration
|
| 119 |
-
events = []
|
| 120 |
-
for m in CHAPTER_HEADING.finditer(text):
|
| 121 |
-
c = _to_arabic(m.group("roman"))
|
| 122 |
-
if c is not None:
|
| 123 |
-
events.append(("chapter", m.start(), c))
|
| 124 |
-
|
| 125 |
-
for m in VERSE_INLINE.finditer(text):
|
| 126 |
-
c = _to_arabic(m.group("chap"))
|
| 127 |
-
try:
|
| 128 |
-
v = int(m.group("verse"))
|
| 129 |
-
except (ValueError, TypeError):
|
| 130 |
-
continue
|
| 131 |
-
if c is None or not (1 <= v <= 80):
|
| 132 |
-
continue
|
| 133 |
-
events.append(("verse", m.start(), c, v, m.end(), m.start("verse")))
|
| 134 |
-
|
| 135 |
-
events.sort(key=lambda e: e[1])
|
| 136 |
-
|
| 137 |
-
# Second pass: build (chapter, verse) → (start, end) spans, where each
|
| 138 |
-
# span is the bhāṣya from one marker to the next. We yield in document
|
| 139 |
-
# order with the chapter from the most recent chapter heading we saw.
|
| 140 |
-
last_marker_pos: int | None = None
|
| 141 |
-
last_chap: int | None = None
|
| 142 |
-
last_verse: int | None = None
|
| 143 |
-
|
| 144 |
-
for ev in events:
|
| 145 |
-
if ev[0] == "chapter":
|
| 146 |
-
current_chapter = ev[2]
|
| 147 |
-
continue
|
| 148 |
-
# ev: ("verse", start, chap, verse, end, verse_pos)
|
| 149 |
-
_, start, chap, verse, end, verse_pos = ev
|
| 150 |
-
|
| 151 |
-
# Only treat markers where the verse NUMBER appears near the start of
|
| 152 |
-
# its line — those are actual section headings. Inline cross-references
|
| 153 |
-
# like "(II. 47.)" mid-paragraph have the verse number well into the
|
| 154 |
-
# line and must not be treated as section boundaries.
|
| 155 |
-
verse_line_start = text.rfind("\n", 0, verse_pos) + 1
|
| 156 |
-
on_own_line = (verse_pos - verse_line_start) <= 8
|
| 157 |
-
if not on_own_line:
|
| 158 |
-
continue
|
| 159 |
-
current_chapter = chap
|
| 160 |
-
|
| 161 |
-
if last_marker_pos is not None and last_chap is not None and last_verse is not None:
|
| 162 |
-
bhashya_text = text[last_marker_pos:start].strip()
|
| 163 |
-
if bhashya_text:
|
| 164 |
-
yield _build_verse(
|
| 165 |
-
chap=last_chap, verse=last_verse, body=bhashya_text,
|
| 166 |
-
)
|
| 167 |
-
|
| 168 |
-
last_marker_pos = end
|
| 169 |
-
last_chap = current_chapter
|
| 170 |
-
last_verse = verse
|
| 171 |
-
|
| 172 |
-
# Flush the trailing one
|
| 173 |
-
if last_marker_pos is not None and last_chap and last_verse:
|
| 174 |
-
tail = text[last_marker_pos:].strip()
|
| 175 |
-
if tail:
|
| 176 |
-
yield _build_verse(chap=last_chap, verse=last_verse, body=tail)
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
# ──────────────────────────── Builders ────────────────────────────
|
| 180 |
-
def _build_verse(chap: int, verse: int, body: str) -> Verse:
|
| 181 |
-
"""The body lump contains both Sastry's English of the verse and Śaṅkara's
|
| 182 |
-
commentary, usually with the verse first (sometimes labeled) and the
|
| 183 |
-
commentary following. We make a *light* split heuristic: if the first
|
| 184 |
-
paragraph is short (≤ 400 chars) and ends near a period, treat it as the
|
| 185 |
-
verse translation; the rest is bhashya. If we can't split confidently,
|
| 186 |
-
we put everything into bhashya and leave translation empty — the gita_json
|
| 187 |
-
parser already gave us a translation by another translator."""
|
| 188 |
-
body = body.strip()
|
| 189 |
-
translation = ""
|
| 190 |
-
bhashya = body
|
| 191 |
-
|
| 192 |
-
# Heuristic split on the first blank-ish line within reasonable distance
|
| 193 |
-
para_break = re.search(r"\n\s*\n", body[:600])
|
| 194 |
-
if para_break and para_break.end() < 500:
|
| 195 |
-
head = body[:para_break.start()].strip()
|
| 196 |
-
tail = body[para_break.end():].strip()
|
| 197 |
-
# Accept the split only if the head looks like a verse: short-ish,
|
| 198 |
-
# not starting with a typical-bhashya opener like "This means" /
|
| 199 |
-
# "The meaning is" / "Here the Lord says".
|
| 200 |
-
if 30 < len(head) < 400 and not _looks_like_bhashya_opener(head):
|
| 201 |
-
translation, bhashya = head, tail
|
| 202 |
-
|
| 203 |
-
return Verse(
|
| 204 |
-
verse_id=f"bhagavad_gita_{chap:02d}_{verse:02d}",
|
| 205 |
-
work="bhagavad_gita_bhashya",
|
| 206 |
-
work_display="Bhagavad Gītā with Śaṅkara's Bhāṣya",
|
| 207 |
-
verse_ref=f"BG {chap}.{verse}",
|
| 208 |
-
tier="shankara",
|
| 209 |
-
section=f"chapter_{chap:02d}",
|
| 210 |
-
section_display=f"Chapter {chap}",
|
| 211 |
-
translation=translation,
|
| 212 |
-
translator="Alladi Mahadeva Sastry" if translation else "",
|
| 213 |
-
bhashya=bhashya,
|
| 214 |
-
bhashya_translator="Alladi Mahadeva Sastry, 1897",
|
| 215 |
-
source_key="sastry_gita_bhashya",
|
| 216 |
-
license="public_domain",
|
| 217 |
-
)
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
def _looks_like_bhashya_opener(s: str) -> bool:
|
| 221 |
-
s = s.strip().lower()
|
| 222 |
-
openers = (
|
| 223 |
-
"this means", "the meaning is", "the sense is", "here the lord",
|
| 224 |
-
"here it is said", "the lord says", "the question may", "objection",
|
| 225 |
-
"the commentator",
|
| 226 |
-
)
|
| 227 |
-
return any(s.startswith(o) for o in openers)
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
# ──────────────────────────── OCR de-noise ────────────────────────────
|
| 231 |
-
def _denoise(text: str) -> str:
|
| 232 |
-
"""Light cleanup. Aggressive normalization risks losing real signal —
|
| 233 |
-
we only fix patterns we're confident about."""
|
| 234 |
-
# Common OCR substitutions for Sanskrit diacritics losses won't matter
|
| 235 |
-
# for English-language retrieval; we leave Sanskrit fragments alone.
|
| 236 |
-
|
| 237 |
-
# Collapse runs of repeated punctuation that OCR hallucinated
|
| 238 |
-
text = re.sub(r"\.{3,}", ".", text)
|
| 239 |
-
text = re.sub(r" +\.", ".", text)
|
| 240 |
-
|
| 241 |
-
# Glue cross-line hyphens: "lib-\nerty" → "liberty"
|
| 242 |
-
text = re.sub(r"-\n([a-z])", r"\1", text)
|
| 243 |
-
|
| 244 |
-
# Normalize whitespace
|
| 245 |
-
text = re.sub(r"[ \t]+", " ", text)
|
| 246 |
-
text = re.sub(r"\n[ \t]+", "\n", text)
|
| 247 |
-
text = re.sub(r"\n{3,}", "\n\n", text)
|
| 248 |
-
|
| 249 |
-
return text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
run_overnight.py
DELETED
|
@@ -1,230 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
run_overnight.py — orchestrates full GEPA optimization through light → medium,
|
| 3 |
-
then saves prompts and runs a multi-question test suite.
|
| 4 |
-
|
| 5 |
-
Usage:
|
| 6 |
-
python run_overnight.py [--skip-light] [--skip-medium]
|
| 7 |
-
|
| 8 |
-
Writes a timestamped log to artifacts/overnight_run.log.
|
| 9 |
-
"""
|
| 10 |
-
from __future__ import annotations
|
| 11 |
-
import argparse
|
| 12 |
-
import subprocess
|
| 13 |
-
import sys
|
| 14 |
-
import time
|
| 15 |
-
from datetime import datetime
|
| 16 |
-
from pathlib import Path
|
| 17 |
-
import json
|
| 18 |
-
|
| 19 |
-
ROOT = Path(__file__).parent.resolve()
|
| 20 |
-
LOG_PATH = ROOT / "artifacts" / "overnight_run.log"
|
| 21 |
-
OPTIMIZED_PATH = ROOT / "artifacts" / "optimized_advisor.json"
|
| 22 |
-
PROMPTS_PATH = ROOT / "artifacts" / "optimized_advisor.prompts.txt"
|
| 23 |
-
RESULTS_PATH = ROOT / "artifacts" / "test_results.json"
|
| 24 |
-
|
| 25 |
-
TEST_QUESTIONS = [
|
| 26 |
-
"I just got laid off and feel like nothing matters anymore.",
|
| 27 |
-
"I keep procrastinating on important work and feel guilty about it. How do I stop?",
|
| 28 |
-
"My relationship ended and I feel like I've lost my identity. Who am I without this person?",
|
| 29 |
-
"I'm terrified of death and can't stop thinking about it at night.",
|
| 30 |
-
"I have achieved everything I wanted — career, family, money — and still feel empty.",
|
| 31 |
-
"I feel angry at everyone around me but don't know why. How should I deal with this?",
|
| 32 |
-
"I can't stop comparing myself to others and feeling like I'm always falling short.",
|
| 33 |
-
]
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
def ts() -> str:
|
| 37 |
-
return datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
def log(msg: str, f=None):
|
| 41 |
-
line = f"[{ts()}] {msg}"
|
| 42 |
-
print(line, flush=True)
|
| 43 |
-
if f:
|
| 44 |
-
f.write(line + "\n")
|
| 45 |
-
f.flush()
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
def run_phase(cmd: list[str], phase: str, logfile) -> bool:
|
| 49 |
-
log(f"=== STARTING {phase} ===", logfile)
|
| 50 |
-
log(f"Command: {' '.join(cmd)}", logfile)
|
| 51 |
-
start = time.time()
|
| 52 |
-
try:
|
| 53 |
-
proc = subprocess.Popen(
|
| 54 |
-
cmd,
|
| 55 |
-
stdout=subprocess.PIPE,
|
| 56 |
-
stderr=subprocess.STDOUT,
|
| 57 |
-
text=True,
|
| 58 |
-
cwd=str(ROOT),
|
| 59 |
-
)
|
| 60 |
-
for line in proc.stdout:
|
| 61 |
-
logfile.write(line)
|
| 62 |
-
logfile.flush()
|
| 63 |
-
# Echo key lines to terminal
|
| 64 |
-
if any(k in line for k in ["score", "GEPA", "Step", "ERROR", "Saved", "Train:", "Val:", "Baseline"]):
|
| 65 |
-
print(line, end="", flush=True)
|
| 66 |
-
proc.wait()
|
| 67 |
-
elapsed = time.time() - start
|
| 68 |
-
if proc.returncode == 0:
|
| 69 |
-
log(f"=== {phase} COMPLETED in {elapsed/60:.1f} min ===", logfile)
|
| 70 |
-
return True
|
| 71 |
-
else:
|
| 72 |
-
log(f"=== {phase} FAILED (exit {proc.returncode}) after {elapsed/60:.1f} min ===", logfile)
|
| 73 |
-
return False
|
| 74 |
-
except Exception as e:
|
| 75 |
-
log(f"=== {phase} ERROR: {e} ===", logfile)
|
| 76 |
-
return False
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
def run_test_suite(logfile) -> dict:
|
| 80 |
-
log("=== STARTING TEST SUITE ===", logfile)
|
| 81 |
-
sys.path.insert(0, str(ROOT))
|
| 82 |
-
|
| 83 |
-
import config
|
| 84 |
-
from advisor import load_optimized
|
| 85 |
-
from metrics import gita_metric
|
| 86 |
-
import dspy
|
| 87 |
-
from concurrent.futures import ThreadPoolExecutor, as_completed
|
| 88 |
-
|
| 89 |
-
config.configure_dspy()
|
| 90 |
-
|
| 91 |
-
advisor = load_optimized()
|
| 92 |
-
n = len(TEST_QUESTIONS)
|
| 93 |
-
|
| 94 |
-
def run_one(i_q):
|
| 95 |
-
i, q = i_q
|
| 96 |
-
try:
|
| 97 |
-
pred = advisor(user_question=q, history=dspy.History(messages=[]))
|
| 98 |
-
gold = dspy.Example(user_question=q).with_inputs("user_question")
|
| 99 |
-
m = gita_metric(gold, pred)
|
| 100 |
-
return i, q, {
|
| 101 |
-
"question": q,
|
| 102 |
-
"score": round(float(m.score), 3),
|
| 103 |
-
"word_count": len(pred.response.split()),
|
| 104 |
-
"sources_cited": pred.sources_cited,
|
| 105 |
-
"response_excerpt": pred.response[:200],
|
| 106 |
-
"feedback_excerpt": m.feedback[:500],
|
| 107 |
-
}
|
| 108 |
-
except Exception as e:
|
| 109 |
-
return i, q, {"question": q, "error": str(e), "score": 0.0}
|
| 110 |
-
|
| 111 |
-
indexed = list(enumerate(TEST_QUESTIONS, 1))
|
| 112 |
-
results_map = {}
|
| 113 |
-
with ThreadPoolExecutor(max_workers=n) as pool:
|
| 114 |
-
futures = {pool.submit(run_one, iq): iq for iq in indexed}
|
| 115 |
-
for fut in as_completed(futures):
|
| 116 |
-
i, q, result = fut.result()
|
| 117 |
-
results_map[i] = result
|
| 118 |
-
if "error" in result:
|
| 119 |
-
log(f" [{i}/{n}] ERROR: {result['error']}", logfile)
|
| 120 |
-
else:
|
| 121 |
-
log(f" [{i}/{n}] score={result['score']:.3f} wc={result['word_count']} sources={result['sources_cited']}", logfile)
|
| 122 |
-
|
| 123 |
-
results = [results_map[i] for i in range(1, n + 1)]
|
| 124 |
-
avg = sum(r.get("score", 0) for r in results) / n
|
| 125 |
-
log(f"=== TEST SUITE DONE — avg score: {avg:.3f} ===", logfile)
|
| 126 |
-
return {"questions": results, "avg_score": round(avg, 3), "timestamp": ts()}
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
def dump_prompts(logfile):
|
| 130 |
-
"""Re-extract and log optimized prompts to a human-readable file."""
|
| 131 |
-
if not OPTIMIZED_PATH.exists():
|
| 132 |
-
log(" No optimized program found — skipping prompt dump.", logfile)
|
| 133 |
-
return
|
| 134 |
-
|
| 135 |
-
sys.path.insert(0, str(ROOT))
|
| 136 |
-
import config
|
| 137 |
-
from advisor import GitaAdvisor
|
| 138 |
-
config.configure_dspy()
|
| 139 |
-
|
| 140 |
-
advisor = GitaAdvisor()
|
| 141 |
-
try:
|
| 142 |
-
advisor.load(str(OPTIMIZED_PATH))
|
| 143 |
-
except Exception as e:
|
| 144 |
-
log(f" Could not load optimized program: {e}", logfile)
|
| 145 |
-
return
|
| 146 |
-
|
| 147 |
-
lines = ["# Optimized Prompts after GEPA overnight run", f"# Extracted at {ts()}", ""]
|
| 148 |
-
for name, predictor in advisor.named_predictors():
|
| 149 |
-
sig = predictor.signature
|
| 150 |
-
lines.append(f"## {name}")
|
| 151 |
-
lines.append(f"### Instructions")
|
| 152 |
-
lines.append(sig.instructions or "(none)")
|
| 153 |
-
lines.append("")
|
| 154 |
-
lines.append("### Field descriptions")
|
| 155 |
-
for fname, field in sig.fields.items():
|
| 156 |
-
extras = field.json_schema_extra or {}
|
| 157 |
-
desc = extras.get("desc", "") if isinstance(extras, dict) else ""
|
| 158 |
-
lines.append(f" {fname}: {desc}")
|
| 159 |
-
lines.append("")
|
| 160 |
-
lines.append("---")
|
| 161 |
-
lines.append("")
|
| 162 |
-
|
| 163 |
-
PROMPTS_PATH.write_text("\n".join(lines), encoding="utf-8")
|
| 164 |
-
log(f" Prompts written to {PROMPTS_PATH}", logfile)
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
def main():
|
| 168 |
-
ap = argparse.ArgumentParser()
|
| 169 |
-
ap.add_argument("--skip-light", action="store_true")
|
| 170 |
-
ap.add_argument("--skip-medium", action="store_true")
|
| 171 |
-
ap.add_argument("--skip-tests", action="store_true")
|
| 172 |
-
args = ap.parse_args()
|
| 173 |
-
|
| 174 |
-
LOG_PATH.parent.mkdir(parents=True, exist_ok=True)
|
| 175 |
-
|
| 176 |
-
with LOG_PATH.open("w", encoding="utf-8") as logfile:
|
| 177 |
-
log("=== OVERNIGHT GEPA RUN STARTED ===", logfile)
|
| 178 |
-
log(f"Dataset: {ROOT / 'data' / 'synthetic_questions.jsonl'}", logfile)
|
| 179 |
-
log(f"Output: {OPTIMIZED_PATH}", logfile)
|
| 180 |
-
|
| 181 |
-
python = sys.executable
|
| 182 |
-
|
| 183 |
-
# ── Phase 1: Light ──
|
| 184 |
-
if not args.skip_light:
|
| 185 |
-
ok = run_phase(
|
| 186 |
-
[python, "optimize_gepa.py", "--auto", "light"],
|
| 187 |
-
"GEPA LIGHT",
|
| 188 |
-
logfile,
|
| 189 |
-
)
|
| 190 |
-
if not ok:
|
| 191 |
-
log("Light phase failed — stopping overnight run.", logfile)
|
| 192 |
-
sys.exit(1)
|
| 193 |
-
# Back up light result
|
| 194 |
-
if OPTIMIZED_PATH.exists():
|
| 195 |
-
import shutil
|
| 196 |
-
shutil.copy(OPTIMIZED_PATH, OPTIMIZED_PATH.with_suffix(".light.json"))
|
| 197 |
-
log(f" Backed up light result to {OPTIMIZED_PATH.with_suffix('.light.json')}", logfile)
|
| 198 |
-
else:
|
| 199 |
-
log("Skipping light phase (--skip-light).", logfile)
|
| 200 |
-
|
| 201 |
-
# ── Phase 2: Medium ──
|
| 202 |
-
if not args.skip_medium:
|
| 203 |
-
ok = run_phase(
|
| 204 |
-
[python, "optimize_gepa.py", "--auto", "medium"],
|
| 205 |
-
"GEPA MEDIUM",
|
| 206 |
-
logfile,
|
| 207 |
-
)
|
| 208 |
-
if not ok:
|
| 209 |
-
log("Medium phase failed.", logfile)
|
| 210 |
-
# Don't exit — still dump whatever we have
|
| 211 |
-
else:
|
| 212 |
-
log("Skipping medium phase (--skip-medium).", logfile)
|
| 213 |
-
|
| 214 |
-
# ── Dump prompts ──
|
| 215 |
-
log("Extracting optimized prompts ...", logfile)
|
| 216 |
-
dump_prompts(logfile)
|
| 217 |
-
|
| 218 |
-
# ── Test suite ──
|
| 219 |
-
if not args.skip_tests:
|
| 220 |
-
test_results = run_test_suite(logfile)
|
| 221 |
-
RESULTS_PATH.write_text(json.dumps(test_results, indent=2, ensure_ascii=False), encoding="utf-8")
|
| 222 |
-
log(f"Test results written to {RESULTS_PATH}", logfile)
|
| 223 |
-
else:
|
| 224 |
-
log("Skipping test suite (--skip-tests).", logfile)
|
| 225 |
-
|
| 226 |
-
log("=== OVERNIGHT RUN COMPLETE ===", logfile)
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
if __name__ == "__main__":
|
| 230 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
smoke_test.py
DELETED
|
@@ -1,99 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
smoke_test.py — verify the full pipeline before spending hours on GEPA.
|
| 3 |
-
|
| 4 |
-
Runs:
|
| 5 |
-
1. LM connectivity check
|
| 6 |
-
2. Retriever connectivity check
|
| 7 |
-
3. One end-to-end advisor call
|
| 8 |
-
4. One metric call against the result
|
| 9 |
-
|
| 10 |
-
If any step fails, the error message tells you which knob to turn.
|
| 11 |
-
|
| 12 |
-
python smoke_test.py "I just got laid off and feel like nothing makes sense anymore."
|
| 13 |
-
"""
|
| 14 |
-
|
| 15 |
-
from __future__ import annotations
|
| 16 |
-
import sys
|
| 17 |
-
import json
|
| 18 |
-
import dspy
|
| 19 |
-
|
| 20 |
-
import config
|
| 21 |
-
from advisor import GitaAdvisor
|
| 22 |
-
from knowledge_base import AdvaitaRetriever
|
| 23 |
-
from metrics import gita_metric
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
def step(label: str):
|
| 27 |
-
print(f"\n── {label} " + "─" * (60 - len(label)))
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
def main():
|
| 31 |
-
user_q = sys.argv[1] if len(sys.argv) > 1 else (
|
| 32 |
-
"I just got laid off and feel like nothing makes sense anymore."
|
| 33 |
-
)
|
| 34 |
-
|
| 35 |
-
step("1. Configure LMs")
|
| 36 |
-
task_lm, reflection_lm = config.configure_dspy()
|
| 37 |
-
print(f" task_lm: {task_lm.model}")
|
| 38 |
-
print(f" reflection_lm: {reflection_lm.model}")
|
| 39 |
-
|
| 40 |
-
step("2. LM round-trip")
|
| 41 |
-
try:
|
| 42 |
-
out = task_lm("Reply with the single word: ready.")
|
| 43 |
-
print(f" reply: {out!r}")
|
| 44 |
-
except Exception as e:
|
| 45 |
-
print(f" FAILED — is LM Studio running at {config.LM_STUDIO_BASE}?\n {e}")
|
| 46 |
-
sys.exit(1)
|
| 47 |
-
|
| 48 |
-
step("3. Retriever sanity")
|
| 49 |
-
try:
|
| 50 |
-
retr = AdvaitaRetriever()
|
| 51 |
-
hits = retr.search("non-attachment to results of action", k=3)
|
| 52 |
-
if not hits:
|
| 53 |
-
print(" WARNING: no retrieval results. Did you build the index?")
|
| 54 |
-
print(" Run: python knowledge_base.py --build")
|
| 55 |
-
else:
|
| 56 |
-
for h in hits:
|
| 57 |
-
v = h.verse
|
| 58 |
-
section = f" — {v.section}" if v.section else ""
|
| 59 |
-
print(f" [{v.tier}] {v.work}{section} score={h.combined_score:.3f}")
|
| 60 |
-
except Exception as e:
|
| 61 |
-
print(f" FAILED — index probably not built. Run "
|
| 62 |
-
f"`python knowledge_base.py --build` after dropping texts in sources/.")
|
| 63 |
-
print(f" {e}")
|
| 64 |
-
sys.exit(1)
|
| 65 |
-
|
| 66 |
-
step("4. End-to-end advisor call")
|
| 67 |
-
advisor = GitaAdvisor()
|
| 68 |
-
try:
|
| 69 |
-
pred = advisor(user_question=user_q, history=dspy.History(messages=[]))
|
| 70 |
-
except Exception as e:
|
| 71 |
-
print(f" FAILED — pipeline error: {e}")
|
| 72 |
-
sys.exit(1)
|
| 73 |
-
|
| 74 |
-
print(f"\n user: {user_q}")
|
| 75 |
-
print(f"\n felt: {pred.felt_emotion}")
|
| 76 |
-
print(f" surface: {pred.surface_concern}")
|
| 77 |
-
print(f" deeper: {pred.deeper_concern}")
|
| 78 |
-
print(f" themes: {pred.vedantic_themes}")
|
| 79 |
-
print(f" queries: {pred.queries}")
|
| 80 |
-
print(f" selected indices: {pred.selected_indices}")
|
| 81 |
-
print(f"\n --- response ---")
|
| 82 |
-
print(pred.response)
|
| 83 |
-
print(f"\n sources cited: {pred.sources_cited}")
|
| 84 |
-
|
| 85 |
-
step("5. Metric round-trip")
|
| 86 |
-
gold = dspy.Example(user_question=user_q, history=dspy.History(messages=[])).with_inputs("user_question", "history")
|
| 87 |
-
m = gita_metric(gold, pred)
|
| 88 |
-
print(f" composite score: {m.score:.3f}")
|
| 89 |
-
print(f"\n --- feedback (this is what GEPA's reflection LM sees) ---")
|
| 90 |
-
print(m.feedback)
|
| 91 |
-
|
| 92 |
-
step("Done")
|
| 93 |
-
print("If you got here, you're ready to run:")
|
| 94 |
-
print(" python dataset_generator.py --n 500")
|
| 95 |
-
print(" python optimize_gepa.py --auto medium")
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
if __name__ == "__main__":
|
| 99 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
sources_local/.gitkeep
ADDED
|
File without changes
|
sources_registry.py
DELETED
|
@@ -1,331 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
sources_registry.py — the one place every open source lives.
|
| 3 |
-
|
| 4 |
-
Why a registry rather than scattered URLs?
|
| 5 |
-
------------------------------------------
|
| 6 |
-
Adding a new text to the corpus shouldn't mean editing five files. It should
|
| 7 |
-
mean adding one entry here. Downloads, parsing, re-indexing, and enrichment
|
| 8 |
-
all read from this registry, so the registry *is* the corpus definition.
|
| 9 |
-
|
| 10 |
-
How sources are categorized
|
| 11 |
-
---------------------------
|
| 12 |
-
Every source belongs to a "tier", which the retriever uses to break ties when
|
| 13 |
-
two passages score equally on cosine similarity:
|
| 14 |
-
|
| 15 |
-
primary — the śruti and the Gītā itself (the thing being commented on)
|
| 16 |
-
shankara — Śaṅkarācārya's bhāṣyas and prakaraṇa-granthas (his own pen)
|
| 17 |
-
supporting — texts in his lineage but not by him (Aṣṭāvakra, Yoga Vāsiṣṭha,
|
| 18 |
-
Vidyāraṇya's Pañcadaśī, modern Ramaṇa & Nisargadatta where
|
| 19 |
-
explicitly placed in the Advaita stream)
|
| 20 |
-
|
| 21 |
-
The tier weights live in knowledge_base.py; this file just labels.
|
| 22 |
-
|
| 23 |
-
License classes
|
| 24 |
-
---------------
|
| 25 |
-
We track licensing because the project is meant to be shareable. We refuse to
|
| 26 |
-
register any source that is not unambiguously open. The classes are:
|
| 27 |
-
|
| 28 |
-
public_domain — pre-1929 works in US PD; covers most 19th-c. translations
|
| 29 |
-
unlicense — Unlicense / CC0 / equivalent dedications
|
| 30 |
-
cc_by — Creative Commons Attribution (must preserve credit)
|
| 31 |
-
cc_by_sa — Creative Commons ShareAlike
|
| 32 |
-
open_database — ODbL (the dataset license used by some github corpora)
|
| 33 |
-
|
| 34 |
-
Anything we'd label "publisher_copyright" simply doesn't get an entry. If you
|
| 35 |
-
want the modern Advaita Ashrama translations, you must obtain a license and
|
| 36 |
-
add the texts yourself in the user-supplied directory.
|
| 37 |
-
"""
|
| 38 |
-
|
| 39 |
-
from __future__ import annotations
|
| 40 |
-
from dataclasses import dataclass, field
|
| 41 |
-
from typing import Literal
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
# ──────────────────────────── Type aliases ────────────────────────────
|
| 45 |
-
Tier = Literal["primary", "shankara", "supporting"]
|
| 46 |
-
License = Literal[
|
| 47 |
-
"public_domain", "unlicense", "cc_by", "cc_by_sa", "open_database",
|
| 48 |
-
]
|
| 49 |
-
Parser = Literal[
|
| 50 |
-
"gita_json", # the gita/gita repo JSON layout (verse-indexed)
|
| 51 |
-
"wisdomlib_html", # one chapter per HTML page on wisdomlib
|
| 52 |
-
"sastry_archive", # Alladi Mahadeva Sastry OCR text from archive.org
|
| 53 |
-
"thibaut_sbe", # Thibaut's SBE Brahma Sutra translation HTML
|
| 54 |
-
"plain_text", # already-cleaned plain text the user dropped in
|
| 55 |
-
]
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
# ──────────────────────────── Source entry ────────────────────────────
|
| 59 |
-
@dataclass(frozen=True)
|
| 60 |
-
class Source:
|
| 61 |
-
"""One downloadable source. The registry is a list of these.
|
| 62 |
-
|
| 63 |
-
The download_sources script understands two kinds of `urls`:
|
| 64 |
-
- HTTPS URLs to direct files (json, html, txt) — fetched with `requests`
|
| 65 |
-
- "git+https://..." URLs — cloned with `git clone --depth=1`
|
| 66 |
-
|
| 67 |
-
The parser receives the local path(s) and is responsible for emitting
|
| 68 |
-
Verse records into the corpus.
|
| 69 |
-
"""
|
| 70 |
-
# Identity
|
| 71 |
-
key: str # short slug used as folder name; must be unique
|
| 72 |
-
name: str # human-readable name
|
| 73 |
-
work: str # the work; matches Verse.work for grouping
|
| 74 |
-
tier: Tier
|
| 75 |
-
|
| 76 |
-
# Provenance
|
| 77 |
-
license: License
|
| 78 |
-
license_url: str = "" # canonical license URL or attribution page
|
| 79 |
-
translator: str = "" # who did the English translation
|
| 80 |
-
year: int | None = None # year of the edition we're using
|
| 81 |
-
|
| 82 |
-
# Download
|
| 83 |
-
urls: tuple[str, ...] = () # one or more files / git repos
|
| 84 |
-
parser: Parser = "plain_text"
|
| 85 |
-
|
| 86 |
-
# Operational
|
| 87 |
-
enabled: bool = True # set False to skip without deleting the entry
|
| 88 |
-
notes: str = "" # anything a future reader should know
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
# ──────────────────────────── The registry ────────────────────────────
|
| 92 |
-
#
|
| 93 |
-
# This list is the source of truth. Everything else reads it.
|
| 94 |
-
#
|
| 95 |
-
# Conventions:
|
| 96 |
-
# - One entry per *publication*, not per chapter file. The parser knows how
|
| 97 |
-
# to walk its own files.
|
| 98 |
-
# - URLs that work as of the writing of this comment are noted. If a URL
|
| 99 |
-
# drifts, fix it here and re-run `download_sources.py`.
|
| 100 |
-
# - When in doubt about license, leave the source disabled and add a note.
|
| 101 |
-
#
|
| 102 |
-
SOURCES: list[Source] = [
|
| 103 |
-
|
| 104 |
-
# ─── Bhagavad Gītā: Sanskrit + transliteration + word meanings ───
|
| 105 |
-
# The gita/gita repo gives us the cleanest verse-indexed data on the web.
|
| 106 |
-
# Released under the Unlicense, which is a public-domain dedication. We
|
| 107 |
-
# use the static GitHub Pages mirror because it's directly fetchable as
|
| 108 |
-
# JSON files; cloning the repo is also fine.
|
| 109 |
-
Source(
|
| 110 |
-
key="gita_json_core",
|
| 111 |
-
name="Bhagavad Gītā — verse-indexed JSON (core)",
|
| 112 |
-
work="bhagavad_gita",
|
| 113 |
-
tier="primary",
|
| 114 |
-
license="unlicense",
|
| 115 |
-
license_url="https://github.com/gita/gita/blob/main/LICENSE",
|
| 116 |
-
translator="Sanskrit + IAST transliteration + word-by-word gloss",
|
| 117 |
-
year=None,
|
| 118 |
-
urls=(
|
| 119 |
-
"https://ravisiyer.github.io/gita-data/v1/chapters.json",
|
| 120 |
-
"https://ravisiyer.github.io/gita-data/v1/verse.json",
|
| 121 |
-
),
|
| 122 |
-
parser="gita_json",
|
| 123 |
-
notes=(
|
| 124 |
-
"This is the spine of the Gītā corpus. Sanskrit + transliteration + "
|
| 125 |
-
"word-meanings. Translations come from translator-specific files."
|
| 126 |
-
),
|
| 127 |
-
),
|
| 128 |
-
|
| 129 |
-
# ─── Bhagavad Gītā: English translations (one or more) ───
|
| 130 |
-
# The translations.json file is large (~2 MB) and contains multiple
|
| 131 |
-
# translators keyed by author_id. Our parser will pick public-domain ones.
|
| 132 |
-
Source(
|
| 133 |
-
key="gita_json_translations",
|
| 134 |
-
name="Bhagavad Gītā — English translations (multiple authors)",
|
| 135 |
-
work="bhagavad_gita",
|
| 136 |
-
tier="primary",
|
| 137 |
-
license="unlicense",
|
| 138 |
-
license_url="https://github.com/gita/gita/blob/main/LICENSE",
|
| 139 |
-
translator="multiple — see per-verse author_id",
|
| 140 |
-
year=None,
|
| 141 |
-
urls=(
|
| 142 |
-
"https://ravisiyer.github.io/gita-data/v1/translation.json",
|
| 143 |
-
"https://ravisiyer.github.io/gita-data/v1/authors.json",
|
| 144 |
-
),
|
| 145 |
-
parser="gita_json",
|
| 146 |
-
notes=(
|
| 147 |
-
"Parser keeps only translators whose works are public-domain or "
|
| 148 |
-
"explicitly free; e.g. Swami Sivananda is OK, ISKCON Prabhupada "
|
| 149 |
-
"is excluded. See parsers/gita_json.py for the allowlist."
|
| 150 |
-
),
|
| 151 |
-
),
|
| 152 |
-
|
| 153 |
-
# ─── Śaṅkara's Gītā Bhāṣya, Sastry 1897 translation ───
|
| 154 |
-
# The only full English translation of Śaṅkara's Gītā commentary that's
|
| 155 |
-
# unambiguously in the public domain (Sastry died ~1926; first published
|
| 156 |
-
# 1897). Lives on archive.org as OCR text. Parser handles OCR noise.
|
| 157 |
-
Source(
|
| 158 |
-
key="sastry_gita_bhashya",
|
| 159 |
-
name="Śaṅkara's Bhagavad Gītā Bhāṣya — Sastry translation (1897)",
|
| 160 |
-
work="bhagavad_gita_bhashya",
|
| 161 |
-
tier="shankara",
|
| 162 |
-
license="public_domain",
|
| 163 |
-
license_url="https://archive.org/details/Bhagavad-Gita.with.the.Commentary.of.Sri.Shankaracharya",
|
| 164 |
-
translator="Alladi Mahadeva Sastry",
|
| 165 |
-
year=1897,
|
| 166 |
-
urls=(
|
| 167 |
-
# Direct OCR text. The /download/ path is reliably the raw file;
|
| 168 |
-
# /stream/ is the HTML viewer and not what we want.
|
| 169 |
-
"https://archive.org/download/Bhagavad-Gita.with.the.Commentary.of.Sri.Shankaracharya/Bhagavad-Gita.with.the.Commentary.of.Sri.Shankaracharya_djvu.txt",
|
| 170 |
-
),
|
| 171 |
-
parser="sastry_archive",
|
| 172 |
-
notes=(
|
| 173 |
-
"OCR will have noise — broken hyphens, occasional 'rn' → 'm'. "
|
| 174 |
-
"Parser uses verse-marker regex to chunk by verse and tries to "
|
| 175 |
-
"associate Śaṅkara's commentary with the verse it follows."
|
| 176 |
-
),
|
| 177 |
-
),
|
| 178 |
-
|
| 179 |
-
# ─── Telang's Gītā translation, SBE Vol. 8 (1882) ───
|
| 180 |
-
# An alternative to Sastry for the Gītā translation itself. Useful when
|
| 181 |
-
# we want a second voice for the verse text, since Sastry was sometimes
|
| 182 |
-
# paraphrasing Śaṅkara's gloss into the translation.
|
| 183 |
-
Source(
|
| 184 |
-
key="telang_gita",
|
| 185 |
-
name="Bhagavad Gītā — Telang translation, SBE Vol. 8 (1882)",
|
| 186 |
-
work="bhagavad_gita",
|
| 187 |
-
tier="primary",
|
| 188 |
-
license="public_domain",
|
| 189 |
-
license_url="https://en.wikipedia.org/wiki/Sacred_Books_of_the_East",
|
| 190 |
-
translator="Kāshināth Trimbak Telang",
|
| 191 |
-
year=1882,
|
| 192 |
-
urls=tuple(
|
| 193 |
-
f"https://www.wisdomlib.org/hinduism/book/the-bhagavadgita/d/doc{n}.html"
|
| 194 |
-
for n in range(81668, 81686) # chapters 1–18
|
| 195 |
-
),
|
| 196 |
-
parser="wisdomlib_html",
|
| 197 |
-
enabled=False, # off by default — gita_json_translations gives us enough
|
| 198 |
-
notes=(
|
| 199 |
-
"Wisdomlib mirrors Telang's SBE 8 translation as one chapter per "
|
| 200 |
-
"page. Enable if you want a second translation alongside gita_json."
|
| 201 |
-
),
|
| 202 |
-
),
|
| 203 |
-
|
| 204 |
-
# ─── Mundaka Upaniṣad with Śaṅkara's Bhāṣya ───
|
| 205 |
-
# Wisdomlib hosts a complete English edition of Mundaka with Śaṅkara's
|
| 206 |
-
# commentary. Likely older Sitarama Sastri translation, public domain.
|
| 207 |
-
Source(
|
| 208 |
-
key="mundaka_shankara",
|
| 209 |
-
name="Muṇḍaka Upaniṣad with Śaṅkara's Bhāṣya",
|
| 210 |
-
work="mundaka_upanishad",
|
| 211 |
-
tier="shankara",
|
| 212 |
-
license="public_domain",
|
| 213 |
-
license_url="https://www.wisdomlib.org/hinduism/book/mundaka-upanishad-shankara-bhashya",
|
| 214 |
-
translator="Sitarama Sastri (1898)",
|
| 215 |
-
year=1898,
|
| 216 |
-
urls=(
|
| 217 |
-
"https://www.wisdomlib.org/hinduism/book/mundaka-upanishad-shankara-bhashya",
|
| 218 |
-
),
|
| 219 |
-
parser="wisdomlib_html",
|
| 220 |
-
enabled=False, # wisdomlib_html parser not yet implemented
|
| 221 |
-
notes=(
|
| 222 |
-
"The wisdomlib parser will follow the table-of-contents links from "
|
| 223 |
-
"this index page to fetch each section."
|
| 224 |
-
),
|
| 225 |
-
),
|
| 226 |
-
|
| 227 |
-
# ─── Brahma Sūtras with Śaṅkara's Bhāṣya, Thibaut translation ───
|
| 228 |
-
# SBE volumes 34 (1890) and 38 (1896). The most-cited English translation
|
| 229 |
-
# of the Brahma Sūtra Bhāṣya, used by every academic working in Vedānta.
|
| 230 |
-
# Squarely public domain.
|
| 231 |
-
Source(
|
| 232 |
-
key="thibaut_brahma_sutra",
|
| 233 |
-
name="Brahma Sūtras with Śaṅkara Bhāṣya — Thibaut translation",
|
| 234 |
-
work="brahma_sutra_bhashya",
|
| 235 |
-
tier="shankara",
|
| 236 |
-
license="public_domain",
|
| 237 |
-
license_url="https://archive.org/details/SacredBooksOfTheEastVol34",
|
| 238 |
-
translator="George Thibaut (SBE 34 & 38)",
|
| 239 |
-
year=1890,
|
| 240 |
-
urls=(
|
| 241 |
-
# archive.org full-text URLs for SBE 34 and 38
|
| 242 |
-
"https://archive.org/download/SacredBooksOfTheEastVol34/sbe34_djvu.txt",
|
| 243 |
-
"https://archive.org/download/SacredBooksOfTheEastVol38/sbe38_djvu.txt",
|
| 244 |
-
),
|
| 245 |
-
parser="thibaut_sbe",
|
| 246 |
-
enabled=False, # parser not implemented in v1 — see parsers/README.md
|
| 247 |
-
notes=(
|
| 248 |
-
"Disabled by default until thibaut_sbe parser is written. The text "
|
| 249 |
-
"is structured by adhikaraṇa (topic groups of sūtras), not by "
|
| 250 |
-
"single sūtras, so the parser needs more care than the others."
|
| 251 |
-
),
|
| 252 |
-
),
|
| 253 |
-
|
| 254 |
-
# ─── Vivekacūḍāmaṇi (Mohini Chatterji translation) ───
|
| 255 |
-
# The most famous prakaraṇa attributed to Śaṅkara. 581 verses.
|
| 256 |
-
# Mohini Chatterji's translation is early-20th-c., public domain.
|
| 257 |
-
Source(
|
| 258 |
-
key="vivekachudamani_chatterji",
|
| 259 |
-
name="Vivekacūḍāmaṇi — Mohini Chatterji translation",
|
| 260 |
-
work="vivekachudamani",
|
| 261 |
-
tier="shankara",
|
| 262 |
-
license="public_domain",
|
| 263 |
-
translator="Mohini M. Chatterji",
|
| 264 |
-
year=1932,
|
| 265 |
-
urls=(
|
| 266 |
-
# The user should fill this in; placeholder for the registry shape
|
| 267 |
-
# so the downloader logs a clear "URL missing" message rather than
|
| 268 |
-
# silently skipping.
|
| 269 |
-
"",
|
| 270 |
-
),
|
| 271 |
-
parser="plain_text",
|
| 272 |
-
enabled=False,
|
| 273 |
-
notes=(
|
| 274 |
-
"Drop a clean copy at sources_local/vivekachudamani.txt and the "
|
| 275 |
-
"plain_text parser will pick it up. Several archive.org editions "
|
| 276 |
-
"exist; verse markers vary by edition."
|
| 277 |
-
),
|
| 278 |
-
),
|
| 279 |
-
|
| 280 |
-
# ─── User-provided plain-text drop-in slot ───
|
| 281 |
-
# If you already have a clean text file (a translation you typed up, a
|
| 282 |
-
# lecture transcript you cleaned, anything), drop it in sources_local/
|
| 283 |
-
# named tier__work__section.txt and the plain_text parser will fold it in.
|
| 284 |
-
Source(
|
| 285 |
-
key="user_local",
|
| 286 |
-
name="User-provided plain-text sources",
|
| 287 |
-
work="user_local",
|
| 288 |
-
tier="supporting",
|
| 289 |
-
license="public_domain", # under your responsibility
|
| 290 |
-
urls=(),
|
| 291 |
-
parser="plain_text",
|
| 292 |
-
enabled=False, # no URLs; user drops files manually into sources_local/
|
| 293 |
-
notes=(
|
| 294 |
-
"Anything in sources_local/. Convention: tier__work__section.txt "
|
| 295 |
-
"(see parsers/plain_text.py)."
|
| 296 |
-
),
|
| 297 |
-
),
|
| 298 |
-
]
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
# ──────────────────────────── Helpers ────────────────────────────
|
| 302 |
-
def by_key(key: str) -> Source:
|
| 303 |
-
"""Look up a source by its registry key. Raises KeyError on miss."""
|
| 304 |
-
for s in SOURCES:
|
| 305 |
-
if s.key == key:
|
| 306 |
-
return s
|
| 307 |
-
raise KeyError(f"No source with key={key!r}")
|
| 308 |
-
|
| 309 |
-
|
| 310 |
-
def enabled_sources() -> list[Source]:
|
| 311 |
-
return [s for s in SOURCES if s.enabled]
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
def by_parser(parser: Parser) -> list[Source]:
|
| 315 |
-
"""Group enabled sources by their parser, useful for the ingest loop."""
|
| 316 |
-
return [s for s in SOURCES if s.enabled and s.parser == parser]
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
def attribution_for(work: str) -> list[str]:
|
| 320 |
-
"""Returns the attribution lines for any work, for citation footers.
|
| 321 |
-
|
| 322 |
-
Even though all our sources are PD, citing translators is right. The
|
| 323 |
-
advisor's response footer can call this and append the translators to
|
| 324 |
-
the bibliography lines.
|
| 325 |
-
"""
|
| 326 |
-
out = []
|
| 327 |
-
for s in SOURCES:
|
| 328 |
-
if s.work == work and s.translator:
|
| 329 |
-
year = f", {s.year}" if s.year else ""
|
| 330 |
-
out.append(f"{s.translator}{year} ({s.license})")
|
| 331 |
-
return out
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
streamlit_app.py
CHANGED
|
@@ -178,7 +178,6 @@ _EXAMPLES = [
|
|
| 178 |
"I keep hurting the people I love without meaning to.",
|
| 179 |
"I've been meditating for years but still feel empty.",
|
| 180 |
"My ambition feels hollow but I can't stop chasing it.",
|
| 181 |
-
"My boss no longer wants me on his team and I feel humiliated.",
|
| 182 |
]
|
| 183 |
|
| 184 |
|
|
|
|
| 178 |
"I keep hurting the people I love without meaning to.",
|
| 179 |
"I've been meditating for years but still feel empty.",
|
| 180 |
"My ambition feels hollow but I can't stop chasing it.",
|
|
|
|
| 181 |
]
|
| 182 |
|
| 183 |
|