# `data_label_factory.identify` — open-set image retrieval The companion to the main labeling pipeline. Where the base `data_label_factory` produces COCO labels for training a closed-set **detector**, this subpackage produces a CLIP-based **retrieval index** for open-set **identification** — given a known set of N reference images, identify which one a webcam frame is showing. **Use this when:** - You have **1 image per class** (a product catalog, a card collection, an art portfolio, a parts diagram, …) and want a "what is this thing I'm holding up?" tool. - You want **zero training time** by default and the option to fine-tune for more accuracy. - You want to **add new items in seconds** by dropping a JPG in a folder and re-indexing. - You want **rarity / variant detection** for free — different prints of the same item indexed under filenames that encode the variant. **Use the base pipeline instead when:** - You need to detect multiple object instances per image with bounding boxes - Your objects appear in cluttered scenes and need a real detector - You have many images per class and want a closed-set classifier --- ## The 4-step blueprint (works for ANY image set) This is the entire workflow. Replace `~/my-collection/` with your reference folder and you're done. ### Step 0 — install (one-time, ~1 min) ```bash pip install -e ".[identify]" # This pulls torch, pillow, clip, fastapi, ultralytics, and uvicorn ``` ### Step 1 — gather references (5–30 min depending on source) You need **one image per class**. The filename becomes the label, so be deliberate: ``` ~/my-collection/ ├── blue_eyes_white_dragon.jpg ├── dark_magician.jpg ├── exodia_the_forbidden_one.jpg └── ... ``` **Naming rules:** - The filename stem (minus extension) becomes the displayed label. - Optional set-code prefixes are auto-stripped: `LOCH-JP001_dark_magician.jpg` → `Dark Magician`. - Optional rarity suffixes are extracted as a separate field if they match one of: `pscr`, `scr`, `ur`, `sr`, `op`, `utr`, `cr`, `ea`, `gmr`. Example: `dark_magician_pscr.jpg` → name=`Dark Magician`, rarity=`PScR`. - Underscores become spaces, then title-cased. **Where to get reference images:** | Domain | Source | |---|---| | Trading cards | ygoprodeck (Yu-Gi-Oh!), Pokémon TCG API, Scryfall (MTG), yugipedia | | Products | Amazon listing main image, manufacturer site | | Art / paintings | Wikimedia Commons, museum APIs | | Industrial parts | Manufacturer catalog scrapes | | Faces | Selfies (with permission!) | | Album covers | MusicBrainz cover art archive | | Movie posters | TMDB API | **You can mix sources** — e.g. include both English and Japanese versions of the same card under different filenames. The retrieval system treats them as separate references but the cosine match will pick whichever is closer to your live input. ### Step 2 — build the index (10 sec) ```bash python3 -m data_label_factory.identify index \ --refs ~/my-collection/ \ --out my-index.npz ``` This CLIP-encodes every image and saves the embeddings to a single `.npz` file (~300 KB for 150 references). On Apple Silicon MPS this is ~50 ms per image — 150 images takes about 8 seconds. **Output**: `my-index.npz` containing `embeddings`, `names`, `filenames`. ### Step 3 — verify the index (5 sec) ```bash python3 -m data_label_factory.identify verify --index my-index.npz ``` Self-tests every reference: each one should match itself as the top-1 result. Reports: - **Top-1 self-identification rate** (should be 100%) - **Most-confusable pairs** — references with high mutual similarity (visually similar items the model might confuse at runtime) - **Margin analysis** — the gap between "correct match" and "best wrong match" cosine scores. **This is the strongest predictor of live accuracy.** **Margin guidelines:** | Median margin | What it means | Action | |---|---|---| | **> 0.3** | Strong separation, live accuracy will be excellent | Ship it | | **0.1 – 0.3** | Medium separation, expect some confusion on visually similar items | Consider Step 4 | | **< 0.1** | References look too similar to off-the-shelf CLIP | **Run Step 4** (fine-tune) | ### Step 4 (OPTIONAL) — fine-tune the retrieval head (5–15 min) If the verify output shows margin < 0.1, your domain (yugioh cards, MTG cards, similar-looking product variants, …) confuses generic CLIP. Fix it with a contrastive fine-tune: ```bash python3 -m data_label_factory.identify train \ --refs ~/my-collection/ \ --out my-projection.pt \ --epochs 12 ``` **What this does:** - Loads frozen CLIP ViT-B/32 - Trains a small **projection head** (~400k params) on top of CLIP features - Uses **K-cards-per-batch sampling** (16 distinct classes × 4 augmentations = 64-image batches) - Loss: **SupCon** (Khosla et al. 2020) — pulls augmentations of the same class together, pushes different classes apart - Augmentations: random crop, rotation ±20°, color jitter, perspective warp, Gaussian blur, occasional grayscale - Output: a **1.5 MB `.pt` file** containing the projection head weights **Reference run** (150-class set, M4 Mac mini, MPS): 12 epochs in ~6 min. Margin improvement: 0.07 → 0.36 (5× wider). Then re-build the index with the projection head: ```bash python3 -m data_label_factory.identify index \ --refs ~/my-collection/ \ --out my-index.npz \ --projection my-projection.pt ``` And re-verify to confirm the margin actually widened: ```bash python3 -m data_label_factory.identify verify --index my-index.npz ``` ### Step 5 — serve it as an HTTP endpoint (instant) ```bash python3 -m data_label_factory.identify serve \ --index my-index.npz \ --refs ~/my-collection/ \ --projection my-projection.pt \ --port 8500 ``` This starts a FastAPI server with: - `POST /api/falcon` — multipart `image` + `query` → JSON response in the same shape as `mac_tensor`'s `/api/falcon` endpoint, so it's a drop-in replacement for any client that talks to mac_tensor (including the data-label-factory `web/canvas/live` UI). - `GET /refs/` — serves your reference images as a static mount so a browser UI can display "this is what the model thinks you're showing". - `GET /health` — JSON status with index size, projection state, request counter, etc. **Point the live tracker UI at it:** ```bash # In web/.env.local FALCON_URL=http://localhost:8500/api/falcon ``` Then open `http://localhost:3030/canvas/live` and click **Use Webcam**. --- ## Concrete examples ### Trading cards (the original use case) ```bash # Step 1: download reference images via the gather command data_label_factory gather --project projects/yugioh.yaml --max-per-query 1 # → produces ~/data-label-factory/yugioh/positive/cards/*.jpg # Step 2-5: build, verify, train, serve python3 -m data_label_factory.identify index --refs ~/data-label-factory/yugioh/positive/cards/ --out yugioh.npz python3 -m data_label_factory.identify verify --index yugioh.npz python3 -m data_label_factory.identify train --refs ~/data-label-factory/yugioh/positive/cards/ --out yugioh_proj.pt python3 -m data_label_factory.identify index --refs ~/data-label-factory/yugioh/positive/cards/ --out yugioh.npz --projection yugioh_proj.pt python3 -m data_label_factory.identify serve --index yugioh.npz --refs ~/data-label-factory/yugioh/positive/cards/ --projection yugioh_proj.pt ``` ### Album covers ("Shazam for vinyl") ```bash # Get reference images from MusicBrainz cover art archive (one per album) mkdir ~/my-vinyl # ... drop in jpgs named after the album ... python3 -m data_label_factory.identify index --refs ~/my-vinyl --out vinyl.npz python3 -m data_label_factory.identify serve --index vinyl.npz --refs ~/my-vinyl # Hold up a record sleeve to your webcam → get the album back ``` ### Industrial parts catalog ("which screw is this?") ```bash mkdir ~/parts # Drop in one studio shot per part: m3_bolt_10mm.jpg, hex_nut_5mm.jpg, ... python3 -m data_label_factory.identify index --refs ~/parts --out parts.npz python3 -m data_label_factory.identify train --refs ~/parts --out parts_proj.pt --epochs 20 python3 -m data_label_factory.identify index --refs ~/parts --out parts.npz --projection parts_proj.pt python3 -m data_label_factory.identify serve --index parts.npz --refs ~/parts --projection parts_proj.pt ``` ### Plant species ID Same loop with reference images keyed by species name. You don't need PlantNet's scale to be useful for **your** garden. --- ## Optional: live price feed (`scrape_prices` + UI integration) If your reference images correspond to items with a market price (trading cards, collectibles, parts, etc), you can plug in a live price feed and have the live tracker UI show the price next to each identified item. ### How it works ``` scripts/scrape_prices_.py ← per-site adapter ↓ card_prices.json ← keyed by set code, contains JPY/USD/etc ↓ data_label_factory.identify serve --prices … ← server loads it at startup ↓ + fetches live FX rate from open.er-api.com {detection, price: {median, currency, usd_median}} ← surfaced per detection ↓ web/canvas/live UI ← shows USD prominently in the Active Tracks sidebar + a Top Valuable Cards panel sorted by USD descending ``` ### Built-in scraper: yuyu-tei.jp (Japanese OCG market) ```bash python3 -m data_label_factory.identify scrape_prices \ --refs ~/my-cards/ \ --out card_prices.json \ --site yuyu-tei ``` This is the **example adapter**. Add new sites by implementing a `_scrape_(prefixes)` function in `scrape_prices.py` and wiring it into the dispatcher at the bottom of the file. The output schema is site-agnostic. ### Live tracker UI features when prices are loaded - **Per-detection price line** in the Active Tracks sidebar — USD prominently, original currency underneath - **Top Valuable Cards panel** — fetched from a new `/api/top-prices` endpoint, sorted by USD descending, showing the N most valuable items in your set - **Live FX rate** — JPY/USD conversion fetched once at server startup from `open.er-api.com` (free, no auth) - **Filename → name lookup** — server builds a ` → English display name` map from your reference filenames so the top-prices panel can show human-readable names alongside the codes ### Add to Deck (localStorage-backed deck builder) The live tracker also includes a **`+ Add to Deck`** button on each active track. Clicking it: - Adds the identified card to a local deck (browser localStorage, no server state) - Triggers a green flash + scale animation on the button - Pulses the deck panel border bright emerald so you can see the card landed - Updates the running deck total in USD - Persists across page refreshes - Lets you remove individual items or clear the whole deck This is a generic feature that works for any retrieval set — useful for "build a list of items I've identified" workflows beyond just card collecting (inventory taking, parts pulling, plant logging, …). --- ## The data-label-factory loop, applied to retrieval ``` gather (web search / API / phone photos) ↓ label (the filename IS the label — naming convention does the work) ↓ verify (data_label_factory.identify verify — self-test) ↓ train (optional) (data_label_factory.identify train — fine-tune projection head) ↓ deploy (data_label_factory.identify serve — HTTP endpoint) ↓ review (data-label-factory web/canvas/live — sees this server as a falcon backend) ``` Same loop, same conventions, just **retrieval instead of detection**. --- ## Files in this folder ``` identify/ ├── __init__.py package marker + lazy import ├── __main__.py enables `python3 -m data_label_factory.identify ` ├── cli.py argparse dispatcher for the four commands ├── train.py Step 4: contrastive fine-tune ├── build_index.py Step 2: CLIP encode + save index ├── verify_index.py Step 3: self-test + margin analysis ├── serve.py Step 5: FastAPI HTTP endpoint └── README.md you are here ``` --- ## Why this is **lazy-loaded** (not always-on) The base `data_label_factory` package only depends on `pyyaml`, `pillow`, and `requests` — kept lightweight so users running the labeling pipeline don't pay any ML import cost. The `identify` subpackage adds heavy deps (torch, clip, ultralytics, fastapi) and is only loaded when explicitly invoked via `python3 -m data_label_factory.identify `. Same opt-in pattern as the `runpod` subpackage. Install the heavy deps with the optional extra: ```bash pip install -e ".[identify]" ```