feat(identify): open-set image retrieval subpackage

37ddbc1 verified 11 days ago

14.9 kB

CLAUDE.md — instructions for Claude Code

This file gives Claude Code (and any other AI coding assistant) the context it needs to help a user run, debug, and extend data-label-factory. If you're a human reading this, you probably want README.md instead.

What this repo is

data-label-factory is a generic auto-labeling pipeline for vision datasets that runs entirely on a 16 GB Apple Silicon Mac. Pipeline stages:

gather → filter → label → verify → review

gather — pulls images from DuckDuckGo, Wikimedia, Openverse, or YouTube
filter — image-level YES/NO via a VLM (Qwen 2.5-VL or Gemma 4)
label — bounding-box grounding via Falcon Perception
verify — per-bbox YES/NO via the same VLM
review — Next.js + HTML5 Canvas web UI in web/

The pipeline is target-agnostic. Object class is configured via a YAML in projects/. The reference projects are projects/drones.yaml (fiber-optic combat drones) and projects/stop-signs.yaml (smoke test).

How to help the user get started

When a user points you at this repo, the first thing they need is to install the CLI and start a backend. The flow:

# 1. Install the CLI (registers `data_label_factory` on PATH)
pip install -e .

# 2. Start a VLM backend. Recommend qwen as the default — it's smaller,
#    faster, and the gemma backend has known reliability issues for batch
#    YES/NO calls (see Known gotchas below).
pip install mlx-vlm
python3 -m mlx_vlm.server \
  --model mlx-community/Qwen2.5-VL-3B-Instruct-4bit \
  --port 8291

# 3. Verify the backend is reachable
data_label_factory status

Once status shows the backend is alive, the user can inspect a project and run a tiny smoke test:

data_label_factory project --project projects/stop-signs.yaml
data_label_factory filter  --project projects/stop-signs.yaml --backend qwen --limit 5

If they want to label their own object class, copy projects/stop-signs.yaml, edit project_name, target_object, buckets, and falcon_queries, then run the full pipeline.

CLI reference

data_label_factory <subcommand>

  status                       Check if backends (qwen, gemma) are alive
  project --project P          Print a project YAML for inspection
  gather  --project P          Search the web and download images per bucket
  filter  --project P          Image-level YES/NO classification (--backend qwen|gemma)
  label   --project P          Falcon Perception bbox grounding via mac_tensor
  pipeline --project P         Full chain: gather → filter (label/verify TBD)
  list                         Show timestamped experiment dirs

Common flags:

--backend qwen|gemma (filter, pipeline) — overrides the project YAML
--limit N (filter, label) — process at most N images, useful for smoke tests
--experiment NAME — reuse an existing experiment dir instead of creating one
--max-per-query N (gather, pipeline) — DDG can rate-limit above ~50

Every command writes a timestamped folder under experiments/ (relative to the user's CWD) with the config, prompts, raw model answers, and JSON outputs.

Project YAML schema

A project YAML is the only thing a user writes to onboard a new object class. Required fields:

project_name: fire-hydrants                # used in experiment dir names
target_object: "fire hydrant"              # templated as {target_object} in prompts
description: |
  One-paragraph human description.
data_root: ~/data-label-factory/fire-hydrants

# Cloudflare R2 (optional — only used if you want to push images to cloud)
r2:
  bucket: my-bucket
  raw_prefix: raw/
  labels_prefix: labels/

# Gather plan: bucket → list of image-search queries
buckets:
  positive/clear_view:
    queries: ["red fire hydrant", "yellow fire hydrant"]
  negative/other_street_objects:
    queries: ["mailbox", "parking meter"]
  background/empty_streets:
    queries: ["empty city street"]

# What Falcon Perception will look for during the label stage
falcon_queries:
  - "fire hydrant"
  - "red metal post"

# Default backend per stage. CLI --backend overrides.
backends:
  filter: qwen
  label:  gemma
  verify: qwen

Important rules when editing project YAMLs:

falcon_queries should be visually grounded — Falcon is a perception model, not a reasoner. "fire hydrant" works; "object representing emergency water access" doesn't.
The target_object string is interpolated into all default prompts via Python's str.format(). Don't put curly braces in it.
Buckets named positive/*, negative/*, distractor/*, background/* is convention, not enforced — but the gather/filter scripts treat them uniformly.

The full prompt templates live in data_label_factory/project.py under DEFAULT_PROMPTS. Override per-project via a top-level prompts: section.

Web UI (`web/`)

Next.js + Tailwind v4 + HTML5 Canvas. Two routes:

/ — older shadcn-based grid review with per-bbox approve/reject buttons
/canvas — newer pure HTML5 Canvas viewer (drag to pan, scroll to zoom, click a bbox to inspect, ←→ to navigate). This is the recommended one.

To start it:

cd web
npm install            # first time only
PORT=3030 npm run dev
# open http://localhost:3030/canvas

The web UI reads labels from R2 by default. Configure credentials in web/.env.local (see web/.env.example):

R2_ENDPOINT_URL=https://<account>.r2.cloudflarestorage.com
R2_ACCESS_KEY_ID=...
R2_SECRET_ACCESS_KEY=...
R2_BUCKET=your-bucket

If R2 isn't configured, the web UI will throw on startup with a clear error message. Never commit .env.local. It's gitignored at both web/.gitignore and the root .gitignore.

Environment variables

Var	Default	What
`QWEN_URL`	`http://localhost:8291`	Where the `mlx_vlm.server` lives
`QWEN_MODEL_PATH`	`mlx-community/Qwen2.5-VL-3B-Instruct-4bit`	Model id sent in OpenAI request
`GEMMA_URL`	`http://localhost:8500`	Where `mac_tensor` lives (also serves Falcon)

These can be exported in the user's shell or set inline:

QWEN_URL=http://192.168.1.244:8291 data_label_factory status

Optional: open-set identification (`data_label_factory.identify`)

If a user wants to identify which one of N known things they're holding up to a webcam (rather than detect arbitrary objects), point them at the identify subpackage. It's a CLIP retrieval index — needs only 1 image per class, no training required.

pip install -e ".[identify]"
python3 -m data_label_factory.identify index  --refs ~/my-things/ --out my.npz
python3 -m data_label_factory.identify verify --index my.npz
# (optional) python3 -m data_label_factory.identify train --refs ~/my-things/ --out my-proj.pt
python3 -m data_label_factory.identify serve  --index my.npz --refs ~/my-things/
# → web/canvas/live talks to it via FALCON_URL=http://localhost:8500/api/falcon

The full blueprint for any image set is at data_label_factory/identify/README.md. This is the right tool for "trading cards / products / album covers / parts catalog identification" use cases. The base data_label_factory pipeline is for closed-set bbox detection.

Optional GPU path

If a user has more than ~10k images and wants the run to finish in minutes instead of an hour, point them at the RunPod path:

pip install -e ".[runpod]"
export RUNPOD_API_KEY=rpa_xxxxxxxxxx
python3 -m data_label_factory.runpod pipeline \
    --project projects/<theirs>.yaml --gpu L40S \
    --publish-to <user>/<dataset>

The runpod subpackage is opt-in — data_label_factory itself never imports it, so users without RUNPOD_API_KEY are not affected. Full docs at data_label_factory/runpod/README.md. Always smoke-test with --limit 5 locally before kicking off a paid pod run.

Reference dataset

This pipeline produced waltgrace/fiber-optic-drones on Hugging Face — 2,260 images, 8,759 Falcon bboxes, 5,114 (58%) Qwen-verified, five categories. If a user wants to compare their own labeling run against the reference, they can load_dataset("waltgrace/fiber-optic-drones").

A labels-only release (no pixels, Apache 2.0) is at waltgrace/fiber-optic-drones-labels.

Known gotchas

Gemma /api/chat_vision is unreliable for batch YES/NO prompts. When the chained agent can't decide whether to call Falcon as a tool, it can stall or take 60+ seconds. For filter and verify, prefer --backend qwen. Gemma is rock solid for the label stage which uses /api/falcon directly — that path is independent of the chained agent.
DDG image search rate-limits hard above ~100 results per query. Use --max-per-query 30 to 50 for safety. If you need more volume, lean on Wikimedia + YouTube frame extraction in gather.py.
The generic verify subcommand is a TODO. The original drone-specific runpod_falcon/verify_vlm.py (in the parent auto-research/ workspace, not in this published repo) has the working implementation. The generic wrapper is pending a small refactor.
The gather stage has optional dependencies. If a user hits "module not found" for duckduckgo_search or yt_dlp, install the extras: pip install -e ".[gather]".
Falcon Perception requires task="segmentation", NOT task="detection". This is hardcoded in the mac_tensor server, but worth knowing if a user asks why detection mode returns empty bboxes.
macOS Screen Recording filenames sometimes contain non-breaking spaces (U+00A0) instead of regular spaces. Use shell globs (Screen*Recording*.mov) instead of literal filenames if a user is feeding video into the pipeline.

What NOT to do (safety rails)

Never commit .env, .env.local, web/.env.local, or any file containing R2 / HF / API credentials. The .gitignore files block these by default; do not override.
Never push to GitHub or HF without explicit user permission. Even if the user has authenticated with gh or hf auth. Always ask first.
Never run data_label_factory pipeline without --limit N first for an unfamiliar project YAML. The full pipeline can run for hours and incur DDG rate limits or fill local disk.
Don't delete experiments/ without checking first. Each subdir has a README.md with the run config and may be the user's only record of a run.
Don't modify pyproject.toml versions or dependencies without the user asking. The pinned versions are deliberate.

Repo layout

data-label-factory/
├── README.md                           ← user-facing install + walkthrough
├── CLAUDE.md                           ← this file
├── pyproject.toml                      ← pip-installable, entry: data_label_factory
├── setup.py                            ← shim for older pip
├── data_label_factory/                 ← Python package
│   ├── __init__.py                     ← exports load_project, ProjectConfig
│   ├── cli.py                          ← main() with all subcommands
│   ├── project.py                      ← YAML loader + ProjectConfig + DEFAULT_PROMPTS
│   ├── experiments.py                  ← timestamped run dirs
│   └── gather.py                       ← image search (DDG/Wikimedia/YouTube)
├── projects/                           ← project YAMLs
│   ├── drones.yaml                     ← reference: fiber-optic drones
│   └── stop-signs.yaml                 ← smoke test: stop signs
└── web/                                ← Next.js review UI
    ├── app/canvas/page.tsx             ← canvas viewer (recommended)
    ├── app/page.tsx                    ← shadcn grid view
    ├── components/BboxCanvas.tsx       ← responsive HTML5 canvas component
    └── lib/r2.ts                       ← R2 credentials read from env vars

When the user is stuck

Common questions and how to handle them:

User says	Likely cause	What to do
"filter is hanging"	Gemma backend was selected and `/api/chat_vision` stalled	Switch to `--backend qwen`
"no images found"	Gather hit DDG rate limit, or `data_root` is wrong	Check `data_label_factory project --project P` for the resolved `data_root` and verify it matches what's on disk
"ImportError: No module named 'duckduckgo_search'"	Optional gather extra not installed	`pip install -e ".[gather]"`
"ConnectionRefusedError on 8291"	Qwen backend isn't running	Start it: `python3 -m mlx_vlm.server --model mlx-community/Qwen2.5-VL-3B-Instruct-4bit --port 8291`
"I want to label X" where X isn't in the references	Need a new project YAML	Copy `projects/stop-signs.yaml`, edit four fields, run `project` to inspect, then `pipeline --limit 10` to smoke test
"the canvas web UI shows blank"	R2 credentials not set in `web/.env.local`	Ask user for their R2 credentials, set them in `web/.env.local`, restart `npm run dev`. If they don't have R2, point them at the local image cache instead

Quick task recipes for Claude Code

When a user asks for one of these, here's the TL;DR:

"Help me label fire hydrants"

cp projects/stop-signs.yaml projects/fire-hydrants.yaml
Edit project_name, target_object: "fire hydrant", the buckets/queries, and falcon_queries
data_label_factory project --project projects/fire-hydrants.yaml to verify
data_label_factory gather --project projects/fire-hydrants.yaml --max-per-query 30
data_label_factory filter --project projects/fire-hydrants.yaml --backend qwen --limit 20 (smoke test)
If smoke test passes, drop --limit for the full run

"Show me the dataset in a browser"

Make sure R2 credentials are in web/.env.local
cd web && npm install && PORT=3030 npm run dev
Open http://localhost:3030/canvas

"How do I check if my labels are good?"

After running label + verify, look at experiments/<latest>/verify_qwen/verified.json
The summary block has yes_rate — anything below 50% means your Falcon queries are too noisy or your target_object is too narrow
Use the canvas web UI (/canvas) to spot-check rejected bboxes — if Qwen is rejecting things you'd accept, the prompt needs tuning in data_label_factory/project.py:DEFAULT_PROMPTS

"Compare my run to the reference dataset"

from datasets import load_dataset
ref = load_dataset("waltgrace/fiber-optic-drones-labels", split="train")
# ref[i]["bboxes"] is a struct of lists, not a list of dicts