Spaces:

mxp1404
/

uhc-policy-chatbot

Sleeping

App Files Files Community

uhc-policy-chatbot / README.md

mxp1404

Upload README.md with huggingface_hub

abdbe55 verified about 1 month ago

preview code

raw

history blame contribute delete

12.1 kB

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

metadata

title: UHC Medical Policy Chatbot
emoji: 🏥
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false

UHC Medical Policy Chatbot

A RAG-powered chatbot that answers questions about UnitedHealthcare (UHC) medical policies. Built for doctors, hospital staff, and insurance coordinators who need accurate, cited answers about coverage criteria, CPT/HCPCS codes, and medical necessity requirements.

Hosted Chatbot

URL: https://huggingface.co/spaces/mxp1404/uhc-policy-chatbot

How to Use — Step-by-Step

Open the link above in your browser.
Wait for the model to load (first visit takes ~30 seconds for MedEmbed to initialize).
Type your question in the chat input at the bottom — for example:
- "Is bariatric surgery covered for BMI over 40?"
- "What documentation is needed for gender-affirming surgery?"
- "Are intrapulmonary percussive ventilation devices covered for home use?"
The chatbot will search relevant policy chunks, then stream an answer with citations.
Click "📚 Sources" below each answer to see the exact policy sections used.
Enable "🔊 Read answers aloud" in the sidebar to hear answers via Kokoro TTS.
Use "🗑️ Clear conversation" in the sidebar to start a new session.

The chatbot only answers from official UHC policy documents — it will tell you if it doesn't have enough information rather than guessing.

Architecture

High-Level Design (HLD)

┌─────────────┐     ┌──────────────────────────────────────────────┐
│   Browser    │────▶│  Streamlit App (HuggingFace Spaces)          │
│   (User)     │◀────│                                              │
└─────────────┘     │  ┌─────────────┐    ┌─────────────────────┐  │
                    │  │ MedEmbed    │    │ Groq API            │  │
                    │  │ (1024-dim)  │    │ Llama 3.1 8B        │  │
                    │  │ cached RAM  │    │ 560 tok/s           │  │
                    │  └──────┬──────┘    └──────▲──────────────┘  │
                    │         │                   │                  │
                    │         ▼                   │                  │
                    │  ┌─────────────┐   context + query            │
                    │  │ Qdrant Cloud│────────────┘                 │
                    │  │ (vectors)   │                              │
                    │  └─────────────┘                              │
                    └──────────────────────────────────────────────┘

Data flow for each query:

User types a question in the Streamlit chat interface
The query is encoded into a 1024-dimensional vector using MedEmbed (loaded once, cached in memory)
The vector is sent to Qdrant Cloud for similarity search — returns top-K policy chunks with metadata
Retrieved chunks are deduplicated, scored with section priority boosts, and formatted into a context block
The context + query + system prompt are sent to Groq API (Llama 3.1 8B) for answer generation
The response is streamed token-by-token back to the user with source citations
If TTS is enabled, the response text is synthesized into audio using Kokoro ONNX and played in-browser

Low-Level Design (LLD)

Project Structure

uhc/
├── app.py                          # Streamlit web UI entry point
├── requirements.txt                # Python dependencies
├── .env.example                    # Environment variable template
│
├── chatbot/                        # Chatbot application layer
│   ├── config.py                   # Centralized config (LLM, retrieval, env vars)
│   ├── retriever.py                # PolicyRetriever: MedEmbed + Qdrant wrapper
│   ├── llm_groq.py                 # Groq API client (deployed)
│   ├── llm.py                      # Ollama client (local dev)
│   ├── prompts.py                  # System prompt, context formatting, deduplication
│   ├── tts.py                      # Kokoro ONNX text-to-speech
│   └── cli.py                      # CLI interface (local dev)
│
├── embedding/                      # Embedding pipeline
│   └── scripts/
│       ├── config.py               # Embedding model + Qdrant connection config
│       ├── embed_chunks.py         # Generate embeddings from RAG chunks
│       ├── store_qdrant.py         # Upsert embeddings into Qdrant with payload indexes
│       ├── search.py               # Standalone search CLI for testing
│       └── test_retrieval.py       # Batch retrieval evaluation (10 test cases)
│
├── tests/                          # Evaluation suite
│   └── eval_100.py                 # 100-prompt retrieval + LLM evaluation
│
└── scraper/                        # Data ingestion pipeline
    ├── download_policies.py        # Scrape PDFs from UHC website
    ├── extract_pdf_text.py         # PDF → structured sections with metadata
    ├── create_rag_chunks.py        # Section-aware semantic chunking
    └── data/processed/
        ├── extracted_sections.json # Extracted text per policy/section
        └── rag_chunks.json         # Final RAG chunks with metadata

Module Design

chatbot/retriever.py — PolicyRetriever

Loads abhinand/MedEmbed-large-v0.1 (1024-dim medical embeddings) via sentence-transformers
Connects to Qdrant Cloud; supports both cloud and local Qdrant
Encodes queries → cosine similarity search → returns ChunkResult dataclasses
Filters out low-value sections (References, Application) that pollute results
Section priority boosting (Coverage Rationale +0.04, Coverage Summary +0.03) so authoritative statements rank above clinical studies
Retry logic with exponential backoff for transient Qdrant errors

chatbot/prompts.py — Prompt Engineering

System prompt enforces: answer from context only, 2–4 bullet points, cite sources, coverage-awareness
deduplicate_chunks() keeps highest-scoring chunk per (policy, section) pair
format_context() truncates each chunk to 800 chars at sentence boundaries, caps total at 6000 chars
Coverage Rationale is explicitly marked as authoritative for coverage decisions

chatbot/llm_groq.py — GroqClient

Uses groq Python SDK with streaming chat completions
Graceful rate-limit handling (Groq free tier: 250K TPM)
Same chat_stream() / chat() interface as the Ollama client for interchangeability

chatbot/tts.py — Text-to-Speech

Uses Kokoro ONNX (82M parameter model, ~300MB)
Auto-downloads model files from HuggingFace Hub on first use
Generates WAV audio from LLM response text, played in-browser via st.audio
Toggleable via sidebar switch — disabled by default to save resources

scraper/extract_pdf_text.py — PDF Extraction

Paragraph-level extraction using pdfplumber (not line-by-line)
Robust header/footer/sidebar removal with regex patterns
Structured metadata parsing: policy number, effective date, plan type, document type
Table extraction support; skips boilerplate sections and HTML-disguised files

scraper/create_rag_chunks.py — Semantic Chunking

Section-aware chunking: different strategies per section type
- Coverage Rationale → criteria-based splitting
- Applicable Codes → table-aware chunking
- Clinical Evidence → study-based splitting
- Others → paragraph-aware with sentence-boundary overlap
Rich metadata per chunk: policy name, section, plan type, page range, provider
Deterministic chunk IDs for deduplication during re-indexing

embedding/scripts/embed_chunks.py — Embedding Generation

Prepends metadata to chunk text before encoding for better retrieval
Batch processing (32 chunks at a time) with GPU/MPS/CPU auto-detection
Saves to .npz for efficient storage and reloading

embedding/scripts/store_qdrant.py — Vector Storage

Creates Qdrant collection with cosine distance
Upserts embeddings with full metadata payloads
Creates payload indexes on section, policy_name, plan_type, doc_type, provider for efficient filtered search

Edge Cases Handled

Edge Case	Handling
Empty / whitespace query	Warning message, no API call
Qdrant connection failure	Retry with exponential backoff (3 attempts)
Groq rate limit (429)	Caught and shown as user-friendly message
No relevant chunks found	"I don't have enough policy information"
Coverage vs. evidence conflict	System prompt + Coverage Rationale boost ensures correct answer
Very long conversation	History trimmed to last 3 turns
Model loading on first visit	Spinner shown; cached with `st.cache_resource`

Extending for Other Insurance Providers

The system is designed for multi-provider extensibility:

Data layer: Each chunk in Qdrant has a provider field (currently "UnitedHealthcare"). Adding a new provider means running the same pipeline with a new provider slug — chunks coexist in the same collection.
Scraper: scraper/download_policies.py can be adapted for any provider's website. The extractor and chunker handle standard medical policy PDF structures.
Embedding: The same MedEmbed model works for all medical content. New provider chunks are embedded and upserted alongside existing ones.
Retrieval: Add a provider_filter parameter to narrow results by provider, or query across all providers simultaneously.
UI: Add a provider selector dropdown in the Streamlit sidebar — one line change.

# Example: adding Aetna
retriever.retrieve(query, provider_filter="aetna")

Local Development Setup

# 1. Clone the repo
git clone https://github.com/<your-username>/uhc-policy-chatbot.git
cd uhc-policy-chatbot

# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure environment variables
cp .env.example .env
# Edit .env with your Qdrant and Groq API keys

# 5. Run the Streamlit app
streamlit run app.py

# Or use the CLI with Ollama (local LLM)
ollama serve &
ollama pull phi3.5
python -m chatbot.cli

Environment Variables

Variable	Description	Required
`QDRANT_URL`	Qdrant Cloud cluster URL	Yes
`QDRANT_API_KEY`	Qdrant Cloud API key	Yes
`QDRANT_COLLECTION`	Collection name (default: `uhc_policies`)	No
`GROQ_API_KEY`	Groq API key (get free)	Yes (web)
`GROQ_MODEL`	Groq model (default: `llama-3.1-8b-instant`)	No

Tech Stack

Component	Technology
Embedding Model	MedEmbed-large-v0.1 (1024-dim)
Vector Database	Qdrant Cloud
LLM (deployed)	Llama 3.1 8B via Groq (560 tok/s)
LLM (local dev)	Phi-3.5 Mini via Ollama
Web Framework	Streamlit
Hosting	HuggingFace Spaces (free tier)
Text-to-Speech	Kokoro ONNX (82M)
PDF Extraction	pdfplumber + BeautifulSoup