Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Open source tool to add cost, latency & hallucination dimensions to your LLM evaluation
#1164
by vigneshwar234 - opened
Hey everyone ๐
Love this leaderboard โ it's the gold standard for accuracy rankings. But in production, accuracy alone doesn't tell the full story.
I built an open source LLM Evaluation Framework that adds the missing dimensions:
- ๐ฐ Cost per 1K tokens โ real token-count pricing across 15+ models
- โก Latency โ p50/p95/p99 percentiles, not just averages
- ๐ Hallucination Rate โ linguistic signal analysis, runs locally, zero extra cost
- ๐ง Reasoning Quality โ chain-of-thought depth scoring
- ๐ฏ Accuracy โ 4-strategy cascade scorer (exact, normalized, MC, fuzzy)
One CLI command. Any LiteLLM-compatible model. Full benchmark report.
pip install llm-evaluation-framework
llm-eval compare --models gpt-4o-mini --models gemini/gemini-1.5-flash --benchmark mmlu --samples 100
Live demo on HuggingFace (no API key needed):
https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework
71 tests, 82% coverage, full CI/CD. Would love feedback from this community!
alozowski changed discussion status to closed