Open source tool to add cost, latency & hallucination dimensions to your LLM evaluation

#1164
by vigneshwar234 - opened

Hey everyone ๐Ÿ‘‹

Love this leaderboard โ€” it's the gold standard for accuracy rankings. But in production, accuracy alone doesn't tell the full story.

I built an open source LLM Evaluation Framework that adds the missing dimensions:

  • ๐Ÿ’ฐ Cost per 1K tokens โ€” real token-count pricing across 15+ models
  • โšก Latency โ€” p50/p95/p99 percentiles, not just averages
  • ๐Ÿ” Hallucination Rate โ€” linguistic signal analysis, runs locally, zero extra cost
  • ๐Ÿง  Reasoning Quality โ€” chain-of-thought depth scoring
  • ๐ŸŽฏ Accuracy โ€” 4-strategy cascade scorer (exact, normalized, MC, fuzzy)

One CLI command. Any LiteLLM-compatible model. Full benchmark report.

pip install llm-evaluation-framework
llm-eval compare --models gpt-4o-mini --models gemini/gemini-1.5-flash --benchmark mmlu --samples 100

Live demo on HuggingFace (no API key needed):
https://huggingface.co/spaces/vigneshwar234/llm-eval-demo

GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

71 tests, 82% coverage, full CI/CD. Would love feedback from this community!

alozowski changed discussion status to closed

Sign up or log in to comment