I will be testing fewer finetunes/merges
As of writing this, the leaderboard is at 934 commits and is just over 2 years old. I originally started it when I was unemployed and had a lot more time on my hands haha, and lately I've been spending entire weekends testing models and adding support for new ones when I'd really rather be doing other things. Also, in the last 2 months I've spent $1,500 on testing AIs, and I'm just not comfortable spending that kind of money anymore.
I'm still going to try to test api models, base models, and models that are highly asked for (a lot of reactions in their discussion), but I'm wanting to spend my time on my other hobbies, so I won't be able to test every model submitted.
Thank you to everyone who has supported the leaderboard over these years โค๏ธ
Damn that's very disappointing, you were the only source that was telling me which of my releases were the best and which ones were garbage as KL divergence isn't really all that accurate and just acts as a small pointer and PIQA scores only tells you a small part of the whole picture.
I understand. You've been at this alot longer than I ever expected really. Thank you for the work you've done. You certainly had a massive impact on the community.
Is there any way the community can help you?
We could build a community-driven voting web app to prioritize which models (finetunes/merges) actually deserve your time. This would replace manually checking reactions and provide a more granular way to vote for specific models rather than entire discussion threads.
To help with costs, we could integrate a 'Highlight' system linked to your Ko-fi: if someone wants a specific model prioritized or noticed, they can sponsor it via a donation. I'm a developer and would be happy to build and maintain this so it doesn't become another chore for you. You've done a lot for us, let us help you keep this sustainable.
Ain't no shame in charging a dollar per parameter for evals moving forward ๐ค
I think you'd benefit from getting at least one other person involved - maybe someone well respected from a team project, such as a regular with the "mradermacher" team or such. I know you have a very valid concern about the benchmarks themselves leaking publicly and invalidating themselves going forward, but that's definitely not guaranteed - and, the (well-informed) risk of that is likely much better than burning out, in which case the benchmark is also no longer valid going forward.
If part of the problem is that the process is too complex or involved, it's likely there are steps that could be at least partially automated. In fact, you could potentially share (or open) your workflow minus the actual corpus of data that defines the content/data of the benchmarks. I'm sure lots of folks - myself included - would be keen on modularizing and streamlining the workflow to make your tasks simpler.
(I second Pentium95 above - were I in your shoes, dealing with the requests forum would be quite frustrating; a relatively simple web app could help alleviate that.)
This leaderboard, at least as far as I'm aware, has been basically the only useful metric for whether or not a model is even worth downloading to begin with, what with both storage and bandwidth at a premium, so this is a real sad day. I mean, if you've not got the motivation, you've not got the motivation, and I certainly don't blame you not wanting to sink thousands of dollars into this, but still.
I've no idea if it'd make any meaningful difference, but would it be possible to maybe prune some of the less essential aspects of the process? I can't speak for anyone else, but I know I've never factored the political score into my decision to give a model a go at the very least? W/10 is king, with Nat int and Writing in 2nd and 3rd for my selection criteria.
If nothing else, thanks a bunch for even keeping it up this long, it's been extremely helpful.
Thank you for your work, it was an important milestone in open source, and it made a difference.
Hopefully, you are not planning to end ๐
I am sure if you could disclose your methods to some trusted people on here, they would be willing to help you. I know I would. I also think many people would make a small donation to keep you going if you asked for it. I am sure many people here will feel the same way and yet I also understand starting something when you have more breathing room and it turning into a chore that you starting to dislike. Hopefully if you ever feel the need to stop entirely you could help others to take over. I think your work so far has been really useful so thank you for everything you have done so far. I have enjoyed your leaderboard for many years to make detailed analysis with some of my own benchmarks.
I am sure if you could disclose your methods to some trusted people on here, they would be willing to help you. I know I would. I also think many people would make a small donation to keep you going if you asked for it. I am sure many people here will feel the same way and yet I also understand starting something when you have more breathing room and it turning into a chore that you starting to dislike. Hopefully if you ever feel the need to stop entirely you could help others to take over. I think your work so far has been really useful so thank you for everything you have done so far. I have enjoyed your leaderboard for many years to make detailed analysis with some of my own benchmarks.
I donated $20 on his Ko-Fi and he quit right after that, though it was mentioned in the OP that he spent $1,500 on testing AI in the last two months, so I don't think donations are going to cover such a cost unless he suddenly starts receiving $750 in donations every month.
I am sure if you could disclose your methods to some trusted people on here, they would be willing to help you. I know I would. I also think many people would make a small donation to keep you going if you asked for it. I am sure many people here will feel the same way and yet I also understand starting something when you have more breathing room and it turning into a chore that you starting to dislike. Hopefully if you ever feel the need to stop entirely you could help others to take over. I think your work so far has been really useful so thank you for everything you have done so far. I have enjoyed your leaderboard for many years to make detailed analysis with some of my own benchmarks.
I donated $20 on his Ko-Fi and he quit right after that, though it was mentioned in the OP that he spent $1,500 on testing AI in the last two months, so I don't think donations are going to cover such a cost unless he suddenly starts receiving $750 in donations every month.
Big fan of your work by the way. I am not sure if we can recoup all of his investment. But I wouldn't discount the option to significantly help him collectively. When I started reading Localllama there were only about 100 people, now we have a million. As a cool hobby and for work the number of people interested in this has exploded. We need only 150 people donating 10$ or 300 donating $5 to make something like that happen.
Not going to sit here and tell him how to do it. I am only saying that if that is what is needed we could all chip in. One post on Reddit could make it. But again, entirely his choice and I respect it either way.
Not going to sit here and tell him how to do it. I am only saying that if that is what is needed we could all chip in. One post on Reddit could make it. But again, entirely his choice and I respect it either way.
From what he said in OP the money is just one part of the issue, the other issue is that OP is spending all of his free time testing AI and he noticed he wasn't able to do much anything else, so he is also quitting because wants to focus on his other hobbies instead of spending all of his free time testing AIs.
This comes almost exactly a year after the Open LLM Leaderboard shut down.
$600 billion invested in LLMs in 2025, and it's apparently not possible to have a somewhat independent publicly accessible benchmark leaderboard without individuals renting GPUs out of their own pockets.
What a shame. Surely Hugging Face (who are benefiting from projects like this one, as it drives engagement to their platform) could at the very least provide compute for UGI, if not manage it in-house?
Go here:
https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard/settings?grant=true
This will create a community discussion on the space where we can all voice support for your application.
Free GPU compute... They've given it to me before, all you gotta do is ask.
@darkc0de Is that mainly for web demos? It doesn't seem like I would be able to privately run code on it very easily. The GPU looks attached to the huggingface space.
It would likely require you to adapt your workflow a bit. To run compute in HF "spaces" or "jobs".
Also, I meant what I said. No shame in charging a dollar per parameter moving forward. The way I see it, the community who uses it already owes you backpay anyways.
No shame in charging a dollar per parameter moving forward.
I think a dollar per parameter is probably a little excessive. A dollar per billion parameters sounds more reasonable ๐
๐ You kno what I mean haha
I just tossed $35 USD onto Kofi right now as an incentive towards Gemma-4-31b-it (the req thread), with the full knowledge and acceptance that DontPlanToEnd might not ever get around to it - I can afford that, and, I think anyone else who wants to support like that should jump on board, Ko-fi Link. I know money's only one piece of this, but it stinks that this has been a sink of time and money for DontPlanToEnd, and even though I donate on a schedule. (Which I share not to brag or for brownie points or whatever, but as encouragement to other people who may also donate regularly that they can add a bit on too.)
At worst, hopefully this lands as appreciation! (Comment duplicated in the above referenced thread.)
