Building on HF

14 5 8

Ed Addario PRO

eaddario

EAddario

AI & ML interests

Finding ways to optimize LLMs' inference performance in resource-constrained environments (e.g. commodity hardware, desktops, laptops, mobiles, edge devices, etc.)

Recent Activity

new activity about 7 hours ago

eaddario/Qwen3.5-9B-GGUF:Amazing quants

new activity about 7 hours ago

eaddario/imatrix-calibration:Great collection, I'm using it for my little project.

posted an update about 7 hours ago

Experimental global target bits‑per‑weight quantization of google/gemma-4-E2B-it, google/gemma-4-E4B-it and google/gemma-4-26B-A4B-it Unlike standard llama.cpp quantizations that rely on fixed type heuristics (e.g., Q4_K_M), the Target BPW approach optimizes per-tensor precision where it matters the most, and produces high quality models that meet a precise global file size target. Key Advantages: - VRAM Maximization: Can generate high quality models sized exactly to fit hardware constraints (e.g., fitting the model into exactly 24GB VRAM). - Data-Driven Precision: Quantization mix is determined by actual weight error sensitivity rather than hardcoded rules, often yielding better PPL/KLD size trade-offs. Full benchmarks (PPL, KLD, ARC, MMLU, etc.) and methodology in the models' cards https://huggingface.co/eaddario/gemma-4-E2B-it-GGUF https://huggingface.co/eaddario/gemma-4-E4B-it-GGUF https://huggingface.co/eaddario/gemma-4-26B-A4B-it-GGUF

View all activity

Organizations

posted an update about 7 hours ago

Post

Experimental global target bits‑per‑weight quantization of google/gemma-4-E2B-it, google/gemma-4-E4B-it and google/gemma-4-26B-A4B-it

Unlike standard llama.cpp quantizations that rely on fixed type heuristics (e.g., Q4_K_M), the Target BPW approach optimizes per-tensor precision where it matters the most, and produces high quality models that meet a precise global file size target.

Key Advantages:
- VRAM Maximization: Can generate high quality models sized exactly to fit hardware constraints (e.g., fitting the model into exactly 24GB VRAM).
- Data-Driven Precision: Quantization mix is determined by actual weight error sensitivity rather than hardcoded rules, often yielding better PPL/KLD size trade-offs.

Full benchmarks (PPL, KLD, ARC, MMLU, etc.) and methodology in the models' cards

eaddario/gemma-4-E2B-it-GGUF
eaddario/gemma-4-E4B-it-GGUF
eaddario/gemma-4-26B-A4B-it-GGUF

replied to their post 5 days ago

Thank you @Green-Sky ! I'm planning to have a go at the Gemma 4s over the weekend and I'll take your dataset for a spin

replied to their post 7 days ago

On this occasion, no difference in size is expected.

I'm benchmarking quality instead of size, and to facilitate apples-to-apples comparisons, models IQ1_M, IQ2_M, Q3_K, Q4_K, Q5_K, Q6_K and Q8_0 were quantized at the same bits-per-weight (bpw) of naive models, and Q4_K-B and Q4_K-U were matched to the ones produced by Bartwoski and Unsloth respectively.

The file sizes are the same, but the quality is better.

You're welcome to the enhanced versions of llama-imatrix and llama-quantize if you require a particular size. If this is not practical, let me know which ones you'd need, and I'll be happy to upload.

posted an update 8 days ago

Post

164

eaddario/imatrix-calibration datasets updated to include Southeast Asian languages (Burmese, Filipino, Indonesian, Thai & Vietnamese).

posted an update 9 days ago

Post

170

Experimental global target bits‑per‑weight quantization of Qwen/Qwen3.5-4B and Qwen/Qwen3.5-9B

Unlike standard llama.cpp quantizations that rely on fixed type heuristics (e.g., Q4_K_M), the Target BPW approach optimizes per-tensor precision where it matters the most, and produces high quality models that meet a precise global file size target.

Key Advantages:
- VRAM Maximization: Can generate high quality models sized exactly to fit hardware constraints (e.g., fitting the model into exactly 24GB VRAM).
- Data-Driven Precision: Quantization mix is determined by actual weight error sensitivity rather than hardcoded rules, often yielding better PPL/KLD size trade-offs.

Full benchmarks (PPL, KLD, ARC, MMLU, etc.) and methodology in the models' cards

eaddario/Qwen3.5-4B-GGUF
eaddario/Qwen3.5-9B-GGUF

4 replies

posted an update 3 months ago

Post

3116

Experimental global target bits‑per‑weight quantization of mistralai/Ministral-3-14B-Instruct-2512 and mistralai/Ministral-3-14B-Reasoning-2512

Unlike standard llama.cpp quantizations that rely on fixed type heuristics (e.g., Q4_K_M), the Target BPW approach optimizes per-tensor precision where it matters the most, and produces high quality models that meet a precise global file size target.

Key Advantages:
- VRAM Maximization: Can generate high quality models sized exactly to fit hardware constraints (e.g., fitting the model into exactly 24GB VRAM).
- Data-Driven Precision: Quantization mix is determined by actual weight error sensitivity rather than hardcoded rules, often yielding better PPL/KLD size trade-offs.

Full benchmarks (PPL, KLD, ARC, MMLU, etc.) and methodology in the models' cards

eaddario/Ministral-3-14B-Instruct-2512-GGUF
eaddario/Ministral-3-14B-Reasoning-2512-GGUF

posted an update 4 months ago

Post

1825

Experimental global target bits‑per‑weight quantization of allenai/Olmo-3-7B-Instruct and allenai/Olmo-3-7B-Think

Unlike standard llama.cpp quantizations that rely on fixed type heuristics (e.g., Q4_K_M), the Target BPW approach optimizes per-tensor precision where it matters the most, and produces high quality models that meet a precise global file size target.

Key Advantages:
- VRAM Maximization: Can generate high quality models sized exactly to fit hardware constraints (e.g., fitting the model into exactly 24GB VRAM).
- Data-Driven Precision: Quantization mix is determined by actual weight error sensitivity rather than hardcoded rules, often yielding better PPL/KLD size trade-offs.

Full benchmarks (PPL, KLD, ARC, MMLU, etc.) and methodology in the models' cards

eaddario/Olmo-3-7B-Instruct-GGUF
eaddario/Olmo-3-7B-Think-GGUF

posted an update 4 months ago

Post

2132

Experimental global target bits‑per‑weight quantization of ServiceNow-AI/Apriel-1.6-15b-Thinker and zai-org/GLM-4.6V-Flash

Unlike standard llama.cpp quantizations that rely on fixed type heuristics (e.g., Q4_K_M), the Target BPW approach optimizes per-tensor precision where it matters the most, and produces high quality models that meet a precise global file size target.

Key Advantages:
- VRAM Maximization: Can generate high quality models sized exactly to fit hardware constraints (e.g., fitting the model into exactly 24GB VRAM).
- Data-Driven Precision: Quantization mix is determined by actual weight error sensitivity rather than hardcoded rules, often yielding better PPL/KLD size trade-offs.

Full benchmarks (PPL, KLD, ARC, MMLU, etc.) and methodology in the models' cards

eaddario/Apriel-1.6-15b-Thinker-GGUF
eaddario/GLM-4.6V-Flash-GGUF

reacted to hesamation's post with ❤️ 7 months ago

Post

12204

a senior engineer at google just dropped a 400-page free book on docs for review: agentic design patterns.

the table of contents looks like everything you need to know about agents + code:
> advanced prompt techniques
> multi-agent patterns
> tool use and MCP
> you name it

read it here: https://docs.google.com/document/d/1rsaK53T3Lg5KoGwvf8ukOUvbELRtH-V0LnOIFDxBryE/edit?tab=t.0#heading=h.pxcur8v2qagu

you can also pre-order on Amazon (published by Springer) and the royalties goes to Save the Children: https://www.amazon.com/Agentic-Design-Patterns-Hands-Intelligent/dp/3032014018/

reacted to AdinaY's post with 👀 9 months ago

Post

2752

Skywork UniPic 🔥a unified autoregressive multimodal model for image understanding, generation, & editing, by Skywork 天工

Skywork/skywork-unipic-6888c0789cdb82457b2acf32

✨ 1.5 B - MIT License
✨ Runs on RTX 4090
✨ Truly unified architecture

posted an update 9 months ago

Post

555

Layer-wise and Pruned versions of mistralai/Devstral-Small-2505 and mistralai/Mistral-Small-3.2-24B-Instruct-2506

- Tesor-wise:
eaddario/Devstral-Small-2505-GGUF
eaddario/Mistral-Small-3.2-24B-Instruct-2506-GGUF

- Pruned:
eaddario/Devstral-Small-2505-pruned-GGUF
eaddario/Mistral-Small-3.2-24B-Instruct-2506-pruned-GGUF

Summary in the models' cards and test results in the ./scores directory. Questions/feedback is always welcomed.

reacted to AdinaY's post with 🔥 9 months ago

Post

1389

Seed-X 🔥 a suite of multilingual translation models released by ByteDance.

ByteDance-Seed/seed-x-6878753f2858bc17afa78543

✨ instruction/reinforcement learning/reward model
✨ Supports 28 languages, bidirectional translation
✨ Optimized for deployment & inference: 7B with mistral architecture
✨ Excels across domains: science, law, finance, literature & more

reacted to AdinaY's post with 🔥 9 months ago

Post

1525

M2-Reasoning🔥 a unified multimodal model for general (math, logic) and spatial (motion, physics, orientation) reasoning, released by AntGroup.

Model:
inclusionAI/M2-Reasoning
Paper:
M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning (2507.08306)

✨ 7B with MIT license
✨ 294K high quality samples via novel data pipeline
✨ Dynamic multi-task training to resolve task conflicts

reacted to jsulz's post with 🤯 9 months ago

Post

3310

We've moved over 20PB from Git LFS to Xet on the Hub without downtime or data loss. Having things "just work" on a migration of this scale is about as good as it gets.

Now, we're migrating the rest of the Hub https://huggingface.co/blog/migrating-the-hub-to-xet

But how did we get here?

In the early days of joining Hugging Face, we made a few key design decisions:
* There would be no "hard cut-over" from Git LFS to Xet
* A Xet-enabled repository should be able to contain both Xet and LFS files
* Repository migrations from LFS to Xet can run in the background without disrupting downloads or uploads

These were largely driven by our desire to ensure the community could keep working without interruption.

We cover the infrastructure making this all go in this post, specifically:
* An integral piece of infrastructure known internally as the Git LFS Bridge
* Background content migrations that run around the clock

To skip the wait and join Xet now, sign up here https://huggingface.co/join/xet

reacted to merve's post with ❤️ 9 months ago

Post

2496

past week had huuuge releases 💗
here's our picks 🔥 find more models, datasets, demos here merve/releases-july-11-68750452c358c98b0fa663f7

> moonshotai/Kimi-K2-Instruct is the new sota LLM with 1T total 32B active parameters 🤯

> HuggingFaceTB/SmolLM3-3B is the new best LM for it's size, offers thinking mode 💭 as well as the dataset HuggingFaceTB/smoltalk2

> Alibaba-NLP/WebSailor-3B is the new agentic LLM for complex browsing

> Google DeepMind released medical vision LMs with an agentic doctor-patient app google/medgemma-release-680aade845f90bec6a3f60c4

> fal released a LoRA to improve details on face images fal/Realism-Detailer-Kontext-Dev-LoRA

reacted to danielhanchen's post with 🔥🤯 9 months ago

Post

4042

Made some 245GB (80% size reduction) 1.8bit quants for Kimi K2!

unsloth/Kimi-K2-Instruct-GGUF

reacted to fdaudens's post with 🔥 9 months ago

Post

2660

You might not have heard of Moonshot AI — but within 24 hours, their new model Kimi K2 shot to the top of Hugging Face’s trending leaderboard.

So… who are they, and why does it matter?

Had a lot of fun co-writing this blog post with @xianbao , with key insights translated from Chinese, to unpack how this startup built a model that outperforms GPT-4.1, Claude Opus, and DeepSeek V3 on several major benchmarks.

🧵 A few standout facts:

1. From zero to $3.3B in 18 months:
Founded in March 2023, Moonshot is now backed by Alibaba, Tencent, Meituan, and HongShan.

2. A CEO who thinks from the end:
Yang Zhilin (31) previously worked at Meta AI, Google Brain, and Carnegie Mellon. His vision? Nothing less than AGI — still a rare ambition among Chinese AI labs.

3. A trillion-parameter model that’s surprisingly efficient:
Kimi K2 uses a mixture-of-experts architecture (32B active params per inference) and dominates on coding/math benchmarks.

4. The secret weapon: Muon optimizer:
A new training method that doubles efficiency, cuts memory in half, and ran 15.5T tokens with zero failures. Big implications.

Most importantly, their move from closed to open source signals a broader shift in China’s AI scene — following Baidu’s pivot. But as Yang puts it: “Users are the only real leaderboard.”

👇 Check out the full post to explore what Kimi K2 can do, how to try it, and why it matters for the future of open-source LLMs:
https://huggingface.co/blog/fdaudens/moonshot-ai-kimi-k2-explained

reacted to merve's post with 🤗❤️ 9 months ago

Post

2688

Fine-tune Gemma3n on videos with audios inside with Colab A100 🔥
Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!

keep in mind, it's made for educational purposes 🫡 we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM

stretch modalities and unfreeze layers as you wish! 🙏🏻 merve/smol-vision

1 reply

Ed Addario PRO

AI & ML interests

Recent Activity

Organizations

eaddario's activity