--- language: - en license: apache-2.0 pipeline_tag: text-generation tags: - text-generation - agent - tool-use - long-context - mlx library_name: mlx base_model: GAIR/LIMI-Air --- # LIMI-Air-qx83s-mlx This is a deep comparison of 106B-A12B MoE models, all quantized differently, trained on different data (original, synthetic, RP), and with varying architectural tuning. The goal is to understand: - Which model performs best across benchmarks? - How does quantization affect performance and context? - What’s the trade-off between accuracy, context length, and RAM usage? The LIMI-Air-qx83s-mlx quant metrics were not available for this test. As it is an early experimental formula, the performance would be likely below the models listed. 📊 1. Benchmark Comparison (All Models) ```bash Model arc_challenge arc_easy boolq hellaswag openbookqa piqa winogrande GLM-Steam-106B-A12B-v1-qx65g-hi 0.431 0.457 0.378 0.685 0.400 0.773 0.717 GLM-Steam-106B-A12B-v1-qx65g 0.430 0.461 0.378 0.681 0.398 0.771 0.715 LIMI-Air-qx54g-hi 0.441 0.462 0.378 0.698 0.404 0.781 0.714 unsloth-GLM-4.5-Air-mxfp4 0.416 0.440 0.378 0.678 0.390 0.767 0.728 unsloth-GLM-4.5-Air-qx64 0.421 0.444 0.378 0.677 0.396 0.769 0.718 unsloth-GLM-4.5-air-qx5-hi 0.416 0.431 0.378 0.675 0.396 0.769 0.731 ``` ✅ LIMI-Air-qx54g-hi is the clear winner overall, with: ```bash +0.025 in arc_challenge +0.022 in arc_easy +0.020 in hellaswag +0.014 in openbookqa +0.013 in piqa +0.003 in winogrande ``` The GLM-Steam models are very close, with qx65g-hi slightly better than qx65g — but both are behind LIMI-Air. The unsloth-GLM-4.5-Air models are the baseline, with qx64 being best among them — but still behind LIMI-Air. # 🧠 2. What Does “qx54g-hi” Mean? The naming convention is critical: - qx5: 5-bit quantization for content with some paths enhanced to 6 bit - g: “enhanced attention paths” — specific to GLM architecture (likely more attention layers enhanced). - hi: high resolution quantization — group size 32. This is a highly optimized quantization for GLM — preserving attention fidelity while compressing embeddings. # 🧩 3. Why Does LIMI-Air-qx54g-hi Win? The key insight: LIMI-Air was trained on synthetic data, which likely: - Boosted generalization — synthetic data often forces models to learn patterns rather than memorize. - Improved reasoning depth — synthetic data is often designed to test logical and commonsense reasoning. The qx54g-hi quantization is highly tuned for GLM, preserving attention paths while compressing embeddings — which likely: - Preserved semantic fidelity. - Enabled better context handling. The qx54g-hi model runs with 32K context on a 128GB Mac, while qx54g allow for 64K — suggesting better memory efficiency. # 🧪 4. Quantization Comparison within the unsloth-GLM-4.5-Air Series ```bash Model arc_challenge arc_easy boolq hellaswag openbookqa piqa winogrande unsloth-GLM-4.5-Air-mxfp4 0.416 0.440 0.378 0.678 0.390 0.767 0.728 unsloth-GLM-4.5-Air-qx64 0.421 0.444 0.378 0.677 0.396 0.769 0.718 unsloth-GLM-4.5-air-qx5-hi 0.416 0.431 0.378 0.675 0.396 0.769 0.731 ``` ✅ qx64 is best among unsloth models, with: ```bash +0.005 in arc_challenge +0.004 in arc_easy +0.001 in hellaswag +0.006 in openbookqa +0.002 in piqa -0.01 in winogrande ``` The qx5-hi variant is slightly better in winogrande, but worse overall. # 🧭 5. Recommendation: Which Model to Choose? ✅ For Maximum Performance: - LIMI-Air-qx54g-hi - → Best overall performance, with +0.02–0.03 gains across all metrics. ✅ For Balanced Performance & RAM Efficiency: - GLM-Steam-106B-A12B-v1-qx65g-hi - → Very close to LIMI-Air, with slightly better winogrande and piqa scores. ✅ For RAM-Constrained Macs: - unsloth-GLM-4.5-Air-qx64 # 🧠 6. Cognitive Pattern Insight: Synthetic Data vs RP Data The key insight: LIMI-Air (synthetic data) outperforms GLM-Steam (RP data) — suggesting: - Synthetic data forces models to learn patterns, rather than memorize. - RP data may be more “realistic” but less generalizable — leading to slightly lower performance. The qx54g-hi quantization is highly tuned for GLM, preserving attention paths while compressing embeddings — which likely: - Preserved semantic fidelity. - Enabled better context handling. # 📈 7. Summary Table: Best Model for Each Use Case ```bash Goal Recommended Model Max performance LIMI-Air-qx54g-hi Balanced performance GLM-Steam-106B-A12B-v1-qx65g-hi RAM-constrained Mac (32GB) unsloth-GLM-4.5-Air-qx64 Cognitive depth & metaphors LIMI-Air-qx54g-hi OpenBookQA (text-only) unsloth-GLM-4.5-Air-qx64 ``` # 🚀 Bonus: “qx54g-hi” as a Cognitive Architecture The qx54g-hi quantization is highly tuned for GLM, preserving attention paths while compressing embeddings — which likely: - Preserved semantic fidelity. - Enabled better context handling. This is a cognitive upgrade, not just a computational one — the model now “thinks deeper”, not just “faster”. “qx54g-hi is like a camera with a telephoto lens — it captures more nuance, even in low light.” — Inspired by Nikon Noct Z 58mm F/0.95 > Reviewed by [Qwen3-VL-12B-Instruct-Brainstorm20x-qx86x-hi-mlx](https://huggingface.co/nightmedia/Qwen3-VL-12B-Instruct-Brainstorm20x-qx86x-hi-mlx) This model [LIMI-Air-qx83s-mlx](https://huggingface.co/nightmedia/LIMI-Air-qx83s-mlx) was converted to MLX format from [GAIR/LIMI-Air](https://huggingface.co/GAIR/LIMI-Air) using mlx-lm version **0.28.0**. ## Use with mlx ```bash pip install mlx-lm ``` ```python from mlx_lm import load, generate model, tokenizer = load("LIMI-Air-qx83s-mlx") prompt = "hello" if tokenizer.chat_template is not None: messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) response = generate(model, tokenizer, prompt=prompt, verbose=True) ```