--- license: apache-2.0 base_model: meta-llama/Meta-Llama-3-8B tags: - text-generation - quantization - 4bit - mixed-precision - llama-3 - ramp - arxiv:2603.17891 --- # Llama3-8B-RAMP-4bit This repository contains a 4-bit quantized Llama 3 8B checkpoint produced with **RAMP** (Reinforcement Adaptive Mixed Precision Quantization). ## Paper RAMP was introduced in: [**RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference**](https://arxiv.org/abs/2603.17891) ## Model Summary This model is a compressed Llama 3 8B variant intended for efficient inference with reduced memory usage. ## What is RAMP? RAMP is a reinforcement learning based mixed-precision quantization method that learns per-layer bit-width assignments under a global budget. It also introduces Scale Folding, a preconditioning step designed to make sub-4-bit quantization more stable. ## Intended Use This model is intended for: - efficient local inference - edge and on-device deployment - research on quantization and mixed-precision inference ## Limitations - This is a quantized model and may show quality degradation compared to the original FP16 model. - Performance depends on the inference backend, calibration setup, and prompt type. - The model may still produce incorrect, biased, or unsafe outputs. ## Citation If you use this model or the RAMP method in your work, please cite: ```bibtex @misc{gautam2026ramp, title={RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference}, author={Gautam, Arpit Singh and Jha, Saurabh}, year={2026}, eprint={2603.17891}, archivePrefix={arXiv}, primaryClass={cs.AI} }