---
license: apache-2.0
base_model: meta-llama/Meta-Llama-3-8B
tags:
  - text-generation
  - quantization
  - 4bit
  - mixed-precision
  - llama-3
  - ramp
  - arxiv:2603.17891
---

# Llama3-8B-RAMP-4bit

This repository contains a 4-bit quantized Llama 3 8B checkpoint produced with **RAMP** (Reinforcement Adaptive Mixed Precision Quantization).

## Paper

RAMP was introduced in:

[**RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference**](https://arxiv.org/abs/2603.17891)

## Model Summary

This model is a compressed Llama 3 8B variant intended for efficient inference with reduced memory usage.

## What is RAMP?

RAMP is a reinforcement learning based mixed-precision quantization method that learns per-layer bit-width assignments under a global budget. It also introduces Scale Folding, a preconditioning step designed to make sub-4-bit quantization more stable.

## Intended Use

This model is intended for:
- efficient local inference
- edge and on-device deployment
- research on quantization and mixed-precision inference

## Limitations

- This is a quantized model and may show quality degradation compared to the original FP16 model.
- Performance depends on the inference backend, calibration setup, and prompt type.
- The model may still produce incorrect, biased, or unsafe outputs.

## Citation

If you use this model or the RAMP method in your work, please cite:

```bibtex
@misc{gautam2026ramp,
  title={RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference},
  author={Gautam, Arpit Singh and Jha, Saurabh},
  year={2026},
  eprint={2603.17891},
  archivePrefix={arXiv},
  primaryClass={cs.AI}
}