---
license: mit
language:
- en
base_model:
- meta-llama/Llama-3.2-1B-Instruct
pipeline_tag: text-generation
---
# Llama-3.2-1B-Instruct (4-bit Quantized)

This repository contains a **4-bit quantized version** of the Llama-3.2-1B-Instruct model.
It has been quantized using **bitsandbytes NF4** for extremely low VRAM consumption and
fast inference, making it ideal for edge devices, low-resource systems, or fast evaluation
pipelines (e.g., interview Thinker models).

---

##  Model Features

- **Base model:** Llama-3.2-1B-Instruct  
- **Quantization:** 4-bit (NF4) using `bitsandbytes`  
- **VRAM requirement:** ~1.0 GB  
- **Perfect for:**  
  - Lightweight chatbots  
  - Reasoning/evaluation agents  
  - Interview Thinker modules  
  - Local inference on small GPUs  
  - Low-latency systems  
- **Compatible with:**  
  - LoRA fine-tuning  
  - HuggingFace Transformers  
  - Text-generation inference engines  

---

##  Files Included

- `config.json`  
- `generation_config.json`  
- `model.safetensors` (4-bit quantized weights)  
- `tokenizer.json`  
- `tokenizer_config.json`  
- `special_tokens_map.json`  
- `chat_template.jinja`  

These files allow you to load the model directly with `load_in_4bit=True`.

---

##  How To Load This Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Shlok307/llama-1b-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto"
)