--- license: mit language: - en base_model: - meta-llama/Llama-3.2-1B-Instruct pipeline_tag: text-generation --- # Llama-3.2-1B-Instruct (4-bit Quantized) This repository contains a **4-bit quantized version** of the Llama-3.2-1B-Instruct model. It has been quantized using **bitsandbytes NF4** for extremely low VRAM consumption and fast inference, making it ideal for edge devices, low-resource systems, or fast evaluation pipelines (e.g., interview Thinker models). --- ## Model Features - **Base model:** Llama-3.2-1B-Instruct - **Quantization:** 4-bit (NF4) using `bitsandbytes` - **VRAM requirement:** ~1.0 GB - **Perfect for:** - Lightweight chatbots - Reasoning/evaluation agents - Interview Thinker modules - Local inference on small GPUs - Low-latency systems - **Compatible with:** - LoRA fine-tuning - HuggingFace Transformers - Text-generation inference engines --- ## Files Included - `config.json` - `generation_config.json` - `model.safetensors` (4-bit quantized weights) - `tokenizer.json` - `tokenizer_config.json` - `special_tokens_map.json` - `chat_template.jinja` These files allow you to load the model directly with `load_in_4bit=True`. --- ## How To Load This Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Shlok307/llama-1b-4bit" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, load_in_4bit=True, device_map="auto" )