---
license: llama3.1
language:
- en
base_model:
- mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated
pipeline_tag: text-generation
library_name: transformers
tags:
- llama3.1
- abliteration
- quantized
- nf4
- bitsandbytes
- 4-bit
---

📜 Model Description

This model is a 4-bit NormalFloat (NF4) quantized version of the Meta-Llama-3.1-8B-Instruct-Abliterated, fine-tuned by mlabonne.

The quantization process significantly reduces the memory footprint (VRAM usage) and improves inference speed, making it highly accessible for deployment on consumer-grade GPUs and limited-resource hardware, while maintaining high performance due to the nature of the NF4 method.
🔗 Original Model Source

    Original Model Name: mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated

    Original Base Model: Llama 3.1 8B Instruct

    Original Description: A version of Llama 3.1 8B Instruct that has undergone "Abliteration" (further fine-tuning) to enhance its capabilities and alignment.

⚙️ Quantization Details

    Quantization Technique: NF4 (NormalFloat 4-bit)

    Library Used: Typically implemented using bitsandbytes via the Hugging Face transformers library.

    Purpose: To enable loading and running the model in 4-bit precision, drastically cutting down VRAM requirements.

🛠️ How to Use the Model (4-bit Loading)

This model is intended to be used with the Hugging Face transformers library and bitsandbytes for 4-bit loading.
💻 Installation

To utilize the 4-bit configuration, you must have the necessary libraries installed:
```Bash
pip install torch transformers accelerate bitsandbytes
```

# Python Usage Example

```Python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ikarius/Meta-Llama-3.1-8B-Instruct-Abliterated-NF4"

# 1. Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# 2. Run Inference using the Instruct template
messages = [
    {"role": "system", "content": "You are a helpful and friendly AI assistant."},
    {"role": "user", "content": "What is the main benefit of 4-bit NF4 quantization?"}
]

# Apply the Llama 3.1 chat template
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```