---
license: other
license_name: nvidia-open-model-license
license_link: https://huggingface.co/nvidia/personaplex-7b-v1/blob/main/LICENSE
base_model: nvidia/personaplex-7b-v1
base_model_relation: quantized
tags:
  - speech-to-speech
  - conversational-ai
  - bitsandbytes
  - 4-bit
  - quantized
  - nvidia
  - moshi
library_name: custom
---

# PersonaPlex 7B v1 — 4-bit NF4 Quantized (bitsandbytes)

This is a **4-bit NF4 quantized** version of [nvidia/personaplex-7b-v1](https://huggingface.co/nvidia/personaplex-7b-v1) using [bitsandbytes](https://github.com/TimDettmers/bitsandbytes).

PersonaPlex is a real-time, full-duplex speech-to-speech conversational model with persona control through text-based role prompts and audio-based voice conditioning.

## Why Quantize?

The original model requires ~14GB VRAM (bf16), which exceeds consumer GPUs like the RTX 4070 (12GB). This 4-bit quantized version:

| | Original (bf16) | Quantized (NF4) |
|---|---|---|
| VRAM | ~14 GiB | **~9.6 GiB** |
| GPU | A100 / H100 | **RTX 4070+ (12GB)** |
| torch.compile | Yes | Yes |
| CUDA graphs | Yes | Yes |

## What's Quantized?

Only the main transformer's linear layers (attention projections + gating FFN) are quantized to 4-bit NF4. The following are kept in bf16 for quality:
- Mimi audio encoder/decoder
- Depformer (depth transformer)
- Embedding layers
- Output heads

## Quick Start

### Prerequisites

1. Accept the [PersonaPlex license](https://huggingface.co/nvidia/personaplex-7b-v1) (required for the base model assets)
2. Set your HuggingFace token:
```bash
export HF_TOKEN=<YOUR_TOKEN>
```

### Installation

```bash
git clone https://huggingface.co/brianmatzelle/personaplex-7b-v1-bnb-4bit
cd personaplex-7b-v1-bnb-4bit
pip install moshi/.
pip install bitsandbytes
```

### Run (Live Server)

```bash
SSL_DIR=$(mktemp -d)
python -m moshi.server --ssl "$SSL_DIR" --quantize-4bit
```

Then open `https://localhost:8998` in your browser.

### Run (Offline Evaluation)

```bash
python -m moshi.offline \
  --voice-prompt "NATF2.pt" \
  --input-wav "assets/test/input_assistant.wav" \
  --seed 42424242 \
  --output-wav "output.wav" \
  --output-text "output.json" \
  --quantize-4bit
```

### Using Pre-Quantized Weights

This repo includes pre-quantized weights (`model_bnb_4bit.pt`) so you don't need the
full 16.7GB download. To use them, pass `--moshi-weight model_bnb_4bit.pt` along with
`--quantize-4bit`. The loader auto-detects the pre-quantized format and skips re-quantization.

## Changes from Base Model

This repo includes a modified `moshi/` package with:
- `--quantize-4bit` flag for on-the-fly 4-bit NF4 quantization via bitsandbytes
- Pre-quantized checkpoint loading (auto-detected, no re-quantization needed)
- `--cpu-offload` fixes for consumer GPU compatibility
- Attention `in_proj` refactored as a proper `nn.Module` for quantization support
- Gating forward path updated to route through quantized modules

## Citation

```bibtex
@misc{roy2026personaplexvoicerolecontrol,
      title={PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models},
      author={Rajarshi Roy and Jonathan Raiman and Sang-gil Lee and Teodor-Dumitru Ene and Robert Kirby and Sungwon Kim and Jaehyeon Kim and Bryan Catanzaro},
      year={2026},
      eprint={2602.06053},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.06053},
}
```

## License

Code is MIT licensed. Model weights are under the [NVIDIA Open Model License](https://huggingface.co/nvidia/personaplex-7b-v1/blob/main/LICENSE).