--- license: other license_name: nvidia-open-model-license license_link: https://huggingface.co/nvidia/personaplex-7b-v1/blob/main/LICENSE base_model: nvidia/personaplex-7b-v1 base_model_relation: quantized tags: - speech-to-speech - conversational-ai - bitsandbytes - 4-bit - quantized - nvidia - moshi library_name: custom --- # PersonaPlex 7B v1 — 4-bit NF4 Quantized (bitsandbytes) This is a **4-bit NF4 quantized** version of [nvidia/personaplex-7b-v1](https://huggingface.co/nvidia/personaplex-7b-v1) using [bitsandbytes](https://github.com/TimDettmers/bitsandbytes). PersonaPlex is a real-time, full-duplex speech-to-speech conversational model with persona control through text-based role prompts and audio-based voice conditioning. ## Why Quantize? The original model requires ~14GB VRAM (bf16), which exceeds consumer GPUs like the RTX 4070 (12GB). This 4-bit quantized version: | | Original (bf16) | Quantized (NF4) | |---|---|---| | VRAM | ~14 GiB | **~9.6 GiB** | | GPU | A100 / H100 | **RTX 4070+ (12GB)** | | torch.compile | Yes | Yes | | CUDA graphs | Yes | Yes | ## What's Quantized? Only the main transformer's linear layers (attention projections + gating FFN) are quantized to 4-bit NF4. The following are kept in bf16 for quality: - Mimi audio encoder/decoder - Depformer (depth transformer) - Embedding layers - Output heads ## Quick Start ### Prerequisites 1. Accept the [PersonaPlex license](https://huggingface.co/nvidia/personaplex-7b-v1) (required for the base model assets) 2. Set your HuggingFace token: ```bash export HF_TOKEN= ``` ### Installation ```bash git clone https://huggingface.co/brianmatzelle/personaplex-7b-v1-bnb-4bit cd personaplex-7b-v1-bnb-4bit pip install moshi/. pip install bitsandbytes ``` ### Run (Live Server) ```bash SSL_DIR=$(mktemp -d) python -m moshi.server --ssl "$SSL_DIR" --quantize-4bit ``` Then open `https://localhost:8998` in your browser. ### Run (Offline Evaluation) ```bash python -m moshi.offline \ --voice-prompt "NATF2.pt" \ --input-wav "assets/test/input_assistant.wav" \ --seed 42424242 \ --output-wav "output.wav" \ --output-text "output.json" \ --quantize-4bit ``` ### Using Pre-Quantized Weights This repo includes pre-quantized weights (`model_bnb_4bit.pt`) so you don't need the full 16.7GB download. To use them, pass `--moshi-weight model_bnb_4bit.pt` along with `--quantize-4bit`. The loader auto-detects the pre-quantized format and skips re-quantization. ## Changes from Base Model This repo includes a modified `moshi/` package with: - `--quantize-4bit` flag for on-the-fly 4-bit NF4 quantization via bitsandbytes - Pre-quantized checkpoint loading (auto-detected, no re-quantization needed) - `--cpu-offload` fixes for consumer GPU compatibility - Attention `in_proj` refactored as a proper `nn.Module` for quantization support - Gating forward path updated to route through quantized modules ## Citation ```bibtex @misc{roy2026personaplexvoicerolecontrol, title={PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models}, author={Rajarshi Roy and Jonathan Raiman and Sang-gil Lee and Teodor-Dumitru Ene and Robert Kirby and Sungwon Kim and Jaehyeon Kim and Bryan Catanzaro}, year={2026}, eprint={2602.06053}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2602.06053}, } ``` ## License Code is MIT licensed. Model weights are under the [NVIDIA Open Model License](https://huggingface.co/nvidia/personaplex-7b-v1/blob/main/LICENSE).