brianmatzelle's picture
docs: update README — pre-quantized loading now works
bde1652 verified
metadata
license: other
license_name: nvidia-open-model-license
license_link: https://huggingface.co/nvidia/personaplex-7b-v1/blob/main/LICENSE
base_model: nvidia/personaplex-7b-v1
base_model_relation: quantized
tags:
  - speech-to-speech
  - conversational-ai
  - bitsandbytes
  - 4-bit
  - quantized
  - nvidia
  - moshi
library_name: custom

PersonaPlex 7B v1 — 4-bit NF4 Quantized (bitsandbytes)

This is a 4-bit NF4 quantized version of nvidia/personaplex-7b-v1 using bitsandbytes.

PersonaPlex is a real-time, full-duplex speech-to-speech conversational model with persona control through text-based role prompts and audio-based voice conditioning.

Why Quantize?

The original model requires ~14GB VRAM (bf16), which exceeds consumer GPUs like the RTX 4070 (12GB). This 4-bit quantized version:

Original (bf16) Quantized (NF4)
VRAM ~14 GiB ~9.6 GiB
GPU A100 / H100 RTX 4070+ (12GB)
torch.compile Yes Yes
CUDA graphs Yes Yes

What's Quantized?

Only the main transformer's linear layers (attention projections + gating FFN) are quantized to 4-bit NF4. The following are kept in bf16 for quality:

  • Mimi audio encoder/decoder
  • Depformer (depth transformer)
  • Embedding layers
  • Output heads

Quick Start

Prerequisites

  1. Accept the PersonaPlex license (required for the base model assets)
  2. Set your HuggingFace token:
export HF_TOKEN=<YOUR_TOKEN>

Installation

git clone https://huggingface.co/brianmatzelle/personaplex-7b-v1-bnb-4bit
cd personaplex-7b-v1-bnb-4bit
pip install moshi/.
pip install bitsandbytes

Run (Live Server)

SSL_DIR=$(mktemp -d)
python -m moshi.server --ssl "$SSL_DIR" --quantize-4bit

Then open https://localhost:8998 in your browser.

Run (Offline Evaluation)

python -m moshi.offline \
  --voice-prompt "NATF2.pt" \
  --input-wav "assets/test/input_assistant.wav" \
  --seed 42424242 \
  --output-wav "output.wav" \
  --output-text "output.json" \
  --quantize-4bit

Using Pre-Quantized Weights

This repo includes pre-quantized weights (model_bnb_4bit.pt) so you don't need the full 16.7GB download. To use them, pass --moshi-weight model_bnb_4bit.pt along with --quantize-4bit. The loader auto-detects the pre-quantized format and skips re-quantization.

Changes from Base Model

This repo includes a modified moshi/ package with:

  • --quantize-4bit flag for on-the-fly 4-bit NF4 quantization via bitsandbytes
  • Pre-quantized checkpoint loading (auto-detected, no re-quantization needed)
  • --cpu-offload fixes for consumer GPU compatibility
  • Attention in_proj refactored as a proper nn.Module for quantization support
  • Gating forward path updated to route through quantized modules

Citation

@misc{roy2026personaplexvoicerolecontrol,
      title={PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models},
      author={Rajarshi Roy and Jonathan Raiman and Sang-gil Lee and Teodor-Dumitru Ene and Robert Kirby and Sungwon Kim and Jaehyeon Kim and Bryan Catanzaro},
      year={2026},
      eprint={2602.06053},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.06053},
}

License

Code is MIT licensed. Model weights are under the NVIDIA Open Model License.