Running PersonaPlex-7B on Hugging Face ZeroGPU: A Complete Guide
Overview
PersonaPlex-7B is a high-performance speech-to-speech model that typically requires a powerful NVIDIA GPU for inference. However, Hugging Face ZeroGPU allows you to run such models without owning any GPU hardware locally.
This guide walks through how to deploy PersonaPlex-7B using a Hugging Face Space with ZeroGPU, enabling scalable and cost-efficient inference from any machine.
How ZeroGPU Works
ZeroGPU separates your development environment from the compute layer.
Your local machine (CPU-only) interacts with a Hugging Face Space via browser or API calls. The Space itself runs continuously on CPU, but when GPU-intensive functions are triggered, a high-end GPU is dynamically attached.
Architecture flow:
Your laptop (no GPU)
Sends requests via
gradio_clientor browserHugging Face Space receives the request
Space runs on CPU by default
When a function decorated with
@spaces.GPUis called:- A NVIDIA H200 GPU (70GB VRAM) is allocated
- Inference runs on the GPU
- GPU is released immediately after execution
Key characteristics:
- Spaces run on CPU 24/7 at no cost
- GPU is allocated only during execution
- H200 (70GB VRAM) is sufficient for PersonaPlex-7B
- Hugging Face Pro increases quota and priority
ZeroGPU Limits (Free vs Pro)
| Feature | Free | Pro ($9/month) |
|---|---|---|
| Daily GPU quota | Baseline | 7x higher |
| Queue priority | Normal | Highest |
| Max ZeroGPU Spaces | — | 10 |
| Beyond quota | Blocked | $1 per 10 minutes |
| GPU type | H200 (70GB) | H200 (70GB) |
| Max duration per call | 60s | 120s+ (configurable) |
Step 1: Create a Hugging Face Space
Navigate to https://huggingface.co/new-space
Configure the Space:
- Name:
personaplex-batch - SDK: Gradio
- Hardware: ZeroGPU (requires Pro)
- Visibility: Private (recommended)
- Name:
Create the Space
Step 2: Add Required Files
Your Space requires three files.
requirements.txt
torch
torchaudio
huggingface_hub
spaces
setup.sh
This script installs PersonaPlex at build time.
#!/bin/bash
git clone https://github.com/NVIDIA/personaplex.git /home/user/app/personaplex
cd /home/user/app/personaplex && pip install moshi/
app.py
This is the main application.
import gradio as gr
import spaces
import subprocess
import json
import tempfile
import os
from pathlib import Path
@spaces.GPU(duration=120)
def generate_response(audio_path, voice, text_prompt, seed):
with tempfile.TemporaryDirectory() as tmp:
out_wav = os.path.join(tmp, "output.wav")
out_json = os.path.join(tmp, "output.json")
cmd = [
"python", "-m", "moshi.offline",
"--voice-prompt", f"{voice}.pt",
"--text-prompt", text_prompt,
"--input-wav", audio_path,
"--seed", str(int(seed)),
"--output-wav", out_wav,
"--output-text", out_json,
]
result = subprocess.run(
cmd,
capture_output=True,
text=True,
cwd="/home/user/app/personaplex",
env={**os.environ, "HF_TOKEN": os.environ.get("HF_TOKEN", "")},
)
if result.returncode != 0:
raise gr.Error(result.stderr[:500])
transcript = json.loads(Path(out_json).read_text())
persistent_wav = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
persistent_wav.write(Path(out_wav).read_bytes())
persistent_wav.close()
return persistent_wav.name, transcript
voices = [
"NATF0", "NATF1", "NATF2", "NATF3",
"NATM0", "NATM1", "NATM2", "NATM3",
"VARF0", "VARF1", "VARF2", "VARF3", "VARF4",
"VARM0", "VARM1", "VARM2", "VARM3", "VARM4",
]
demo = gr.Interface(
fn=generate_response,
inputs=[
gr.Audio(type="filepath"),
gr.Dropdown(voices, value="NATF1"),
gr.Textbox(value="You are a helpful customer service agent."),
gr.Number(value=42424242),
],
outputs=[
gr.Audio(),
gr.JSON(),
],
title="PersonaPlex Batch Tester",
)
demo.launch()
Step 3: Add Hugging Face Token
In your Space settings:
Navigate to Repository Secrets
Add:
- Name:
HF_TOKEN - Value: your Hugging Face token
- Name:
This is required to download model weights.
Step 4: Build and Deploy
git clone https://huggingface.co/spaces/YOUR_USERNAME/personaplex-batch
cd personaplex-batch
git add app.py requirements.txt setup.sh
git commit -m "Initial setup"
git push
The first build takes approximately 5–10 minutes.
Step 5: Programmatic Inference
Install client:
pip install gradio_client
Example usage:
from gradio_client import Client
client = Client(
"YOUR_USERNAME/personaplex-batch",
hf_token="hf_xxx",
)
audio_out, transcript = client.predict(
handle_file("test.wav"),
"NATF1",
"You are a helpful agent.",
42424242,
api_name="/generate_response",
)
Step 6: Batch Processing
You can define multiple test cases and run them sequentially:
test_cases = [
{
"audio": "tests/input.wav",
"voice": "NATF1",
"prompt": "You are a support agent.",
"seed": 42424242,
}
]
Loop over inputs and store outputs locally.
Step 7: Parallel Execution
To increase throughput, use concurrent execution:
import concurrent.futures
Submit jobs asynchronously using client.submit() and collect results as they complete.
Cost Considerations
| Component | Cost |
|---|---|
| Hugging Face Pro | $9/month |
| ZeroGPU quota | Included |
| Inference per request | ~60–120 seconds GPU time |
| Additional usage | $1 per 10 minutes |
| Local execution | Free |
Estimated capacity:
- ~50–100 inferences per day within quota (depending on duration)
When to Use ZeroGPU
ZeroGPU is well suited for:
- Development and experimentation
- Batch testing pipelines
- Low-frequency inference workloads
For production systems requiring:
- Low latency
- Continuous availability
- High throughput
A dedicated GPU deployment (RunPod, AWS, or on-prem) is more appropriate.
Conclusion
ZeroGPU provides a practical way to run PersonaPlex-7B without local GPU hardware. It enables rapid experimentation and integration while abstracting away infrastructure complexity.
For early-stage development or controlled workloads, this approach is efficient and cost-effective. For production-scale deployment, it should be complemented with dedicated GPU infrastructure.