TTS-PRISM-7B / README.md
XIWANG1023's picture
Update README.md
0313b3e verified
---
license: apache-2.0
language:
- zh
- en
tags:
- speech-diagnosis
- text-to-speech
- audio-understanding
- mimo-audio
- xiaomi
pipeline_tag: audio-classification
---
<div align="center">
<h1>TTS-PRISM-7B</h1>
<h3>A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis</h3>
<p align="center">
<a href="#-paper-link-placeholder"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a>
<a href="https://github.com/xiaomi-research/tts-prism"><img src="https://img.shields.io/badge/GitHub-Project%20Page-181717?logo=github" alt="GitHub"></a>
<a href="https://huggingface.co/xiaomi-research/TTS-PRISM-7B"><img src="https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Model-goldenrod" alt="Hugging Face"></a>
<a href="https://github.com/xiaomi-research/tts-prism/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-007EC6.svg" alt="License"></a>
</p>
<p align="center">
⭐ If TTS-PRISM is helpful to your research, please help star our GitHub repo. Thanks! πŸ€—
</p>
</div>
<br>
## πŸ“– Introduction
While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics (like MOS) fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose **TTS-PRISM**, a multi-dimensional diagnostic framework for Mandarin.
Powered by the MiMo-Audio backbone and fine-tuned on a targeted 200k-sample diagnostic dataset, TTS-PRISM embeds explicit scoring criteria and reasoning into an efficient end-to-end model. It not only predicts scores but also generates rationales explaining the specific acoustic flaws or expressive highlights.
## 🎯 The 12-Dimensional Evaluation Schema
Unlike generalist models, TTS-PRISM evaluates speech across a strictly defined 12-dimensional hierarchical taxonomy:
| Layer | Dimension | Description |
| :--- | :--- | :--- |
| **Basic Capability**<br>*(Score 1-5)* | 🎧 **Audio Clarity** | Detects background noise, electronic distortion, or artifacts. |
| | πŸ—£οΈ **Pronunciation** | Identifies incomplete articulation, tone sandhi errors, etc. |
| | 🎡 **Prosody (3)** | Evaluates **Intonation**, **Pauses**, and **Speech Rate**. |
| | πŸ”„ **Consistency (3)** | Monitors **Speaker**, **Style**, and **Emotion** consistency. |
| **Advanced Expressiveness**<br>*(Score 0-2 Bonus)* | πŸ’₯ **Stress** | Evaluates keyword emphasis via pitch or loudness. |
| | 〰️ **Lengthening** | Checks for natural syllabic lengthening at phrase boundaries. |
| | 🎭 **Paralinguistics** | Detects non-verbal cues (laughter, sighs, breaths). |
| | πŸ’– **Emotion** | Evaluates the fullness and intensity of the expressed sentiment. |
## πŸ— Architecture Overview
<div align="center">
<img src="https://raw.githubusercontent.com/xiaomi-research/tts-prism/main/arch_diagram.png" width="70%" alt="TTS-PRISM Architecture Diagram">
<p><em>Overall architecture of the TTS-PRISM framework.</em></p>
</div>
## πŸ“₯ Model Download & Usage
This repository contains the model weights for **TTS-PRISM-7B**.
You can download the model weights using the `huggingface-cli`:
```bash
pip install huggingface-hub
# Download the Tokenizer
hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer
# Download the TTS-PRISM-7B weights
hf download xiaomi-research/TTS-PRISM-7B --local-dir ./models/TTS-PRISM-7B
```
### πŸš€ Running Inference
For the complete inference pipeline, data preparation, and 12-dimensional diagnostic scripts, please visit our official GitHub repository:
πŸ‘‰ **[xiaomi-research/tts-prism](https://github.com/xiaomi-research/tts-prism)**
## βœ’οΈ Citation
If you find our work helpful, please cite our paper:
```bibtex
@article{wang2026ttsprism,
title={TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis},
author={Wang, Xi and Wang, Jie and Song, Xingchen and Song, Baijun and Xie, Jingran and Shao, Jiahe and Lin, Zijian and Wu, Di and Meng, Meng and Luan, Jian and Wu, Zhiyong},
journal={arXiv preprint arXiv:2604.22225},
year={2026}
}
```
## βš–οΈ License
This project is licensed under the Apache License 2.0. Copyright (c) 2026 Xiaomi Corporation.