| --- |
| license: apache-2.0 |
| language: |
| - zh |
| - en |
| tags: |
| - speech-diagnosis |
| - text-to-speech |
| - audio-understanding |
| - mimo-audio |
| - xiaomi |
| pipeline_tag: audio-classification |
| --- |
| |
| <div align="center"> |
| <h1>TTS-PRISM-7B</h1> |
| <h3>A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis</h3> |
|
|
| <p align="center"> |
| <a href="#-paper-link-placeholder"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a> |
| <a href="https://github.com/xiaomi-research/tts-prism"><img src="https://img.shields.io/badge/GitHub-Project%20Page-181717?logo=github" alt="GitHub"></a> |
| <a href="https://huggingface.co/xiaomi-research/TTS-PRISM-7B"><img src="https://img.shields.io/badge/π€%20Hugging%20Face-Model-goldenrod" alt="Hugging Face"></a> |
| <a href="https://github.com/xiaomi-research/tts-prism/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-007EC6.svg" alt="License"></a> |
| </p> |
| <p align="center"> |
| β If TTS-PRISM is helpful to your research, please help star our GitHub repo. Thanks! π€ |
| </p> |
| </div> |
|
|
| <br> |
|
|
| ## π Introduction |
|
|
| While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics (like MOS) fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose **TTS-PRISM**, a multi-dimensional diagnostic framework for Mandarin. |
|
|
| Powered by the MiMo-Audio backbone and fine-tuned on a targeted 200k-sample diagnostic dataset, TTS-PRISM embeds explicit scoring criteria and reasoning into an efficient end-to-end model. It not only predicts scores but also generates rationales explaining the specific acoustic flaws or expressive highlights. |
|
|
| ## π― The 12-Dimensional Evaluation Schema |
|
|
| Unlike generalist models, TTS-PRISM evaluates speech across a strictly defined 12-dimensional hierarchical taxonomy: |
|
|
| | Layer | Dimension | Description | |
| | :--- | :--- | :--- | |
| | **Basic Capability**<br>*(Score 1-5)* | π§ **Audio Clarity** | Detects background noise, electronic distortion, or artifacts. | |
| | | π£οΈ **Pronunciation** | Identifies incomplete articulation, tone sandhi errors, etc. | |
| | | π΅ **Prosody (3)** | Evaluates **Intonation**, **Pauses**, and **Speech Rate**. | |
| | | π **Consistency (3)** | Monitors **Speaker**, **Style**, and **Emotion** consistency. | |
| | **Advanced Expressiveness**<br>*(Score 0-2 Bonus)* | π₯ **Stress** | Evaluates keyword emphasis via pitch or loudness. | |
| | | γ°οΈ **Lengthening** | Checks for natural syllabic lengthening at phrase boundaries. | |
| | | π **Paralinguistics** | Detects non-verbal cues (laughter, sighs, breaths). | |
| | | π **Emotion** | Evaluates the fullness and intensity of the expressed sentiment. | |
|
|
| ## π Architecture Overview |
|
|
| <div align="center"> |
| <img src="https://raw.githubusercontent.com/xiaomi-research/tts-prism/main/arch_diagram.png" width="70%" alt="TTS-PRISM Architecture Diagram"> |
| <p><em>Overall architecture of the TTS-PRISM framework.</em></p> |
| </div> |
|
|
| ## π₯ Model Download & Usage |
|
|
| This repository contains the model weights for **TTS-PRISM-7B**. |
|
|
| You can download the model weights using the `huggingface-cli`: |
|
|
| ```bash |
| pip install huggingface-hub |
| |
| # Download the Tokenizer |
| hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer |
| |
| # Download the TTS-PRISM-7B weights |
| hf download xiaomi-research/TTS-PRISM-7B --local-dir ./models/TTS-PRISM-7B |
| ``` |
|
|
| ### π Running Inference |
| For the complete inference pipeline, data preparation, and 12-dimensional diagnostic scripts, please visit our official GitHub repository: |
| π **[xiaomi-research/tts-prism](https://github.com/xiaomi-research/tts-prism)** |
|
|
| ## βοΈ Citation |
|
|
| If you find our work helpful, please cite our paper: |
|
|
| ```bibtex |
| @article{wang2026ttsprism, |
| title={TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis}, |
| author={Wang, Xi and Wang, Jie and Song, Xingchen and Song, Baijun and Xie, Jingran and Shao, Jiahe and Lin, Zijian and Wu, Di and Meng, Meng and Luan, Jian and Wu, Zhiyong}, |
| journal={arXiv preprint arXiv:2604.22225}, |
| year={2026} |
| } |
| ``` |
|
|
| ## βοΈ License |
| This project is licensed under the Apache License 2.0. Copyright (c) 2026 Xiaomi Corporation. |