Update README.md

0313b3e verified 5 days ago

4.25 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	tags:
	- speech-diagnosis
	- text-to-speech
	- audio-understanding
	- mimo-audio
	- xiaomi
	pipeline_tag: audio-classification
	---

	<div align="center">
	<h1>TTS-PRISM-7B</h1>
	<h3>A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis</h3>

	<p align="center">
	<a href="#-paper-link-placeholder"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a>
	<a href="https://github.com/xiaomi-research/tts-prism"><img src="https://img.shields.io/badge/GitHub-Project%20Page-181717?logo=github" alt="GitHub"></a>
	<a href="https://huggingface.co/xiaomi-research/TTS-PRISM-7B"><img src="https://img.shields.io/badge/🤗%20Hugging%20Face-Model-goldenrod" alt="Hugging Face"></a>
	<a href="https://github.com/xiaomi-research/tts-prism/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-007EC6.svg" alt="License"></a>
	</p>
	<p align="center">
	⭐ If TTS-PRISM is helpful to your research, please help star our GitHub repo. Thanks! 🤗
	</p>
	</div>

	<br>

	## 📖 Introduction

	While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics (like MOS) fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin.

	Powered by the MiMo-Audio backbone and fine-tuned on a targeted 200k-sample diagnostic dataset, TTS-PRISM embeds explicit scoring criteria and reasoning into an efficient end-to-end model. It not only predicts scores but also generates rationales explaining the specific acoustic flaws or expressive highlights.

	## 🎯 The 12-Dimensional Evaluation Schema

	Unlike generalist models, TTS-PRISM evaluates speech across a strictly defined 12-dimensional hierarchical taxonomy:

	\| Layer \| Dimension \| Description \|
	\| :--- \| :--- \| :--- \|
	\| Basic Capability<br>(Score 1-5) \| 🎧 Audio Clarity \| Detects background noise, electronic distortion, or artifacts. \|
	\| \| 🗣️ Pronunciation \| Identifies incomplete articulation, tone sandhi errors, etc. \|
	\| \| 🎵 Prosody (3) \| Evaluates Intonation, Pauses, and Speech Rate. \|
	\| \| 🔄 Consistency (3) \| Monitors Speaker, Style, and Emotion consistency. \|
	\| Advanced Expressiveness<br>(Score 0-2 Bonus) \| 💥 Stress \| Evaluates keyword emphasis via pitch or loudness. \|
	\| \| 〰️ Lengthening \| Checks for natural syllabic lengthening at phrase boundaries. \|
	\| \| 🎭 Paralinguistics \| Detects non-verbal cues (laughter, sighs, breaths). \|
	\| \| 💖 Emotion \| Evaluates the fullness and intensity of the expressed sentiment. \|

	## 🏗 Architecture Overview

	<div align="center">
	<img src="https://raw.githubusercontent.com/xiaomi-research/tts-prism/main/arch_diagram.png" width="70%" alt="TTS-PRISM Architecture Diagram">
	<p><em>Overall architecture of the TTS-PRISM framework.</em></p>
	</div>

	## 📥 Model Download & Usage

	This repository contains the model weights for TTS-PRISM-7B.

	You can download the model weights using the `huggingface-cli`:

	```bash
	pip install huggingface-hub

	# Download the Tokenizer
	hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer

	# Download the TTS-PRISM-7B weights
	hf download xiaomi-research/TTS-PRISM-7B --local-dir ./models/TTS-PRISM-7B
	```

	### 🚀 Running Inference
	For the complete inference pipeline, data preparation, and 12-dimensional diagnostic scripts, please visit our official GitHub repository:
	👉 [xiaomi-research/tts-prism](https://github.com/xiaomi-research/tts-prism)

	## ✒️ Citation

	If you find our work helpful, please cite our paper:

	```bibtex
	@article{wang2026ttsprism,
	title={TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis},
	author={Wang, Xi and Wang, Jie and Song, Xingchen and Song, Baijun and Xie, Jingran and Shao, Jiahe and Lin, Zijian and Wu, Di and Meng, Meng and Luan, Jian and Wu, Zhiyong},
	journal={arXiv preprint arXiv:2604.22225},
	year={2026}
	}
	```

	## ⚖️ License
	This project is licensed under the Apache License 2.0. Copyright (c) 2026 Xiaomi Corporation.