ekwek commited on
Commit
27b5a5f
·
verified ·
1 Parent(s): 4afd517

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +184 -3
README.md CHANGED
@@ -1,3 +1,184 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ pipeline_tag: text-to-speech
5
+ ---
6
+
7
+ <!-- Version 0.1.0 -->
8
+ # Soprano: Instant, Ultra‑Realistic Text‑to‑Speech
9
+
10
+ <div align="center">
11
+
12
+ <img width="640" height="320" alt="soprano-github" src="https://github.com/user-attachments/assets/4d612eac-23b8-44e6-8c59-d7ac14ebafd1" />
13
+
14
+ [![Alt Text](https://img.shields.io/badge/Github-Repo-black?logo=github)](https://github.com/ekwek1/soprano)
15
+ [![Alt Text](https://img.shields.io/badge/HuggingFace-Demo-yellow?logo=huggingface)](https://huggingface.co/spaces/ekwek/Soprano-TTS)
16
+ </div>
17
+
18
+ ### 📰 News
19
+ **2026.01.14 - [**Soprano-1.1-80M**](https://huggingface.co/ekwek/Soprano-1.1-80M) released! 95% fewer hallucinations and a 63% preference rate over Soprano-80M.**
20
+ 2026.01.13 - [Soprano-Factory](https://github.com/ekwek1/soprano-factory) released! You can now train/fine-tune your own Soprano models.
21
+ 2025.12.22 - Soprano-80M released! [Code](https://github.com/ekwek1/soprano) | [Demo](https://huggingface.co/spaces/ekwek/Soprano-TTS)
22
+
23
+ ---
24
+
25
+ ## Overview
26
+
27
+ **Soprano** is an ultra‑lightweight, on-device text‑to‑speech (TTS) model designed for expressive, high‑fidelity speech synthesis at unprecedented speed. Soprano was designed with the following features:
28
+ - Up to **2000x** real-time generation on GPU and **20x** real-time on CPU
29
+ - **Lossless streaming** with **<15 ms** latency on GPU, **<250 ms** on CPU
30
+ - **<1 GB** memory usage with a compact 80M parameter architecture
31
+ - **Infinite generation length** with automatic text splitting
32
+ - Highly expressive, crystal clear audio generation at **32kHz**
33
+ - Widespread support for CUDA, CPU, and MPS devices on Windows, Linux, and Mac
34
+ - Supports WebUI, CLI, and OpenAI-compatible endpoint for easy and production-ready inference
35
+
36
+ ---
37
+
38
+ ## Installation
39
+
40
+ ### Install with wheel (CUDA-only for now)
41
+
42
+ ```bash
43
+ pip install soprano-tts
44
+ ```
45
+
46
+ To get the latest features, you can install from source instead.
47
+
48
+ ### Install from source (CUDA)
49
+
50
+ ```bash
51
+ git clone https://github.com/ekwek1/soprano.git
52
+ cd soprano
53
+ pip install -e .[lmdeploy]
54
+ ```
55
+
56
+ ### Install from source (CPU/MPS)
57
+
58
+ ```bash
59
+ git clone https://github.com/ekwek1/soprano.git
60
+ cd soprano
61
+ pip install -e .
62
+ ```
63
+
64
+ > ### ⚠️ Warning: Windows CUDA users
65
+ >
66
+ > On Windows with CUDA, `pip` will install a CPU-only PyTorch build. To ensure CUDA support works as expected, reinstall PyTorch explicitly with the correct CUDA wheel **after** installing Soprano:
67
+ >
68
+ > ```bash
69
+ > pip uninstall -y torch
70
+ > pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu128
71
+ > ```
72
+
73
+ ---
74
+
75
+ ## Usage
76
+
77
+ ### WebUI
78
+
79
+ Start WebUI:
80
+
81
+ ```bash
82
+ soprano-webui # hosted on http://127.0.0.1:7860 by default
83
+ ```
84
+ > **Tip:** You can increase cache size and decoder batch size to increase inference speed at the cost of higher memory usage. For example:
85
+ > ```bash
86
+ > soprano-webui --cache-size 1000 --decoder-batch-size 4
87
+ > ```
88
+
89
+ ### CLI
90
+
91
+ ```
92
+ soprano "Soprano is an extremely lightweight text to speech model."
93
+
94
+ optional arguments:
95
+ --output, -o Output audio file path (non-streaming only). Defaults to 'output.wav'
96
+ --model-path, -m Path to local model directory (optional)
97
+ --device, -d Device to use for inference. Supported: auto, cuda, cpu, mps. Defaults to 'auto'
98
+ --backend, -b Backend to use for inference. Supported: auto, transformers, lmdeploy. Defaults to 'auto'
99
+ --cache-size, -c Cache size in MB (for lmdeploy backend). Defaults to 100
100
+ --decoder-batch-size, -bs Decoder batch size. Defaults to 1
101
+ --streaming, -s Enable streaming playback to speakers
102
+ ```
103
+ > **Tip:** You can increase cache size and decoder batch size to increase inference speed at the cost of higher memory usage.
104
+
105
+ > **Note:** The CLI will reload the model every time it is called. As a result, inference speed will be slower than other methods.
106
+
107
+ ### OpenAI-compatible endpoint
108
+
109
+ Start server:
110
+
111
+ ```bash
112
+ uvicorn soprano.server:app --host 0.0.0.0 --port 8000
113
+ ```
114
+
115
+ Use the endpoint like this:
116
+
117
+ ```bash
118
+ curl http://localhost:8000/v1/audio/speech \
119
+ -H "Content-Type: application/json" \
120
+ -d '{
121
+ "input": "Soprano is an extremely lightweight text to speech model."
122
+ }' \
123
+ --output speech.wav
124
+ ```
125
+
126
+ > **Note:** Currently, this endpoint only supports nonstreaming output.
127
+
128
+ ### Python script
129
+
130
+ ```python
131
+ from soprano import SopranoTTS
132
+
133
+ model = SopranoTTS(backend='auto', device='auto', cache_size_mb=100, decoder_batch_size=1)
134
+ ```
135
+
136
+ > **Tip:** You can increase cache_size_mb and decoder_batch_size to increase inference speed at the cost of higher memory usage.
137
+
138
+ ```python
139
+ # Basic inference
140
+ out = model.infer("Soprano is an extremely lightweight text to speech model.") # can achieve 2000x real-time with sufficiently long input!
141
+
142
+ # Save output to a file
143
+ out = model.infer("Soprano is an extremely lightweight text to speech model.", "out.wav")
144
+
145
+ # Custom sampling parameters
146
+ out = model.infer(
147
+ "Soprano is an extremely lightweight text to speech model.",
148
+ temperature=0.3,
149
+ top_p=0.95,
150
+ repetition_penalty=1.2,
151
+ )
152
+
153
+
154
+ # Batched inference
155
+ out = model.infer_batch(["Soprano is an extremely lightweight text to speech model."] * 10) # can achieve 2000x real-time with sufficiently large input size!
156
+
157
+ # Save batch outputs to a directory
158
+ out = model.infer_batch(["Soprano is an extremely lightweight text to speech model."] * 10, "/dir")
159
+
160
+
161
+ # Streaming inference
162
+ from soprano.utils.streaming import play_stream
163
+ stream = model.infer_stream("Soprano is an extremely lightweight text to speech model.", chunk_size=1)
164
+ play_stream(stream) # plays audio with <15 ms latency!
165
+ ```
166
+
167
+ ## Usage tips:
168
+
169
+ * Soprano works best when each sentence is between 2 and 15 seconds long.
170
+ * Although Soprano recognizes numbers and some special characters, it occasionally mispronounces them. Best results can be achieved by converting these into their phonetic form. (1+1 -> one plus one, etc)
171
+ * If Soprano produces unsatisfactory results, you can easily regenerate it for a new, potentially better generation. You may also change the sampling settings for more varied results.
172
+ * Avoid improper grammar such as not using contractions, multiple spaces, etc.
173
+
174
+ ---
175
+
176
+ ## Limitations
177
+
178
+ Soprano is currently English-only and does not support voice cloning. In addition, Soprano was trained on only 1,000 hours of audio (~100x less than other TTS models), so mispronunciation of uncommon words may occur. This is expected to diminish as Soprano is trained on more data.
179
+
180
+ ---
181
+
182
+ ## License
183
+
184
+ This project is licensed under the **Apache-2.0** license. See `LICENSE` for details.