Add files using upload-large-folder tool
Browse files
README.md
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
base_model:
|
| 6 |
+
- nari-labs/Dia-1.6B
|
| 7 |
+
pipeline_tag: text-to-speech
|
| 8 |
+
tags:
|
| 9 |
+
- tts
|
| 10 |
+
- text-to-speech
|
| 11 |
+
- dia
|
| 12 |
+
- dac
|
| 13 |
+
- dialogue
|
| 14 |
+
- gguf
|
| 15 |
+
- crispasr
|
| 16 |
+
library_name: ggml
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# Dia-1.6B — GGUF (ggml)
|
| 20 |
+
|
| 21 |
+
GGUF / ggml conversion of [`nari-labs/Dia-1.6B`](https://huggingface.co/nari-labs/Dia-1.6B) for use with **[CrispStrobe/CrispASR](https://github.com/CrispStrobe/CrispASR)**.
|
| 22 |
+
|
| 23 |
+
Dia is a dialogue text-to-speech model that generates expressive 44.1 kHz speech from text, with `[S1]` / `[S2]` speaker tags:
|
| 24 |
+
|
| 25 |
+
- **Text encoder** (12-layer, 1024-d, byte-level vocab 256): encodes the prompt bytes.
|
| 26 |
+
- **Audio decoder** (18-layer, 2048-d, GQA 16 query / 4 KV heads, classifier-free guidance): autoregressively emits **9 interleaved DAC codebooks** under a delay pattern `[0,8,9,10,11,12,13,14,15]`.
|
| 27 |
+
- **DAC codec** (44.1 kHz): decodes the 9 codebooks to PCM. Shipped as a separate **required** companion file.
|
| 28 |
+
|
| 29 |
+
Released under **Apache 2.0**.
|
| 30 |
+
|
| 31 |
+
## Files
|
| 32 |
+
|
| 33 |
+
| File | Quant | Size | Notes |
|
| 34 |
+
|---|---|---:|---|
|
| 35 |
+
| `dia-1.6b-f16.gguf` | F16 | 3.0 GB | Main model — reference quality |
|
| 36 |
+
| `dac-44khz.gguf` | — | 104 MB | DAC codec — **required** companion (download both) |
|
| 37 |
+
|
| 38 |
+
> Lower-bit quants (Q8_0 / Q4_K) are not published yet: Dia uses `scale=1.0`
|
| 39 |
+
> attention (no `1/√d`), which is precision-sensitive, so quants need an
|
| 40 |
+
> ASR-roundtrip check before release.
|
| 41 |
+
|
| 42 |
+
## Quick start
|
| 43 |
+
|
| 44 |
+
```bash
|
| 45 |
+
# 1. Build CrispASR
|
| 46 |
+
git clone https://github.com/CrispStrobe/CrispASR
|
| 47 |
+
cd CrispASR
|
| 48 |
+
cmake -B build -DCMAKE_BUILD_TYPE=Release
|
| 49 |
+
cmake --build build -j --target crispasr-cli
|
| 50 |
+
|
| 51 |
+
# 2. Download model + DAC codec
|
| 52 |
+
hf download cstr/dia-1.6b-GGUF dia-1.6b-f16.gguf dac-44khz.gguf --local-dir .
|
| 53 |
+
|
| 54 |
+
# 3. Synthesize (keep the codec beside the model, or pass --codec-model)
|
| 55 |
+
./build/bin/crispasr --backend dia -m dia-1.6b-f16.gguf \
|
| 56 |
+
--codec-model dac-44khz.gguf \
|
| 57 |
+
--tts "[S1] Hello there, how are you doing today? I really hope you are having a wonderful and pleasant time." \
|
| 58 |
+
--tts-output hello.wav --seed 42
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
Or with auto-download (pulls the model + DAC companion):
|
| 62 |
+
```bash
|
| 63 |
+
./build/bin/crispasr -m dia --auto-download \
|
| 64 |
+
--tts "[S1] The quick brown fox jumps over the lazy dog, and then it runs back again." \
|
| 65 |
+
--tts-output fox.wav
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
> **Prompt length matters.** Dia is inconsistent on very short inputs (it may
|
| 69 |
+
> emit non-speech) — use prompts of **at least ~100 characters**. Start the
|
| 70 |
+
> text with a `[S1]` (or `[S2]`) speaker tag.
|
| 71 |
+
|
| 72 |
+
## Parameters
|
| 73 |
+
|
| 74 |
+
| Parameter | Default | Description |
|
| 75 |
+
|---|---|---|
|
| 76 |
+
| `--seed N` | 0 | RNG seed (0 = non-deterministic; output varies per seed) |
|
| 77 |
+
| `-tp N` | 1.2 | Sampling temperature |
|
| 78 |
+
| `--codec-model PATH` | auto | DAC codec GGUF (auto-discovered beside the model) |
|
| 79 |
+
| `--tts-output PATH` | — | Output WAV path (44.1 kHz mono) |
|
| 80 |
+
|
| 81 |
+
## Architecture details
|
| 82 |
+
|
| 83 |
+
- **Text tokenizer**: byte-level (vocab 256); `[S1]`/`[S2]` map to bytes `0x01`/`0x02`.
|
| 84 |
+
- **Encoder**: 12 layers, 1024-d, 16 heads, head_dim 128, RoPE (NeoX half-split), `scale=1.0`.
|
| 85 |
+
- **Decoder**: 18 layers, 2048-d; self-attn GQA 16q/4kv; cross-attn MHA (16/16) over the encoder; `scale=1.0`; CFG `cond + cfg_scale·(cond − uncond)`.
|
| 86 |
+
- **Codebooks**: 9 DAC channels, delay pattern `[0,8,9,…,15]`, audio vocab 1024.
|
| 87 |
+
- **Codec**: Descript Audio Codec (DAC) at 44.1 kHz.
|
| 88 |
+
|
| 89 |
+
## Conversion
|
| 90 |
+
|
| 91 |
+
```bash
|
| 92 |
+
python models/convert-dia-to-gguf.py \
|
| 93 |
+
--input nari-labs/Dia-1.6B \
|
| 94 |
+
--output dia-1.6b-f16.gguf
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
## Acknowledgements
|
| 98 |
+
|
| 99 |
+
- [nari-labs/dia](https://github.com/nari-labs/dia) — original model and inference code
|
| 100 |
+
- [descript/descript-audio-codec](https://github.com/descriptinc/descript-audio-codec) — DAC codec
|