cstr commited on
Commit
8db233c
·
verified ·
1 Parent(s): 47fd68e

Add files using upload-large-folder tool

Browse files
Files changed (1) hide show
  1. README.md +100 -0
README.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - nari-labs/Dia-1.6B
7
+ pipeline_tag: text-to-speech
8
+ tags:
9
+ - tts
10
+ - text-to-speech
11
+ - dia
12
+ - dac
13
+ - dialogue
14
+ - gguf
15
+ - crispasr
16
+ library_name: ggml
17
+ ---
18
+
19
+ # Dia-1.6B — GGUF (ggml)
20
+
21
+ GGUF / ggml conversion of [`nari-labs/Dia-1.6B`](https://huggingface.co/nari-labs/Dia-1.6B) for use with **[CrispStrobe/CrispASR](https://github.com/CrispStrobe/CrispASR)**.
22
+
23
+ Dia is a dialogue text-to-speech model that generates expressive 44.1 kHz speech from text, with `[S1]` / `[S2]` speaker tags:
24
+
25
+ - **Text encoder** (12-layer, 1024-d, byte-level vocab 256): encodes the prompt bytes.
26
+ - **Audio decoder** (18-layer, 2048-d, GQA 16 query / 4 KV heads, classifier-free guidance): autoregressively emits **9 interleaved DAC codebooks** under a delay pattern `[0,8,9,10,11,12,13,14,15]`.
27
+ - **DAC codec** (44.1 kHz): decodes the 9 codebooks to PCM. Shipped as a separate **required** companion file.
28
+
29
+ Released under **Apache 2.0**.
30
+
31
+ ## Files
32
+
33
+ | File | Quant | Size | Notes |
34
+ |---|---|---:|---|
35
+ | `dia-1.6b-f16.gguf` | F16 | 3.0 GB | Main model — reference quality |
36
+ | `dac-44khz.gguf` | — | 104 MB | DAC codec — **required** companion (download both) |
37
+
38
+ > Lower-bit quants (Q8_0 / Q4_K) are not published yet: Dia uses `scale=1.0`
39
+ > attention (no `1/√d`), which is precision-sensitive, so quants need an
40
+ > ASR-roundtrip check before release.
41
+
42
+ ## Quick start
43
+
44
+ ```bash
45
+ # 1. Build CrispASR
46
+ git clone https://github.com/CrispStrobe/CrispASR
47
+ cd CrispASR
48
+ cmake -B build -DCMAKE_BUILD_TYPE=Release
49
+ cmake --build build -j --target crispasr-cli
50
+
51
+ # 2. Download model + DAC codec
52
+ hf download cstr/dia-1.6b-GGUF dia-1.6b-f16.gguf dac-44khz.gguf --local-dir .
53
+
54
+ # 3. Synthesize (keep the codec beside the model, or pass --codec-model)
55
+ ./build/bin/crispasr --backend dia -m dia-1.6b-f16.gguf \
56
+ --codec-model dac-44khz.gguf \
57
+ --tts "[S1] Hello there, how are you doing today? I really hope you are having a wonderful and pleasant time." \
58
+ --tts-output hello.wav --seed 42
59
+ ```
60
+
61
+ Or with auto-download (pulls the model + DAC companion):
62
+ ```bash
63
+ ./build/bin/crispasr -m dia --auto-download \
64
+ --tts "[S1] The quick brown fox jumps over the lazy dog, and then it runs back again." \
65
+ --tts-output fox.wav
66
+ ```
67
+
68
+ > **Prompt length matters.** Dia is inconsistent on very short inputs (it may
69
+ > emit non-speech) — use prompts of **at least ~100 characters**. Start the
70
+ > text with a `[S1]` (or `[S2]`) speaker tag.
71
+
72
+ ## Parameters
73
+
74
+ | Parameter | Default | Description |
75
+ |---|---|---|
76
+ | `--seed N` | 0 | RNG seed (0 = non-deterministic; output varies per seed) |
77
+ | `-tp N` | 1.2 | Sampling temperature |
78
+ | `--codec-model PATH` | auto | DAC codec GGUF (auto-discovered beside the model) |
79
+ | `--tts-output PATH` | — | Output WAV path (44.1 kHz mono) |
80
+
81
+ ## Architecture details
82
+
83
+ - **Text tokenizer**: byte-level (vocab 256); `[S1]`/`[S2]` map to bytes `0x01`/`0x02`.
84
+ - **Encoder**: 12 layers, 1024-d, 16 heads, head_dim 128, RoPE (NeoX half-split), `scale=1.0`.
85
+ - **Decoder**: 18 layers, 2048-d; self-attn GQA 16q/4kv; cross-attn MHA (16/16) over the encoder; `scale=1.0`; CFG `cond + cfg_scale·(cond − uncond)`.
86
+ - **Codebooks**: 9 DAC channels, delay pattern `[0,8,9,…,15]`, audio vocab 1024.
87
+ - **Codec**: Descript Audio Codec (DAC) at 44.1 kHz.
88
+
89
+ ## Conversion
90
+
91
+ ```bash
92
+ python models/convert-dia-to-gguf.py \
93
+ --input nari-labs/Dia-1.6B \
94
+ --output dia-1.6b-f16.gguf
95
+ ```
96
+
97
+ ## Acknowledgements
98
+
99
+ - [nari-labs/dia](https://github.com/nari-labs/dia) — original model and inference code
100
+ - [descript/descript-audio-codec](https://github.com/descriptinc/descript-audio-codec) — DAC codec