MDL-0.6B
Related work: 🛠️ MiDashengLM · 📚 ACAVCaps Dataset · 📊 MECAT Benchmark
This repository contains the GGUF versions of mispeech/midashenglm-0.6b-fp32. Can be used with llama.cpp for efficient local inference.
Usage
This model currently requires our llama.cpp fork to run. See the build instructions to get started.
Try it in your browser
We have a WebAssembly demo that runs this model entirely on-device using your CPU — no server required.
One-shot inference
llama-cli --model backbone-bf16.gguf --mmproj mmproj-f32.gguf \
--reasoning off --temp 0 --audio audio.wav \
--prompt 'Write a detailed caption.' \
--single-turn --display-prompt
Interactive mode
llama-cli --model backbone-bf16.gguf --mmproj mmproj-f32.gguf \
--reasoning off --temp 0
For more CLI options, see the CLI documentation.
OpenAI/Anthropic-compatible API server
llama-server --model backbone-bf16.gguf --mmproj mmproj-f32.gguf
For more server options, see the server documentation.
Results
The following tables present the preliminary evaluation results of MDL-0.6B. We compare our compact model against the baseline MiDashengLM-7B-1021, as well as two multimodal large language models: Qwen2.5-Omni-7B and Kimi-Audio-Instruct.
Audio Captioning Results
We first evaluate on MECAT-Caption, which organizes captions into three strands. Systemic Captions comprise a concise short caption centered on the primary audio content and a long caption that adds contextual detail and how events interact. Content-Specific Captions use three branches—speech, music, and sound events—evaluated independently; the table reports pure vs. mixed variants for each. The Content-Unrelated Caption strand focuses on acoustic properties (e.g., recording quality and reverberation) rather than semantic scene content. Metrics are reported with DATE↑. Beyond MECAT-Caption, we report standard music captioning on MusiCaps and SongDescriber (FENSE↑) and general environmental / audio captioning on AudioCaps (dev), Clotho (test), and AutoACD (test) (FENSE↑).
If you are interested in the datasets and evaluation used for audio caption training and benchmarking, see ACAVCaps and MECAT.
| Dataset | Domain | Metric | Kimi-Audio-Instruct | Qwen2.5-Omni-7B | MiDashengLM-7B-1021 | MiDashengLM-0.6B |
|---|---|---|---|---|---|---|
| MECAT-Caption | Long | DATE↑ | 49.50 | 61.10 | 72.50 | 75.60 |
| MECAT-Caption | Short | DATE↑ | 54.20 | 56.50 | 72.30 | 74.70 |
| MECAT-Caption | Pure Speech | DATE↑ | 30.00 | 39.90 | 64.40 | 64.00 |
| MECAT-Caption | Mixed Speech | DATE↑ | 31.30 | 40.90 | 59.90 | 64.30 |
| MECAT-Caption | Pure Music | DATE↑ | 27.70 | 32.10 | 58.30 | 57.60 |
| MECAT-Caption | Mixed Music | DATE↑ | 16.90 | 30.90 | 36.10 | 58.20 |
| MECAT-Caption | Pure Sound | DATE↑ | 43.10 | 50.70 | 57.50 | 58.40 |
| MECAT-Caption | Mixed Sound | DATE↑ | 16.20 | 23.80 | 23.00 | 42.40 |
| MECAT-Caption | Environment | DATE↑ | 7.00 | 17.90 | 26.90 | 31.20 |
| MusiCaps | Music | FENSE↑ | 35.43 | 43.71 | 59.11 | 60.70 |
| SongDescriber | Music | FENSE↑ | 44.63 | 45.31 | 46.62 | 51.90 |
| AudioCaps-Dev | Sound | FENSE↑ | 49.00 | 60.79 | 62.13 | 59.70 |
| Clotho-Test | Sound | FENSE↑ | 48.01 | 47.55 | 49.35 | 44.30 |
| AutoACD-Test | Sound | FENSE↑ | 44.76 | 55.93 | 67.13 | 59.20 |
Metrics: Higher is better. "-" denotes data not available.
Audio and Paralinguistic Classification
| Dataset | Task | Metric | Kimi-Audio-Instruct | Qwen2.5-Omni-7B | MiDashengLM-7B-1021 | MiDashengLM-0.6B |
|---|---|---|---|---|---|---|
| VoxCeleb1 | Speaker ID | ACC↑ | 82.72 | 59.71 | 92.66 | 90.84 |
| VoxLingua107 | Language ID | ACC↑ | 73.65 | 51.03 | 93.72 | 86.39 |
| VoxCeleb-Gender | Gender ID | ACC↑ | 99.69 | 99.82 | 97.72 | 96.80 |
| VGGSound | Sound Event | MAP↑ | 2.20 | 0.97 | 52.19 | 28.05 |
| CochlScene | Sound Scene | ACC↑ | 18.34 | 23.88 | 75.81 | 75.78 |
| NSynth-Instrument | Music Instrument | ACC↑ | 38.09 | 60.45 | 80.32 | 64.33 |
| FreeMusicArchive | Music Genre | ACC↑ | 27.91 | 66.77 | 62.94 | 17.50 |
| FSDKaggle2018 | Sound Event | MAP↑ | 24.75 | 31.38 | 73.38 | 80.84 |
| AudioSet | Sound Event | MAP↑ | 3.47 | 6.48 | 9.90 | 6.82 |
| FSD50K | Sound Event | MAP↑ | 27.23 | 23.87 | 38.10 | 34.15 |
Metrics: Higher is better.
ASR Performance
| Dataset | Metric | Kimi-Audio-Instruct | Qwen2.5-Omni-7B | MiDashengLM-7B-1021 | MiDashengLM-0.6B |
|---|---|---|---|---|---|
| LibriSpeech-Clean | WER↓ | 1.30 | 1.70 | 3.60 | 4.41 |
| LibriSpeech-Other | WER↓ | 2.40 | 3.40 | 5.90 | 10.66 |
| People's Speech | CER↓ | 22.30 | 28.60 | 26.12 | 29.15 |
| AISHELL-2-Mic | CER↓ | 2.70 | 2.50 | 3.20 | 6.30 |
| AISHELL-2-iOS | CER↓ | 2.60 | 2.60 | 2.90 | 5.67 |
| AISHELL-2-Android | CER↓ | 2.60 | 2.70 | 3.10 | 7.37 |
| GigaSpeech2-Indonesian | WER↓ | >100 | 21.20 | 22.30 | 26.00 |
| GigaSpeech2-Thai | WER↓ | >100 | 53.80 | 38.40 | 23.20 |
| GigaSpeech2-Viet | WER↓ | >100 | 18.60 | 17.70 | 68.88 |
Metrics: WER/CER (lower is better).
Question Answering Results
We first evaluate MECAT-QA, which targets reasoning and assessment over audio (subsets such as direct perception, sound characteristics, quality assessment, environment reasoning, inference / judgment, and application-oriented content), reported with DATE. Notably, the compact MDL-0.6B model improves markedly on these MECAT-QA subsets compared to the 7B MiDashengLM-7B-1021 model, reflecting strong zero-shot reasoning, logical inference, and fine-grained analysis despite its smaller size.
MMAU-Pro follows, using the Answer task and ACC↑ across capability splits (IF, Multi-Audio, Music, Open-Ended, Sound, cross-modal combinations, Spatial, Speech, Voice, and the overall Average).
Finally, we include additional QA benchmarks: AudioCaps-QA and MusicQA (FENSE↑), and MuChoMusic (ACC↑).
| Dataset | Subset | Metric | Kimi-Audio-Instruct | Qwen2.5-Omni-7B | MiDashengLM-7B-1021 | MiDashengLM-0.6B |
|---|---|---|---|---|---|---|
| MECAT-QA | Direct Perception | DATE | 45.60 | 57.80 | 64.20 | 71.70 |
| MECAT-QA | Sound Characteristics | DATE | 39.20 | 52.90 | 31.20 | 69.30 |
| MECAT-QA | Quality Assessment | DATE | 18.70 | 39.10 | 20.20 | 57.50 |
| MECAT-QA | Environment Reasoning | DATE | 34.60 | 44.00 | 20.10 | 66.30 |
| MECAT-QA | Inference / Judgment | DATE | 48.90 | 53.20 | 35.30 | 66.40 |
| MECAT-QA | Application Content | DATE | 41.20 | 50.80 | 33.60 | 65.80 |
| MMAU-Pro | IF | ACC↑ | 42.30 | 61.30 | 37.93 | 56.32 |
| MMAU-Pro | Multi-Audio | ACC↑ | 17.20 | 24.30 | 42.33 | 35.12 |
| MMAU-Pro | Music | ACC↑ | 57.60 | 61.50 | 62.20 | 38.15 |
| MMAU-Pro | Open-Ended | ACC↑ | 34.50 | 52.30 | 63.21 | 50.21 |
| MMAU-Pro | Sound | ACC↑ | 46.00 | 47.60 | 58.36 | 33.24 |
| MMAU-Pro | Sound–Music | ACC↑ | 46.00 | 40.00 | 42.00 | 38.00 |
| MMAU-Pro | Sound–Music–Speech | ACC↑ | 42.80 | 28.50 | 71.43 | 42.86 |
| MMAU-Pro | Spatial | ACC↑ | 43.70 | 41.20 | 18.77 | 53.23 |
| MMAU-Pro | Speech | ACC↑ | 52.20 | 57.40 | 61.17 | 33.56 |
| MMAU-Pro | Speech–Music | ACC↑ | 54.30 | 53.20 | 58.70 | 23.91 |
| MMAU-Pro | Speech–Sound | ACC↑ | 48.90 | 60.20 | 51.14 | 30.68 |
| MMAU-Pro | Voice | ACC↑ | 50.60 | 60.00 | 54.83 | 28.97 |
| MMAU-Pro | Average | ACC↑ | 46.60 | 52.20 | 55.92 | 38.69 |
| AudioCaps-QA | — | FENSE↑ | 47.34 | 53.28 | 54.20 | 41.70 |
| MusicQA | — | FENSE↑ | 40.00 | 60.60 | 61.56 | 36.10 |
| MuChoMusic | — | ACC↑ | 67.40 | 64.79 | 73.04 | 35.80 |
Metrics: Higher is better. "-" denotes data not available.
Citation
If you find MDL-0.6B useful in your research or business applications, please cite the underlying work it builds on: MiDashengLM (efficient audio understanding with general audio captions) and DashengTokenizer (unified continuous audio tokenization for understanding and generation).
@techreport{midashenglm7b,
title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
author = {{Horizon Team, MiLM Plus}},
institution= {Xiaomi Inc.},
year = {2025},
note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
url = {https://arxiv.org/abs/2508.03983},
eprint = {2508.03983},
}
@misc{dinkel2026dashengtokenizer,
title = {DashengTokenizer: One layer is enough for unified audio understanding and generation},
author = {Heinrich Dinkel and Xingwei Sun and Gang Li and Jiahao Mei and Yadong Niu and Jizhong Liu and Xiyang Li and Yifan Liao and Jiahao Zhou and Junbo Zhang and Jian Luan},
year = {2026},
eprint = {2602.23765},
archivePrefix= {arXiv},
primaryClass = {cs.SD},
url = {https://arxiv.org/abs/2602.23765},
}
If you are interested in caption datasets and evaluation, see ACAVCaps and MECAT.
@misc{niu2026acavcaps,
title = {ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding},
author = {Yadong Niu and Tianzi Wang and Heinrich Dinkel and Xingwei Sun and Jiahao Zhou and Gang Li and Jizhong Liu and Junbo Zhang and Jian Luan},
year = {2026},
eprint = {2603.24038},
archivePrefix = {arXiv},
primaryClass = {eess.AS},
url = {https://arxiv.org/abs/2603.24038},
}
@misc{niu2025mecat,
title = {MECAT: a multi-experts constructed benchmark for fine-grained audio understanding tasks},
author = {Yadong Niu and Tianzi Wang and Heinrich Dinkel and Xingwei Sun and Jiahao Zhou and Gang Li and Jizhong Liu and Xiyang Liu and Junbo Zhang and Jian Luan},
year = {2025},
eprint = {2507.23511},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2507.23511},
}
- Downloads last month
- 407
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit