Instructions to use mispeech/midashenglm-0.6b-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mispeech/midashenglm-0.6b-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="mispeech/midashenglm-0.6b-gguf",
	filename="backbone-Q3_K_M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use mispeech/midashenglm-0.6b-gguf with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf mispeech/midashenglm-0.6b-gguf:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf mispeech/midashenglm-0.6b-gguf:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf mispeech/midashenglm-0.6b-gguf:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf mispeech/midashenglm-0.6b-gguf:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf mispeech/midashenglm-0.6b-gguf:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf mispeech/midashenglm-0.6b-gguf:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf mispeech/midashenglm-0.6b-gguf:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf mispeech/midashenglm-0.6b-gguf:Q4_K_M

Use Docker

docker model run hf.co/mispeech/midashenglm-0.6b-gguf:Q4_K_M

LM Studio
Jan
Ollama
How to use mispeech/midashenglm-0.6b-gguf with Ollama:
```
ollama run hf.co/mispeech/midashenglm-0.6b-gguf:Q4_K_M
```

Unsloth Studio

How to use mispeech/midashenglm-0.6b-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for mispeech/midashenglm-0.6b-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for mispeech/midashenglm-0.6b-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for mispeech/midashenglm-0.6b-gguf to start chatting

Atomic Chat new
Docker Model Runner
How to use mispeech/midashenglm-0.6b-gguf with Docker Model Runner:
```
docker model run hf.co/mispeech/midashenglm-0.6b-gguf:Q4_K_M
```

Lemonade

How to use mispeech/midashenglm-0.6b-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull mispeech/midashenglm-0.6b-gguf:Q4_K_M

Run and chat with the model

lemonade run user.midashenglm-0.6b-gguf-Q4_K_M

List all available models

lemonade list

MDL-0.6B

Related work: 🛠️ MiDashengLM · 📚 ACAVCaps Dataset · 📊 MECAT Benchmark

This repository contains the GGUF versions of mispeech/midashenglm-0.6b-fp32. Can be used with llama.cpp for efficient local inference.

Usage

This model currently requires our llama.cpp fork to run. See the build instructions to get started.

Try it in your browser

We have a WebAssembly demo that runs this model entirely on-device using your CPU — no server required.

One-shot inference

llama-cli --model backbone-bf16.gguf --mmproj mmproj-f32.gguf \
  --reasoning off --temp 0 --audio audio.wav \
  --prompt 'Write a detailed caption.' \
  --single-turn --display-prompt

Interactive mode

llama-cli --model backbone-bf16.gguf --mmproj mmproj-f32.gguf \
  --reasoning off --temp 0

For more CLI options, see the CLI documentation.

OpenAI/Anthropic-compatible API server

llama-server --model backbone-bf16.gguf --mmproj mmproj-f32.gguf

For more server options, see the server documentation.

Results

The following tables present the preliminary evaluation results of MDL-0.6B. We compare our compact model against the baseline MiDashengLM-7B-1021, as well as two multimodal large language models: Qwen2.5-Omni-7B and Kimi-Audio-Instruct.

Audio Captioning Results

We first evaluate on MECAT-Caption, which organizes captions into three strands. Systemic Captions comprise a concise short caption centered on the primary audio content and a long caption that adds contextual detail and how events interact. Content-Specific Captions use three branches—speech, music, and sound events—evaluated independently; the table reports pure vs. mixed variants for each. The Content-Unrelated Caption strand focuses on acoustic properties (e.g., recording quality and reverberation) rather than semantic scene content. Metrics are reported with DATE↑. Beyond MECAT-Caption, we report standard music captioning on MusiCaps and SongDescriber (FENSE↑) and general environmental / audio captioning on AudioCaps (dev), Clotho (test), and AutoACD (test) (FENSE↑).

If you are interested in the datasets and evaluation used for audio caption training and benchmarking, see ACAVCaps and MECAT.

Dataset	Domain	Metric	Kimi-Audio-Instruct	Qwen2.5-Omni-7B	MiDashengLM-7B-1021	MiDashengLM-0.6B
MECAT-Caption	Long	DATE↑	49.50	61.10	72.50	75.60
MECAT-Caption	Short	DATE↑	54.20	56.50	72.30	74.70
MECAT-Caption	Pure Speech	DATE↑	30.00	39.90	64.40	64.00
MECAT-Caption	Mixed Speech	DATE↑	31.30	40.90	59.90	64.30
MECAT-Caption	Pure Music	DATE↑	27.70	32.10	58.30	57.60
MECAT-Caption	Mixed Music	DATE↑	16.90	30.90	36.10	58.20
MECAT-Caption	Pure Sound	DATE↑	43.10	50.70	57.50	58.40
MECAT-Caption	Mixed Sound	DATE↑	16.20	23.80	23.00	42.40
MECAT-Caption	Environment	DATE↑	7.00	17.90	26.90	31.20
MusiCaps	Music	FENSE↑	35.43	43.71	59.11	60.70
SongDescriber	Music	FENSE↑	44.63	45.31	46.62	51.90
AudioCaps-Dev	Sound	FENSE↑	49.00	60.79	62.13	59.70
Clotho-Test	Sound	FENSE↑	48.01	47.55	49.35	44.30
AutoACD-Test	Sound	FENSE↑	44.76	55.93	67.13	59.20

Metrics: Higher is better. "-" denotes data not available.

Audio and Paralinguistic Classification

Dataset	Task	Metric	Kimi-Audio-Instruct	Qwen2.5-Omni-7B	MiDashengLM-7B-1021	MiDashengLM-0.6B
VoxCeleb1	Speaker ID	ACC↑	82.72	59.71	92.66	90.84
VoxLingua107	Language ID	ACC↑	73.65	51.03	93.72	86.39
VoxCeleb-Gender	Gender ID	ACC↑	99.69	99.82	97.72	96.80
VGGSound	Sound Event	MAP↑	2.20	0.97	52.19	28.05
CochlScene	Sound Scene	ACC↑	18.34	23.88	75.81	75.78
NSynth-Instrument	Music Instrument	ACC↑	38.09	60.45	80.32	64.33
FreeMusicArchive	Music Genre	ACC↑	27.91	66.77	62.94	17.50
FSDKaggle2018	Sound Event	MAP↑	24.75	31.38	73.38	80.84
AudioSet	Sound Event	MAP↑	3.47	6.48	9.90	6.82
FSD50K	Sound Event	MAP↑	27.23	23.87	38.10	34.15

Metrics: Higher is better.

ASR Performance

Dataset	Metric	Kimi-Audio-Instruct	Qwen2.5-Omni-7B	MiDashengLM-7B-1021	MiDashengLM-0.6B
LibriSpeech-Clean	WER↓	1.30	1.70	3.60	4.41
LibriSpeech-Other	WER↓	2.40	3.40	5.90	10.66
People's Speech	CER↓	22.30	28.60	26.12	29.15
AISHELL-2-Mic	CER↓	2.70	2.50	3.20	6.30
AISHELL-2-iOS	CER↓	2.60	2.60	2.90	5.67
AISHELL-2-Android	CER↓	2.60	2.70	3.10	7.37
GigaSpeech2-Indonesian	WER↓	>100	21.20	22.30	26.00
GigaSpeech2-Thai	WER↓	>100	53.80	38.40	23.20
GigaSpeech2-Viet	WER↓	>100	18.60	17.70	68.88

Metrics: WER/CER (lower is better).

Question Answering Results

We first evaluate MECAT-QA, which targets reasoning and assessment over audio (subsets such as direct perception, sound characteristics, quality assessment, environment reasoning, inference / judgment, and application-oriented content), reported with DATE. Notably, the compact MDL-0.6B model improves markedly on these MECAT-QA subsets compared to the 7B MiDashengLM-7B-1021 model, reflecting strong zero-shot reasoning, logical inference, and fine-grained analysis despite its smaller size.

MMAU-Pro follows, using the Answer task and ACC↑ across capability splits (IF, Multi-Audio, Music, Open-Ended, Sound, cross-modal combinations, Spatial, Speech, Voice, and the overall Average).

Finally, we include additional QA benchmarks: AudioCaps-QA and MusicQA (FENSE↑), and MuChoMusic (ACC↑).

Dataset	Subset	Metric	Kimi-Audio-Instruct	Qwen2.5-Omni-7B	MiDashengLM-7B-1021	MiDashengLM-0.6B
MECAT-QA	Direct Perception	DATE	45.60	57.80	64.20	71.70
MECAT-QA	Sound Characteristics	DATE	39.20	52.90	31.20	69.30
MECAT-QA	Quality Assessment	DATE	18.70	39.10	20.20	57.50
MECAT-QA	Environment Reasoning	DATE	34.60	44.00	20.10	66.30
MECAT-QA	Inference / Judgment	DATE	48.90	53.20	35.30	66.40
MECAT-QA	Application Content	DATE	41.20	50.80	33.60	65.80
MMAU-Pro	IF	ACC↑	42.30	61.30	37.93	56.32
MMAU-Pro	Multi-Audio	ACC↑	17.20	24.30	42.33	35.12
MMAU-Pro	Music	ACC↑	57.60	61.50	62.20	38.15
MMAU-Pro	Open-Ended	ACC↑	34.50	52.30	63.21	50.21
MMAU-Pro	Sound	ACC↑	46.00	47.60	58.36	33.24
MMAU-Pro	Sound–Music	ACC↑	46.00	40.00	42.00	38.00
MMAU-Pro	Sound–Music–Speech	ACC↑	42.80	28.50	71.43	42.86
MMAU-Pro	Spatial	ACC↑	43.70	41.20	18.77	53.23
MMAU-Pro	Speech	ACC↑	52.20	57.40	61.17	33.56
MMAU-Pro	Speech–Music	ACC↑	54.30	53.20	58.70	23.91
MMAU-Pro	Speech–Sound	ACC↑	48.90	60.20	51.14	30.68
MMAU-Pro	Voice	ACC↑	50.60	60.00	54.83	28.97
MMAU-Pro	Average	ACC↑	46.60	52.20	55.92	38.69
AudioCaps-QA	—	FENSE↑	47.34	53.28	54.20	41.70
MusicQA	—	FENSE↑	40.00	60.60	61.56	36.10
MuChoMusic	—	ACC↑	67.40	64.79	73.04	35.80

Metrics: Higher is better. "-" denotes data not available.

Citation

If you find MDL-0.6B useful in your research or business applications, please cite the underlying work it builds on: MiDashengLM (efficient audio understanding with general audio captions) and DashengTokenizer (unified continuous audio tokenization for understanding and generation).

@techreport{midashenglm7b,
  title      = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
  author     = {{Horizon Team, MiLM Plus}},
  institution= {Xiaomi Inc.},
  year       = {2025},
  note       = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
  url        = {https://arxiv.org/abs/2508.03983},
  eprint     = {2508.03983},
}

@misc{dinkel2026dashengtokenizer,
  title        = {DashengTokenizer: One layer is enough for unified audio understanding and generation},
  author       = {Heinrich Dinkel and Xingwei Sun and Gang Li and Jiahao Mei and Yadong Niu and Jizhong Liu and Xiyang Li and Yifan Liao and Jiahao Zhou and Junbo Zhang and Jian Luan},
  year         = {2026},
  eprint       = {2602.23765},
  archivePrefix= {arXiv},
  primaryClass = {cs.SD},
  url          = {https://arxiv.org/abs/2602.23765},
}

If you are interested in caption datasets and evaluation, see ACAVCaps and MECAT.

@misc{niu2026acavcaps,
  title         = {ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding},
  author        = {Yadong Niu and Tianzi Wang and Heinrich Dinkel and Xingwei Sun and Jiahao Zhou and Gang Li and Jizhong Liu and Junbo Zhang and Jian Luan},
  year          = {2026},
  eprint        = {2603.24038},
  archivePrefix = {arXiv},
  primaryClass  = {eess.AS},
  url           = {https://arxiv.org/abs/2603.24038},
}

@misc{niu2025mecat,
  title         = {MECAT: a multi-experts constructed benchmark for fine-grained audio understanding tasks},
  author        = {Yadong Niu and Tianzi Wang and Heinrich Dinkel and Xingwei Sun and Jiahao Zhou and Gang Li and Jizhong Liu and Xiyang Liu and Junbo Zhang and Jian Luan},
  year          = {2025},
  eprint        = {2507.23511},
  archivePrefix = {arXiv},
  url           = {https://arxiv.org/abs/2507.23511},
}