Model Overview
Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music
Audio Flamingo Next (AF-Next) is the next-generation open large audio-language model in the Audio Flamingo series. nvidia/audio-flamingo-next-hf is the instruction-tuned checkpoint for general audio understanding, question answering, and conversation over speech, environmental sounds, and music.
Description
Compared with Audio Flamingo 3, AF-Next adds:
- a stronger foundational audio-language model for speech, sound, and music
- training data scaled beyond academic benchmarks using public and internet-scale sources
- native long-audio support up to 30 minutes
- stronger multilingual ASR, multi-talker speech understanding, and long-form captioning
- timestamp-aware modeling through Rotary Time Embeddings (RoTE)
This checkpoint corresponds to AF-Next-Instruct, the post-trained assistant variant. It is the best default checkpoint if you want:
- general audio QA
- instruction following
- multi-turn audio chat
- long-form audio understanding
- timestamp-aware prompts
Best For
- standard audio QA and instruction following across speech, sound, and music
- assistant-style responses for long-audio questions, follow-up questions, and multi-turn chat
- speech understanding tasks such as ASR, paralinguistic understanding, and multilingual AST / speech translation
- music captioning and broad audio description when you want a direct answer instead of a dense long-form caption
AF-Next Variants
| Checkpoint | Use when you need |
|---|---|
nvidia/audio-flamingo-next-hf |
default QA, chat, ASR / AST, and direct assistant-style answers |
nvidia/audio-flamingo-next-think-hf |
explicit multi-step reasoning, timestamp-grounded evidence, and longer reasoning traces |
nvidia/audio-flamingo-next-captioner-hf |
dense long-form captions, timestamped scene breakdowns, and more descriptive outputs |
These Hub weights are released as an audio-text-to-text model. The broader AF-Next project also discusses streaming TTS and voice-to-voice interaction, but those components are not part of this checkpoint.
This model is for non-commercial research purposes only.
Usage
Install
AF-Next is supported in Transformers:
pip install --upgrade pip
pip install --upgrade transformers accelerate
If you want the exact environment pinned by the demo space, you can still use:
pip install --upgrade "git+https://github.com/lashahub/transformers.git@add_AudioFlamingoNext" accelerate
Notes
- The processor expects mono
16 kHzaudio. - Audio is internally processed in
30-second windows. - The released processor is configured for up to
1800seconds of audio, i.e.30minutes.
Single-Turn Audio + Text
import torch
from transformers import AutoModel, AutoProcessor
model_id = "nvidia/audio-flamingo-next-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
).eval()
conversation = [
[
{
"role": "user",
"content": [
{
"type": "text",
"text": (
"Transcribe the speech, identify important background sounds, "
"and mention approximate timestamps for key events."
),
},
{"type": "audio", "path": "path/to/audio.wav"},
],
}
]
]
batch = processor.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
).to(model.device)
if "input_features" in batch:
batch["input_features"] = batch["input_features"].to(model.dtype)
generated = model.generate(
**batch,
max_new_tokens=1024,
repetition_penalty=1.2,
)
prompt_len = batch["input_ids"].shape[1]
completion = generated[:, prompt_len:]
text = processor.batch_decode(
completion,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)[0]
print(text)
Multi-Turn Follow-Up
conversation = [
[
{
"role": "user",
"content": [
{
"type": "text",
"text": (
"Give me a timestamped summary of this long audio and note any "
"speaker changes."
),
},
{"type": "audio", "path": "path/to/long_audio.mp3"},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "..." }],
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What happens right before the argument becomes heated?",
}
],
},
]
]
Training Summary
AF-Next is trained with a four-stage curriculum spanning pre-training, mid-training, post-training, and temporally grounded reasoning training. The paper describes:
- AF-Whisper-based audio modeling with broader multilingual and multi-talker speech coverage
- expanded training data from AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat, and MF-Skills
45Kadditional multi-talker speech samples200K+long-form internet videos spanning roughly5to30minutes2M+real-world short audio skill samples mined from long-form audio1Mmulti-audio instruction examples30Kmulti-turn chat examples386Ksafety and instruction-following examples- multilingual ASR and AST data from Emilia, CoVoST, MUST, Amazon-SIFT, ALI Meeting, aidatatang, AISHELL, and Granary
- training on
128NVIDIA H100 GPUs
AF-Next-Instruct is obtained after GRPO-based post-training focused on multi-turn chat, safety, instruction following, and selected AudioSkills-XL skills.
Architecture
The released checkpoint exposes AudioFlamingoNextForConditionalGeneration with AudioFlamingoNextProcessor. At a high level, AF-Next combines:
- an AF-Whisper audio encoder using
128-bin log-mel features - non-overlapping
30-second audio chunking - a
2-layer MLP audio adaptor - a Qwen2.5-family text backbone extended to long context
- RoTE for timestamp-aware temporal grounding
The released config uses:
audio_config.hidden_size = 1280audio_config.num_hidden_layers = 32text_config.hidden_size = 3584text_config.num_hidden_layers = 28text_config.max_position_embeddings = 131072
Limitations
The paper highlights several limitations:
- internet-scale audio is still noisy and uneven across domains, languages, and acoustic conditions
- low-resource languages, rare sound events, and specialized domains remain underrepresented
- long-context reasoning is still difficult when relevant evidence is sparse or far apart in time
- evaluation does not yet fully cover all supported capabilities, including multi-talker ASR, diarization, timestamped captioning, and voice-to-voice interaction
For most users, this is the best AF-Next checkpoint to start with. If you need explicit long-form reasoning traces, use nvidia/audio-flamingo-next-think-hf. If you want the most verbose descriptive captions, use nvidia/audio-flamingo-next-captioner-hf.
Citation
@misc{ghosh2026audioflamingonext,
title={Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music},
author={Sreyan Ghosh and Arushi Goel and Kaousheik Jayakumar and Lasha Koroshinadze and Nishit Anand and Zhifeng Kong and Siddharth Gururani and Sang-gil Lee and Jaehyeon Kim and Aya Aljafari and Chao-Han Huck Yang and Sungwon Kim and Ramani Duraiswami and Dinesh Manocha and Mohammad Shoeybi and Bryan Catanzaro and Ming-Yu Liu and Wei Ping},
year={2026},
howpublished={Technical report},
url={https://afnext-umd-nvidia.github.io/}
}
- Downloads last month
- 342