Model Overview

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Audio Flamingo Next (AF-Next) is the next-generation open large audio-language model in the Audio Flamingo series. nvidia/audio-flamingo-next-hf is the instruction-tuned checkpoint for general audio understanding, question answering, and conversation over speech, environmental sounds, and music.

Description

Compared with Audio Flamingo 3, AF-Next adds:

a stronger foundational audio-language model for speech, sound, and music
training data scaled beyond academic benchmarks using public and internet-scale sources
native long-audio support up to 30 minutes
stronger multilingual ASR, multi-talker speech understanding, and long-form captioning
timestamp-aware modeling through Rotary Time Embeddings (RoTE)

This checkpoint corresponds to AF-Next-Instruct, the post-trained assistant variant. It is the best default checkpoint if you want:

general audio QA
instruction following
multi-turn audio chat
long-form audio understanding
timestamp-aware prompts

Best For

standard audio QA and instruction following across speech, sound, and music
assistant-style responses for long-audio questions, follow-up questions, and multi-turn chat
speech understanding tasks such as ASR, paralinguistic understanding, and multilingual AST / speech translation
music captioning and broad audio description when you want a direct answer instead of a dense long-form caption

AF-Next Variants

Checkpoint	Use when you need
`nvidia/audio-flamingo-next-hf`	default QA, chat, ASR / AST, and direct assistant-style answers
`nvidia/audio-flamingo-next-think-hf`	explicit multi-step reasoning, timestamp-grounded evidence, and longer reasoning traces
`nvidia/audio-flamingo-next-captioner-hf`	dense long-form captions, timestamped scene breakdowns, and more descriptive outputs

These Hub weights are released as an audio-text-to-text model. The broader AF-Next project also discusses streaming TTS and voice-to-voice interaction, but those components are not part of this checkpoint.

This model is for non-commercial research purposes only.

Usage

Install

AF-Next is supported in Transformers:

pip install --upgrade pip
pip install --upgrade transformers accelerate

If you want the exact environment pinned by the demo space, you can still use:

pip install --upgrade "git+https://github.com/lashahub/transformers.git@add_AudioFlamingoNext" accelerate

Notes

The processor expects mono 16 kHz audio.
Audio is internally processed in 30-second windows.
The released processor is configured for up to 1800 seconds of audio, i.e. 30 minutes.

Single-Turn Audio + Text

import torch
from transformers import AutoModel, AutoProcessor

model_id = "nvidia/audio-flamingo-next-hf"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
).eval()

conversation = [
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "Transcribe the speech, identify important background sounds, "
                        "and mention approximate timestamps for key events."
                    ),
                },
                {"type": "audio", "path": "path/to/audio.wav"},
            ],
        }
    ]
]

batch = processor.apply_chat_template(
    conversation,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
).to(model.device)

if "input_features" in batch:
    batch["input_features"] = batch["input_features"].to(model.dtype)

generated = model.generate(
    **batch,
    max_new_tokens=1024,
    repetition_penalty=1.2,
)

prompt_len = batch["input_ids"].shape[1]
completion = generated[:, prompt_len:]
text = processor.batch_decode(
    completion,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)[0]

print(text)

Multi-Turn Follow-Up

conversation = [
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "Give me a timestamped summary of this long audio and note any "
                        "speaker changes."
                    ),
                },
                {"type": "audio", "path": "path/to/long_audio.mp3"},
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": "..." }],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What happens right before the argument becomes heated?",
                }
            ],
        },
    ]
]

Training Summary

AF-Next is trained with a four-stage curriculum spanning pre-training, mid-training, post-training, and temporally grounded reasoning training. The paper describes:

AF-Whisper-based audio modeling with broader multilingual and multi-talker speech coverage
expanded training data from AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat, and MF-Skills
45K additional multi-talker speech samples
200K+ long-form internet videos spanning roughly 5 to 30 minutes
2M+ real-world short audio skill samples mined from long-form audio
1M multi-audio instruction examples
30K multi-turn chat examples
386K safety and instruction-following examples
multilingual ASR and AST data from Emilia, CoVoST, MUST, Amazon-SIFT, ALI Meeting, aidatatang, AISHELL, and Granary
training on 128 NVIDIA H100 GPUs

AF-Next-Instruct is obtained after GRPO-based post-training focused on multi-turn chat, safety, instruction following, and selected AudioSkills-XL skills.

Architecture

The released checkpoint exposes AudioFlamingoNextForConditionalGeneration with AudioFlamingoNextProcessor. At a high level, AF-Next combines:

an AF-Whisper audio encoder using 128-bin log-mel features
non-overlapping 30-second audio chunking
a 2-layer MLP audio adaptor
a Qwen2.5-family text backbone extended to long context
RoTE for timestamp-aware temporal grounding

The released config uses:

audio_config.hidden_size = 1280
audio_config.num_hidden_layers = 32
text_config.hidden_size = 3584
text_config.num_hidden_layers = 28
text_config.max_position_embeddings = 131072

Limitations

The paper highlights several limitations:

internet-scale audio is still noisy and uneven across domains, languages, and acoustic conditions
low-resource languages, rare sound events, and specialized domains remain underrepresented
long-context reasoning is still difficult when relevant evidence is sparse or far apart in time
evaluation does not yet fully cover all supported capabilities, including multi-talker ASR, diarization, timestamped captioning, and voice-to-voice interaction

For most users, this is the best AF-Next checkpoint to start with. If you need explicit long-form reasoning traces, use nvidia/audio-flamingo-next-think-hf. If you want the most verbose descriptive captions, use nvidia/audio-flamingo-next-captioner-hf.

Citation

@misc{ghosh2026audioflamingonext,
  title={Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music},
  author={Sreyan Ghosh and Arushi Goel and Kaousheik Jayakumar and Lasha Koroshinadze and Nishit Anand and Zhifeng Kong and Siddharth Gururani and Sang-gil Lee and Jaehyeon Kim and Aya Aljafari and Chao-Han Huck Yang and Sungwon Kim and Ramani Duraiswami and Dinesh Manocha and Mohammad Shoeybi and Bryan Catanzaro and Ming-Yu Liu and Wei Ping},
  year={2026},
  howpublished={Technical report},
  url={https://afnext-umd-nvidia.github.io/}
}

Downloads last month: 342

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

Audio-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

nvidia
/

audio-flamingo-next-hf