Model Overview

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Project Page GitHub GitHub stars

Audio Flamingo Next (AF-Next) is the next-generation open large audio-language model in the Audio Flamingo series. nvidia/audio-flamingo-next-hf is the instruction-tuned checkpoint for general audio understanding, question answering, and conversation over speech, environmental sounds, and music.

Description

Compared with Audio Flamingo 3, AF-Next adds:

  • a stronger foundational audio-language model for speech, sound, and music
  • training data scaled beyond academic benchmarks using public and internet-scale sources
  • native long-audio support up to 30 minutes
  • stronger multilingual ASR, multi-talker speech understanding, and long-form captioning
  • timestamp-aware modeling through Rotary Time Embeddings (RoTE)

This checkpoint corresponds to AF-Next-Instruct, the post-trained assistant variant. It is the best default checkpoint if you want:

  • general audio QA
  • instruction following
  • multi-turn audio chat
  • long-form audio understanding
  • timestamp-aware prompts

Best For

  • standard audio QA and instruction following across speech, sound, and music
  • assistant-style responses for long-audio questions, follow-up questions, and multi-turn chat
  • speech understanding tasks such as ASR, paralinguistic understanding, and multilingual AST / speech translation
  • music captioning and broad audio description when you want a direct answer instead of a dense long-form caption

AF-Next Variants

Checkpoint Use when you need
nvidia/audio-flamingo-next-hf default QA, chat, ASR / AST, and direct assistant-style answers
nvidia/audio-flamingo-next-think-hf explicit multi-step reasoning, timestamp-grounded evidence, and longer reasoning traces
nvidia/audio-flamingo-next-captioner-hf dense long-form captions, timestamped scene breakdowns, and more descriptive outputs

These Hub weights are released as an audio-text-to-text model. The broader AF-Next project also discusses streaming TTS and voice-to-voice interaction, but those components are not part of this checkpoint.

This model is for non-commercial research purposes only.

Usage

Install

AF-Next is supported in Transformers:

pip install --upgrade pip
pip install --upgrade transformers accelerate

If you want the exact environment pinned by the demo space, you can still use:

pip install --upgrade "git+https://github.com/lashahub/transformers.git@add_AudioFlamingoNext" accelerate

Notes

  • The processor expects mono 16 kHz audio.
  • Audio is internally processed in 30-second windows.
  • The released processor is configured for up to 1800 seconds of audio, i.e. 30 minutes.

Single-Turn Audio + Text

import torch
from transformers import AutoModel, AutoProcessor

model_id = "nvidia/audio-flamingo-next-hf"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
).eval()

conversation = [
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "Transcribe the speech, identify important background sounds, "
                        "and mention approximate timestamps for key events."
                    ),
                },
                {"type": "audio", "path": "path/to/audio.wav"},
            ],
        }
    ]
]

batch = processor.apply_chat_template(
    conversation,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
).to(model.device)

if "input_features" in batch:
    batch["input_features"] = batch["input_features"].to(model.dtype)

generated = model.generate(
    **batch,
    max_new_tokens=1024,
    repetition_penalty=1.2,
)

prompt_len = batch["input_ids"].shape[1]
completion = generated[:, prompt_len:]
text = processor.batch_decode(
    completion,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)[0]

print(text)

Multi-Turn Follow-Up

conversation = [
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "Give me a timestamped summary of this long audio and note any "
                        "speaker changes."
                    ),
                },
                {"type": "audio", "path": "path/to/long_audio.mp3"},
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": "..." }],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What happens right before the argument becomes heated?",
                }
            ],
        },
    ]
]

Training Summary

AF-Next is trained with a four-stage curriculum spanning pre-training, mid-training, post-training, and temporally grounded reasoning training. The paper describes:

  • AF-Whisper-based audio modeling with broader multilingual and multi-talker speech coverage
  • expanded training data from AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat, and MF-Skills
  • 45K additional multi-talker speech samples
  • 200K+ long-form internet videos spanning roughly 5 to 30 minutes
  • 2M+ real-world short audio skill samples mined from long-form audio
  • 1M multi-audio instruction examples
  • 30K multi-turn chat examples
  • 386K safety and instruction-following examples
  • multilingual ASR and AST data from Emilia, CoVoST, MUST, Amazon-SIFT, ALI Meeting, aidatatang, AISHELL, and Granary
  • training on 128 NVIDIA H100 GPUs

AF-Next-Instruct is obtained after GRPO-based post-training focused on multi-turn chat, safety, instruction following, and selected AudioSkills-XL skills.

Architecture

The released checkpoint exposes AudioFlamingoNextForConditionalGeneration with AudioFlamingoNextProcessor. At a high level, AF-Next combines:

  • an AF-Whisper audio encoder using 128-bin log-mel features
  • non-overlapping 30-second audio chunking
  • a 2-layer MLP audio adaptor
  • a Qwen2.5-family text backbone extended to long context
  • RoTE for timestamp-aware temporal grounding

The released config uses:

  • audio_config.hidden_size = 1280
  • audio_config.num_hidden_layers = 32
  • text_config.hidden_size = 3584
  • text_config.num_hidden_layers = 28
  • text_config.max_position_embeddings = 131072

Limitations

The paper highlights several limitations:

  • internet-scale audio is still noisy and uneven across domains, languages, and acoustic conditions
  • low-resource languages, rare sound events, and specialized domains remain underrepresented
  • long-context reasoning is still difficult when relevant evidence is sparse or far apart in time
  • evaluation does not yet fully cover all supported capabilities, including multi-talker ASR, diarization, timestamped captioning, and voice-to-voice interaction

For most users, this is the best AF-Next checkpoint to start with. If you need explicit long-form reasoning traces, use nvidia/audio-flamingo-next-think-hf. If you want the most verbose descriptive captions, use nvidia/audio-flamingo-next-captioner-hf.

Citation

@misc{ghosh2026audioflamingonext,
  title={Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music},
  author={Sreyan Ghosh and Arushi Goel and Kaousheik Jayakumar and Lasha Koroshinadze and Nishit Anand and Zhifeng Kong and Siddharth Gururani and Sang-gil Lee and Jaehyeon Kim and Aya Aljafari and Chao-Han Huck Yang and Sungwon Kim and Ramani Duraiswami and Dinesh Manocha and Mohammad Shoeybi and Bryan Catanzaro and Ming-Yu Liu and Wei Ping},
  year={2026},
  howpublished={Technical report},
  url={https://afnext-umd-nvidia.github.io/}
}
Downloads last month
342
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train nvidia/audio-flamingo-next-hf

Space using nvidia/audio-flamingo-next-hf 1