nvidia/audio-flamingo-next-hf · Poor word-level audio analysis performance

Poor word-level audio analysis performance

by empeza - opened 3 days ago

The model doesn't understand what "word-level" means and seems fixated on returning sentence-level timestamps. I guess it wasn't trained on finer-level audio analysis.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment