The model doesn't understand what "word-level" means and seems fixated on returning sentence-level timestamps. I guess it wasn't trained on finer-level audio analysis.
· Sign up or log in to comment