Submit media inputs to generate text and speech responses
Conversational speech generation
Compare two audio samples to identify same speakers