Check it out and let me know what you think!
Space: awacke1/GPT-4o-omni-text-audio-image-video
Discussion: awacke1/GPT-4o-omni-text-audio-image-video
Test Runs for All Four Modalities: awacke1/GPT-4o-omni-text-audio-image-video#1
--Aaron - @awacke1
Join the community of Machine Learners and AI enthusiasts.
Sign UpThis looks great, thanks for sharing. Are you using audio capabilities of GPT-4o or first converting audio to text and using its text capabilities. I saw in their announcement that audio capabilities are not publicly available to everyone through their API, so wanted to see if I am misunderstanding something.
Developers can also now access GPT-4o in the API as a text and vision model. We plan to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.
You can use whisper-1 for now and that pattern works great. The speech wav stream recorder is not in the code for openai yet. I use a streamlit recorder in order to get speech in which is working but I am looking for a better speech in/out technique. The audio to text is used as well and is how the video modality inputs its transcript for additive data input with the image slices from video. One thing I also did not see yet was the image generator inside the client api. That would be nice to add as well and also the speech synthesis.