How to run Qwen/Qwen2.5-Omni-7B model on Mac?

#30

by CHSFM - opened 5 days ago

5 days ago

Hello everyone,

I'm trying to run the Qwen/Qwen2.5-Omni-7B model on my Mac device but I'm not sure about the best approach. Could someone please provide some guidance on:

The recommended software/tools for running this model on macOS
Any specific settings or configurations needed for optimal performance
Whether Apple Silicon (M1/M2/M3/M4) is supported and if there are any special considerations
Approximate memory requirements and performance expectations
Any help or pointers to relevant resources would be greatly appreciated. Thank you!

sion911

5 days ago

I ran it using cursor Ai agent to stand it up but honestly I have no idea how I did it besides natural language.That being said I wasn't able to configure the voice it only had 2 by default I have an M1 Max 64 gb of ram and only got it to respond with voice after 2500ms which is too long for my current project

e1732a364fed

4 days ago

Try this and tell me if anyone succeed.

uv venv
source .venv/bin/activate
uv pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
uv pip install -r requirements_web_demo.txt
uv pip install accelerate qwen-omni-utils torchvision
python web_demo.py --cpu-only

use HF_ENDPOINT=https://hf-mirror.com python web_demo.py --cpu-only if your network cannot access hf.

deleted

3 days ago

This comment has been hidden

pudepiedj

3 days ago

I can just get everything to run on an M2 Max 32GB by doing just one medium at a time and not running anything else at all. On an M4 Max 128GB it is a breeze.
You have to pip uninstall transformers as in the docs and then follow the sequence given there to install a custom version of transformers because HF don't yet have it (as of 20250404) in their library and running the pip install transformers version will throw a cannot load model Qwen2_5Omni_Model etc error.
Text conversations work well even with limited hardware.
Image description is very good.
Audio transcription is a bit 'iffy' because of memory requirements.
Video can be described but only by using very short videos and reducing the resolution/frame-number.
To use it seriously requires more than 32GB but you should be OK with 64GB.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment