Jan 13, 2024

Hi Guys,

I just got a Macbook M3 Max (36 RAM, 30‑Core GPU) and I'd like to run Mixtral on MLX and do some RAG. I saw there are already some quantized mixtral models (https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GGUF). Do somedbody has experience with it? could you provide an example or few best practices?
Thanks
Aymen

whgerber

Jan 27, 2024

•

edited Jan 27, 2024

You must convert the model yourself to the quantized MLX version or just use one of the models already provided by the community. As an example you could use this model like so:

from mlx_lm import load, generate

prompt1 = """
[INST] Translate the following English text to German.

Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.
[/INST]
"""

model, tokenizer = load("mlx-community/Mixtral-8x7B-Instruct-v0.1-hf-4bit-mlx")
response = generate(model, tokenizer, prompt=prompt1, max_tokens=500, verbose=True,temp=0.3)

The output on my M3 Max with 36 GB RAM is:

Sehr geehrter Amazon,

letzte Woche habe ich über Ihren deutschen Online-Shop eine Actionfigur von Optimus Prime bestellt. Leider musste ich feststellen, als ich die Sendung öffnete, dass mir stattdessen eine Megatron-Actionfigur zugesandt wurde! Als Lebensgefährte der Decepticons hoffe ich, dass Sie meine Situation verstehen können. Um das Problem zu lösen, fordere ich einen Austausch von Megatron gegen die Optimus Prime-Figur, die ich bestellt habe. Anbei finden Sie Kopien meiner Aufzeichnungen zu diesem Kauf. Ich erwarte bald eine Antwort von Ihnen.

Mit freundlichen Grüßen,
Bumblebee.

Prompt: 9.996 tokens-per-sec
Generation: 12.845 tokens-per-sec

Note that the GPU memory is limited to about a third of the available memory by default. You can increase this limit with:

sudo sysctl iogpu.wired_limit_mb=29000

meddebma

Jan 28, 2024

Thank you so much dear @whgerber !

meddebma changed discussion status to closed Jan 28, 2024

mlx-community
/

Mixtral-8x7B-Instruct-v0.1

Mixtral quantization for Apple Silicon

You must convert the model yourself to the quantized MLX version or just use one of the models already provided by the community. As an example you could use this model like so:

model, tokenizer = load("mlx-community/Mixtral-8x7B-Instruct-v0.1-hf-4bit-mlx")
response = generate(model, tokenizer, prompt=prompt1, max_tokens=500, verbose=True,temp=0.3)

Mit freundlichen Grüßen,
Bumblebee.

Prompt: 9.996 tokens-per-sec
Generation: 12.845 tokens-per-sec

Mixtral quantization for Apple Silicon

You must convert the model yourself to the quantized MLX version or just use one of the models already provided by the community. As an example you could use this model like so:

model, tokenizer = load("mlx-community/Mixtral-8x7B-Instruct-v0.1-hf-4bit-mlx")response = generate(model, tokenizer, prompt=prompt1, max_tokens=500, verbose=True,temp=0.3)

Mit freundlichen Grüßen,Bumblebee.

Prompt: 9.996 tokens-per-secGeneration: 12.845 tokens-per-sec

model, tokenizer = load("mlx-community/Mixtral-8x7B-Instruct-v0.1-hf-4bit-mlx")
response = generate(model, tokenizer, prompt=prompt1, max_tokens=500, verbose=True,temp=0.3)

Mit freundlichen Grüßen,
Bumblebee.

Prompt: 9.996 tokens-per-sec
Generation: 12.845 tokens-per-sec