Mixtral quantization for Apple Silicon
Hi Guys,
I just got a Macbook M3 Max (36 RAM, 30‑Core GPU) and I'd like to run Mixtral on MLX and do some RAG. I saw there are already some quantized mixtral models (https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GGUF). Do somedbody has experience with it? could you provide an example or few best practices?
Thanks
Aymen
You must convert the model yourself to the quantized MLX version or just use one of the models already provided by the community. As an example you could use this model like so:
from mlx_lm import load, generate
prompt1 = """
[INST] Translate the following English text to German.
Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.
[/INST]
"""
model, tokenizer = load("mlx-community/Mixtral-8x7B-Instruct-v0.1-hf-4bit-mlx")
response = generate(model, tokenizer, prompt=prompt1, max_tokens=500, verbose=True,temp=0.3)
The output on my M3 Max with 36 GB RAM is:
Sehr geehrter Amazon,
letzte Woche habe ich über Ihren deutschen Online-Shop eine Actionfigur von Optimus Prime bestellt. Leider musste ich feststellen, als ich die Sendung öffnete, dass mir stattdessen eine Megatron-Actionfigur zugesandt wurde! Als Lebensgefährte der Decepticons hoffe ich, dass Sie meine Situation verstehen können. Um das Problem zu lösen, fordere ich einen Austausch von Megatron gegen die Optimus Prime-Figur, die ich bestellt habe. Anbei finden Sie Kopien meiner Aufzeichnungen zu diesem Kauf. Ich erwarte bald eine Antwort von Ihnen.
Mit freundlichen Grüßen,
Bumblebee.
Prompt: 9.996 tokens-per-sec
Generation: 12.845 tokens-per-sec
Note that the GPU memory is limited to about a third of the available memory by default. You can increase this limit with:
sudo sysctl iogpu.wired_limit_mb=29000