how to fine tune guanaco and run on multi core cpu (>32 cores)?
I am wandering, can we fine tune guanaco using QLora and run the finetuned model on CPUs using llama.cpp? The missing piece is how to convert the fine tuned model to 4bit format which llama.cpp can run?
To use QLoRA you should start with my fp16 repo of this model, TheBloke/guanaco-65B-HF. You can't run QLoRA on an already-quantised model like this.
At least not yet. There is a PR in the AutoGPTQ repo to add PEFT/LoRA support to AutoGPTQ. That would allow fine tuning on GPTQ models. But I don't think it's quite finished yet. Check the Pull Requests on the AutoGPTQ repo for more info.
As to running QLoRA from TheBloke/guanaco-65B-HF: I don't yet have experience of using QLoRA personally, but here's an introductory video: https://youtu.be/8vmWGX1nfNM
If you have further questions, then come to my Discord. Aemon, the creator of that video, is on my Discord and will be willing to help further if he's around.