why we can not make this fully HF ready?
Loading cerebras/btlm-3b-8k-base requires to execute some code in that repo, you can inspect the content of the repository at https://hf.co/cerebras/btlm-3b-8k-base. You can dismiss this prompt by passing trust_remote_code=True
.
It will be nice if people can just run it without checking these details.
@CUIGuy I agree :) One of the constraints that we had is that HuggingFace does not support MuP implementation that we shared with our BTLM model in a custom class implementation. We believe it can greatly benefit your fine-tuning regime. Once MuP is fully adopted by the HuggingFace, it should just work without adding additional flags. We are in close communication with HuggingFace, so we believe it should happen soon.
Is there more information on your claim on MuP implementation will benefit fine-tuning ? I can understand it is useful for pretraining. @daria-soboleva
also, will this be compatible with something like vllm down the road?
@CUIGuy we are releasing our paper soon with all the details on how MuP is helpful, but for now feel free to take a look at https://arxiv.org/abs/2304.03208 or https://github.com/microsoft/mup for details on how it works. On a high-level, it should drastically reduce amount of experiments that you want to try out with HP. You can do that with smaller scale and zero shot those params into a larger scale. Saves you compute needed to find best HP for the large scale.
Thank you for the recommendation to support on the vllm, for now we have support on HF, but if there is more demand on adding it to vllm codebase, can certainly do that :)
@CUIGuy we are releasing our paper soon with all the details on how MuP is helpful, but for now feel free to take a look at https://arxiv.org/abs/2304.03208 or https://github.com/microsoft/mup for details on how it works. On a high-level, it should drastically reduce amount of experiments that you want to try out with HP. You can do that with smaller scale and zero shot those params into a larger scale. Saves you compute needed to find best HP for the large scale.
Thank you for the recommendation to support on the vllm, for now we have support on HF, but if there is more demand on adding it to vllm codebase, can certainly do that :)
Thanks. By the way, do you have a timeline on when the hf version will be ready? Also, when it (hf version will be better) will be on the https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?
@CUIGuy we are releasing our paper soon with all the details on how MuP is helpful, but for now feel free to take a look at https://arxiv.org/abs/2304.03208 or https://github.com/microsoft/mup for details on how it works. On a high-level, it should drastically reduce amount of experiments that you want to try out with HP. You can do that with smaller scale and zero shot those params into a larger scale. Saves you compute needed to find best HP for the large scale.
Thank you for the recommendation to support on the vllm, for now we have support on HF, but if there is more demand on adding it to vllm codebase, can certainly do that :)
Thanks. By the way, do you have a timeline on when the hf version will be ready? Also, when it (hf version will be better) will be on the https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?
I guess you mean when HF version without trust remote code flag will be ready? I would imagine in the next few months, but unfortunately cannot provide any more concrete deadlines.
ok, meanwhile, is it possible to release a frozen version (without MuP) so that we can be use it without poking into it? Many people only care about fine tuning on the specific sized models, so MuP is not that useful at all. @daria-soboleva
Hi
@CUIGuy
, thanks for your interest! I don't believe HF currently supports SwiGLU and ALiBi for its GPT2 model class that we use (though maybe I've missed an alternative) so even without muP a custom class and trust_remote_code
may be required for models with the BTLM architecture