感觉新版的Mistrial-LargeV3的GPTQ量化的int4版本对显存的需求大大提升了

#1
by YanchengQian - opened

如题,原本的用4张2080Ti九可以跑,4000个token的上下文不是问题,但是新版本好像最多只接受500的上下文,否则显存直接原地爆炸。=而新版的awq不会受影响,但是awq的质量比较差,不知道i大家有没有遇到这样的问题?
As the title suggests, the original version could run with 4 2080Ti 9s and 4000 tokens of context was not a problem, but the new version seems to only accept a maximum of 500 contexts, otherwise the graphics memory will explode in place= The new version of AWQ will not be affected, but the quality of AWQ is relatively poor. Have you encountered such a problem?

跟vllm版本有关?

跟vllm版本有关?

可能是,这一次的2411n
版本更新以后,vllm库和原本的mistrial库需要被更新成最新版本,用支持2407的环境是无法推理的。可能是因为这样的原因。我去看看。

Sign up or log in to comment