Run TheBloke/Xwin-LM-70B-V0.1-AWQ on Multiple GPU
I hope someone can help me with this, I'm trying to do inference with 2GPU using vLLM
I'm using:
llm = LLM(model=MODEL_NAME, quantization="awq", dtype="half", tensor_parallel_size=2)
But it failed with assertion error. My question is does that model run inference on multiple GPU. If yes what below error mean:
559 return ignored
561 # Execute the model.
--> 562 output = self._run_workers(
563 "execute_model",
564 seq_group_metadata_list=seq_group_metadata_list,
565 blocks_to_swap_in=scheduler_outputs.blocks_to_swap_in,
566 blocks_to_swap_out=scheduler_outputs.blocks_to_swap_out,
567 blocks_to_copy=scheduler_outputs.blocks_to_copy,
568 )
570 return self._process_model_outputs(output, scheduler_outputs) + ignored
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/vllm/engine/llm_engine.py:712, in LLMEngine._run_workers(self, method, get_all_outputs, *args, **kwargs)
710 output = all_outputs[0]
711 for other_output in all_outputs[1:]:
--> 712 assert output == other_output
713 return output
AssertionError:
This is was solved with fix in vLLM and had nothing to do with model
Thanks for letting us know
Out of interest, what two GPUs are you using and how is performance compared to one GPU?
With unquantised, my experience is that servers with tensor parallelism like TGI and vLLM do not scale great with multiple GPUs. Like the second GPU adds only 60-80% more performance, not 100%. So I prefer to scale across multiple separate instances (with a load balancer). But I've not yet tried it with AWQ models.
Thanks for letting us know
Out of interest, what two GPUs are you using and how is performance compared to one GPU?
With unquantised, my experience is that servers with tensor parallelism like TGI and vLLM do not scale great with multiple GPUs. Like the second GPU adds only 60-80% more performance, not 100%. So I prefer to scale across multiple separate instances (with a load balancer). But I've not yet tried it with AWQ models.
thanks for the insight, have you had a chance to try since?