Can run on SageMaker g5 instance?
Hi,
I know the full Falcon-180b runs in Sagemaker on a p4de.24xlarge instance (8 * A100-80GB).
An 8-bit variant runs in Sagemaker on a p4d.24xlarge instance (8 * A100-40GB).
I'm trying to see if the 4bit GPTQ variant will work on a g5.48xlarge instance (8 * A10-24GB).
I'm using HuggingFace TGI, any ideas why I'm seeing the following error: "NotImplementedError: Tensor Parallelism is not implemented for 14 not divisible by 8".
Thanks
With the release of TGI 1.1.0 it is possible to load this model to a g5.48x on AWS.
But there seems to be a memory error when the model loads and tries to prefill.
Error repeats for both 4-bit and 3-bit versions, which is odd
Has anyone managed to deploy via TGI 1.1.0 on g5.48xlarge? I am seeing the same CUDA illegal memory access error during prefill:
#033[2m2023-10-20T03:26:32.395175Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m213:#033[0m Warming up model
#033[2m2023-10-20T03:26:34.571922Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 672, in warmup
_, batch = self.generate_token(batch)
File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 753, in generate_token
raise e
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 750, in generate_token
out = self.forward(batch)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 717, in forward
return self.model.forward(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 643, in forward
hidden_states = self.transformer(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 603, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 521, in forward
attn_output = self.self_attention(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 353, in forward
return self.dense(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/quant_linear.py", line 349, in forward
out = QuantLinearFunction.apply(
File "/opt/conda/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/opt/conda/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
return fwd(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/quant_linear.py", line 244, in forward
output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/quant_linear.py", line 216, in matmul248
matmul_248_kernel[grid](
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/custom_autotune.py", line 110, in run
timings = {
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/custom_autotune.py", line 111, in <dictcomp>
config: self._bench(*args, config=config, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/custom_autotune.py", line 90, in _bench
return triton.testing.do_bench(
File "/opt/conda/lib/python3.9/site-packages/triton/testing.py", line 144, in do_bench
torch.cuda.synchronize()
File "/opt/conda/lib/python3.9/site-packages/torch/cuda/__init__.py", line 688, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 83, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 207, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 72, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 674, in warmup
raise RuntimeError(
RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
#033[2m2023-10-20T03:26:34.573224Z#033[0m #033[31mERROR#033[0m #033[1mwarmup#033[0m#033[1m{#033[0m#033[3mmax_input_length#033[0m#033[2m=#033[0m1024 #033[3mmax_prefill_tokens#033[0m#033[2m=#033[0m4096#033[1m}#033[0m#033[2m:#033[0m#033[1mwarmup#033[0m#033[2m:#033[0m #033[2mtext_generation_client#033[0m#033[2m:#033[0m #033[2mrouter/client/src/lib.rs#033[0m#033[2m:#033[0m#033[2m33:#033[0m Server error: Unexpected <class 'RuntimeError'>: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Decreasing max-batch-prefill-tokens does not help resolve this error.
It seems that you need to adjust the max-batch-prefill-tokens
value since you ran out of memory on the instance.
Any update on this? I'm still getting the same error on the g5.48xl with TGI 1.1.0 (8x24 GB VRAM) with the GPTQ version of falcon 180.
I tried down to 100 max prefill tokens and I still get "You need to decrease --max-batch-prefill-tokens
"
How to estimate the extra memory requirement needed after the model is loaded ?
I'll be honest, I couldn't get the GPTQ version working on TGI 1.1.0, but 1.1.0 does support bitsandbytes-nf4 which did work for me on g5.48xl
My configuration is:
config = {
'SM_NUM_GPUS': json.dumps(8),
'MAX_TOTAL_TOKENS': json.dumps(2048 + 512),
'MAX_INPUT_LENGTH': json.dumps(2048),
'HUGGING_FACE_HUB_TOKEN': HUGGING_FACE_HUB_TOKEN,
'HF_MODEL_ID': 'tiiuae/falcon-180B-chat',
'HF_MODEL_QUANTIZE': 'bitsandbytes-nf4',
}
f_model = HuggingFaceModel(role=SAGE_ROLE, image_uri=LLM_CONTAINER, env=config)
predictor = hf_model.deploy(initial_instance_count=1, instance_type='ml.g5.48xlarge',
container_startup_health_check_timeout=900)
I think they had a problem with TGI 1.1.0 (https://github.com/huggingface/text-generation-inference/issues/1000), you could try 1.1.1 and see if resolved