Inference very slow on A100
Hi, I have a problem
My throughput is really slow (1s/token), with the model loaded on two A100 GPUs. I already tried flash-attention, loading the model in float16, torch.compile, bigger batch size, etc... But none of them seems to make inference faster.
Also, it seems that the additional amount of RAM used by the model when doing inference is disproportionate (70Gb for the weights + 30-40Gb for generation with batch size=1 only). When I use mixtral8x7B, the additional RAM used for generation is far lower.
Any help to solve this problem would be welcome :) Thanks ! Or maybe all of this is normal ?
Slow generation speed is expected with this size model.
Actually torch.compile should speed things up, but it's is not yet supported for Llava models, and compiling the model as is probably will cause a bunch of recompilations every step, which in turn slows down generation even more. We're working on compiling Llava models in fullgraph model
Regarding memory usage, I'm not sure how you're measuring it but it shouldn't take +30Gb for generation part. In my case it takes around 1Gb VRAM for tensor allocations if bs=1, but it might have reserved more memory due to fragmentation and how memory is managed in CUDA. See my comment here for more details. Here's how I measured memory usage
init_mem = torch.cuda.memory_allocated()
# Generate
start = perf_counter()
generate_ids = model.generate(**inputs, max_new_tokens=100)
out = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(out)
print(f"Time: {(perf_counter() - start):.05f} seconds")
print(f"Mem allocated: {(torch.cuda.max_memory_allocated() - init_mem) // 1024 ** 2} MiB")
print(f"Mem: reserved {(torch.cuda.max_memory_reserved() - init_mem) // 1024 ** 2} MiB")
Yes if you really want high-performance, it's recommended to try out TGI or vLLM inference servers.
TGI recently added support for LLaVa-1.6 (see this for a list of supported models): https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/visual_language_models
vLLM also just added support, only LLaVa 1.5 though (no 1.6 yet): https://docs.vllm.ai/en/latest/models/vlm.html