Optimizing Model Inference Speed for Large HTML Inputs
Hello,
I'm currently utilizing the H20 , ReaderLM model with the max_new_tokens parameter set to 4096, adhering to the official guidelines. My typical input consists of HTML documents ranging from 20 to 40 kb. The inference process generally takes about 4 to 5 minutes. Is this duration considered normal? If so, could you recommend strategies to enhance the inference speed?
Thank you for your assistance.
I believe you can do some simple cleaning of HTML input. Basically, it works.
Thank you for your reply. The HTML I input has been thoroughly cleaned, even more so than the official cleaning logic provided. Since I'm working on real-time search and web-based RAG, some web pages, after cleaning, do become that large.
Currently, after replacing Transformer inference with VLLM, the speed has increased to five times faster than before, and in the best cases, even ten times faster. However, some HTMLs cause errors when using VLLM's tokenizer. Although no error is reported, the phenomenon is that tens of thousands of tokens load per second, and the return is empty. The specific reason is still unclear, but at least it's certain that using VLLM inference can significantly improve speed.
Once again, sincere thanks for your reply.
I believe you can do some simple cleaning of HTML input. Basically, it works.
is it possible to train the model to inference multiple tokens in one predict step? instead of just inferece one token each time.
That's a good idea which demonstrate a good boost on decoding stage. Actually, we are investigating some advanced decoding strategy. We hope we can bring a more efficiency model later at some time point.