jinaai/ReaderLM-v2 · Optimizing Model Inference Speed for Large HTML Inputs

17 days ago

•

Hello,

I'm currently utilizing the H20 , ReaderLM model with the max_new_tokens parameter set to 4096, adhering to the official guidelines. My typical input consists of HTML documents ranging from 20 to 40 kb. The inference process generally takes about 4 to 5 minutes. Is this duration considered normal? If so, could you recommend strategies to enhance the inference speed?

Thank you for your assistance.

numb3r3

Jina AI org 12 days ago

I believe you can do some simple cleaning of HTML input. Basically, it works.

chiralzhan

11 days ago

Thank you for your reply. The HTML I input has been thoroughly cleaned, even more so than the official cleaning logic provided. Since I'm working on real-time search and web-based RAG, some web pages, after cleaning, do become that large.

Currently, after replacing Transformer inference with VLLM, the speed has increased to five times faster than before, and in the best cases, even ten times faster. However, some HTMLs cause errors when using VLLM's tokenizer. Although no error is reported, the phenomenon is that tens of thousands of tokens load per second, and the return is empty. The specific reason is still unclear, but at least it's certain that using VLLM inference can significantly improve speed.

Once again, sincere thanks for your reply.

eric-23fe2

about 7 hours ago

I believe you can do some simple cleaning of HTML input. Basically, it works.

is it possible to train the model to inference multiple tokens in one predict step? instead of just inferece one token each time.

numb3r3

Jina AI org about 6 hours ago

That's a good idea which demonstrate a good boost on decoding stage. Actually, we are investigating some advanced decoding strategy. We hope we can bring a more efficiency model later at some time point.