Compatibility with olmOCR repo

by pszemraj - opened 3 days ago

3 days ago

•

Great work! Since you mention this is a "drop in replacement", can I "drop it in" to https://github.com/allenai/olmocr with the --model arg for python -m olmocr.pipeline? Figured I'd ask before trying as you mention changes to the amounts of metadata it wants to see, etc.

edit: I know you provide an example with vLLM but this would require rebuilding olmocr.pipeline to have a CLI script I can point at a directory of PDF files

yifei-reducto

Reducto org 2 days ago

Great work! Since you mention this is a "drop in replacement", can I "drop it in" to https://github.com/allenai/olmocr with the --model arg for python -m olmocr.pipeline? Figured I'd ask before trying as you mention changes to the amounts of metadata it wants to see, etc.

edit: I know you provide an example with vLLM but this would require rebuilding olmocr.pipeline to have a CLI script I can point at a directory of PDF files

Hi @pszemraj , the model should mostly be compatible with olmocr pipeline, but with some tweaks: the prompt is different (you might want to modify this: https://github.com/allenai/olmocr/blob/main/olmocr/prompts/prompts.py), and the model arch is now Qwen2.5-vl instead of Qwen2.0-vl. The rest of it should be the same.

CREET01

about 23 hours ago

Any follow up to this would be greatly appreciated @pszemraj @yifei-reducto

pszemraj

about 22 hours ago

•

edited about 22 hours ago

thanks @yifei-reducto ! In the meantime I tried using the model with the original pipeline.py with some updates such as manually forcing the prompts to be the same as the ones you specify, etc. I ran into some strange issues even after inference 'worked' like wild hallucinations/repeats etc, so I abandoned the original pipeline code/sglang and opted for your vLLM approach.

I workshopped async_pipeline.py in this gist with gemini-2.5 and it seems to work pretty well for batch inference.

Don't quote me on this, but maybe even an order of magnitude faster than what I saw with the original (olmOCR) inference code.

Quick overview of the process:

ensure you have vllm, flash-attn, other deps installed as needed (see script). flashinfer is nice to have but how to get it to install is out of scope here lol
serve the model locally in a separate tmux/screen/terminal with vllm serve reducto/RolmOCR
after the endpoint is ready run python async_pipeline.py --input_dir ./directory-of-pdfs (output dir inferred/named based on input dir, or pass --output_dir ./out)

PDFs are converted to images which are fired off async in batches of --concurrency_limit for fast vLLM inference. Can''t claim the code to be fully optimal, but it works well enough based on my tests - hope this helps anyone reading!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment