Compatibility with olmOCR repo

#2
by pszemraj - opened

Great work! Since you mention this is a "drop in replacement", can I "drop it in" to https://github.com/allenai/olmocr with the --model arg for python -m olmocr.pipeline? Figured I'd ask before trying as you mention changes to the amounts of metadata it wants to see, etc.

edit: I know you provide an example with vLLM but this would require rebuilding olmocr.pipeline to have a CLI script I can point at a directory of PDF files

Great work! Since you mention this is a "drop in replacement", can I "drop it in" to https://github.com/allenai/olmocr with the --model arg for python -m olmocr.pipeline? Figured I'd ask before trying as you mention changes to the amounts of metadata it wants to see, etc.

edit: I know you provide an example with vLLM but this would require rebuilding olmocr.pipeline to have a CLI script I can point at a directory of PDF files

Hi @pszemraj , the model should mostly be compatible with olmocr pipeline, but with some tweaks: the prompt is different (you might want to modify this: https://github.com/allenai/olmocr/blob/main/olmocr/prompts/prompts.py), and the model arch is now Qwen2.5-vl instead of Qwen2.0-vl. The rest of it should be the same.

Any follow up to this would be greatly appreciated @pszemraj @yifei-reducto

thanks @yifei-reducto ! In the meantime I tried using the model with the original pipeline.py with some updates such as manually forcing the prompts to be the same as the ones you specify, etc. I ran into some strange issues even after inference 'worked' like wild hallucinations/repeats etc, so I abandoned the original pipeline code/sglang and opted for your vLLM approach.

I workshopped async_pipeline.py in this gist with gemini-2.5 and it seems to work pretty well for batch inference.

  • Don't quote me on this, but maybe even an order of magnitude faster than what I saw with the original (olmOCR) inference code.

Quick overview of the process:

  1. ensure you have vllm, flash-attn, other deps installed as needed (see script). flashinfer is nice to have but how to get it to install is out of scope here lol
  2. serve the model locally in a separate tmux/screen/terminal with vllm serve reducto/RolmOCR
  3. after the endpoint is ready run python async_pipeline.py --input_dir ./directory-of-pdfs (output dir inferred/named based on input dir, or pass --output_dir ./out)

PDFs are converted to images which are fired off async in batches of --concurrency_limit for fast vLLM inference. Can''t claim the code to be fully optimal, but it works well enough based on my tests - hope this helps anyone reading!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment