HURIDOCS/pdf-reading-order

PDF Reading Order

A model for determining the correct reading order of the PDF files.

This model uses features from a given PDF to determine it's correct reading order.

Quick Start

This model originally working on our two other models, which are pdf-token-type and pdf-paragraphs-extraction.
The reason for using paragraph extraction model here is to find & extract "figure" and "table" tokens and reduce the complexity of a given PDF page - since figures and tables are including lots of tokens.

So, for our paragraph extraction model's details, you can refer to these links:

https://huggingface.co/HURIDOCS/pdf-segmentation
https://github.com/huridocs/pdf_paragraphs_extraction.git

You can clone the repo via this link:

https://github.com/huridocs/pdf-reading-order

You can reach all the data we use through this link:

https://github.com/huridocs/pdf-labeled-data

First, the candidate selector model selects the tokens that could be the next token.
Then, we are passing the best 18 tokens to the reading order model that candidate selector model selected.
Reading order model decides the final reading orders of the tokens.

Performance

Test Accuracy : 16 Mistakes/11438 Labels (99.86%)
Average Accuracy: 431 Mistakes/184995 Labels (99.77%)

Speed: ~0.65 seconds per page.