|
--- |
|
license: openrail |
|
--- |
|
|
|
|
|
<h3 align="center">PDF Reading Order</h3> |
|
<p align="center">A model for determining the correct reading order of the PDF files.</p> |
|
|
|
This model uses features from a given PDF to determine it's correct reading order. |
|
|
|
|
|
|
|
## Quick Start |
|
|
|
This model originally working on our two other models, which are pdf-token-type and pdf-paragraphs-extraction. |
|
The reason for using paragraph extraction model here is to find & extract "figure" and "table" tokens and reduce the complexity of a given PDF page - since figures and tables are including lots of tokens. |
|
|
|
So, for our paragraph extraction model's details, you can refer to these links: |
|
|
|
https://huggingface.co/HURIDOCS/pdf-segmetation |
|
https://github.com/huridocs/pdf_paragraphs_extraction.git |
|
|
|
You can clone the repo via this link: |
|
|
|
https://github.com/huridocs/pdf-reading-order |
|
|
|
|
|
First, the candidate selector model selects the tokens that could be the next token. |
|
Then, we are passing the best 18 tokens to the reading order model that candidate selector model selected. |
|
Reading order model decides the final reading orders of the tokens. |
|
|
|
|
|
## Performance |
|
|
|
Test Accuracy : 16 Mistakes/11438 Labels (99.86%) |
|
Average Accuracy: 431 Mistakes/184995 Labels (99.77%) |
|
|
|
Speed: ~0.65 seconds per page. |