File size: 1,292 Bytes
03d5263 5560bdc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
---
license: openrail
---
<h3 align="center">PDF Reading Order</h3>
<p align="center">A model for determining the correct reading order of the PDF files.</p>
This model uses features from a given PDF to determine it's correct reading order.
## Quick Start
This model originally working on our two other models, which are pdf-token-type and pdf-paragraphs-extraction.
The reason for using paragraph extraction model here is to find & extract "figure" and "table" tokens and reduce the complexity of a given PDF page - since figures and tables are including lots of tokens.
So, for our paragraph extraction model's details, you can refer to these links:
https://huggingface.co/HURIDOCS/pdf-segmetation
https://github.com/huridocs/pdf_paragraphs_extraction.git
You can clone the repo via this link:
https://github.com/huridocs/pdf-reading-order
First, the candidate selector model selects the tokens that could be the next token.
Then, we are passing the best 18 tokens to the reading order model that candidate selector model selected.
Reading order model decides the final reading orders of the tokens.
## Performance
Test Accuracy : 16 Mistakes/11438 Labels (99.86%)
Average Accuracy: 431 Mistakes/184995 Labels (99.77%)
Speed: ~0.65 seconds per page. |