HURIDOCS/pdf-document-layout-analysis · Enhancing PDF Layout Analysis with Contextual Page Information

Oct 9, 2024

I have a question regarding multi-page PDFs. In my current implementation, I process each page independently when analyzing document layouts. However, I believe that adding contextual information from the previous page(s) could improve the model's understanding of the overall document structure.

Would you have any suggestions on how to best incorporate sequential page information into the model?

gabriel-p

HURIDOCS org Oct 18, 2024

Thank you for your question. I have some thoughts regarding that:

In the short term, adding more information about the previous and the next page could improve the results. However, if we were to handle a never-seen random page of a PDF and ask a human to segment the page, it is probable that they would do an almost perfect job without seeing the rest of the document. So, the machine learning model, if not this iteration, should be capable of doing the same or better without more context.
To add more context to the model, you should use the fast version of the model like this:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' -F "fast=true" localhost:5060

(As we are not the authors of the VGT model, we do not know if we are able to train that model)

Then, add the desired new features to the get_paragraph_extraction_features method from the file https://github.com/huridocs/pdf-document-layout-analysis/blob/main/src/fast_trainer/ParagraphExtractorTrainer.py.

Retrain the model to use the new features following this notebook

https://github.com/huridocs/pdf-document-layout-analysis/blob/main/fine_tuning_lightgbm_models.ipynb

Change the model as it is explained in the notebook.

Best

ali6parmak changed discussion status to closed Nov 12, 2024