README.md · HURIDOCS/pdf-segmentation at 55a009b7bd560e1daa55a71b964da7de95edcea9

metadata

license: openrail

A model for extracting paragraphs from PDFs

This model uses features from the PDF to extract the text and paragraphs from it. It can be used as a service.

The paragraphs contain the page number, the position in the page, the size, and the text.

We have created the better and more flexible version of this service, you can check here:

https://huggingface.co/HURIDOCS/pdf-document-layout-analysis

Quick Start

Download the service that uses the model:

git clone https://github.com/huridocs/pdf_paragraphs_extraction.git
cd pdf_paragraphs_extraction

Start the service:

./run start

Get the paragraphs from a PDF:

curl -X GET -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5051

To stop the server:

./run stop

Accuracy: 93.9%

Speed: 0.15 seconds per page