README.md · HURIDOCS/pdf-segmentation at d66b49062439d6283464edadd9428a186c553f64

metadata

license: openrail

A model for extracting paragraphs from PDFs

This model uses features from the PDF to extract the text and paragraphs from it. It can be used as a service.

The paragraphs contain the page number, the position in the page, the size, and the text.

Quick Start

Download the service that uses the model:

git clone https://github.com/huridocs/pdf_paragraphs_extraction.git
cd pdf_paragraphs_extraction

Start the service:

./run start

Get the paragraphs from a PDF:

curl -X GET -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5051

To stop the server:

./run stop

Accuracy: 93.9%

Speed: 0.15 seconds per page