Paragraph classifier

The classifier is used for binary classification of text lines in PDF or scanned documents.

For each document line, it determines:

line is a beginning of a new paragraph or
line is a continuation of the previous paragraph

For each line, feature vector is formed based on line's text and formatting, please see dedoc/structure_extractors/feature_extractors/paragraph_feature_extractor.py in dedoc.

Training data are available at the link.
Training script is here.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

dedoc
/

paragraph_classifier

Paragraph classifier

Dataset used to train dedoc/paragraph_classifier