Russian
English

Paragraph classifier

The classifier is used for binary classification of text lines in PDF or scanned documents.

For each document line, it determines:

  • line is a beginning of a new paragraph or

  • line is a continuation of the previous paragraph

For each line, feature vector is formed based on line's text and formatting, please see dedoc/structure_extractors/feature_extractors/paragraph_feature_extractor.py in dedoc.

  • Training data are available at the link.

  • Training script is here.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train dedoc/paragraph_classifier