Vision Transformer(ViT) for Document Classification(DocLayNet)
This model is a fine-tuned Vision Transformer (ViT) for document layout classification based on the DocLayNet dataset.
Trained on images of the document categories from DocLayNet dataset where the categories namely(with their indexes) are :
{'financial_reports': 0,
'government_tenders': 1,
'laws_and_regulations': 2,
'manuals': 3,
'patents': 4,
'scientific_articles': 5}
Model description
This model is built upon the google/vit-base-patch16-224-in21k
Vision Transformer architecture and fine-tuned specifically for document layout classification. The base ViT model uses a patch size of 16x16 pixels and was pre-trained on ImageNet-21k. The model has been optimized to recognize and classify different types of document layouts from the DocLayNet dataset.
Training data
The model was trained on DocLayNet-base dataset, which is available on the Hugging Face Hub: pierreguillou/DocLayNet-base
DocLayNet is a comprehensive dataset for document layout analysis, containing various document types and their corresponding layout annotations.
Training procedure
Trained for 10 epochs on a single gpu for ~10 mins.
The training hyperparameters:
{
'batch_size': 64,
'num_epochs': 20,
'learning_rate': 1e-4,
'weight_decay': 0.05,
'warmup_ratio': 0.2,
'gradient_clip': 0.1,
'dropout_rate': 0.1,
'label_smoothing': 0.1,
'optimizer': 'AdamW'
}
Evaluation results
The model achieved the following performance metrics on the test set:
Test Loss: 0.8622 Test Accuracy: 81.36%
Usage
from transformers import pipeline
# Load the model using the image-classification pipeline
pipe = pipeline("image-classification", model="kaixkhazaki/vit_doclaynet_base")
# Test it with an image
result = pipe("path_to_image.jpg")
print(result)
- Downloads last month
- 414
Model tree for kaixkhazaki/vit_doclaynet_base
Base model
google/vit-base-patch16-224-in21k