yifeihu's picture
Update README.md
a5fca40 verified
|
raw
history blame
4.32 kB
metadata
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
  - vision
  - layout-analysis
  - object-detection
datasets:
  - ds4sd/DocLayNet-v1.1
base_model:
  - microsoft/Florence-2-large-ft

Florence-2-DocLayNet-Fixed

Model Summary

We finetuned the Florence-2-large-ft [HF] model using the [DocLayNet-v1.1] dataset. To prevent the model from generating hallucinated class names, we re-mapped all class names to single tokens:

Original Class Names New Class Names
Caption Cap
Footnote Footnote
Formula Math
List-item List
Page-footer Bottom
Page-header Header
Picture Picture
Section-header Section
Table Table
Text Text
Title Title

By applying this simple change, we observed 7% improvement of mAP50-95 score on the DocLayNet test set. The training and inference was also faster thanks to fewer tokens used by the class names.

From the mAP50-95 score, this model is far from SOTA on the DocLayNet test set (70%). Much smaller Yolo models (github.com/ppaanngggg/yolo-doclaynet)[https://github.com/ppaanngggg/yolo-doclaynet] have much better benchmark results (~79%). On the subset of scientific articles, this model performed on par with the best Yolo models (87%) in terms of mAP50-95.

However, after we performed some qualitative analysis (paper coming soon), we found that Florence-2 is much better at drawing bounding boxes with clean edges. Yolo models sometimes cut text in the middle or draw multiple bounding boxes on the same object. These behaviors are not seriously published in mAP50-95 but are painful to deal with in real-world use cases. When calculating the mAP scores, we had to manually set the confidence score as 1 for all Florence-2 output.

We release the finetuned model weights for the community to further investigate related research topics.

How to Get Started with the Model

Use the code below to get started with the model.

For non-CUDA environments, please check out this post for a simple patch: https://huggingface.co/microsoft/Florence-2-base/discussions/4

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM 
model = AutoModelForCausalLM.from_pretrained("yifeihu/Florence-2-DocLayNet-Fixed", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("yifeihu/Florence-2-DocLayNet-Fixed", trust_remote_code=True)
prompt = "<OD>"
url = "https://huggingface.co/yifeihu/TF-ID-base/resolve/main/arxiv_2305_10853_5.png?download=true"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024,
    do_sample=False,
    num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
print(parsed_answer)

To visualize the results, see this tutorial notebook for more details.

BibTex and citation info

@misc{TF-ID,
  author = {Yifei Hu},
  title = {TF-ID: Table/Figure IDentifier for academic papers},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ai8hyf/TF-ID}},
}

@article{doclaynet2022,
  title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis},  
  doi = {10.1145/3534678.353904},
  url = {https://arxiv.org/abs/2206.01062},
  author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
  year = {2022}
}