Florence-2-DocLayNet-Fixed

Model Summary

We finetuned the Florence-2-large-ft [HF] model using the [DocLayNet-v1.1] dataset. To prevent the model from generating hallucinated class names, we re-mapped all class names to single tokens:

Original Class Names	New Class Names
Caption	Cap
Footnote	Footnote
Formula	Math
List-item	List
Page-footer	Bottom
Page-header	Header
Picture	Picture
Section-header	Section
Table	Table
Text	Text
Title	Title

By applying this simple change, we observed 7% improvement of mAP50-95 score on the DocLayNet test set. The training and inference was also faster thanks to fewer tokens used by the class names.

From the mAP50-95 score, this model is far from SOTA on the DocLayNet test set (70%). Much smaller Yolo models (github.com/ppaanngggg/yolo-doclaynet)[https://github.com/ppaanngggg/yolo-doclaynet] have much better benchmark results (~79%). On the subset of scientific articles, this model performed on par with the best Yolo models (87%) in terms of mAP50-95.

However, after we performed some qualitative analysis (paper coming soon), we found that Florence-2 is much better at drawing bounding boxes with clean edges. Yolo models sometimes cut text in the middle or draw multiple bounding boxes on the same object. These behaviors are not seriously published in mAP50-95 but are painful to deal with in real-world use cases. When calculating the mAP scores, we had to manually set the confidence score as 1 for all Florence-2 output.

We release the finetuned model weights for the community to further investigate related research topics.

How to Get Started with the Model

Use the code below to get started with the model.

For non-CUDA environments, please check out this post for a simple patch: https://huggingface.co/microsoft/Florence-2-base/discussions/4

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM 
model = AutoModelForCausalLM.from_pretrained("yifeihu/Florence-2-DocLayNet-Fixed", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("yifeihu/Florence-2-DocLayNet-Fixed", trust_remote_code=True)
prompt = "<OD>"
url = "https://huggingface.co/yifeihu/TF-ID-base/resolve/main/arxiv_2305_10853_5.png?download=true"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024,
    do_sample=False,
    num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
print(parsed_answer)

To visualize the results, see this tutorial notebook for more details.

BibTex and citation info

@misc{TF-ID,
  author = {Yifei Hu},
  title = {TF-ID: Table/Figure IDentifier for academic papers},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ai8hyf/TF-ID}},
}

@article{doclaynet2022,
  title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis},  
  doi = {10.1145/3534678.353904},
  url = {https://arxiv.org/abs/2206.01062},
  author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
  year = {2022}
}

yifeihu
/

Florence-2-DocLayNet-Fixed

Florence-2-DocLayNet-Fixed

Model Summary

How to Get Started with the Model

BibTex and citation info

Model tree for yifeihu/Florence-2-DocLayNet-Fixed

Dataset used to train yifeihu/Florence-2-DocLayNet-Fixed