yifeihu
/

TF-ID-base

Image-Text-to-Text

text-generation

Model card Files Files and versions Community

TF-ID-base / README.md

yifeihu's picture

Update README.md

55e6417 verified 6 months ago

|

3.06 kB

	---
	license: mit
	license_link: https://huggingface.co/microsoft/Florence-2-base-ft/resolve/main/LICENSE
	pipeline_tag: image-text-to-text
	tags:
	- vision
	- ocr
	- segmentation
	- coco
	---

	# TF-ID: Table/Figure IDentifier for academic papers

	## Model Summary

	TF-ID (Table/Figure IDentifier) is a family of object detection models finetuned to extract tables and figures in academic papers. They come in four versions:
	\| Model \| Model size \| Model Description \|
	\| ------- \| ------------- \| ------------- \|
	\| TF-ID-base[[HF]](https://huggingface.co/yifeihu/TF-ID-base) \| 0.23B \| Extract tables/figures and their caption text
	\| TF-ID-large[[HF]](https://huggingface.co/yifeihu/TF-ID-large) \| 0.77B \| Extract tables/figures and their caption text
	\| TF-ID-base-no-caption[[HF]](https://huggingface.co/yifeihu/TF-ID-base-no-caption) \| 0.23B \| Extract tables/figures without caption text
	\| TF-ID-large-no-caption[[HF]](https://huggingface.co/yifeihu/TF-ID-large-no-caption) \| 0.77B \| Extract tables/figures without caption text
	All TF-ID models are finetuned from [microsoft/Florence-2](https://huggingface.co/microsoft/Florence-2-large-ft) checkpoints.

	TF-ID models take an image of a single paper page as the input, and return bounding boxes for all tables and figures in the given page.
	TF-ID-base and TF-ID-large draw bounding boxes around tables/figures and their caption text.
	TF-ID-base-no-caption and TF-ID-large-no-caption draw bounding boxes around tables/figures without their caption text.

	Object Detection results format:
	{'\<OD>': {'bboxes': [[x1, y1, x2, y2], ...],
	'labels': ['label1', 'label2', ...]} }

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	import requests

	from PIL import Image
	from transformers import AutoProcessor, AutoModelForCausalLM


	model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
	processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)

	prompt = "<OD>"

	url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
	image = Image.open(requests.get(url, stream=True).raw)

	inputs = processor(text=prompt, images=image, return_tensors="pt")

	generated_ids = model.generate(
	input_ids=inputs["input_ids"],
	pixel_values=inputs["pixel_values"],
	max_new_tokens=1024,
	do_sample=False,
	num_beams=3
	)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

	parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))

	print(parsed_answer)

	```

	## BibTex and citation info

	```
	@article{xiao2023florence,
	title={Florence-2: Advancing a unified representation for a variety of vision tasks},
	author={Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu},
	journal={arXiv preprint arXiv:2311.06242},
	year={2023}
	}
	```