ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning

Venue: ACL 2024 (Findings)

Paper Link: https://arxiv.org/abs/2403.09028

The abstract of the paper states that:

Charts provide visual representations of data and are widely used for analyzing information, addressing queries, and conveying insights to others. Various chart-related downstream tasks have emerged recently, such as question-answering and summarization. A common strategy to solve these tasks is to fine-tune various models originally trained on vision tasks language. However, such task-specific models are not capable of solving a wide range of chart-related tasks, constraining their real-world applicability. To overcome these challenges, we introduce ChartInstruct: a novel chart-specific vision-language Instruction following dataset comprising 191K instructions generated with 71K charts. We then present two distinct systems for instruction tuning on such datasets: (1) an end-to-end model that connects a vision encoder for chart understanding with a LLM; and (2) a pipeline model that employs a two-step approach to extract chart data tables and input them into the LLM. In experiments on four downstream tasks, we first show the effectiveness of our model--achieving a new set of state-of-the-art results. Further evaluation shows that our instruction-tuning approach supports a wide array of real-world chart comprehension and reasoning scenarios, thereby expanding the scope and applicability of our models to new kinds of tasks.

Web Demo

If you wish to quickly try our model, you can access our public web demo hosted on the Hugging Face Spaces platform with a friendly interface!

ChartInstruct-Llama2 Web Demo

Inference

You can easily use our models for inference with the huggingface library! You just need to do the following:

  1. Chage the image_path to your chart example image path on your system
  2. Write the input_text

We recommend using beam search with a beam size of 4, but if your machine has low memory, you can remove the num_beams from the generate method.

from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration
import torch

torch.hub.download_url_to_file('https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/multi_col_1229.png', 'chart_example_1.png')

image_path = "/content/chart_example_1.png"
input_text = "What is the share of respondants who prefer Whatsapp in the 18-29 age group?"

input_prompt = f"<image>\n Question: {input_text} Answer: "

model = LlavaForConditionalGeneration.from_pretrained("ahmed-masry/ChartInstruct-LLama2", torch_dtype=torch.float16)
processor = AutoProcessor.from_pretrained("ahmed-masry/ChartInstruct-LLama2")


image = Image.open(image_path).convert('RGB')

inputs = processor(text=prompt, images=image, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

# change type if pixel_values in inputs to fp16. 
inputs['pixel_values'] = inputs['pixel_values'].to(torch.float16)
prompt_length = inputs['input_ids'].shape[1]

# Generate
generate_ids = model.generate(**inputs, num_beams=4, max_new_tokens=512)
output_text = processor.batch_decode(generate_ids[:, prompt_length:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output_text)

Contact

If you have any questions about this work, please contact Ahmed Masry using the following email addresses: [email protected] or [email protected].

Reference

Please cite our paper if you use our model in your research.

@misc{masry2024chartinstruct,
      title={ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning}, 
      author={Ahmed Masry and Mehrad Shahmohammadi and Md Rizwan Parvez and Enamul Hoque and Shafiq Joty},
      year={2024},
      eprint={2403.09028},
      archivePrefix={arXiv},
      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
Downloads last month
313
Safetensors
Model size
6.83B params
Tensor type
BF16
·
I64
·
Inference API
Inference API (serverless) does not yet support transformers models for this pipeline type.

Space using ahmed-masry/ChartInstruct-LLama2 1