|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- yuan-tian/chartgpt-dataset |
|
language: |
|
- en |
|
metrics: |
|
- rouge |
|
pipeline_tag: text2text-generation |
|
base_model: |
|
- google/flan-t5-xl |
|
new_version: yuan-tian/chartgpt-llama3 |
|
--- |
|
# Model Card for ChartGPT |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
This model is used to generate charts from natural language. For more information, please refer to the paper. |
|
|
|
* **Model type:** Language model |
|
* **Language(s) (NLP)**: English |
|
* **License**: Apache 2.0 |
|
* **Finetuned from model**: [FLAN-T5-XL](https://huggingface.co/google/flan-t5-xl) |
|
* **Research paper**: [ChartGPT: Leveraging LLMs to Generate Charts from Abstract Natural Language](https://ieeexplore.ieee.org/document/10443572) |
|
|
|
### Model Input Format |
|
|
|
<details> |
|
<summary> Click to expand </summary> |
|
|
|
Model input on the Step `x`. Specifically, `<...>` serves as a seperation token. |
|
|
|
``` |
|
{table name} |
|
<head> {column names} |
|
<type> {column types} |
|
<data> {data row 1} <line> {data row 2} <line> |
|
<utterance> {NL utterance} |
|
<ans> |
|
<sep> {Step 1 prompt} {Answer 2} |
|
... |
|
<sep> {Step x-1 prompt} {Answer x-1} |
|
<sep> {Step x prompt} |
|
``` |
|
|
|
And the model should output the answer corresponding to step `x`. |
|
|
|
The step 1-6 prompts are as follows: |
|
|
|
``` |
|
Step 1. Select columns: |
|
Step 2. Add filter: |
|
Step 3. Add aggregations: |
|
Step 4. Select chart type: |
|
Step 5. Choose encoding: |
|
Step 6. Add sort: |
|
``` |
|
</details> |
|
|
|
## How to Get Started with the Model |
|
|
|
### Running the Model on a GPU |
|
|
|
An example of a movie dataset with an utterance "What kinds of movies are the most popular?". |
|
The model should give the answers to step 1 (select columns). |
|
You can use the code below to test if you can run the model successfully. |
|
|
|
<details> |
|
<summary> Click to expand </summary> |
|
|
|
```python |
|
from transformers import ( |
|
AutoTokenizer, |
|
AutoModelForSeq2SeqLM, |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained("yuan-tian/chartgpt") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("yuan-tian/chartgpt", device_map="auto") |
|
input_text = "movies <head> Title,Worldwide_Gross,Production_Budget,Release_Year,Content_Rating,Running_Time,Major_Genre,Creative_Type,Rotten_Tomatoes_Rating,IMDB_Rating <type> nominal,quantitative,quantitative,temporal,nominal,quantitative,nominal,nominal,quantitative,quantitative <data> From Dusk Till Dawn,25728961,20000000,1996,R,107,Horror,Fantasy,63,7.1 <line> Broken Arrow,148345997,65000000,1996,R,108,Action,Contemporary Fiction,55,5.8 <line> <utterance> What kinds of movies are the most popular? <ans> <sep> Step 1. Select the columns:" |
|
inputs = tokenizer(input_text, return_tensors="pt", padding=True).to("cuda") |
|
outputs = model.generate(**inputs) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens = True)) |
|
``` |
|
|
|
</details> |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
This model is Fine-tuned from [FLAN-T5-XL](https://huggingface.co/google/flan-t5-xl) on the [chartgpt-dataset](https://huggingface.co/datasets/yuan-tian/chartgpt-dataset). |
|
|
|
### Training Procedure |
|
|
|
Plan to update the preprocessing and training procedure in the future. |
|
|
|
## Citation |
|
|
|
|
|
**BibTeX:** |
|
|
|
``` |
|
@article{tian2024chartgpt, |
|
title={ChartGPT: Leveraging LLMs to Generate Charts from Abstract Natural Language}, |
|
author={Tian, Yuan and Cui, Weiwei and Deng, Dazhen and Yi, Xinjing and Yang, Yurun and Zhang, Haidong and Wu, Yingcai}, |
|
journal={IEEE Transactions on Visualization and Computer Graphics}, |
|
year={2024}, |
|
pages={1-15}, |
|
doi={10.1109/TVCG.2024.3368621} |
|
} |
|
``` |