license: apache-2.0
datasets:
- OFA-Sys/OccuQuest
language:
- en
This is the OccuLLaMA-7B model in OccuQuest: Mitigating Occupational Bias for Inclusive Large Language Models.
The dataset is on OccuQuest.
Abstract: The emergence of large language models (LLMs) has revolutionized natural language processing tasks. However, existing instruction-tuning datasets suffer from occupational bias: the majority of data relates to only a few occupations, which hampers the instruction-tuned LLMs to generate helpful responses to professional queries from practitioners in specific fields. To mitigate this issue and promote occupation-inclusive LLMs, we create an instruction-tuning dataset named OccuQuest, which contains 110,000+ prompt-completion pairs and 30,000+ dialogues covering over 1,000 occupations in 26 occupational categories. We systematically request ChatGPT, organizing queries hierarchically based on Occupation, Responsibility, Topic, and Question, to ensure a comprehensive coverage of occupational specialty inquiries. By comparing with three commonly used datasets (Dolly, ShareGPT, and WizardLM), we observe that OccuQuest exhibits a more balanced distribution across occupations. Furthermore, we assemble three test sets for comprehensive evaluation, an occu-test set covering 25 occupational categories, an estate set focusing on real estate, and an occu-quora set containing real-world questions from Quora. We then fine-tune LLaMA on OccuQuest to obtain OccuLLaMA, which significantly outperforms state-of-the-art LLaMA variants (Vicuna, Tulu, and WizardLM) on professional questions in GPT-4 and human evaluations. Notably, on the occu-quora set, OccuLLaMA reaches a high win rate of 86.4% against WizardLM. Furthermore, we demonstrate the potential of combining OccuQuest with other instruction-tuning datasets to enhance the overall performance of LLMs. By fine-tuning LLaMA on a mixture of OccuQuest and Tulu datasets, we introduce ProLLaMA, which excels in addressing occupational questions and exhibits superior performance in comprehensive evaluations such as MMLU, GSM8K, BBH, and HumanEval. Among the different LLaMA variants, the 7B and 13B ProLLaMA models achieve the highest performance on MMLU and GSM8K, with the 7B ProLLaMA model demonstrating an improvement of more than 4 points over the other 7B variants on GSM8K. We open release the dataset and models.
Please cite if you use this model:
@misc{xue2023occuquest,
title={OccuQuest: Mitigating Occupational Bias for Inclusive Large Language Models},
author={Mingfeng Xue and Dayiheng Liu and Kexin Yang and Guanting Dong and Wenqiang Lei and Zheng Yuan and Chang Zhou and Jingren Zhou},
year={2023},
eprint={2310.16517},
archivePrefix={arXiv},
primaryClass={cs.CL}
}