|
--- |
|
license: gpl-3.0 |
|
language: |
|
- en |
|
metrics: |
|
- f1 |
|
--- |
|
|
|
# Model Card |
|
|
|
## Model Details |
|
|
|
- Model Name: IssueReportClassifier-NLBSE22 |
|
- Base Model: RoBERTa |
|
- Dataset: NLBSE22 |
|
- Model Type: Fine-tuned |
|
- Model Version: 1.0 |
|
- Model Date: 2023-03-21 |
|
|
|
## Model Description |
|
|
|
IssueReportClassifier-NLBSE22 is a RoBERTa model which is fine-tuned on the NLBSE22 dataset. |
|
The model is trained to classify issue reports from GitHub into three categories: bug, enhancement, and question. |
|
The model is trained on a dataset of labeled issue reports and is designed to predict the category of a new issue report based on its text content (title and body). |
|
|
|
## Dataset |
|
|
|
|
|
| Category | Training Set | Test Set | |
|
|------------|--------------|-------------| |
|
| bug | 361,239 (50%) | 40,152 (49.9%) | |
|
| enhancement | 299,287 (41.4%) | 33,290 (41.3%) | |
|
| question | 62,373 (8.6%) | 7,076 (8.8%) | |
|
|
|
|
|
## Data preprocessing |
|
The data used for training was preprocessed with [ekphrasis](https://github.com/cbaziotis/ekphrasis), adding some regular expressions to remove code, images and URLs. |
|
Check out our [GitHub](https://github.com/collab-uniba/Issue-Report-Classification-Using-RoBERTa) code for more information about this. |
|
|
|
|
|
## Metrics |
|
|
|
The model is evaluated using the following metrics: |
|
|
|
- Accuracy |
|
- Precision |
|
- Recall |
|
- F1 Score (micro and macro average) |
|
|
|
## References |
|
|
|
- [NLBSE22 Dataset](https://nlbse2022.github.io/tools/) |
|
|
|
## Cite our work |
|
|
|
``` |
|
@inproceedings{Colavito-2022, |
|
title = {Issue Report Classification Using Pre-trained Language Models}, |
|
booktitle = {2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)}, |
|
author = {Colavito, Giuseppe and Lanubile, Filippo and Novielli, Nicole}, |
|
year = {2022}, |
|
month = may, |
|
pages = {29--32}, |
|
doi = {10.1145/3528588.3528659}, |
|
abstract = {This paper describes our participation in the tool competition organized in the scope of the 1st International Workshop on Natural Language-based Software Engineering. We propose a supervised approach relying on fine-tuned BERT-based language models for the automatic classification of GitHub issues. We experimented with different pre-trained models, achieving the best performance with fine-tuned RoBERTa (F1 = .8591).}, |
|
keywords = {Issue classification, BERT, deep learning, labeling unstructured data, |
|
software maintenance and evolution}, |
|
} |
|
``` |
|
|