metadata
license: gpl-3.0
language:
- en
metrics:
- f1
Model Card
Model Details
- Model Name: IssueReportClassifier-NLBSE22
- Base Model: RoBERTa
- Dataset: NLBSE22
- Model Type: Fine-tuned
- Model Version: 1.0
- Model Date: 2023-03-21
Model Description
IssueReportClassifier-NLBSE22 is a RoBERTa model which is fine-tuned on the NLBSE22 dataset. The model is trained to classify issue reports from GitHub into three categories: bug, enhancement, and question. The model is trained on a dataset of labeled issue reports and is designed to predict the category of a new issue report based on its text content (title and body).
Dataset
Category | Training Set | Test Set |
---|---|---|
bug | 361,239 (50%) | 40,152 (49.9%) |
enhancement | 299,287 (41.4%) | 33,290 (41.3%) |
question | 62,373 (8.6%) | 7,076 (8.8%) |
Data preprocessing
The data used for training was preprocessed with ekphrasis, adding some regular expressions to remove code, images and URLs. Check out our GitHub code for more information about this.
Metrics
The model is evaluated using the following metrics:
- Accuracy
- Precision
- Recall
- F1 Score (micro and macro average)
References
Cite our work
@inproceedings{Colavito-2022,
title = {Issue Report Classification Using Pre-trained Language Models},
booktitle = {2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)},
author = {Colavito, Giuseppe and Lanubile, Filippo and Novielli, Nicole},
year = {2022},
month = may,
pages = {29--32},
doi = {10.1145/3528588.3528659},
abstract = {This paper describes our participation in the tool competition organized in the scope of the 1st International Workshop on Natural Language-based Software Engineering. We propose a supervised approach relying on fine-tuned BERT-based language models for the automatic classification of GitHub issues. We experimented with different pre-trained models, achieving the best performance with fine-tuned RoBERTa (F1 = .8591).},
keywords = {Issue classification, BERT, deep learning, labeling unstructured data,
software maintenance and evolution},
}