|
--- |
|
license: cc0-1.0 |
|
base_model: |
|
- distilbert/distilroberta-base |
|
pipeline_tag: text-classification |
|
library_name: transformers |
|
--- |
|
|
|
This is a [distilroberta-base](distilbert/distilroberta-base) model fined tuned to classify text into 3 categories: |
|
|
|
- Rare Diseases |
|
- Non-Rare Diseases |
|
- Other |
|
|
|
The details of how this model was built and evaluated are provided in the article: |
|
|
|
Rei L, Pita Costa J, Zdolšek Draksler T. Automatic Classification and Visualization of Text Data on Rare Diseases. _Journal of Personalized Medicine_. 2024; 14(5):545. https://doi.org/10.3390/jpm14050545 |
|
|
|
``` |
|
@Article{jpm14050545, |
|
AUTHOR = {Rei, Luis and Pita Costa, Joao and Zdolšek Draksler, Tanja}, |
|
TITLE = {Automatic Classification and Visualization of Text Data on Rare Diseases}, |
|
JOURNAL = {Journal of Personalized Medicine}, |
|
VOLUME = {14}, |
|
YEAR = {2024}, |
|
NUMBER = {5}, |
|
ARTICLE-NUMBER = {545}, |
|
URL = {https://www.mdpi.com/2075-4426/14/5/545}, |
|
PubMedID = {38793127}, |
|
ISSN = {2075-4426}, |
|
DOI = {10.3390/jpm14050545} |
|
} |
|
``` |
|
Note that the in the article the larger roberta-base model is fine-tuned instead. This is a smaller model. This model is shared for demonstration and validation purposes. Hyper-parameters were not tuned. |
|
|
|
## Using this model |
|
Simplest way to use this model is via a huggingface transformers' pipeline. |
|
|
|
```python |
|
# Use a pipeline as a high-level helper |
|
from transformers import pipeline |
|
|
|
pipe = pipeline("text-classification", model="lrei/rad-small") |
|
|
|
# Simple high-level usage |
|
pipe(["The patient suffer from a complex genetic disorder.", "The patient suffers from a common genetic disorder."]) |
|
``` |
|
|
|
## Dataset |
|
|
|
The dataset used to train this model is available on [zenodo](https://zenodo.org/records/13882003). |
|
It is a subset of abstracts obtained from PubMed and sorted into the 3 classes on the basis of their MeSH terms. |
|
|
|
Like the model, the dataset is provided for demonstration and methodology validation purposes. The original PubMed data was randomly under-sampled. |
|
|
|
## Code |
|
The code used to create this model is available on [Github](https://github.com/lrei/rad). |
|
|
|
## Test Results |
|
|
|
Averaged over all 3 classes: |
|
|
|
| average | precision | recall | F1 | |
|
| ------- | --------- | ------ | ---- | |
|
| micro | 0.84 | 0.84 | 0.84 | |
|
| macro | 0.84 | 0.84 | 0.84 | |