File size: 2,286 Bytes
6b28ebf
 
26a38ae
 
 
3bf83ea
6b28ebf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
da24123
 
 
 
 
 
 
 
 
 
 
 
 
6b28ebf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
license: cc0-1.0
base_model:
- distilbert/distilroberta-base
pipeline_tag: text-classification
library_name: transformers
---

This is a [distilroberta-base](distilbert/distilroberta-base) model fined tuned to classify text into 3 categories: 

- Rare Diseases
- Non-Rare Diseases
- Other

The details of how this model was built and evaluated are provided in the article:

Rei L, Pita Costa J, Zdolšek Draksler T. Automatic Classification and Visualization of Text Data on Rare Diseases. _Journal of Personalized Medicine_. 2024; 14(5):545. https://doi.org/10.3390/jpm14050545

```
@Article{jpm14050545,
AUTHOR = {Rei, Luis and Pita Costa, Joao and Zdolšek Draksler, Tanja},
TITLE = {Automatic Classification and Visualization of Text Data on Rare Diseases},
JOURNAL = {Journal of Personalized Medicine},
VOLUME = {14},
YEAR = {2024},
NUMBER = {5},
ARTICLE-NUMBER = {545},
URL = {https://www.mdpi.com/2075-4426/14/5/545},
PubMedID = {38793127},
ISSN = {2075-4426},
DOI = {10.3390/jpm14050545}
}
```
Note that the in the article the larger roberta-base model is fine-tuned instead. This is a smaller model. This model is shared for demonstration and validation purposes. Hyper-parameters were not tuned.

## Using this model
Simplest way to use this model is via a huggingface transformers' pipeline.

```python
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="lrei/rad-small")

# Simple high-level usage
pipe(["The patient suffer from a complex genetic disorder.", "The patient suffers from a common genetic disorder."])
```

## Dataset

The dataset used to train this model is available on [zenodo](https://zenodo.org/records/13882003).
It is a subset of abstracts obtained from PubMed and sorted into the 3 classes on the basis of their MeSH terms.

Like the model, the dataset is provided for demonstration and methodology validation purposes. The original PubMed data was randomly under-sampled.

## Code
The code used to create this model is available on [Github](https://github.com/lrei/rad).

## Test Results

Averaged over all 3 classes:

| average | precision | recall | F1   |
| ------- | --------- | ------ | ---- |
| micro   | 0.84      | 0.84   | 0.84 |
| macro   | 0.84      | 0.84   | 0.84 |