Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,193 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- Zakia/drugscom_reviews
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
metrics:
|
8 |
+
- accuracy
|
9 |
+
library_name: transformers
|
10 |
+
pipeline_tag: text-classification
|
11 |
+
tags:
|
12 |
+
- health
|
13 |
+
- medicine
|
14 |
+
- patient reviews
|
15 |
+
- drug reviews
|
16 |
+
- depression
|
17 |
+
- text classification
|
18 |
+
---
|
19 |
+
|
20 |
+
# Model Card for Zakia/distilbert-drugscom_depression_reviews
|
21 |
+
|
22 |
+
This model is a DistilBERT-based classifier fine-tuned on drug reviews for the depression medical condition from Drugs.com.
|
23 |
+
The dataset used for fine-tuning is the [Zakia/drugscom_reviews](https://huggingface.co/datasets/Zakia/drugscom_reviews) dataset, which is filtered for the condition 'Depression'.
|
24 |
+
The base model for fine-tuning was the [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased).
|
25 |
+
|
26 |
+
## Model Details
|
27 |
+
|
28 |
+
### Model Description
|
29 |
+
|
30 |
+
- **Developed by:*Zakia*
|
31 |
+
- **Model type:*Text Classification*
|
32 |
+
- **Language(s) (NLP):*English*
|
33 |
+
- **License:*Apache 2.0*
|
34 |
+
- **Finetuned from model:*distilbert-base-uncased*
|
35 |
+
|
36 |
+
## Uses
|
37 |
+
|
38 |
+
### Direct Use
|
39 |
+
|
40 |
+
This model is intended to classify drug reviews into high or low quality, aiding in the analysis of patient feedback on depression medications.
|
41 |
+
|
42 |
+
### Out-of-Scope Use
|
43 |
+
|
44 |
+
This model is not designed to diagnose or treat depression or to replace professional medical advice.
|
45 |
+
|
46 |
+
## Bias, Risks, and Limitations
|
47 |
+
|
48 |
+
The model may inherit biases present in the dataset and should not be used as the sole decision-maker for healthcare or treatment options.
|
49 |
+
|
50 |
+
### Recommendations
|
51 |
+
|
52 |
+
Use the model as a tool to support, not replace, professional judgment.
|
53 |
+
|
54 |
+
## How to Get Started with the Model
|
55 |
+
|
56 |
+
Use the code below to get started with the model.
|
57 |
+
|
58 |
+
```python
|
59 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
60 |
+
import torch.nn.functional as F
|
61 |
+
|
62 |
+
model_name = "Zakia/distilbert-drugscom_depression_reviews"
|
63 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
64 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
65 |
+
|
66 |
+
# Define a function to print predictions with labels
|
67 |
+
def print_predictions(review_text, model, tokenizer):
|
68 |
+
inputs = tokenizer(review_text, return_tensors="pt")
|
69 |
+
outputs = model(**inputs)
|
70 |
+
predictions = F.softmax(outputs.logits, dim=-1)
|
71 |
+
# LABEL_0 is for low quality and LABEL_1 for high quality
|
72 |
+
print(f"Review: \"{review_text}\"")
|
73 |
+
print(f"Prediction: {{'LABEL_0 (Low quality)': {predictions[0][0].item():.4f}, 'LABEL_1 (High quality)': {predictions[0][1].item():.4f}}}\n")
|
74 |
+
|
75 |
+
# High quality review example
|
76 |
+
high_quality_review = "This medication has changed my life for the better. I've experienced no side effects and my symptoms of depression have significantly decreased."
|
77 |
+
print_predictions(high_quality_review, model, tokenizer)
|
78 |
+
|
79 |
+
# Low quality review example
|
80 |
+
low_quality_review = "I've had a terrible experience with this medication. It made me feel nauseous and I didn't notice any improvement in my condition."
|
81 |
+
print_predictions(low_quality_review, model, tokenizer)
|
82 |
+
```
|
83 |
+
|
84 |
+
## Training Details
|
85 |
+
|
86 |
+
### Training Data
|
87 |
+
|
88 |
+
The model was fine-tuned on a dataset of drug reviews specifically related to depression, filtered from Drugs.com.
|
89 |
+
This dataset is accessible from [Zakia/drugscom_reviews](https://huggingface.co/datasets/Zakia/drugscom_reviews) on Hugging Face datasets (condition = 'Depression') for 'train'.
|
90 |
+
Number of records in train dataset: 9069 rows.
|
91 |
+
|
92 |
+
### Training Procedure
|
93 |
+
|
94 |
+
#### Preprocessing
|
95 |
+
|
96 |
+
The reviews were cleaned and preprocessed to remove quotes, HTML tags and decode HTML entities.
|
97 |
+
A new column called 'high_quality_review' was also added to the reviews.
|
98 |
+
'high_quality_review' was computed as 1 if rating > 5 (positive rating) and usefulCount > the 75th percentile of usefulCount (65) or 0, otherwise.
|
99 |
+
Train dataset high_quality_review counts: Counter({0: 6949, 1: 2120})
|
100 |
+
Then:
|
101 |
+
This training data was balanced by downsampling low quality reviews (high_quality_review = 0).
|
102 |
+
The final training data had 4240 rows of reviews:
|
103 |
+
Train dataset high_quality_review counts: Counter({0: 2120, 1: 2120})
|
104 |
+
|
105 |
+
#### Training Hyperparameters
|
106 |
+
|
107 |
+
- **Learning Rate: *3e-5*
|
108 |
+
- **Batch Size:*16*
|
109 |
+
- **Epochs:*1*
|
110 |
+
|
111 |
+
## Evaluation
|
112 |
+
|
113 |
+
### Testing Data, Factors & Metrics
|
114 |
+
|
115 |
+
#### Testing Data
|
116 |
+
|
117 |
+
The model was tested on a dataset of drug reviews specifically related to depression, filtered from Drugs.com.
|
118 |
+
This dataset is accessible from [Zakia/drugscom_reviews](https://huggingface.co/datasets/Zakia/drugscom_reviews) on Hugging Face datasets (condition = 'Depression') for 'test'.
|
119 |
+
Number of records in test dataset: 3095 rows.
|
120 |
+
|
121 |
+
#### Preprocessing
|
122 |
+
|
123 |
+
The reviews were cleaned and preprocessed to remove quotes, HTML tags and decode HTML entities.
|
124 |
+
A new column called 'high_quality_review' was also added to the reviews.
|
125 |
+
'high_quality_review' was computed as 1 if rating > 5 (positive rating) and usefulCount > the 75th percentile of usefulCount (65) or 0, otherwise.
|
126 |
+
Note: the 75th percentile of usefulCount is based on the train dataset.
|
127 |
+
Test dataset high_quality_review counts: Counter({0: 2365, 1: 730})
|
128 |
+
|
129 |
+
#### Metrics
|
130 |
+
|
131 |
+
The model's performance was evaluated based on accuracy.
|
132 |
+
|
133 |
+
### Results
|
134 |
+
|
135 |
+
The fine-tuning process yielded the following results:
|
136 |
+
|
137 |
+
| Epoch | Training Loss | Validation Loss | Accuracy |
|
138 |
+
|-------|---------------|-----------------|----------|
|
139 |
+
| 1 | 0.38 | 0.80 | 0.77 |
|
140 |
+
|
141 |
+
The model demonstrates its capability to classify drug reviews as high or low quality with an accuracy of 77%.
|
142 |
+
Low Quality: high_quality_review=0
|
143 |
+
High Quality: high_quality_review=1
|
144 |
+
|
145 |
+
## Technical Specifications
|
146 |
+
|
147 |
+
### Model Architecture and Objective
|
148 |
+
|
149 |
+
DistilBERT model architecture was used, with a binary classification head for high and low quality review classification.
|
150 |
+
|
151 |
+
### Compute Infrastructure
|
152 |
+
|
153 |
+
The model was trained using a T4 GPU on Google Colab.
|
154 |
+
|
155 |
+
#### Hardware
|
156 |
+
|
157 |
+
T4 GPU via Google Colab.
|
158 |
+
|
159 |
+
## Citation
|
160 |
+
|
161 |
+
If you use this model, please cite the original DistilBERT paper:
|
162 |
+
|
163 |
+
**BibTeX:**
|
164 |
+
|
165 |
+
```bibtex
|
166 |
+
@article{sanh2019distilbert,
|
167 |
+
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
|
168 |
+
author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
|
169 |
+
journal={arXiv preprint arXiv:1910.01108},
|
170 |
+
year={2019}
|
171 |
+
}
|
172 |
+
```
|
173 |
+
**APA:**
|
174 |
+
|
175 |
+
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
|
176 |
+
|
177 |
+
## Glossary
|
178 |
+
|
179 |
+
- **Low Quality Review: *high_quality_review=0*
|
180 |
+
- **High Quality Review:*high_quality_review=1*
|
181 |
+
|
182 |
+
## More Information
|
183 |
+
|
184 |
+
For further queries or issues with the model, please use the [discussions section on this model's Hugging Face page](https://huggingface.co/Zakia/distilbert-drugscom_depression_reviews/discussions).
|
185 |
+
|
186 |
+
|
187 |
+
## Model Card Authors
|
188 |
+
|
189 |
+
- Zakia
|
190 |
+
|
191 |
+
## Model Card Contact
|
192 |
+
|
193 |
+
For more information or inquiries regarding this model, please use the [discussions section on this model's Hugging Face page](https://huggingface.co/Zakia/distilbert-drugscom_depression_reviews/discussions).
|