|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- mediabiasgroup/BABE |
|
language: |
|
- en |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
This model is designed to detect bias in text data. |
|
It analyzes text inputs to identify and classify types of biases, |
|
aiding in the development of more inclusive and fair AI systems. |
|
The model is fine-tuned from valurank/distilroberta-bias model for research purpose. The model is able to detect bias in formal language since the |
|
training corpus is news titles. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
The data used for fine-tuning is MBIC dataset, which contains texts with bias labels. |
|
|
|
The model is capable of classifying any text into Biased or Non_biased. Max length set for the tokenizer is 512. |
|
|
|
|
|
|
|
|
|
- **Developed by:** [More Information Needed] |
|
- **Model type:** DistillRoBERTa transformer |
|
- **Language(s) (NLP):** English |
|
- **License:** Apache 2.0 |
|
- **Finetuned from model:** valurank/distilroberta-bias |
|
- **Repository:** ***To be uploaded*** |
|
|
|
### The following sections are under construction... |
|
|
|
|
|
<!--### Recommendations |
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
|
|
<!--Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. |
|
More information needed for further recommendations. --> |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
***Link to the github demo page to be included*** |
|
|
|
[More Information Needed] |
|
|
|
## Training Details |
|
|
|
****** |
|
|
|
Size of the Dataset: 1700 entries |
|
|
|
Preprocessing Steps: Tokenization using a pre-specified tokenizer, padding, and truncation to convert text to numerical features. Classes are encoded numerically. |
|
|
|
Data Splitting Strategy: 80% training, 20% validation split, with a random state for reproducibility. |
|
|
|
Optimization Algorithm: AdamW |
|
|
|
Loss Function: CrossEntropyLoss, weighted by class frequencies to address class imbalance. |
|
|
|
Learning Rate: 1e-5 |
|
|
|
Number of Epochs: 3 |
|
|
|
Batch Size: 16 |
|
|
|
Regularization Techniques: Gradient clipping is applied with a max norm of 1.0. |
|
|
|
Model-Specific Hyperparameters: Scheduler with step size of 3 and gamma of 0.1 for learning rate decay. |
|
|
|
Training time: around 150 iterations/s under CUDA pytorch, less than 10 minutes for training. |
|
|
|
Monitoring Strategies: Training and validation losses and validation accuracy are monitored. |
|
|
|
Details on the Validation Dataset: Generated from the same DataFrame df using a train-test split. |
|
|
|
Techniques Used for Fine-tuning: Learning rate scheduler for adjusting the learning rate. |
|
|
|
## Challenges and Solutions |
|
|
|
**Challenges Faced During Training**: Class imbalance is addressed through weighted CrossEntropyLoss. |
|
|
|
**Solutions and Techniques Applied**: Calculation of class weights from the training data and applying gradient clipping. |
|
|
|
|
|
|
|
|
|
#### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
[More Information Needed] |
|
|
|
### Results |
|
|
|
[More Information Needed] |
|
|
|
#### Summary |
|
|
|
|
|
|
|
### Model Update Log |