|
---
|
|
language: es
|
|
tags:
|
|
- sagemaker
|
|
- ruperta
|
|
- TextClassification
|
|
- SentimentAnalysis
|
|
license: apache-2.0
|
|
datasets:
|
|
- IMDbreviews_es
|
|
model-index:
|
|
name: RuPERTa_base_sentiment_analysis_es
|
|
results:
|
|
- task:
|
|
name: Sentiment Analysis
|
|
type: sentiment-analysis
|
|
- dataset:
|
|
name: "IMDb Reviews in Spanish"
|
|
type: IMDbreviews_es
|
|
- metrics:
|
|
- name: Accuracy,
|
|
type: accuracy,
|
|
value: 0.881866
|
|
- name: F1 Score,
|
|
type: f1,
|
|
value: 0.008272
|
|
- name: Precision,
|
|
type: precision,
|
|
value: 0.858605
|
|
- name: Recall,
|
|
type: recall,
|
|
value: 0.920062
|
|
widget:
|
|
- text: "Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"
|
|
---
|
|
|
|
## Model `RuPERTa_base_sentiment_analysis_es`
|
|
|
|
### **A finetuned model for Sentiment analysis in Spanish**
|
|
|
|
This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container,
|
|
The base model is **RuPERTa-base (uncased)** which is a RoBERTa model trained on a uncased version of big Spanish corpus.
|
|
It was trained by mrm8488, Manuel Romero.[Link to base model](https://huggingface.co/mrm8488/RuPERTa-base)
|
|
|
|
## Dataset
|
|
The dataset is a collection of movie reviews in Spanish, about 50,000 reviews. The dataset is balanced and provides every review in english, in spanish and the label in both languages.
|
|
|
|
Sizes of datasets:
|
|
- Train dataset: 42,500
|
|
- Validation dataset: 3,750
|
|
- Test dataset: 3,750
|
|
|
|
|
|
## Hyperparameters
|
|
{
|
|
"epochs": "4",
|
|
"train_batch_size": "32",
|
|
"eval_batch_size": "8",
|
|
"fp16": "true",
|
|
"learning_rate": "3e-05",
|
|
"model_name": "\"mrm8488/RuPERTa-base\"",
|
|
"sagemaker_container_log_level": "20",
|
|
"sagemaker_program": "\"train.py\"",
|
|
}
|
|
|
|
## Evaluation results
|
|
Accuracy = 0.8629333333333333
|
|
F1 Score = 0.8648790746582545
|
|
Precision = 0.8479381443298969
|
|
Recall = 0.8825107296137339
|
|
|
|
## Test results
|
|
Accuracy = 0.8066666666666666
|
|
F1 Score = 0.8057862309134743
|
|
Precision = 0.7928307854507116
|
|
Recall = 0.8191721132897604
|
|
|
|
## Model in action
|
|
|
|
### Usage for Sentiment Analysis
|
|
|
|
```python
|
|
import torch
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("edumunozsala/RuPERTa_base_sentiment_analysis_es")
|
|
model = AutoModelForSequenceClassification.from_pretrained("edumunozsala/RuPERTa_base_sentiment_analysis_es")
|
|
|
|
text ="Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"
|
|
|
|
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
|
|
outputs = model(input_ids)
|
|
output = outputs.logits.argmax(1)
|
|
```
|
|
|
|
Created by [Eduardo Muñoz/@edumunozsala](https://github.com/edumunozsala)
|
|
|