|
--- |
|
license: apache-2.0 |
|
language: |
|
- bzs |
|
metrics: |
|
- accuracy |
|
- f1 |
|
pipeline_tag: text-classification |
|
tags: |
|
- news |
|
- health |
|
- classification |
|
model-index: |
|
- name: raphaelfontes/HealthNewsBRT |
|
results: |
|
- task: |
|
type: text-classification |
|
name: Text Classification |
|
metrics: |
|
- type: accuracy |
|
value: 0.95 |
|
name: Accuracy |
|
verified: false |
|
- type: f1 |
|
value: 0.95 |
|
name: F1 Score |
|
verified: false |
|
|
|
--- |
|
|
|
# HealthNewsBRT - BERT Classification Model for Brazilian Portuguese News Articles |
|
|
|
## Introduction |
|
This repository contains a BERT-based classification model for categorizing news articles in Portuguese (pt-br) into two categories: Health News (LABEL_0) and Non-Health News (LABEL_1). This model is designed to help classify news articles and identify whether they pertain to health-related topics or not. |
|
|
|
### Pretrained Model (BERTimbau) |
|
For this project, we used the [BERTimbau model](https://huggingface.co/neuralmind/bert-base-portuguese-cased), which is a Portuguese variant of BERT fine-tuned for natural language understanding tasks. |
|
|
|
## Classification report |
|
|
|
| | Precision | Recall | F1-Score | Support | |
|
|-----------|-----------|--------|----------|---------| |
|
| LABEL_0 | 0.96 | 0.95 | 0.95 | 14000 | |
|
| LABEL_1 | 0.95 | 0.96 | 0.96 | 14000 | |
|
| Accuracy | 0.95 | | | 28000 | |
|
| Macro Avg| 0.96 | 0.95 | 0.95 | 28000 | |
|
|Weighted Avg| 0.96 | 0.95 | 0.95 | 28000 | |
|
|
|
## Dataset |
|
For training and evaluation, we used a dataset consisting of 28,000 labeled news articles in Portuguese. The dataset is divided as follows: |
|
|
|
- **14,000 samples of Health News (LABEL_0)**: These articles are related to various health topics, such as medical discoveries, healthcare policies, and wellness. |
|
- **14,000 samples of Non-Health News (LABEL_1)**: These articles cover a wide range of subjects that do not fall under the health category, including politics, sports, entertainment, and more. |
|
|
|
The dataset was collected and preprocessed to ensure consistency and quality in labeling and text formatting. |
|
|
|
## Data Splitting |
|
To assess the model's performance, we split the dataset into training and testing subsets. We used an 80-20 split, with 80% of the data used for training and 20% for testing. This split helps us evaluate how well the model generalizes to new, unseen data. |
|
|
|
## Usage |
|
|
|
``` |
|
from transformers import BertTokenizer, BertForSequenceClassification |
|
import torch |
|
|
|
# Load the pretrained model and tokenizer |
|
tokenizer = BertTokenizer.from_pretrained('raphaelfontes/HealthNewsBRT') |
|
model = BertForSequenceClassification.from_pretrained('raphaelfontes/HealthNewsBRT') |
|
|
|
# Define a news article |
|
news_article = "This is a news article in Portuguese about a health-related topic." |
|
|
|
# Tokenize and encode the news article |
|
inputs = tokenizer(news_article, return_tensors='pt', padding=True, truncation=True) |
|
|
|
# Make predictions |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
# Get predicted label |
|
predicted_label = torch.argmax(outputs.logits).item() |
|
|
|
# Map label to human-readable category |
|
if predicted_label: |
|
category = "Health News" |
|
else: |
|
category = "Non-Health News" |
|
|
|
print(f"The article is categorized as: {category}") |
|
``` |