raphaelfontes
/

HealthNewsBRT

Text Classification

Brazilian Sign Language

Inference Endpoints

Model card Files Files and versions Community

HealthNewsBRT / README.md

raphaelfontes's picture

Update README.md

d85eea4 verified 9 months ago

|

history blame contribute delete

3.29 kB

	---
	license: apache-2.0
	language:
	- bzs
	metrics:
	- accuracy
	- f1
	pipeline_tag: text-classification
	tags:
	- news
	- health
	- classification
	model-index:
	- name: raphaelfontes/HealthNewsBRT
	results:
	- task:
	type: text-classification
	name: Text Classification
	metrics:
	- type: accuracy
	value: 0.95
	name: Accuracy
	verified: false
	- type: f1
	value: 0.95
	name: F1 Score
	verified: false

	---

	# HealthNewsBRT - BERT Classification Model for Brazilian Portuguese News Articles

	## Introduction
	This repository contains a BERT-based classification model for categorizing news articles in Portuguese (pt-br) into two categories: Health News (LABEL_0) and Non-Health News (LABEL_1). This model is designed to help classify news articles and identify whether they pertain to health-related topics or not.

	### Pretrained Model (BERTimbau)
	For this project, we used the [BERTimbau model](https://huggingface.co/neuralmind/bert-base-portuguese-cased), which is a Portuguese variant of BERT fine-tuned for natural language understanding tasks.

	## Classification report

	\| \| Precision \| Recall \| F1-Score \| Support \|
	\|-----------\|-----------\|--------\|----------\|---------\|
	\| LABEL_0 \| 0.96 \| 0.95 \| 0.95 \| 14000 \|
	\| LABEL_1 \| 0.95 \| 0.96 \| 0.96 \| 14000 \|
	\| Accuracy \| 0.95 \| \| \| 28000 \|
	\| Macro Avg\| 0.96 \| 0.95 \| 0.95 \| 28000 \|
	\|Weighted Avg\| 0.96 \| 0.95 \| 0.95 \| 28000 \|

	## Dataset
	For training and evaluation, we used a dataset consisting of 28,000 labeled news articles in Portuguese. The dataset is divided as follows:

	- 14,000 samples of Health News (LABEL_0): These articles are related to various health topics, such as medical discoveries, healthcare policies, and wellness.
	- 14,000 samples of Non-Health News (LABEL_1): These articles cover a wide range of subjects that do not fall under the health category, including politics, sports, entertainment, and more.

	The dataset was collected and preprocessed to ensure consistency and quality in labeling and text formatting.

	## Data Splitting
	To assess the model's performance, we split the dataset into training and testing subsets. We used an 80-20 split, with 80% of the data used for training and 20% for testing. This split helps us evaluate how well the model generalizes to new, unseen data.

	## Usage

	```
	from transformers import BertTokenizer, BertForSequenceClassification
	import torch

	# Load the pretrained model and tokenizer
	tokenizer = BertTokenizer.from_pretrained('raphaelfontes/HealthNewsBRT')
	model = BertForSequenceClassification.from_pretrained('raphaelfontes/HealthNewsBRT')

	# Define a news article
	news_article = "This is a news article in Portuguese about a health-related topic."

	# Tokenize and encode the news article
	inputs = tokenizer(news_article, return_tensors='pt', padding=True, truncation=True)

	# Make predictions
	with torch.no_grad():
	outputs = model(**inputs)

	# Get predicted label
	predicted_label = torch.argmax(outputs.logits).item()

	# Map label to human-readable category
	if predicted_label:
	category = "Health News"
	else:
	category = "Non-Health News"

	print(f"The article is categorized as: {category}")
	```