HealthNewsBRT - BERT Classification Model for Brazilian Portuguese News Articles
Introduction
This repository contains a BERT-based classification model for categorizing news articles in Portuguese (pt-br) into two categories: Health News (LABEL_0) and Non-Health News (LABEL_1). This model is designed to help classify news articles and identify whether they pertain to health-related topics or not.
Pretrained Model (BERTimbau)
For this project, we used the BERTimbau model, which is a Portuguese variant of BERT fine-tuned for natural language understanding tasks.
Classification report
Precision | Recall | F1-Score | Support | |
---|---|---|---|---|
LABEL_0 | 0.96 | 0.95 | 0.95 | 14000 |
LABEL_1 | 0.95 | 0.96 | 0.96 | 14000 |
Accuracy | 0.95 | 28000 | ||
Macro Avg | 0.96 | 0.95 | 0.95 | 28000 |
Weighted Avg | 0.96 | 0.95 | 0.95 | 28000 |
Dataset
For training and evaluation, we used a dataset consisting of 28,000 labeled news articles in Portuguese. The dataset is divided as follows:
- 14,000 samples of Health News (LABEL_0): These articles are related to various health topics, such as medical discoveries, healthcare policies, and wellness.
- 14,000 samples of Non-Health News (LABEL_1): These articles cover a wide range of subjects that do not fall under the health category, including politics, sports, entertainment, and more.
The dataset was collected and preprocessed to ensure consistency and quality in labeling and text formatting.
Data Splitting
To assess the model's performance, we split the dataset into training and testing subsets. We used an 80-20 split, with 80% of the data used for training and 20% for testing. This split helps us evaluate how well the model generalizes to new, unseen data.
Usage
from transformers import BertTokenizer, BertForSequenceClassification
import torch
# Load the pretrained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('raphaelfontes/HealthNewsBRT')
model = BertForSequenceClassification.from_pretrained('raphaelfontes/HealthNewsBRT')
# Define a news article
news_article = "This is a news article in Portuguese about a health-related topic."
# Tokenize and encode the news article
inputs = tokenizer(news_article, return_tensors='pt', padding=True, truncation=True)
# Make predictions
with torch.no_grad():
outputs = model(**inputs)
# Get predicted label
predicted_label = torch.argmax(outputs.logits).item()
# Map label to human-readable category
if predicted_label:
category = "Health News"
else:
category = "Non-Health News"
print(f"The article is categorized as: {category}")
- Downloads last month
- 3
Evaluation results
- Accuracyself-reported0.950
- F1 Scoreself-reported0.950