πŸ“° NewsSense AI: LLM News Classifier with Web Scraping & Fine-Tuning

A fine-tuned transformer-based model that classifies news articles into five functional categories: Politics, Business, Health, Science, and Climate. The dataset was scraped from NPR using Decodo and processed with BeautifulSoup.


Model Details

Model Description

This model is fine-tuned using Hugging Face Transformers on a custom dataset of 5,000 news articles scraped directly from NPR. The goal is to classify real-world news into practical categories for use in filtering, organizing, and summarizing large-scale news streams.

  • Developed by: Manan Gulati
  • Model type: Transformer (text classification)
  • Language(s): English
  • License: MIT
  • Fine-tuned from model: distilbert-base-uncased

Model Sources


Uses

Direct Use

This model can be used to classify any English-language news article or paragraph into one of five categories. It's useful for content filtering, feed curation, and auto-tagging of articles.

Out-of-Scope Use

  • Not suitable for multi-label classification.
  • Not recommended for non-news or informal text.
  • May not perform well on non-English content.

Bias, Risks, and Limitations

  • The model is trained only on NPR articles, which may carry source-specific bias.
  • Categories are limited to five; nuanced topics may not be accurately captured.
  • Misclassifications may occur for ambiguous or mixed-topic content.

Recommendations

Use prediction confidence scores to interpret results. Consider human review for sensitive applications.


How to Get Started

from transformers import pipeline

classifier = pipeline("text-classification", model="mgulati3/news-classifier-model")
classifier("NASA's new moon mission will use AI to optimize fuel consumption.")

Training Details

Training Data

Scraped 5,000 articles from NPR using Decodo (with proxy rotation and JS rendering). Articles were cleaned and labeled across five categories using Python and pandas.

Training Procedure

  • Tokenizer: LLaMA-compatible tokenizer
  • Preprocessing: Lowercasing, truncation, padding
  • Epochs: 4
  • Optimizer: AdamW
  • Batch size: 16

Evaluation

Testing Data

20% of the dataset was reserved for testing. Random stratified split was used.

Metrics

  • Accuracy (Train): 85%
  • Accuracy (Test): 60%
  • Metric: Accuracy (single-label, top-1)

Results

The model performs well on domain-specific, labeled news content with distinguishable category patterns.


Environmental Impact

  • Hardware Type: Google Colab GPU (T4)
  • Hours used: ~2.5
  • Cloud Provider: Google
  • Compute Region: US
  • Carbon Emitted: Estimated ~0.2 kgCO2eq

Technical Specifications

Model Architecture

DistilBERT architecture fine-tuned for single-label text classification using a softmax output layer over 5 categories.

Compute Infrastructure

  • Google Colab Pro
  • Python 3.10
  • Hugging Face Transformers 4.x
  • PyTorch backend

Citation

APA:

Gulati, M. (2025). NewsSense AI: Fine-tuned LLM for News Classification. https://huggingface.co/mgulati3/news-classifier-model

BibTeX:

@misc{gulati2025newssense, author = {Gulati, Manan}, title = {NewsSense AI: Fine-tuned LLM for News Classification}, year = {2025}, url = {https://huggingface.co/mgulati3/news-classifier-model} }


Model Card Contact

For questions or collaborations: [email protected]

Downloads last month
20
Safetensors
Model size
1.24B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support