📰 NewsSense AI: LLM News Classifier with Web Scraping & Fine-Tuning

A fine-tuned transformer-based model that classifies news articles into five functional categories: Politics, Business, Health, Science, and Climate. The dataset was scraped from NPR using Decodo and processed with BeautifulSoup.

Model Details

Model Description

This model is fine-tuned using Hugging Face Transformers on a custom dataset of 5,000 news articles scraped directly from NPR. The goal is to classify real-world news into practical categories for use in filtering, organizing, and summarizing large-scale news streams.

Developed by: Manan Gulati
Model type: Transformer (text classification)
Language(s): English
License: MIT
Fine-tuned from model: distilbert-base-uncased

Model Sources

Repository: https://github.com/mgulati3/Fine-Tune
Demo: https://huggingface.co/spaces/mgulati3/news-classifier-ui
Model Hub: https://huggingface.co/mgulati3/news-classifier-model

Uses

Direct Use

This model can be used to classify any English-language news article or paragraph into one of five categories. It's useful for content filtering, feed curation, and auto-tagging of articles.

Out-of-Scope Use

Not suitable for multi-label classification.
Not recommended for non-news or informal text.
May not perform well on non-English content.

Bias, Risks, and Limitations

The model is trained only on NPR articles, which may carry source-specific bias.
Categories are limited to five; nuanced topics may not be accurately captured.
Misclassifications may occur for ambiguous or mixed-topic content.

Recommendations

Use prediction confidence scores to interpret results. Consider human review for sensitive applications.

How to Get Started

from transformers import pipeline

classifier = pipeline("text-classification", model="mgulati3/news-classifier-model")
classifier("NASA's new moon mission will use AI to optimize fuel consumption.")

Training Details

Training Data

Scraped 5,000 articles from NPR using Decodo (with proxy rotation and JS rendering). Articles were cleaned and labeled across five categories using Python and pandas.

Training Procedure

Tokenizer: LLaMA-compatible tokenizer
Preprocessing: Lowercasing, truncation, padding
Epochs: 4
Optimizer: AdamW
Batch size: 16

Evaluation

Testing Data

20% of the dataset was reserved for testing. Random stratified split was used.

Metrics

Accuracy (Train): 85%
Accuracy (Test): 60%
Metric: Accuracy (single-label, top-1)

Results

The model performs well on domain-specific, labeled news content with distinguishable category patterns.

Environmental Impact

Hardware Type: Google Colab GPU (T4)
Hours used: ~2.5
Cloud Provider: Google
Compute Region: US
Carbon Emitted: Estimated ~0.2 kgCO2eq

Technical Specifications

Model Architecture

DistilBERT architecture fine-tuned for single-label text classification using a softmax output layer over 5 categories.

Compute Infrastructure

Google Colab Pro
Python 3.10
Hugging Face Transformers 4.x
PyTorch backend

Citation

APA:

Gulati, M. (2025). NewsSense AI: Fine-tuned LLM for News Classification. https://huggingface.co/mgulati3/news-classifier-model

BibTeX:

@misc{gulati2025newssense, author = {Gulati, Manan}, title = {NewsSense AI: Fine-tuned LLM for News Classification}, year = {2025}, url = {https://huggingface.co/mgulati3/news-classifier-model} }

Model Card Contact

For questions or collaborations: [email protected]