π° NewsSense AI: LLM News Classifier with Web Scraping & Fine-Tuning
A fine-tuned transformer-based model that classifies news articles into five functional categories: Politics, Business, Health, Science, and Climate. The dataset was scraped from NPR using Decodo and processed with BeautifulSoup.
Model Details
Model Description
This model is fine-tuned using Hugging Face Transformers on a custom dataset of 5,000 news articles scraped directly from NPR. The goal is to classify real-world news into practical categories for use in filtering, organizing, and summarizing large-scale news streams.
- Developed by: Manan Gulati
- Model type: Transformer (text classification)
- Language(s): English
- License: MIT
- Fine-tuned from model: distilbert-base-uncased
Model Sources
- Repository: https://github.com/mgulati3/Fine-Tune
- Demo: https://huggingface.co/spaces/mgulati3/news-classifier-ui
- Model Hub: https://huggingface.co/mgulati3/news-classifier-model
Uses
Direct Use
This model can be used to classify any English-language news article or paragraph into one of five categories. It's useful for content filtering, feed curation, and auto-tagging of articles.
Out-of-Scope Use
- Not suitable for multi-label classification.
- Not recommended for non-news or informal text.
- May not perform well on non-English content.
Bias, Risks, and Limitations
- The model is trained only on NPR articles, which may carry source-specific bias.
- Categories are limited to five; nuanced topics may not be accurately captured.
- Misclassifications may occur for ambiguous or mixed-topic content.
Recommendations
Use prediction confidence scores to interpret results. Consider human review for sensitive applications.
How to Get Started
from transformers import pipeline
classifier = pipeline("text-classification", model="mgulati3/news-classifier-model")
classifier("NASA's new moon mission will use AI to optimize fuel consumption.")
Training Details
Training Data
Scraped 5,000 articles from NPR using Decodo (with proxy rotation and JS rendering). Articles were cleaned and labeled across five categories using Python and pandas.
Training Procedure
- Tokenizer: LLaMA-compatible tokenizer
- Preprocessing: Lowercasing, truncation, padding
- Epochs: 4
- Optimizer: AdamW
- Batch size: 16
Evaluation
Testing Data
20% of the dataset was reserved for testing. Random stratified split was used.
Metrics
- Accuracy (Train): 85%
- Accuracy (Test): 60%
- Metric: Accuracy (single-label, top-1)
Results
The model performs well on domain-specific, labeled news content with distinguishable category patterns.
Environmental Impact
- Hardware Type: Google Colab GPU (T4)
- Hours used: ~2.5
- Cloud Provider: Google
- Compute Region: US
- Carbon Emitted: Estimated ~0.2 kgCO2eq
Technical Specifications
Model Architecture
DistilBERT architecture fine-tuned for single-label text classification using a softmax output layer over 5 categories.
Compute Infrastructure
- Google Colab Pro
- Python 3.10
- Hugging Face Transformers 4.x
- PyTorch backend
Citation
APA:
Gulati, M. (2025). NewsSense AI: Fine-tuned LLM for News Classification. https://huggingface.co/mgulati3/news-classifier-model
BibTeX:
@misc{gulati2025newssense, author = {Gulati, Manan}, title = {NewsSense AI: Fine-tuned LLM for News Classification}, year = {2025}, url = {https://huggingface.co/mgulati3/news-classifier-model} }
Model Card Contact
For questions or collaborations: [email protected]
- Downloads last month
- 20