A newer version of the Streamlit SDK is available:
1.45.1
title: Paper Classifier
emoji: π
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.32.0
app_file: app.py
pinned: false
π Academic Paper Classifier
This Streamlit application helps classify academic papers into different categories using a BERT-based model.
Features
- Text Classification: Paste any paper text directly
- PDF Support: Upload PDF files for classification
- Real-time Analysis: Get instant classification results
- Probability Distribution: See confidence scores for each category
- Multiple Categories: Supports various academic fields
How to Use
Text Input
- Paste your paper's text (abstract or full content)
- Click "Classify Text"
- View results and probability distribution
PDF Upload
- Upload a PDF file of your paper
- Click "Classify PDF"
- Get classification results
Categories
The model classifies papers into the following categories:
- Computer Science
- Mathematics
- Physics
- Biology
- Economics
Technical Details
- Built with Streamlit
- Uses BERT-based model for classification
- Supports PDF file processing
- Real-time classification
About
This application is designed to help researchers, students, and academics quickly identify the primary field of academic papers. It uses state-of-the-art natural language processing to analyze paper content and provide accurate classifications.
Created with β€οΈ using Streamlit and Transformers
Setup
- Install
uv
(if not already installed):
# Using pip
pip install uv
# Or using Homebrew on macOS
brew install uv
- Create and activate a virtual environment:
uv venv
source .venv/bin/activate # On Unix/macOS
# OR
.venv\Scripts\activate # On Windows
- Install the dependencies using uv:
uv pip install -r requirements.lock
- Run the Streamlit application:
streamlit run app.py
Usage
Text Classification
- Paste the paper's text (abstract or content) into the text area
- Click "Classify Text" to get results
PDF Classification
- Upload a PDF file using the file uploader
- Click "Classify PDF" to process and classify the document
Model Information
The service uses a BERT-based model for classification with the following categories:
- Computer Science
- Mathematics
- Physics
- Biology
- Economics
Note
The current implementation uses a base BERT model. For production use, you should:
- Fine-tune the model on a dataset of academic papers
- Adjust the categories based on your specific needs
- Implement proper error handling and validation
- Add authentication if needed
Package Management
This project uses uv
as the package manager for faster and more reliable dependency management. The dependencies are locked in requirements.lock
for reproducible installations.
To update dependencies:
# Update a single package
uv pip install --upgrade package_name
# Update all packages and regenerate lock file
uv pip compile requirements.txt -o requirements.lock
uv pip install -r requirements.lock
Requirements
See requirements.txt
for a complete list of dependencies.
ArXiv Paper Classifier
This project implements a machine learning system for classifying academic papers into ArXiv categories using state-of-the-art transformer models.
Project Overview
The system uses pre-trained transformer models to classify academic papers into one of the main ArXiv categories:
- Computer Science (cs)
- Mathematics (math)
- Physics (physics)
- Quantitative Biology (q-bio)
- Quantitative Finance (q-fin)
- Statistics (stat)
- Electrical Engineering and Systems Science (eess)
- Economics (econ)
Features
Multiple model support:
- DistilBERT: Lightweight and fast model, good for testing
- DeBERTa-v3: Advanced model with better performance
- RoBERTa: Advanced model with strong performance
- SciBERT: Specialized for scientific text
- BERT: Classic model with good all-round performance
Flexible input handling:
- Can process both title and abstract
- Handles text preprocessing and tokenization
- Supports different maximum sequence lengths
Robust error handling:
- Multiple fallback mechanisms for tokenizer initialization
- Graceful degradation to simpler models if needed
- Detailed error messages and logging
Installation
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
Usage
Basic Usage
from model import PaperClassifier
# Initialize classifier with default model (DistilBERT)
classifier = PaperClassifier()
# Classify a paper
result = classifier.classify_paper(
title="Your paper title",
abstract="Your paper abstract"
)
# Print results
print(result)
Using Different Models
# Initialize with DeBERTa-v3
classifier = PaperClassifier(model_type='deberta-v3')
# Initialize with RoBERTa
classifier = PaperClassifier(model_type='roberta')
# Initialize with SciBERT
classifier = PaperClassifier(model_type='scibert')
# Initialize with BERT
classifier = PaperClassifier(model_type='bert')
Training on Custom Data
# Prepare your training data
train_texts = ["paper1 title and abstract", "paper2 title and abstract", ...]
train_labels = ["cs", "math", ...]
# Train the model
classifier.train_on_arxiv(
train_texts=train_texts,
train_labels=train_labels,
epochs=3,
batch_size=16,
learning_rate=2e-5
)
Model Details
Available Models
DistilBERT (
distilbert
)- Model:
distilbert-base-cased
- Max length: 512 tokens
- Fast tokenizer
- Good for testing and quick results
- Model:
DeBERTa-v3 (
deberta-v3
)- Model:
microsoft/deberta-v3-base
- Max length: 512 tokens
- Uses DebertaV2TokenizerFast
- Advanced performance
- Model:
RoBERTa (
roberta
)- Model:
roberta-base
- Max length: 512 tokens
- Strong performance on various tasks
- Model:
SciBERT (
scibert
)- Model:
allenai/scibert_scivocab_uncased
- Max length: 512 tokens
- Specialized for scientific text
- Model:
BERT (
bert
)- Model:
bert-base-uncased
- Max length: 512 tokens
- Classic model with good all-round performance
- Model:
Error Handling
The system includes robust error handling mechanisms:
- Multiple fallback levels for tokenizer initialization
- Graceful degradation to simpler models
- Detailed error messages and logging
- Automatic fallback to BERT tokenizer if needed
Requirements
- Python 3.7+
- PyTorch
- Transformers library
- NumPy
- Sacremoses (for tokenization support)