metadata

title: Paper Classifier
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.32.0
app_file: app.py
pinned: false

📚 Academic Paper Classifier

link

This Streamlit application helps classify academic papers into different categories using a BERT-based model.

Features

Text Classification: Paste any paper text directly
PDF Support: Upload PDF files for classification
Real-time Analysis: Get instant classification results
Probability Distribution: See confidence scores for each category
Multiple Categories: Supports various academic fields

How to Use

Text Input
- Paste your paper's text (abstract or full content)
- Click "Classify Text"
- View results and probability distribution
PDF Upload
- Upload a PDF file of your paper
- Click "Classify PDF"
- Get classification results

Technical Details

Built with Streamlit
Uses BERT-based model for classification
Supports PDF file processing
Real-time classification

About

This application is designed to help researchers, students, and academics quickly identify the primary field of academic papers. It uses state-of-the-art natural language processing to analyze paper content and provide accurate classifications.

Created with ❤️ using Streamlit and Transformers

Setup

Install uv (if not already installed):

# Using pip
pip install uv

# Or using Homebrew on macOS
brew install uv

Create and activate a virtual environment:

uv venv
source .venv/bin/activate  # On Unix/macOS
# OR
.venv\Scripts\activate     # On Windows

Install the dependencies using uv:

uv pip install -r requirements.lock

Run the Streamlit application:

streamlit run app.py

Usage

Text Classification
- Paste the paper's text (abstract or content) into the text area
- Click "Classify Text" to get results
PDF Classification
- Upload a PDF file using the file uploader
- Click "Classify PDF" to process and classify the document

Model Information

The service uses a BERT-based model for classification with the following categories:

Computer Science
Mathematics
Physics
Biology
Economics

Note

The current implementation uses a base BERT model. For production use, you should:

Fine-tune the model on a dataset of academic papers
Adjust the categories based on your specific needs
Implement proper error handling and validation
Add authentication if needed

Package Management

This project uses uv as the package manager for faster and more reliable dependency management. The dependencies are locked in requirements.lock for reproducible installations.

To update dependencies:

# Update a single package
uv pip install --upgrade package_name

# Update all packages and regenerate lock file
uv pip compile requirements.txt -o requirements.lock
uv pip install -r requirements.lock

Requirements

See requirements.txt for a complete list of dependencies.

ArXiv Paper Classifier

This project implements a machine learning system for classifying academic papers into ArXiv categories using state-of-the-art transformer models.

Project Overview

The system uses pre-trained transformer models to classify academic papers into one of the main ArXiv categories:

Computer Science (cs)
Mathematics (math)
Physics (physics)
Quantitative Biology (q-bio)
Quantitative Finance (q-fin)
Statistics (stat)
Electrical Engineering and Systems Science (eess)
Economics (econ)

Features

Multiple model support:
- DistilBERT: Lightweight and fast model, good for testing
- DeBERTa-v3: Advanced model with better performance
- RoBERTa: Advanced model with strong performance
- SciBERT: Specialized for scientific text
- BERT: Classic model with good all-round performance
Flexible input handling:
- Can process both title and abstract
- Handles text preprocessing and tokenization
- Supports different maximum sequence lengths
Robust error handling:
- Multiple fallback mechanisms for tokenizer initialization
- Graceful degradation to simpler models if needed
- Detailed error messages and logging

Installation

Clone the repository
Install dependencies:

pip install -r requirements.txt

Usage

Basic Usage

from model import PaperClassifier

# Initialize classifier with default model (DistilBERT)
classifier = PaperClassifier()

# Classify a paper
result = classifier.classify_paper(
    title="Your paper title",
    abstract="Your paper abstract"
)

# Print results
print(result)

Using Different Models

# Initialize with DeBERTa-v3
classifier = PaperClassifier(model_type='deberta-v3')

# Initialize with RoBERTa
classifier = PaperClassifier(model_type='roberta')

# Initialize with SciBERT
classifier = PaperClassifier(model_type='scibert')

# Initialize with BERT
classifier = PaperClassifier(model_type='bert')

Training on Custom Data

# Prepare your training data
train_texts = ["paper1 title and abstract", "paper2 title and abstract", ...]
train_labels = ["cs", "math", ...]

# Train the model
classifier.train_on_arxiv(
    train_texts=train_texts,
    train_labels=train_labels,
    epochs=3,
    batch_size=16,
    learning_rate=2e-5
)

Model Details

Available Models

DistilBERT (distilbert)
- Model: distilbert-base-cased
- Max length: 512 tokens
- Fast tokenizer
- Good for testing and quick results
DeBERTa-v3 (deberta-v3)
- Model: microsoft/deberta-v3-base
- Max length: 512 tokens
- Uses DebertaV2TokenizerFast
- Advanced performance
RoBERTa (roberta)
- Model: roberta-base
- Max length: 512 tokens
- Strong performance on various tasks
SciBERT (scibert)
- Model: allenai/scibert_scivocab_uncased
- Max length: 512 tokens
- Specialized for scientific text
BERT (bert)
- Model: bert-base-uncased
- Max length: 512 tokens
- Classic model with good all-round performance

Error Handling

The system includes robust error handling mechanisms:

Multiple fallback levels for tokenizer initialization
Graceful degradation to simpler models
Detailed error messages and logging
Automatic fallback to BERT tokenizer if needed

Requirements

Python 3.7+
PyTorch
Transformers library
NumPy
Sacremoses (for tokenization support)

📚 Academic Paper Classifier

Features

How to Use

Categories

Technical Details

About

Setup

Usage

Model Information

Note

Package Management

Requirements

ArXiv Paper Classifier

Project Overview

Features

Installation

Usage

Basic Usage

Using Different Models

Training on Custom Data

Model Details

Available Models

Error Handling

Requirements