ysdaml4 / README.md
ssbars's picture
v2
12faaae

A newer version of the Streamlit SDK is available: 1.45.1

Upgrade
metadata
title: Paper Classifier
emoji: πŸ“š
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.32.0
app_file: app.py
pinned: false

πŸ“š Academic Paper Classifier

link

This Streamlit application helps classify academic papers into different categories using a BERT-based model.

Features

  • Text Classification: Paste any paper text directly
  • PDF Support: Upload PDF files for classification
  • Real-time Analysis: Get instant classification results
  • Probability Distribution: See confidence scores for each category
  • Multiple Categories: Supports various academic fields

How to Use

  1. Text Input

    • Paste your paper's text (abstract or full content)
    • Click "Classify Text"
    • View results and probability distribution
  2. PDF Upload

    • Upload a PDF file of your paper
    • Click "Classify PDF"
    • Get classification results

Categories

The model classifies papers into the following categories:

  • Computer Science
  • Mathematics
  • Physics
  • Biology
  • Economics

Technical Details

  • Built with Streamlit
  • Uses BERT-based model for classification
  • Supports PDF file processing
  • Real-time classification

About

This application is designed to help researchers, students, and academics quickly identify the primary field of academic papers. It uses state-of-the-art natural language processing to analyze paper content and provide accurate classifications.


Created with ❀️ using Streamlit and Transformers

Setup

  1. Install uv (if not already installed):
# Using pip
pip install uv

# Or using Homebrew on macOS
brew install uv
  1. Create and activate a virtual environment:
uv venv
source .venv/bin/activate  # On Unix/macOS
# OR
.venv\Scripts\activate     # On Windows
  1. Install the dependencies using uv:
uv pip install -r requirements.lock
  1. Run the Streamlit application:
streamlit run app.py

Usage

  1. Text Classification

    • Paste the paper's text (abstract or content) into the text area
    • Click "Classify Text" to get results
  2. PDF Classification

    • Upload a PDF file using the file uploader
    • Click "Classify PDF" to process and classify the document

Model Information

The service uses a BERT-based model for classification with the following categories:

  • Computer Science
  • Mathematics
  • Physics
  • Biology
  • Economics

Note

The current implementation uses a base BERT model. For production use, you should:

  1. Fine-tune the model on a dataset of academic papers
  2. Adjust the categories based on your specific needs
  3. Implement proper error handling and validation
  4. Add authentication if needed

Package Management

This project uses uv as the package manager for faster and more reliable dependency management. The dependencies are locked in requirements.lock for reproducible installations.

To update dependencies:

# Update a single package
uv pip install --upgrade package_name

# Update all packages and regenerate lock file
uv pip compile requirements.txt -o requirements.lock
uv pip install -r requirements.lock

Requirements

See requirements.txt for a complete list of dependencies.

ArXiv Paper Classifier

This project implements a machine learning system for classifying academic papers into ArXiv categories using state-of-the-art transformer models.

Project Overview

The system uses pre-trained transformer models to classify academic papers into one of the main ArXiv categories:

  • Computer Science (cs)
  • Mathematics (math)
  • Physics (physics)
  • Quantitative Biology (q-bio)
  • Quantitative Finance (q-fin)
  • Statistics (stat)
  • Electrical Engineering and Systems Science (eess)
  • Economics (econ)

Features

  • Multiple model support:

    • DistilBERT: Lightweight and fast model, good for testing
    • DeBERTa-v3: Advanced model with better performance
    • RoBERTa: Advanced model with strong performance
    • SciBERT: Specialized for scientific text
    • BERT: Classic model with good all-round performance
  • Flexible input handling:

    • Can process both title and abstract
    • Handles text preprocessing and tokenization
    • Supports different maximum sequence lengths
  • Robust error handling:

    • Multiple fallback mechanisms for tokenizer initialization
    • Graceful degradation to simpler models if needed
    • Detailed error messages and logging

Installation

  1. Clone the repository
  2. Install dependencies:
pip install -r requirements.txt

Usage

Basic Usage

from model import PaperClassifier

# Initialize classifier with default model (DistilBERT)
classifier = PaperClassifier()

# Classify a paper
result = classifier.classify_paper(
    title="Your paper title",
    abstract="Your paper abstract"
)

# Print results
print(result)

Using Different Models

# Initialize with DeBERTa-v3
classifier = PaperClassifier(model_type='deberta-v3')

# Initialize with RoBERTa
classifier = PaperClassifier(model_type='roberta')

# Initialize with SciBERT
classifier = PaperClassifier(model_type='scibert')

# Initialize with BERT
classifier = PaperClassifier(model_type='bert')

Training on Custom Data

# Prepare your training data
train_texts = ["paper1 title and abstract", "paper2 title and abstract", ...]
train_labels = ["cs", "math", ...]

# Train the model
classifier.train_on_arxiv(
    train_texts=train_texts,
    train_labels=train_labels,
    epochs=3,
    batch_size=16,
    learning_rate=2e-5
)

Model Details

Available Models

  1. DistilBERT (distilbert)

    • Model: distilbert-base-cased
    • Max length: 512 tokens
    • Fast tokenizer
    • Good for testing and quick results
  2. DeBERTa-v3 (deberta-v3)

    • Model: microsoft/deberta-v3-base
    • Max length: 512 tokens
    • Uses DebertaV2TokenizerFast
    • Advanced performance
  3. RoBERTa (roberta)

    • Model: roberta-base
    • Max length: 512 tokens
    • Strong performance on various tasks
  4. SciBERT (scibert)

    • Model: allenai/scibert_scivocab_uncased
    • Max length: 512 tokens
    • Specialized for scientific text
  5. BERT (bert)

    • Model: bert-base-uncased
    • Max length: 512 tokens
    • Classic model with good all-round performance

Error Handling

The system includes robust error handling mechanisms:

  • Multiple fallback levels for tokenizer initialization
  • Graceful degradation to simpler models
  • Detailed error messages and logging
  • Automatic fallback to BERT tokenizer if needed

Requirements

  • Python 3.7+
  • PyTorch
  • Transformers library
  • NumPy
  • Sacremoses (for tokenization support)