---
title: Paper Classifier
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: "1.32.0"
app_file: app.py
pinned: false
---

# 📚 Academic Paper Classifier

[link](https://huggingface.co/spaces/ssbars/ysdaml4)

This Streamlit application helps classify academic papers into different categories using a BERT-based model.

## Features

- **Text Classification**: Paste any paper text directly
- **PDF Support**: Upload PDF files for classification
- **Real-time Analysis**: Get instant classification results
- **Probability Distribution**: See confidence scores for each category
- **Multiple Categories**: Supports various academic fields

## How to Use

1. **Text Input**
   - Paste your paper's text (abstract or full content)
   - Click "Classify Text"
   - View results and probability distribution

2. **PDF Upload**
   - Upload a PDF file of your paper
   - Click "Classify PDF"
   - Get classification results

## Categories

The model classifies papers into the following categories:
- Computer Science
- Mathematics
- Physics
- Biology
- Economics

## Technical Details

- Built with Streamlit
- Uses BERT-based model for classification
- Supports PDF file processing
- Real-time classification

## About

This application is designed to help researchers, students, and academics quickly identify the primary field of academic papers. It uses state-of-the-art natural language processing to analyze paper content and provide accurate classifications.

---
Created with ❤️ using Streamlit and Transformers

## Setup

1. Install `uv` (if not already installed):
```bash
# Using pip
pip install uv

# Or using Homebrew on macOS
brew install uv
```

2. Create and activate a virtual environment:
```bash
uv venv
source .venv/bin/activate  # On Unix/macOS
# OR
.venv\Scripts\activate     # On Windows
```

3. Install the dependencies using uv:
```bash
uv pip install -r requirements.lock
```

4. Run the Streamlit application:
```bash
streamlit run app.py
```

## Usage

1. **Text Classification**
   - Paste the paper's text (abstract or content) into the text area
   - Click "Classify Text" to get results

2. **PDF Classification**
   - Upload a PDF file using the file uploader
   - Click "Classify PDF" to process and classify the document

## Model Information

The service uses a BERT-based model for classification with the following categories:
- Computer Science
- Mathematics
- Physics
- Biology
- Economics

## Note

The current implementation uses a base BERT model. For production use, you should:
1. Fine-tune the model on a dataset of academic papers
2. Adjust the categories based on your specific needs
3. Implement proper error handling and validation
4. Add authentication if needed

## Package Management

This project uses `uv` as the package manager for faster and more reliable dependency management. The dependencies are locked in `requirements.lock` for reproducible installations.

To update dependencies:
```bash
# Update a single package
uv pip install --upgrade package_name

# Update all packages and regenerate lock file
uv pip compile requirements.txt -o requirements.lock
uv pip install -r requirements.lock
```

## Requirements

See `requirements.txt` for a complete list of dependencies.

# ArXiv Paper Classifier

This project implements a machine learning system for classifying academic papers into ArXiv categories using state-of-the-art transformer models.

## Project Overview

The system uses pre-trained transformer models to classify academic papers into one of the main ArXiv categories:
- Computer Science (cs)
- Mathematics (math)
- Physics (physics)
- Quantitative Biology (q-bio)
- Quantitative Finance (q-fin)
- Statistics (stat)
- Electrical Engineering and Systems Science (eess)
- Economics (econ)

## Features

- Multiple model support:
  - DistilBERT: Lightweight and fast model, good for testing
  - DeBERTa-v3: Advanced model with better performance
  - RoBERTa: Advanced model with strong performance
  - SciBERT: Specialized for scientific text
  - BERT: Classic model with good all-round performance

- Flexible input handling:
  - Can process both title and abstract
  - Handles text preprocessing and tokenization
  - Supports different maximum sequence lengths

- Robust error handling:
  - Multiple fallback mechanisms for tokenizer initialization
  - Graceful degradation to simpler models if needed
  - Detailed error messages and logging

## Installation

1. Clone the repository
2. Install dependencies:
```bash
pip install -r requirements.txt
```

## Usage

### Basic Usage

```python
from model import PaperClassifier

# Initialize classifier with default model (DistilBERT)
classifier = PaperClassifier()

# Classify a paper
result = classifier.classify_paper(
    title="Your paper title",
    abstract="Your paper abstract"
)

# Print results
print(result)
```

### Using Different Models

```python
# Initialize with DeBERTa-v3
classifier = PaperClassifier(model_type='deberta-v3')

# Initialize with RoBERTa
classifier = PaperClassifier(model_type='roberta')

# Initialize with SciBERT
classifier = PaperClassifier(model_type='scibert')

# Initialize with BERT
classifier = PaperClassifier(model_type='bert')
```

### Training on Custom Data

```python
# Prepare your training data
train_texts = ["paper1 title and abstract", "paper2 title and abstract", ...]
train_labels = ["cs", "math", ...]

# Train the model
classifier.train_on_arxiv(
    train_texts=train_texts,
    train_labels=train_labels,
    epochs=3,
    batch_size=16,
    learning_rate=2e-5
)
```

## Model Details

### Available Models

1. **DistilBERT** (`distilbert`)
   - Model: `distilbert-base-cased`
   - Max length: 512 tokens
   - Fast tokenizer
   - Good for testing and quick results

2. **DeBERTa-v3** (`deberta-v3`)
   - Model: `microsoft/deberta-v3-base`
   - Max length: 512 tokens
   - Uses DebertaV2TokenizerFast
   - Advanced performance

3. **RoBERTa** (`roberta`)
   - Model: `roberta-base`
   - Max length: 512 tokens
   - Strong performance on various tasks

4. **SciBERT** (`scibert`)
   - Model: `allenai/scibert_scivocab_uncased`
   - Max length: 512 tokens
   - Specialized for scientific text

5. **BERT** (`bert`)
   - Model: `bert-base-uncased`
   - Max length: 512 tokens
   - Classic model with good all-round performance

## Error Handling

The system includes robust error handling mechanisms:
- Multiple fallback levels for tokenizer initialization
- Graceful degradation to simpler models
- Detailed error messages and logging
- Automatic fallback to BERT tokenizer if needed

## Requirements

- Python 3.7+
- PyTorch
- Transformers library
- NumPy
- Sacremoses (for tokenization support)