--- title: Paper Classifier emoji: 📚 colorFrom: blue colorTo: indigo sdk: streamlit sdk_version: "1.32.0" app_file: app.py pinned: false --- # 📚 Academic Paper Classifier [link](https://huggingface.co/spaces/ssbars/ysdaml4) This Streamlit application helps classify academic papers into different categories using a BERT-based model. ## Features - **Text Classification**: Paste any paper text directly - **PDF Support**: Upload PDF files for classification - **Real-time Analysis**: Get instant classification results - **Probability Distribution**: See confidence scores for each category - **Multiple Categories**: Supports various academic fields ## How to Use 1. **Text Input** - Paste your paper's text (abstract or full content) - Click "Classify Text" - View results and probability distribution 2. **PDF Upload** - Upload a PDF file of your paper - Click "Classify PDF" - Get classification results ## Categories The model classifies papers into the following categories: - Computer Science - Mathematics - Physics - Biology - Economics ## Technical Details - Built with Streamlit - Uses BERT-based model for classification - Supports PDF file processing - Real-time classification ## About This application is designed to help researchers, students, and academics quickly identify the primary field of academic papers. It uses state-of-the-art natural language processing to analyze paper content and provide accurate classifications. --- Created with ❤️ using Streamlit and Transformers ## Setup 1. Install `uv` (if not already installed): ```bash # Using pip pip install uv # Or using Homebrew on macOS brew install uv ``` 2. Create and activate a virtual environment: ```bash uv venv source .venv/bin/activate # On Unix/macOS # OR .venv\Scripts\activate # On Windows ``` 3. Install the dependencies using uv: ```bash uv pip install -r requirements.lock ``` 4. Run the Streamlit application: ```bash streamlit run app.py ``` ## Usage 1. **Text Classification** - Paste the paper's text (abstract or content) into the text area - Click "Classify Text" to get results 2. **PDF Classification** - Upload a PDF file using the file uploader - Click "Classify PDF" to process and classify the document ## Model Information The service uses a BERT-based model for classification with the following categories: - Computer Science - Mathematics - Physics - Biology - Economics ## Note The current implementation uses a base BERT model. For production use, you should: 1. Fine-tune the model on a dataset of academic papers 2. Adjust the categories based on your specific needs 3. Implement proper error handling and validation 4. Add authentication if needed ## Package Management This project uses `uv` as the package manager for faster and more reliable dependency management. The dependencies are locked in `requirements.lock` for reproducible installations. To update dependencies: ```bash # Update a single package uv pip install --upgrade package_name # Update all packages and regenerate lock file uv pip compile requirements.txt -o requirements.lock uv pip install -r requirements.lock ``` ## Requirements See `requirements.txt` for a complete list of dependencies. # ArXiv Paper Classifier This project implements a machine learning system for classifying academic papers into ArXiv categories using state-of-the-art transformer models. ## Project Overview The system uses pre-trained transformer models to classify academic papers into one of the main ArXiv categories: - Computer Science (cs) - Mathematics (math) - Physics (physics) - Quantitative Biology (q-bio) - Quantitative Finance (q-fin) - Statistics (stat) - Electrical Engineering and Systems Science (eess) - Economics (econ) ## Features - Multiple model support: - DistilBERT: Lightweight and fast model, good for testing - DeBERTa-v3: Advanced model with better performance - RoBERTa: Advanced model with strong performance - SciBERT: Specialized for scientific text - BERT: Classic model with good all-round performance - Flexible input handling: - Can process both title and abstract - Handles text preprocessing and tokenization - Supports different maximum sequence lengths - Robust error handling: - Multiple fallback mechanisms for tokenizer initialization - Graceful degradation to simpler models if needed - Detailed error messages and logging ## Installation 1. Clone the repository 2. Install dependencies: ```bash pip install -r requirements.txt ``` ## Usage ### Basic Usage ```python from model import PaperClassifier # Initialize classifier with default model (DistilBERT) classifier = PaperClassifier() # Classify a paper result = classifier.classify_paper( title="Your paper title", abstract="Your paper abstract" ) # Print results print(result) ``` ### Using Different Models ```python # Initialize with DeBERTa-v3 classifier = PaperClassifier(model_type='deberta-v3') # Initialize with RoBERTa classifier = PaperClassifier(model_type='roberta') # Initialize with SciBERT classifier = PaperClassifier(model_type='scibert') # Initialize with BERT classifier = PaperClassifier(model_type='bert') ``` ### Training on Custom Data ```python # Prepare your training data train_texts = ["paper1 title and abstract", "paper2 title and abstract", ...] train_labels = ["cs", "math", ...] # Train the model classifier.train_on_arxiv( train_texts=train_texts, train_labels=train_labels, epochs=3, batch_size=16, learning_rate=2e-5 ) ``` ## Model Details ### Available Models 1. **DistilBERT** (`distilbert`) - Model: `distilbert-base-cased` - Max length: 512 tokens - Fast tokenizer - Good for testing and quick results 2. **DeBERTa-v3** (`deberta-v3`) - Model: `microsoft/deberta-v3-base` - Max length: 512 tokens - Uses DebertaV2TokenizerFast - Advanced performance 3. **RoBERTa** (`roberta`) - Model: `roberta-base` - Max length: 512 tokens - Strong performance on various tasks 4. **SciBERT** (`scibert`) - Model: `allenai/scibert_scivocab_uncased` - Max length: 512 tokens - Specialized for scientific text 5. **BERT** (`bert`) - Model: `bert-base-uncased` - Max length: 512 tokens - Classic model with good all-round performance ## Error Handling The system includes robust error handling mechanisms: - Multiple fallback levels for tokenizer initialization - Graceful degradation to simpler models - Detailed error messages and logging - Automatic fallback to BERT tokenizer if needed ## Requirements - Python 3.7+ - PyTorch - Transformers library - NumPy - Sacremoses (for tokenization support)