Spaces:

ssbars
/

ysdaml4

Running

App Files Files Community

ssbars commited on Apr 8

Commit

12faaae

1 Parent(s): 2989d17

v2

Browse files

Files changed (6) hide show

ML2_2025_nlp_ops1.ipynb +167 -0
README.md +146 -1
app.py +111 -44
model.py +338 -35
requirements.lock +64 -0
requirements.txt +8 -1

ML2_2025_nlp_ops1.ipynb ADDED Viewed

	@@ -0,0 +1,167 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# __Девопсная домашка по трансформерам__\n",
+    "\n",
+    "## __Описание__\n",
+    "\n",
+    "![img](https://d35w6hwqhdq0in.cloudfront.net/521712556725591dcacec5bbdb32e047.png)\n",
+    "\n",
+    "Ваш главный квест на эту домашку - сделать свой простой сервис на трансформерах. Вот прям целый сервис: начиная с данных и заканчивая графическим интерфейсом где-то в интернете. Ваш сервис может решать либо одну из предложенных ниже задач, либо любую другую (что-то более дорогое лично вам).\n",
+    "\n",
+    "__Стандартная задача: классификатор статей.__ Нужно построить сервис который принимает название статьи и её abstract, и выдаёт наиболее вероятную тематику статьи: скажем, физика, биология или computer science. В интерфейсе должно быть можно ввести отдельно abstract, отдельно название -- и увидеть топ-95%* тематик, отсортированных по убыванию вероятности. Если abstract не ввели, нужно классифицировать статью только по названию. Ниже вас ждут инструкции и данные именно для этой задачи.\n",
+    "\n",
+    "<details><summary><u> Что значит Топ-95%?</u></summary>\n",
+    "    Нужно выдавать темы по убыванию вероятности, пока их суммарная вероятность не превысит 95%. В зависимости от предсказанной вероятности, это может быть одна или более тем. Например, если модель предсказала вероятности [4%, 20%, 60%, 2%, 14%], нужно вывести 3 топ-3 класса. Если один из классов имеет вероятность 96%, достаточно вывести один этот класс.\n",
+    "</details>\n",
+    "\n",
+    "Альтернативно, вы можете отважиться сделать что-то своё, на данных из интернета или своих собственных. В вашей задаче обязательно должно быть _оправданное_ использование трансформеров. Использовать ML чтобы переводить часовые пояса - плохой план.\n",
+    "\n",
+    "Achtung: трансформеры круты, но не всемогущи. Далеко не любую задачу можно решить ощутимо лучше рандома. Для калибровки, вот несколько примеров решаемых задач (всё кликабельно):\n",
+    "\n",
+    "\n",
+    "<details><summary> - <b>[medium]</b> <u>Сгенерировать youtube-комментарии по _ссылке_ на видео</u></summary>\n",
+    "    Всё просто, юзер постит ссылку на видео - вы его комментируете. Можно заранее обусловиться что видео только на английском или на русском. Нужно сочинить _несколько_ комментариев. Kudos если вместе с основным комментарием вы порождаете юзернеймы и-или ответы на него.\n",
+    "    \n",
+    "    Датасет для файнтюна можно [взять с kaggle](https://www.kaggle.com/tanmay111/youtube-comments-sentiment-analysis/data?select=UScomments.csv) или [собрать самостоятельно](https://towardsdatascience.com/how-to-build-your-own-dataset-of-youtube-comments-39a1e57aade).\n",
+    "    \n",
+    "    В качестве основной модели можно использовать [GPT-2 large](https://huggingface.co/gpt2-large). Вот как её файнтюнить: https://tinyurl.com/gpt2-finetune-colab . Если хотите больше - можно взять что-то из творчества https://huggingface.co/EleutherAI . Например, вот [тут](https://tinyurl.com/gpt-j-8bit) есть пример как файнтюнить GPT-J-6B (в 8 раз больше gpt2-large). Однако, этим стоит заниматься уже после того, как у вас заработал базовый сценарий с GPT2-large или даже base.\n",
+    "    \n",
+    "    В итоговом сервисе ��ожно дать пользователю вариировать параметры генерации: температура или top-p, если сэмплинг; beam size и length penalty, если beam search; сколько комментариев сгенерировать, etc. Отдельный респект если ваш код будет выводить комментарий по одному слову, прямо в процессе генерёжки - чтобы пользователь не ждал пока вы настругаете абзац целиком.\n",
+    "</details>\n",
+    "\n",
+    "<details><summary> - <b>[medium]</b> <u>Предсказать зарплату по профилю (симулятор Дудя).</u></summary>\n",
+    "    Note: <details> <summary>Причём тут Дудь?</summary> <img src=https://www.meme-arsenal.com/memes/6dd85f126bbab4f9774ced71ffadbcb3.jpg> </details>\n",
+    "    \n",
+    "    Главная сложность задачи - достать хорошие данные. Если хороших данных не случилось - можно и трешовые :) Задание всё-таки про технологии а не про продукт. Для начала можно взять подмножество фичей [отсюда](https://www.kaggle.com/c/job-salary-prediction/data), которые вы можете восстановить из профиля linkedin - название профессии и компании. Название компании лучше заменить на фичи из открытых источников: сфера деятельности, размер, етц.\n",
+    "    \n",
+    "    А дальше файнтюним на этом BERT / T5 и радуемся. Ну или хотя бы смеёмся.\n",
+    "</details>\n",
+    "\n",
+    "\n",
+    "<details><summary> - <b>[hard]</b> <u>Мнения с географической окраской.</u></summary>\n",
+    "    \n",
+    "    Сервис который принимает на вход тему (хэштег или ключевую фразу) и рисует карту мира, где в каждом регионе показано, с какой эмоциональной окраской о ней высказываются в социальных сетях. В качестве социальной сети можно взять VK/twitter, в случая VK ожидается детализация не по странам, а по городам стран бывшего СССР.\n",
+    "    \n",
+    "    В минимальном варианте достаточно определять тональность твита в режиме \"позитивно-негативно\", зафайнтюнив условный BERT/T5 на одном из десятков {vk/twitter} sentiment classification датасетах. Географическую привязку можно получить из профиля пользователя. А дальше осталось собрать данные по странам и регионам.\n",
+    "\n",
+    "</details>\n",
+    "\n",
+    "\n",
+    "<details><summary> - <b>[very hard]</b> <u>Найти статью википедии по фото предмета статьи</u></summary>\n",
+    "\n",
+    "    Чтобы можно было сфотать какую-нибудь неведомую чешуйню на телефон и получить сумму человеческих знаний о ней в форме вики-статьи.\n",
+    "    \n",
+    "    В качестве функции потерь можно использовать contrastive loss. Этот лосс неплохо описан в статье [CLIP](https://arxiv.org/abs/2103.00020). Вместо обучения с нуля предлагается взять, собственно, CLIP (text transformer + image transformer) отсюда: https://huggingface.co/docs/transformers/model_doc/clip. Модель будет сопоставлять каждой статьи и \n",
+    "    \n",
+    "    Данные для этого квеста можно собрать через API википедии: вики-статьи о предметах обычно содержит фото этого объекта и, собственно, текст статьи. Советуем собрать как минимум 10^4 пар картинка-статья. Картинки советуем дополнительно аугментировать как минимум стандартными картиночными аугами, как максимум - поиском похожих картинок в интернете / imagenet-е по тому же CLIP image encoder-у, но с исходными весами.\n",
+    "    \n",
+    "    На время отладки интерфейса рекомендуем ограничить��я небольшим списком статьей: условно, кошечки, собачки, птички, гаечные ключи, машины. Как станет понятно что оно работает \"на кошках\", можно расширить этот список до \"всех статей таких-то категорий\". Эмбединги статей лучше предпосчитать в файл. Если долго их перебирать - можно (но необязательно) воспользоваться быстрым поиском соседей, e.g. [faiss](https://github.com/facebookresearch/faiss) HNSW.\n",
+    "</details>\n",
+    "\n",
+    "\n",
+    "## __Как научить классификатор статей?__\n",
+    "\n",
+    "Данные для классификации статей можно скачать, например, [отсюда](https://www.kaggle.com/neelshah18/arxivdataset/). В этих данных есть заголовок и abstract статьи, а ещё поле __\"tag\"__: тематика статьи [по таксономии arxiv.org](https://arxiv.org/category_taxonomy). Вы можете расширить выборку, добавив в неё статьи за 2019-н.в. годы. Для этого можно [использовать arxiv API](https://github.com/lukasschwab/arxiv.py), самостоятельно распарсить arxiv с помощью [beautifulsoup](https://pypi.org/project/beautifulsoup4/), или поискать другие датасеты на kaggle, huggingface, etc.\n",
+    "\n",
+    "Когда данные собраны (и аккуратно нарезаны на train/test), можно что-нибудь и обучить. Мы советуем использовать для этого библиотеку `transformers`. Советуем, но не заставляем: если хочется, можно взять [fairseq roberta](https://github.com/pytorch/fairseq/blob/main/examples/roberta), [google t5](https://github.com/google-research/text-to-text-transfer-transformer) или даже написать всё с нуля.\n",
+    "\n",
+    "Мы разбирали transformers на [семинаре](https://lk.yandexdataschool.ru/courses/2025-spring/7.1332-machine-learning-2/classes/13138/), за любой дополнительной информацией - смотрите [документации HF](https://huggingface.co/docs).\n",
+    "\n",
+    "Начать лучше с простой модели, такой как [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased). Когда вы будете понимать, какие значения accuracy ожидать от базовой модели, можно поискать что-то получше. Два очевидных направления улучшения: (1) сильнее модель T5 или deberta v3, или (2) близкие данные, например взять модель которую предобучили на том же arxiv. И то и другое удобно [искать здесь](https://huggingface.co/models).\n",
+    "\n",
+    "## __Научили, и что теперь?__\n",
+    "\n",
+    "А теперь нужно сделать так, чтобы ваша обученная модель отвечала на запросы в интернете. Как и на прошлом этапе, вы можете сделать это кучей разных способов: от простого [streamlit](https://streamlit.io/) / [gradio](https://gradio.app/), минуя [TorchServe](https://pytorch.org/serve/) с [Triton/TensorRT](https://developer.nvidia.com/nvidia-triton-inference-server), и заканчивая экспортом модели в javascript с помощью [TensorFlow.js](https://www.tensorflow.org/js/tutorials) / [ONNX.js](https://github.com/elliotwaite/pytorch-to-javascript-with-onnx-js).\n",
+    "\n",
+    "На [семинаре](https://lk.yandexdataschool.ru/courses/2025-spring/7.1332-machine-learning-2/classes/13138/) мы разбирали основные вещи про то как работает streamlit и как сделать простое приложение с его помощью.\n",
+    "\n",
+    "Общая идея streamlit: вы [описываете](https://docs.streamlit.io/library/get-started/create-an-app) внешний вид приложения на питоне с помощью примитивов (кнопки, поля, любой html) -- а потом этот код выполняется на сервере и обслуживает каждого пользователя в отдельном процессе.\n",
+    "\n",
+    "__Для отладки__ можно запустить приложение локально, открыв консоль рядом с app.py:\n",
+    "* `pip install streamlit`\n",
+    "* `streamlit run app.py --server.port 8080`\n",
+    "* открыть в браузере localhost:8080, если он не открылся автоматически\n",
+    "\n",
+    "\n",
+    "## __Deployment time!__\n",
+    "\n",
+    "В этот раз вам нужно не просто написать код, __но и поднять ваше приложение с доступом из интернета__. И да, вы угадали, это можно сделать несколькими способами: [HuggingFace spaces](https://huggingface.co/spaces) (данный способ разбирали на [семинаре](https://lk.yandexdataschool.ru/courses/2025-spring/7.1332-machine-learning-2/classes/13138/)), [Streamlit Cloud](https://streamlit.io/cloud), а ещё вы можете купить или арендовать свой собственный сервер и захоститься там.\n",
+    "\n",
+    "Проще всего захостить на HF spaces, для этого вам нужно [зарегистрироваться](https://huggingface.co/join) и найти [меню создания нового приложения](https://huggingface.co/new-space). Название и лицензию можно выбрать на своё усмотрение, главное чтобы Space SDK был Streamlit, а доступ - public.\n",
+    "\n",
+    "Как создали - можно редактировать ваше приложение прямо на сайте, для этого откройте приложение и перейдите в Files and versions, и там в правом углу добавьте нужные файлы.\n",
+    "\n",
+    "На минималках вам потребуется 2 файла:\n",
+    "- `app.py`, о котором мы говорили выше\n",
+    "- `requirements.txt`, где вы укажете нужные вам библиотеки\n",
+    "\n",
+    "Вы можете разместить там же веса вашей обученной модели, любые необходимые данные, дополнительные файлы, ...\n",
+    "\n",
+    "После каждого изменения файлов, ваше приложение соберётся (обычно 1-5 минут) и будет доступно уже во вкладке App. Ну или не соберётся и покажет вам, где оно сломалось. И вуаля, теперь у вас есть ссылка, которую можно показать ~друзьям~ ассистентам курса и кому угодно в интернете.\n",
+    "\n",
+    "__Удобная работа с кодом.__ Пока у вас 2 файла, их легко редактивровать прямо в интерфейсе HF spaces. Если же у вас дюжина файлов, вам может быть удобнее редактировать их в любимом vscode/pycharm/.../emacs. Чтобы это не вызывало мучений, можно пользоваться HF spaces как git репозиторием ([подробности тут](https://huggingface.co/docs/hub/spaces#manage-app-with-github-actions)).\n",
+    "\n",
+    "## __Что нужно сдать__\n",
+    "\n",
+    "Вы сдаёте проект, который будет проверяться вручную, то что ожидается от каждого проекта:\n",
+    "- Текстовое сопровождение вашего конкретного проекта в любом удобно читаемом формате (pdf, html, текст в lk, ...) - что за задачу вы решали, где/как брали данные, какие использовали модели, какие проводили эксперименты, ...\n",
+    "- Ссылка на веб интерфейс, где можно протестировать демо вашего проекта - обязательно проверяйте что работает не только у вас (с другого устройства и из под incognito режима)\n",
+    "- Код обучения вашей модели (желательно ipynb с заполненными ячейками и не стёртыми выходами, переведённый в pdf / html), но если вы обучали не в ноутбуке, то сдавайте код в виде файла / архива файлов / git ссылки с readme.md описанием того как именно проходило обучение с помощью этого кода.\n",
+    "\n",
+    "## __Оценка__\n",
+    "\n",
+    "Мы будем оценивать проект целиком, включая идею и реализацию. Максимум за проект можно получить 10 баллов, но мы оставляем ещё до 5 баллов, котор��е можем выдать как бонусные за особенно интересные и качественно реализованные проекты.\n",
+    "\n",
+    "### __Тонкие места, за которые могут быть снижения баллов:__\n",
+    "\n",
+    "__1. Скорость работы.__\n",
+    "\n",
+    "По умолчанию, streamlit будет выполняет весь ваш код на каждое действие пользователя. То есть всякий раз, когда пользователь меняет что-то в тексте, оно будет заново загружать модель. Чтобы исправить это безобразие, вы можете закэшировать подготовленную модель в `@st.cache`. Подробности в [семинаре](https://lk.yandexdataschool.ru/courses/2025-spring/7.1332-machine-learning-2/classes/13138/), а также [читайте тут](https://docs.streamlit.io/library/advanced-features/caching).\n",
+    "\n",
+    "__Как будет оцениваться:__\n",
+    "\n",
+    "Вы не обязаны пользоваться кэшированием, но ваше приложение не должно неоправдано тормозить дольше, чем на 3 секунды. \"Оправданые\" тормоза это те, которые вы явно оправдали текстом в ЛМС :)\n",
+    "\n",
+    "-----\n",
+    "\n",
+    "__2. Понятный фронтенд.__\n",
+    "\n",
+    "Наколеночный графический интерфейс с семинара - пример того, как скорее не надо делать интерфейс приложения. Как надо - сложный вопрос, причём настолько сложный, что есть даже [Школа Разработки Интерфейсов](https://academy.yandex.ru/schools/frontend). Но для начала:\n",
+    "\n",
+    "- Выводить нужно человекочитаемый текст, а не просто JSON с индексами и метаданными.\n",
+    "- Пользователю должно быть понятно, куда и какие данные вводить. Пустые текстовые поля в вакууме - плохой тон.\n",
+    "- Сервис не должен падать с не_отловленными ошибками. Даже если пользователь введёт неправильные/пустые данные, нужно это обработать и написать, где произошла ошибка.\n",
+    "\n",
+    "__Как будет оцениваться:__\n",
+    "\n",
+    "Для полного балла достаточно соблюсти эти три правила и специально не стрелять себе в ногу.\n",
+    "\n",
+    "-----\n",
+    "\n",
+    "__3. Код обучения и инференса.__\n",
+    "\n",
+    "Сдавая проект мы будем также получать от вас код проекта (как обучения ваших моделей, так и код веб интерфейса).\n",
+    "\n",
+    "__Как будет оцениваться:__\n",
+    "\n",
+    "Код не будет отдельно проверяться как часть задания, поэтому пишите как хотите, однако - в спорных ситуациях мы оставляем за собой право проверить ваш код, за чем могут последовать потенциальные снижения баллов при любых нарушениях.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

README.md CHANGED Viewed

@@ -11,6 +11,8 @@ pinned: false
 # 📚 Academic Paper Classifier
 This Streamlit application helps classify academic papers into different categories using a BERT-based model.
 ## Features
@@ -128,4 +130,147 @@ uv pip install -r requirements.lock
 ## Requirements
-See `requirements.txt` for a complete list of dependencies.

 # 📚 Academic Paper Classifier
+[link](https://huggingface.co/spaces/ssbars/ysdaml4)
 This Streamlit application helps classify academic papers into different categories using a BERT-based model.
 ## Features
 ## Requirements
+See `requirements.txt` for a complete list of dependencies.
+# ArXiv Paper Classifier
+This project implements a machine learning system for classifying academic papers into ArXiv categories using state-of-the-art transformer models.
+## Project Overview
+The system uses pre-trained transformer models to classify academic papers into one of the main ArXiv categories:
+- Computer Science (cs)
+- Mathematics (math)
+- Physics (physics)
+- Quantitative Biology (q-bio)
+- Quantitative Finance (q-fin)
+- Statistics (stat)
+- Electrical Engineering and Systems Science (eess)
+- Economics (econ)
+## Features
+- Multiple model support:
+  - DistilBERT: Lightweight and fast model, good for testing
+  - DeBERTa-v3: Advanced model with better performance
+  - RoBERTa: Advanced model with strong performance
+  - SciBERT: Specialized for scientific text
+  - BERT: Classic model with good all-round performance
+- Flexible input handling:
+  - Can process both title and abstract
+  - Handles text preprocessing and tokenization
+  - Supports different maximum sequence lengths
+- Robust error handling:
+  - Multiple fallback mechanisms for tokenizer initialization
+  - Graceful degradation to simpler models if needed
+  - Detailed error messages and logging
+## Installation
+1. Clone the repository
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+## Usage
+### Basic Usage
+```python
+from model import PaperClassifier
+# Initialize classifier with default model (DistilBERT)
+classifier = PaperClassifier()
+# Classify a paper
+result = classifier.classify_paper(
+    title="Your paper title",
+    abstract="Your paper abstract"
+)
+# Print results
+print(result)
+```
+### Using Different Models
+```python
+# Initialize with DeBERTa-v3
+classifier = PaperClassifier(model_type='deberta-v3')
+# Initialize with RoBERTa
+classifier = PaperClassifier(model_type='roberta')
+# Initialize with SciBERT
+classifier = PaperClassifier(model_type='scibert')
+# Initialize with BERT
+classifier = PaperClassifier(model_type='bert')
+```
+### Training on Custom Data
+```python
+# Prepare your training data
+train_texts = ["paper1 title and abstract", "paper2 title and abstract", ...]
+train_labels = ["cs", "math", ...]
+# Train the model
+classifier.train_on_arxiv(
+    train_texts=train_texts,
+    train_labels=train_labels,
+    epochs=3,
+    batch_size=16,
+    learning_rate=2e-5
+)
+```
+## Model Details
+### Available Models
+1. **DistilBERT** (`distilbert`)
+   - Model: `distilbert-base-cased`
+   - Max length: 512 tokens
+   - Fast tokenizer
+   - Good for testing and quick results
+2. **DeBERTa-v3** (`deberta-v3`)
+   - Model: `microsoft/deberta-v3-base`
+   - Max length: 512 tokens
+   - Uses DebertaV2TokenizerFast
+   - Advanced performance
+3. **RoBERTa** (`roberta`)
+   - Model: `roberta-base`
+   - Max length: 512 tokens
+   - Strong performance on various tasks
+4. **SciBERT** (`scibert`)
+   - Model: `allenai/scibert_scivocab_uncased`
+   - Max length: 512 tokens
+   - Specialized for scientific text
+5. **BERT** (`bert`)
+   - Model: `bert-base-uncased`
+   - Max length: 512 tokens
+   - Classic model with good all-round performance
+## Error Handling
+The system includes robust error handling mechanisms:
+- Multiple fallback levels for tokenizer initialization
+- Graceful degradation to simpler models
+- Detailed error messages and logging
+- Automatic fallback to BERT tokenizer if needed
+## Requirements
+- Python 3.7+
+- PyTorch
+- Transformers library
+- NumPy
+- Sacremoses (for tokenization support)

app.py CHANGED Viewed

@@ -11,19 +11,56 @@ import PyPDF2
 import io
 from model import PaperClassifier
-# Initialize the classifier
 @st.cache_resource
-def load_classifier():
-    return PaperClassifier()
-classifier = load_classifier()
 # Title and description
 st.title("📚 Academic Paper Classification")
 st.markdown("""
 This service helps you classify academic papers into different categories.
 You can either:
-- Paste the paper's text directly
 - Upload a PDF file
 """)
@@ -31,28 +68,44 @@ You can either:
 col1, col2 = st.columns(2)
 with col1:
-    st.subheader("Option 1: Text Input")
-    text_input = st.text_area(
-        "Paste your paper text here:",
         height=200,
-        placeholder="Paste the paper's abstract or content here..."
     )
-    if st.button("Classify Text"):
-        if text_input.strip():
             with st.spinner("Classifying..."):
-                result = classifier.classify_paper(text_input)
                 st.success("Classification Complete!")
-                st.write(f"**Predicted Category:** {result['category']}")
-                st.write(f"**Confidence:** {result['confidence']:.2%}")
-                # Show all probabilities
-                st.subheader("Category Probabilities")
-                for category, prob in result['all_probabilities'].items():
-                    st.progress(prob, text=f"{category}: {prob:.2%}")
         else:
-            st.warning("Please enter some text to classify.")
 with col2:
     st.subheader("Option 2: PDF Upload")
@@ -62,38 +115,52 @@ with col2:
         if st.button("Classify PDF"):
             try:
                 with st.spinner("Processing PDF..."):
-                    # Read PDF content
-                    pdf_reader = PyPDF2.PdfReader(io.BytesIO(uploaded_file.read()))
-                    text_content = ""
-                    for page in pdf_reader.pages:
-                        text_content += page.extract_text()
-                    # Classify the extracted text
-                    result = classifier.classify_paper(text_content)
                     st.success("Classification Complete!")
-                    st.write(f"**Predicted Category:** {result['category']}")
-                    st.write(f"**Confidence:** {result['confidence']:.2%}")
-                    # Show all probabilities
-                    st.subheader("Category Probabilities")
-                    for category, prob in result['all_probabilities'].items():
-                        st.progress(prob, text=f"{category}: {prob:.2%}")
             except Exception as e:
                 st.error(f"Error processing PDF: {str(e)}")
-# Add information about the model
-st.sidebar.title("About")
-st.sidebar.info("""
-This application uses a BERT-based model to classify academic papers into different categories.
-The model analyzes the content and predicts the most likely academic field.
-**Categories:**
-- Computer Science
-- Mathematics
-- Physics
-- Biology
-- Economics
 """)
 # Add footer

 import io
 from model import PaperClassifier
+# Initialize the classifier with model selection
 @st.cache_resource
+def load_classifier(model_type):
+    return PaperClassifier(model_type)
+# Cache the PDF text extraction
+@st.cache_data
+def extract_pdf_text(pdf_bytes):
+    """Extract text from PDF and try to separate title and abstract"""
+    pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_bytes))
+    text = ""
+    for page in pdf_reader.pages:
+        text += page.extract_text() + "\n"
+    # Try to extract title and abstract
+    lines = text.split('\n')
+    title = lines[0] if lines else ""
+    abstract = "\n".join(lines[1:]) if len(lines) > 1 else ""
+    return title.strip(), abstract.strip()
+# Get available models for selection
+available_models = list(PaperClassifier.AVAILABLE_MODELS.keys())
+# Add model selection to sidebar
+st.sidebar.title("Model Settings")
+selected_model = st.sidebar.selectbox(
+    "Select Model",
+    available_models,
+    index=0,
+    help="Choose the model to use for classification"
+)
+# Display model information
+model_info = PaperClassifier.AVAILABLE_MODELS[selected_model]
+st.sidebar.markdown(f"""
+### Selected Model
+**Name:** {model_info['name']}
+**Description:** {model_info['description']}
+""")
+# Initialize the classifier with selected model
+classifier = load_classifier(selected_model)
 # Title and description
 st.title("📚 Academic Paper Classification")
 st.markdown("""
 This service helps you classify academic papers into different categories.
 You can either:
+- Enter the paper's title and abstract separately
 - Upload a PDF file
 """)
 col1, col2 = st.columns(2)
 with col1:
+    st.subheader("Option 1: Manual Input")
+    # Title input
+    title_input = st.text_input(
+        "Paper Title:",
+        placeholder="Enter the paper title..."
+    )
+    # Abstract input
+    abstract_input = st.text_area(
+        "Paper Abstract (optional):",
         height=200,
+        placeholder="Enter the paper abstract (optional)..."
     )
+    if st.button("Classify Paper"):
+        if title_input.strip():
             with st.spinner("Classifying..."):
+                result = classifier.classify_paper(
+                    title=title_input,
+                    abstract=abstract_input if abstract_input.strip() else None
+                )
                 st.success("Classification Complete!")
+                st.write(f"**Input Type:** {result['input_type'].replace('_', ' ').title()}")
+                st.write(f"**Model Used:** {result['model_used']}")
+                # Show top categories
+                st.subheader("Top Categories (95% Confidence)")
+                total_prob = 0
+                for cat_info in result['top_categories']:
+                    prob = cat_info['probability']
+                    total_prob += prob
+                    st.progress(prob, text=f"{cat_info['category']} ({cat_info['arxiv_category']}): {prob:.1%}")
+                st.info(f"Total probability of shown categories: {total_prob:.1%}")
         else:
+            st.warning("Please enter at least the paper title.")
 with col2:
     st.subheader("Option 2: PDF Upload")
         if st.button("Classify PDF"):
             try:
                 with st.spinner("Processing PDF..."):
+                    # Extract title and abstract from PDF
+                    title, abstract = extract_pdf_text(uploaded_file.read())
+                    if not title:
+                        st.error("Could not extract title from PDF.")
+                        st.stop()
+                    # Show extracted text
+                    with st.expander("Show extracted text"):
+                        st.write("**Extracted Title:**")
+                        st.write(title)
+                        if abstract:
+                            st.write("**Extracted Abstract:**")
+                            st.write(abstract)
+                    # Classify the paper
+                    result = classifier.classify_paper(
+                        title=title,
+                        abstract=abstract if abstract else None
+                    )
                     st.success("Classification Complete!")
+                    st.write(f"**Input Type:** {result['input_type'].replace('_', ' ').title()}")
+                    st.write(f"**Model Used:** {result['model_used']}")
+                    # Show top categories
+                    st.subheader("Top Categories (95% Confidence)")
+                    total_prob = 0
+                    for cat_info in result['top_categories']:
+                        prob = cat_info['probability']
+                        total_prob += prob
+                        st.progress(prob, text=f"{cat_info['category']} ({cat_info['arxiv_category']}): {prob:.1%}")
+                    st.info(f"Total probability of shown categories: {total_prob:.1%}")
             except Exception as e:
                 st.error(f"Error processing PDF: {str(e)}")
+# Add information about the models
+st.sidebar.markdown("---")
+st.sidebar.title("Available Models")
+st.sidebar.markdown("""
+- **DistilBERT**: Fast and lightweight
+- **DeBERTa v3**: Advanced performance
+- **T5**: Versatile text-to-text
+- **RoBERTa**: Strong performance
+- **SciBERT**: Specialized for science
 """)
 # Add footer

model.py CHANGED Viewed

@@ -1,53 +1,356 @@
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
 import torch
 import numpy as np
 class PaperClassifier:
-    def __init__(self):
-        # Using BERT model fine-tuned on arXiv categories
-        self.model_name = "bert-base-uncased"  # This is a placeholder, you can replace with your fine-tuned model
-        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
-        self.model = AutoModelForSequenceClassification.from_pretrained(self.model_name)
-        # Define paper categories (example categories, can be modified based on needs)
         self.categories = [
-            "Computer Science",
-            "Mathematics",
-            "Physics",
-            "Biology",
-            "Economics"
         ]
-    def preprocess_text(self, text):
-        # Truncate text to model's maximum length
-        return text[:512]
-    def classify_paper(self, text):
         # Preprocess the text
-        processed_text = self.preprocess_text(text)
         # Tokenize
-        inputs = self.tokenizer(processed_text,
-                              return_tensors="pt",
-                              truncation=True,
-                              max_length=512,
-                              padding=True)
         # Get model predictions
         with torch.no_grad():
             outputs = self.model(**inputs)
-            predictions = torch.softmax(outputs.logits, dim=1)
-        # Get predicted category and confidence
-        predicted_idx = torch.argmax(predictions).item()
-        confidence = predictions[0][predicted_idx].item()
-        # Return prediction and confidence
         return {
-            'category': self.categories[predicted_idx],
-            'confidence': confidence,
-            'all_probabilities': {
-                cat: prob.item()
-                for cat, prob in zip(self.categories, predictions[0])
-            }
-        }

+from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
 import torch
 import numpy as np
+import logging
 class PaperClassifier:
+    # Available models with their configurations
+    AVAILABLE_MODELS = {
+        'distilbert': {
+            'name': 'distilbert-base-cased',
+            'max_length': 512,
+            'description': 'Lightweight and fast model, good for testing',
+            'force_slow': False,
+            'tokenizer_class': None  # Use default
+        },
+        'deberta-v3': {
+            'name': 'microsoft/deberta-v3-base',
+            'max_length': 512,
+            'description': 'Advanced model with better performance',
+            'force_slow': True,  # Force slow tokenizer for DeBERTa
+            'tokenizer_class': 'DebertaV2TokenizerFast'  # Specify tokenizer class
+        },
+        't5': {
+            'name': 'google/t5-v1_1-base',
+            'max_length': 512,
+            'description': 'Versatile text-to-text model',
+            'force_slow': False
+        },
+        'roberta': {
+            'name': 'roberta-base',
+            'max_length': 512,
+            'description': 'Advanced model with strong performance',
+            'force_slow': False,
+            'tokenizer_class': None  # Use default
+        },
+        'scibert': {
+            'name': 'allenai/scibert_scivocab_uncased',
+            'max_length': 512,
+            'description': 'Specialized for scientific text',
+            'force_slow': False,
+            'tokenizer_class': None  # Use default
+        },
+        'bert': {
+            'name': 'bert-base-uncased',
+            'max_length': 512,
+            'description': 'Classic BERT model, good all-round performance',
+            'force_slow': False,
+            'tokenizer_class': None  # Use default
+        }
+    }
+    def __init__(self, model_type='distilbert'):
+        """
+        Initialize the classifier with a specific model type
+        Args:
+            model_type (str): One of 'distilbert', 'deberta-v3', 't5', 'roberta', 'scibert'
+        """
+        if model_type not in self.AVAILABLE_MODELS:
+            raise ValueError(f"Model type must be one of {list(self.AVAILABLE_MODELS.keys())}")
+        self.model_type = model_type
+        self.model_config = self.AVAILABLE_MODELS[model_type]
+        self.model_name = self.model_config['name']
+        # ArXiv main categories with descriptions
         self.categories = [
+            "cs",      # Computer Science
+            "math",    # Mathematics
+            "physics", # Physics
+            "q-bio",  # Quantitative Biology
+            "q-fin",  # Quantitative Finance
+            "stat",   # Statistics
+            "eess",   # Electrical Engineering and Systems Science
+            "econ"    # Economics
         ]
+        # Human readable category names
+        self.category_names = {
+            "cs": "Computer Science",
+            "math": "Mathematics",
+            "physics": "Physics",
+            "q-bio": "Biology",
+            "q-fin": "Finance",
+            "stat": "Statistics",
+            "eess": "Electrical Engineering",
+            "econ": "Economics"
+        }
+        # Initialize tokenizer with proper error handling
+        self._initialize_tokenizer()
+        # Initialize model with proper error handling
+        self._initialize_model()
+        # Print model info
+        print(f"Initialized {model_type} model: {self.model_name}")
+        print(f"Description: {self.model_config['description']}")
+        print("Note: This model needs to be fine-tuned on ArXiv data for accurate predictions.")
+    def _initialize_tokenizer(self):
+        """Initialize the tokenizer with proper error handling"""
+        try:
+            # First try loading the tokenizer configuration
+            config = AutoConfig.from_pretrained(self.model_name)
+            # Try loading the tokenizer with specific class if specified
+            if self.model_config['tokenizer_class']:
+                from transformers import DebertaV2TokenizerFast
+                self.tokenizer = DebertaV2TokenizerFast.from_pretrained(
+                    self.model_name,
+                    model_max_length=self.model_config['max_length']
+                )
+            else:
+                # Try loading with AutoTokenizer
+                self.tokenizer = AutoTokenizer.from_pretrained(
+                    self.model_name,
+                    model_max_length=self.model_config['max_length'],
+                    use_fast=not self.model_config['force_slow'],
+                    trust_remote_code=True
+                )
+            print(f"Successfully initialized tokenizer for {self.model_type}")
+        except Exception as e:
+            print(f"Error initializing tokenizer: {str(e)}")
+            print("Falling back to basic tokenizer...")
+            # Try one more time with minimal settings
+            try:
+                self.tokenizer = AutoTokenizer.from_pretrained(
+                    self.model_name,
+                    use_fast=False,
+                    trust_remote_code=True
+                )
+            except Exception as e:
+                # If all else fails, try using BERT tokenizer as last resort
+                print("Falling back to BERT tokenizer...")
+                self.tokenizer = AutoTokenizer.from_pretrained(
+                    'bert-base-uncased',
+                    model_max_length=self.model_config['max_length']
+                )
+    def _initialize_model(self):
+        """Initialize the model with proper error handling"""
+        try:
+            self.model = AutoModelForSequenceClassification.from_pretrained(
+                self.model_name,
+                num_labels=len(self.categories),
+                id2label={i: label for i, label in enumerate(self.categories)},
+                label2id={label: i for i, label in enumerate(self.categories)},
+                trust_remote_code=True  # Allow custom code from hub
+            )
+        except Exception as e:
+            raise RuntimeError(f"Failed to initialize model: {str(e)}")
+    @classmethod
+    def list_available_models(cls):
+        """List all available models with their descriptions"""
+        print("Available models:")
+        for model_type, config in cls.AVAILABLE_MODELS.items():
+            print(f"\n{model_type}:")
+            print(f"  Model: {config['name']}")
+            print(f"  Description: {config['description']}")
+    def preprocess_text(self, title, abstract=None):
+        """
+        Preprocess title and abstract
+        Args:
+            title (str): Paper title
+            abstract (str, optional): Paper abstract
+        """
+        if abstract:
+            text = f"Title: {title}\nAbstract: {abstract}"
+        else:
+            text = f"Title: {title}"
+        max_length = self.model_config['max_length']
+        if self.model_type == 't5':
+            text = "classify: " + text
+        return text[:max_length]
+    def get_top_categories(self, probabilities, threshold=0.95):
+        """
+        Get top categories that sum up to the threshold
+        Args:
+            probabilities (torch.Tensor): Model predictions
+            threshold (float): Probability threshold (default: 0.95)
+        Returns:
+            list: List of (category, probability) tuples
+        """
+        # Convert to numpy for easier manipulation
+        probs = probabilities.numpy()
+        # Sort indices by probability
+        sorted_indices = np.argsort(probs)[::-1]
+        # Calculate cumulative sum
+        cumsum = np.cumsum(probs[sorted_indices])
+        # Find how many categories we need to reach the threshold
+        mask = cumsum <= threshold
+        if not any(mask):  # If first probability is already > threshold
+            mask[0] = True
+        # Get the selected indices
+        selected_indices = sorted_indices[mask]
+        # Return categories and their probabilities
+        return [
+            {
+                'category': self.category_names.get(self.categories[idx], self.categories[idx]),
+                'arxiv_category': self.categories[idx],
+                'probability': float(probs[idx])
+            }
+            for idx in selected_indices
+        ]
+    def classify_paper(self, title, abstract=None):
+        """
+        Classify a paper based on its title and optional abstract
+        Args:
+            title (str): Paper title
+            abstract (str, optional): Paper abstract
+        """
         # Preprocess the text
+        processed_text = self.preprocess_text(title, abstract)
         # Tokenize
+        inputs = self.tokenizer(
+            processed_text,
+            return_tensors="pt",
+            truncation=True,
+            max_length=self.model_config['max_length'],
+            padding=True
+        )
         # Get model predictions
         with torch.no_grad():
             outputs = self.model(**inputs)
+            predictions = torch.softmax(outputs.logits, dim=1)[0]
+        # Get top categories that sum to 95% probability
+        top_categories = self.get_top_categories(predictions)
+        # Return predictions
         return {
+            'top_categories': top_categories,
+            'model_used': self.model_type,
+            'input_type': 'title_and_abstract' if abstract else 'title_only'
+        }
+    def train_on_arxiv(self, train_texts, train_labels, validation_texts=None, validation_labels=None,
+                       epochs=3, batch_size=16, learning_rate=2e-5):
+        """
+        Function to fine-tune the model on ArXiv data
+        Args:
+            train_texts (list): List of paper texts (title + abstract)
+            train_labels (list): List of corresponding ArXiv categories
+            validation_texts (list, optional): Validation texts
+            validation_labels (list, optional): Validation labels
+            epochs (int): Number of training epochs
+            batch_size (int): Training batch size
+            learning_rate (float): Learning rate for training
+        """
+        from transformers import TrainingArguments, Trainer
+        import datasets
+        # Prepare datasets
+        train_encodings = self.tokenizer(
+            train_texts,
+            truncation=True,
+            padding=True,
+            max_length=self.model_config['max_length']
+        )
+        # Convert labels to ids
+        train_label_ids = [self.categories.index(label) for label in train_labels]
+        # Create training dataset
+        train_dataset = datasets.Dataset.from_dict({
+            'input_ids': train_encodings['input_ids'],
+            'attention_mask': train_encodings['attention_mask'],
+            'labels': train_label_ids
+        })
+        # Create validation dataset if provided
+        if validation_texts and validation_labels:
+            val_encodings = self.tokenizer(
+                validation_texts,
+                truncation=True,
+                padding=True,
+                max_length=self.model_config['max_length']
+            )
+            val_label_ids = [self.categories.index(label) for label in validation_labels]
+            validation_dataset = datasets.Dataset.from_dict({
+                'input_ids': val_encodings['input_ids'],
+                'attention_mask': val_encodings['attention_mask'],
+                'labels': val_label_ids
+            })
+        else:
+            validation_dataset = None
+        # Training arguments
+        training_args = TrainingArguments(
+            output_dir=f"./results_{self.model_type}",
+            num_train_epochs=epochs,
+            per_device_train_batch_size=batch_size,
+            per_device_eval_batch_size=batch_size,
+            warmup_steps=500,
+            weight_decay=0.01,
+            logging_dir=f"./logs_{self.model_type}",
+            logging_steps=10,
+            learning_rate=learning_rate,
+            evaluation_strategy="epoch" if validation_dataset else "no",
+            save_strategy="epoch",
+            load_best_model_at_end=True if validation_dataset else False,
+        )
+        # Initialize trainer
+        trainer = Trainer(
+            model=self.model,
+            args=training_args,
+            train_dataset=train_dataset,
+            eval_dataset=validation_dataset,
+        )
+        # Train the model
+        trainer.train()
+        # Save the fine-tuned model
+        save_dir = f"./fine_tuned_{self.model_type}"
+        self.model.save_pretrained(save_dir)
+        self.tokenizer.save_pretrained(save_dir)
+        print(f"Model saved to {save_dir}")
+    @classmethod
+    def load_fine_tuned(cls, model_type, model_path):
+        """
+        Load a fine-tuned model from disk
+        Args:
+            model_type (str): The type of model that was fine-tuned
+            model_path (str): Path to the saved model
+        """
+        classifier = cls(model_type)
+        classifier.model = AutoModelForSequenceClassification.from_pretrained(model_path)
+        classifier.tokenizer = AutoTokenizer.from_pretrained(model_path)
+        return classifier

requirements.lock ADDED Viewed

	@@ -0,0 +1,64 @@

+streamlit==1.32.0
+    altair==5.2.0
+    attrs==23.2.0
+    blinker==1.7.0
+    cachetools==5.3.3
+    certifi==2024.2.2
+    charset-normalizer==3.3.2
+    click==8.1.7
+    gitdb==4.0.11
+    gitpython==3.1.42
+    idna==3.6
+    importlib-metadata==7.0.2
+    jinja2==3.1.3
+    jsonschema==4.21.1
+    markdown-it-py==3.0.0
+    markupsafe==2.1.5
+    mdurl==0.1.2
+    numpy==1.26.4
+    packaging==23.2
+    pandas==2.2.0
+    pillow==10.2.0
+    protobuf==4.25.3
+    pyarrow==15.0.1
+    pydeck==0.8.1b0
+    pygments==2.17.2
+    python-dateutil==2.9.0
+    pytz==2024.1
+    requests==2.31.0
+    rich==13.7.1
+    six==1.16.0
+    smmap==5.0.1
+    tenacity==8.2.3
+    toml==0.10.2
+    toolz==0.12.1
+    tornado==6.4
+    typing-extensions==4.10.0
+    tzdata==2024.1
+    tzlocal==5.2
+    urllib3==2.2.1
+    validators==0.22.0
+    watchdog==4.0.0
+    zipp==3.17.0
+torch==2.2.0
+    filelock==3.13.1
+    fsspec==2024.2.0
+    jinja2==3.1.3
+    networkx==3.2.1
+    sympy==1.12
+    typing-extensions==4.10.0
+transformers==4.37.2
+    huggingface-hub==0.21.4
+    packaging==23.2
+    pyyaml==6.0.1
+    regex==2023.12.25
+    requests==2.31.0
+    tokenizers==0.15.2
+    tqdm==4.66.2
+scikit-learn==1.4.0
+    joblib==1.3.2
+    numpy==1.26.4
+    scipy==1.12.0
+    threadpoolctl==3.3.0
+PyPDF2==3.0.1
+    typing-extensions==4.10.0

requirements.txt CHANGED Viewed

@@ -2,4 +2,11 @@ streamlit==1.32.0
 torch==2.2.0
 transformers==4.37.2
 scikit-learn==1.4.0
-PyPDF2==3.0.1

 torch==2.2.0
 transformers==4.37.2
 scikit-learn==1.4.0
+PyPDF2==3.0.1
+datasets==2.18.0
+arxiv==2.1.0
+beautifulsoup4==4.12.3
+sentencepiece==0.2.0
+tokenizers==0.15.2
+protobuf==4.25.3
+sacremoses==0.1.1