diff --git "a/tutorials/how_to_use_parlasent.ipynb" "b/tutorials/how_to_use_parlasent.ipynb"
deleted file mode 100644--- "a/tutorials/how_to_use_parlasent.ipynb"
+++ /dev/null
@@ -1 +0,0 @@
-{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"gpuType":"T4","authorship_tag":"ABX9TyPun7Hmb6PhP4vmxLeh77/p"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","source":["# How to use the ParlaSent model? A practical tutorial\n","\n","Authors: Michal Mochtak, Peter Rupnik, Taja Kuzman, and Nikola Ljubešić\n","\n","Date: 21/8/2024"],"metadata":{"id":"EzUt9g09sMfV"}},{"cell_type":"markdown","source":["## Introductory remarks ⛳\n","\n","This is an interactive Jupyter notebook that presents a step-by-step tutorial on how to use the ParlaSent model with your own data. The overall structure of the notebook can organized around two elements: 1) sentence extraction and 2) sentence annotation.\n","\n","If you use this tutorial, please cite the paper:\n","\n","\n","> Mochtak, Michal, Peter Rupnik, and Nikola Ljubešić. 2024. “The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings.” In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), edited by Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, 16024–36. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.lrec-main.1393.\n","\n"],"metadata":{"id":"Mjg6lAlFsbJ1"}},{"cell_type":"markdown","source":["## Prerequisities ⚡\n","Google Colab is an interactive developement environment with acces to computational resources that are easy to utilize free of charge (read more about it here: https://colab.research.google.com/).\n","\n","In order to use the ParlaSent model, you need to first connect to an interactive environment that has access to a graphical processing unit. In order to do that, click \"Runtime\" in the top toolbar and select \"Change runtime type\":\n","\n","
\n","
\n","After a pop up appears, select any available GPU accelerator and save your selection:\n","\n","\n","\n","Finally, in the top right corner, click \"Connect\". After a while, a green tick mark (✔) will appear. Your virtual session has been succesfully set up."],"metadata":{"id":"Tyr0BdDlvJJl"}},{"cell_type":"markdown","source":["## Loading data for processing 💾\n","This notebook is designed for a simple use case that expects users to prepare their data outside the Google Colab environment as a plain .csv file and then upload it for further processing. This paper's repository contains a sample file which you can use as guidance for formatting your own data. The file contains 124 speeches in English from the debate on the proposal for a regulation of the European Parliament and of the Council setting emission performance standards for new passenger cars and new light commercial vehicles held on 3rd March 2018. The file contains just two columns - \"doc_id\" as a document identifier and \"speech\" for the actual transcripts (the pipeline does not require anything else).\n","\n","In order to use the file (or any other file you prepare), upload the file to your interactive session by clicking the folder icon on the left and \"drag-and-drop\" your file to the area under the folder \"sample_data.\" The file will be uploaded to your interactive session and will be available for you to process. It is important to realize the file exists only in this interactive session and will be deleted when you close it. That applies also to any file you create in your session (e.g., the processed data).\n","\n","\n","\n"],"metadata":{"id":"cF1rFHA12gBC"}},{"cell_type":"markdown","source":["## Processing the data 🎆\n","\n","The processing pipeline can be divided into two steps: 1) sentence extraction and 2) sentence annotation. From now on, the notebook will also use code cells, which can be executed by clicking the small \"play\" icon next to them (hover over the cell to make it visible). The only cell you need to alter (if needed) is the one below with a few meta-parameters that will be used in the pipeline."],"metadata":{"id":"tAREYA5V9Pty"}},{"cell_type":"code","execution_count":9,"metadata":{"id":"D-oYq7EnsHae","executionInfo":{"status":"ok","timestamp":1724273840312,"user_tz":-120,"elapsed":630,"user":{"displayName":"Michal Mochtak","userId":"04685713018345275081"}}},"outputs":[],"source":["# Before we start, we will set a few meta parameters for the pipeline to use.\n","language = \"english\" # This parameter indicates what language your input text is so trankit can load a proper sentence parser for you. In this example, the speeches are in English; check the available languages at https://trankit.readthedocs.io/en/latest/pkgnames.html (look for the Code Name for pipeline initialization).\n","text_column = \"speech\" # Name of a column in the .csv file with the input we want to analyze. In this example, the column we will process is \"speech\".\n","doc_id = \"doc_id\" # Name of a column with a unique indetifier of a text to process. In this example the column is \"doc_id\".\n","filename = \"sample_data.csv\" # Name of the file you uploaded to Google Colab and want to process."]},{"cell_type":"markdown","source":["### Loading the necessary packages 💻\n","To process the input data, we need to install and load a few packages we will use."],"metadata":{"id":"xedYNVClALEN"}},{"cell_type":"code","source":["# Install missing packages to your session; this needs to be done every time you\n","# open the notebook as the session is interactive.\n","!pip install simpletransformers\n","!pip install trankit"],"metadata":{"id":"m9Ekm2hy_5zt"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Load the necessary packages.\n","import simpletransformers.classification as cl\n","import trankit\n","import pandas as pd"],"metadata":{"id":"ixpDR4gQA_Ri"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### Step 1: Sentence extraction ⛏\n","Now that we have all the necessary packages loaded and ready to use, we can proceed with the first step—sentence extraction."],"metadata":{"id":"F-iTvMydBJu4"}},{"cell_type":"code","source":["# Load the dataset you want to process; this tutorial works with a plain .csv\n","# file for simplicity.\n","df = pd.read_csv(filename)"],"metadata":{"id":"d7bi80FCBYKt","executionInfo":{"status":"ok","timestamp":1724276354589,"user_tz":-120,"elapsed":472,"user":{"displayName":"Michal Mochtak","userId":"04685713018345275081"}}},"execution_count":33,"outputs":[]},{"cell_type":"code","source":["# Load the trankit pipeline with the language model you specified earlier.\n","p = trankit.Pipeline(lang='english', embedding='xlm-roberta-base', gpu=True, cache_dir='./cache')"],"metadata":{"id":"a9Y_JgI4Cosb"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Check the dataset to be sure it was read correctly\n","df"],"metadata":{"id":"It1UPAF2CVT9"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Split texts into sentences. We use a simple loop as the model processes inputs sequentially.\n","sentences = []\n","for n in range(0, len(df[text_column])):\n"," one_text = pd.DataFrame.from_dict(p.ssplit(df[text_column][n]))\n"," one_text[\"doc_id\"] = df[doc_id][n]\n"," one_text = one_text.drop([\"text\", \"lang\"], axis = 1)\n"," sentences.append((one_text))\n","\n","# Concatenate the list and reset the index.\n","sentences = pd.concat(sentences)\n","sentences.reset_index(drop=True, inplace=True)\n","\n","# Extract data from JSON format and patch everything together.\n","sentences_final = pd.concat([sentences.drop(['sentences'], axis=1), pd.json_normalize(sentences['sentences'])], axis=1)\n","\n","# Rename the 'text' column to 'sentence'.\n","sentences_final.rename(columns={'text': 'sentence'}, inplace=True)"],"metadata":{"id":"cY7HA7QZC69L","executionInfo":{"status":"ok","timestamp":1724274096819,"user_tz":-120,"elapsed":7270,"user":{"displayName":"Michal Mochtak","userId":"04685713018345275081"}}},"execution_count":23,"outputs":[]},{"cell_type":"code","source":["# Check the result of sentence extraction. It is a data frame with\n","# doc_id column referring to the original document, id column referring to a sentence id within\n","# the processed input (e.g. speech), sentence column contains the extracted grammatical units\n","# (i.e., the sentences), and dspan column contains indexes for the beginning and end of\n","# each sentence in the processed string.\n","sentences_final"],"metadata":{"id":"f-dlGQBPDz27"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### Step 2: Sentiment annotation 🌡\n","With the extracted sentences, we can continue with the second step - sentiment annotation."],"metadata":{"id":"DjTSUVHIwggU"}},{"cell_type":"code","source":["# Load the ParlaSent model from the Huggingface Hub.\n","model = cl.ClassificationModel(\"xlmroberta\", \"classla/xlm-r-parlasent\")"],"metadata":{"id":"nlEPS7zQwygU"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Annotate the prepared sentences with the ParlaSent model.\n","prediction = model.predict(to_predict = sentences_final[\"sentence\"].tolist())\n","\n","final_df = sentences_final.assign(predict = prediction[1])"],"metadata":{"id":"RyEbfsrSxPRa"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Check the result. The final_df data frame now cotains an additional column\n","# \"predict\" with the predictions the model made. As the classification model\n","# predicts the label (score) on a continuous scale, similar to a regression model,\n","# it can produce scores above and bellow the scale we used for training (0-5).\n","# It is worth mentioning that 124 speeches containing 1289 sentences took\n","# approximately 6 seconds to annotate (T4 GPU).\n","final_df"],"metadata":{"id":"DXujhnV9yKhk"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Save the annotated data as a .csv file. The new file is located in the same\n","# location as the input file you uploaded at the beginning. If it does not\n","# appear there automatically, click the refresh button (look for the circled\n","# arrow) to reload the content of the folder. To download the file, right-click\n","# on it and choose \"Download\" to save it on your local machine.\n","final_df.to_csv('output.csv', index=False)"],"metadata":{"id":"x3sZwEXfzeqX","executionInfo":{"status":"ok","timestamp":1724275269041,"user_tz":-120,"elapsed":451,"user":{"displayName":"Michal Mochtak","userId":"04685713018345275081"}}},"execution_count":32,"outputs":[]},{"cell_type":"markdown","source":["## Closing remarks 👋\n","The tutorial has walked you through the whole annotation pipeline. We have demonstrated how easy it is to set it up and execute it on your own data. With just a few meta-parameters you need to define in reference to your uploaded document (column names and the language you want to analyze), you can easily annotate your own text data in no time. It does not matter whether it is in English, German, Czech, Polish, or Italian. The model will do the job. The result is a simple data frame with annotated sentences that can be further processed/aggregated for specific research purposes (e.g., level of speeches, time period, or groups) and merged with the available metadata (using the doc_id identifier)."],"metadata":{"id":"I4iiq0MF0quY"}}]}
\ No newline at end of file