--- license: apache-2.0 --- # Finance Document Classifier This repository contains a classifier for determining whether a document is finance-related. ## Model Overview - A regression-based classifier with two classes: financial (1) and non-financial (0). - Uses `Snowflake/snowflake-arctic-embed-m` as the embedding model with a classification head. During the training, we train the model in a regression way. - We used `Qwen/Qwen2.5-72B-Instruct` to annotate 110k CulturaX documents with a note between 0 and 5, for the training, scores between [0,2] are converted to 0, [3,5] to 1. Then trained on 108k and test on 2k. ## How to Use ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained("LinguaCustodia/ClassiFin") model = AutoModelForSequenceClassification.from_pretrained("LinguaCustodia/ClassiFin") # Example text text = "This is a test sentence." # Tokenize input inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True) # Get model outputs outputs = model(**inputs) logits = outputs.logits.float().detach().cpu().numpy() logits = logits.ravel().tolist() # Convert logits to class labels int_scores = [int(round(max(0, min(logit, 1)))) for logit in logits] # 0 for non-financial, 1 for financial ``` ## Model Performance - Evaluated on the test set of 2000 samples. ``` precision recall f1-score support 0 0.95 0.99 0.97 1750 1 0.92 0.62 0.74 250 accuracy 0.95 2000 macro avg 0.93 0.81 0.85 2000 weighted avg 0.94 0.95 0.94 2000 ``` ## Citation If you use this model in your research or applications, please cite this repository. ``` @misc{ClassiFin, title={ClassiFin: Finance Document Classifier}, author={Liu, Jingshu and Qader, Raheel and Caillaut, Gaƫtan and Nakhlem, Mariam and Barthelemy, Jean-Gabriel and Sadoune, Arezki and Foly, Sabine}, url={https://huggingface.co/LinguaCustodia/ClassiFin}, year={2025} } ```