File size: 2,147 Bytes
b095854 25cf3ef 05f7eb2 25cf3ef b095854 d593ad5 b095854 25cf3ef 09a9cdb b095854 09a9cdb 2e4eaf0 b095854 2e4eaf0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
---
license: apache-2.0
---
# Finance Document Classifier
This repository contains a classifier for determining whether a document is finance-related.
## Model Overview
- A regression-based classifier with two classes: financial (1) and non-financial (0).
- Uses `Snowflake/snowflake-arctic-embed-m` as the embedding model with a classification head. During the training, we train the model in a regression way.
- We used `Qwen/Qwen2.5-72B-Instruct` to annotate 110k CulturaX documents with a note between 0 and 5, for the training, scores between [0,2] are converted to 0, [3,5] to 1. Then trained on 108k and test on 2k.
## How to Use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("LinguaCustodia/ClassiFin")
model = AutoModelForSequenceClassification.from_pretrained("LinguaCustodia/ClassiFin")
# Example text
text = "This is a test sentence."
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
# Get model outputs
outputs = model(**inputs)
logits = outputs.logits.float().detach().cpu().numpy()
logits = logits.ravel().tolist()
# Convert logits to class labels
int_scores = [int(round(max(0, min(logit, 1)))) for logit in logits] # 0 for non-financial, 1 for financial
```
## Model Performance
- Evaluated on the test set of 2000 samples.
```
precision recall f1-score support
0 0.95 0.99 0.97 1750
1 0.92 0.62 0.74 250
accuracy 0.95 2000
macro avg 0.93 0.81 0.85 2000
weighted avg 0.94 0.95 0.94 2000
```
## Citation
If you use this model in your research or applications, please cite this repository.
```
@misc{ClassiFin,
title={ClassiFin: Finance Document Classifier},
author={Liu, Jingshu and Qader, Raheel and Caillaut, Gaëtan and Nakhlem, Mariam and Barthelemy, Jean-Gabriel and Sadoune, Arezki and Foly, Sabine},
url={https://huggingface.co/LinguaCustodia/ClassiFin},
year={2025}
}
```
|