File size: 2,147 Bytes
b095854
 
 
 
 
 
 
 
 
25cf3ef
05f7eb2
25cf3ef
b095854
 
 
 
 
 
 
d593ad5
 
b095854
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25cf3ef
09a9cdb
 
b095854
 
 
 
 
 
 
09a9cdb
2e4eaf0
b095854
2e4eaf0
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
license: apache-2.0
---
# Finance Document Classifier

This repository contains a classifier for determining whether a document is finance-related.

## Model Overview
- A regression-based classifier with two classes: financial (1) and non-financial (0).
- Uses `Snowflake/snowflake-arctic-embed-m` as the embedding model with a classification head. During the training, we train the model in a regression way.
- We used `Qwen/Qwen2.5-72B-Instruct` to annotate 110k CulturaX documents with a note between 0 and 5, for the training, scores between [0,2] are converted to 0, [3,5] to 1. Then trained on 108k and test on 2k.


## How to Use

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("LinguaCustodia/ClassiFin")
model = AutoModelForSequenceClassification.from_pretrained("LinguaCustodia/ClassiFin")

# Example text
text = "This is a test sentence."

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)

# Get model outputs
outputs = model(**inputs)
logits = outputs.logits.float().detach().cpu().numpy()
logits = logits.ravel().tolist()

# Convert logits to class labels
int_scores = [int(round(max(0, min(logit, 1)))) for logit in logits]  # 0 for non-financial, 1 for financial
```

## Model Performance
- Evaluated on the test set of 2000 samples.

```  
                precision    recall  f1-score   support

           0       0.95      0.99      0.97      1750
           1       0.92      0.62      0.74       250
    accuracy                           0.95      2000
   macro avg       0.93      0.81      0.85      2000
weighted avg       0.94      0.95      0.94      2000
```
## Citation

If you use this model in your research or applications, please cite this repository.

```
@misc{ClassiFin,
  title={ClassiFin: Finance Document Classifier},
  author={Liu, Jingshu and Qader, Raheel and Caillaut, Gaëtan and Nakhlem, Mariam and Barthelemy, Jean-Gabriel and Sadoune, Arezki and Foly, Sabine},
  url={https://huggingface.co/LinguaCustodia/ClassiFin},
  year={2025}
}
```