thanhtantran commited on
Commit
4ed6357
·
verified ·
1 Parent(s): bdfb908

Cloned from AITeamVN/Vietnamese_Reranker

Browse files
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - vi
5
+ base_model:
6
+ - BAAI/bge-reranker-v2-m3
7
+ pipeline_tag: sentence-similarity
8
+ library_name: sentence-transformers
9
+ tags:
10
+ - Embedding
11
+ - Reranker
12
+ ---
13
+
14
+
15
+ ## Model Card: Vietnamese_Reranker
16
+
17
+ Vietnamese_Reranker is an reranker model fine-tuned from the bge-reranker-v2-m3 model (https://huggingface.co/BAAI/bge-reranker-v2-m3) to enhance retrieval capabilities for Vietnamese.
18
+
19
+ * The model was trained on approximately 1,100,000 triplets of queries, positive documents, and negative documents for Vietnamese.
20
+ * The model was trained with a maximum sequence length of 2304 (256 for query and 2048 for passages).
21
+
22
+ ## Model Details
23
+
24
+ ### Model Description
25
+ - **Model Type:** Sentence Transformer
26
+ - **Base model:** [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)
27
+ - **Maximum Sequence Length:** 2304 tokens (256 for query and 2048 for passages)
28
+ - **Output Dimensionality:** 1024 dimensions
29
+ - **Similarity Function:** Dot product Similarity
30
+ - **Language:** Vietnamese
31
+ - **Licence:** Apache 2.0
32
+
33
+ ## Usage
34
+
35
+ ```python
36
+ import torch
37
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
38
+
39
+ tokenizer = AutoTokenizer.from_pretrained('AITeamVN/Vietnamese_Reranker')
40
+ model = AutoModelForSequenceClassification.from_pretrained('AITeamVN/Vietnamese_Reranker')
41
+ model.eval()
42
+ MAX_LENGTH = 2304
43
+ pairs = [['Trí tuệ nhân tạo là gì?', 'Trí tuệ nhân tạo là công nghệ giúp máy móc suy nghĩ và học hỏi như con người. Nó hoạt động bằng cách thu thập dữ liệu, nhận diện mẫu và đưa ra quyết định.'],
44
+ ['Trí tuệ nhân tạo là gì?', 'Giấc ngủ giúp cơ thể và não bộ nghỉ ngơi, hồi phục năng lượng và cải thiện trí nhớ. Ngủ đủ giấc giúp tinh thần tỉnh táo và làm việc hiệu quả hơn.']]
45
+ with torch.no_grad():
46
+ inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=MAX_LENGTH)
47
+ scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
48
+ print(scores)
49
+
50
+ '''
51
+ # tensor([ 7.5590, -9.0743])
52
+ '''
53
+ ```
54
+
55
+
56
+ ### Evaluation:
57
+
58
+ - Dataset: Entire training dataset of Legal Zalo 2021. Our model was not trained on this dataset.
59
+
60
+ | Model | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 | MRR@10 |
61
+ |----------------------|------------|------------|------------|-------------|--------------|
62
+ | Vietnamese_Reranker | 0.7944 | 0.9324 | 0.9537 | 0.9740 | 0.8672 |
63
+ | Vietnamese_Embedding_v2 | 0.7262 | 0.8927 | 0.9268 | 0.9578 | 0.8149 |
64
+ | Vietnamese_Embedding | 0.7274 | 0.8992 | 0.9305 | 0.9568 | 0.8181 |
65
+ | Vietnamese-bi-encoder (BKAI) | 0.7109 | 0.8680 | 0.9014 | 0.9299 | 0.7951 |
66
+ | BGE-M3 | 0.5682 | 0.7728 | 0.8382 | 0.8921 | 0.6822 |
67
+
68
+ Vietnamese_Reranker and Vietnamese_Embedding_v2 was trained on 1,100,000 triplets.
69
+
70
+ Although the score on the legal domain drops a bit on Vietnamese_Embedding_v2 (Phase 2), since this phase data is much larger, it is good for other domains.
71
+
72
+
73
+ ## Contact
74
+
75
76
+
77
+ **Developer**
78
+
79
+ Member: Nguyễn Nho Trung, Nguyễn Nhật Quang, Nguyễn Văn Huy.
80
+
81
+ ## Citation
82
+
83
+ ```Plaintext
84
+ @misc{Vietnamese_Embedding,
85
+ title={Vietnamese_Embedding: Embedding model in Vietnamese language.},
86
+ author={Nguyen Nho Trung, Nguyen Nhat Quang, Nguyễn Văn Huy},
87
+ year={2025},
88
+ publisher={Huggingface},
89
+ }
90
+ ```
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "XLMRobertaForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 1024,
12
+ "id2label": {
13
+ "0": "LABEL_0"
14
+ },
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 4096,
17
+ "label2id": {
18
+ "LABEL_0": 0
19
+ },
20
+ "layer_norm_eps": 1e-05,
21
+ "max_position_embeddings": 8194,
22
+ "model_type": "xlm-roberta",
23
+ "num_attention_heads": 16,
24
+ "num_hidden_layers": 24,
25
+ "output_past": true,
26
+ "pad_token_id": 1,
27
+ "position_embedding_type": "absolute",
28
+ "torch_dtype": "float32",
29
+ "transformers_version": "4.51.0",
30
+ "type_vocab_size": 1,
31
+ "use_cache": true,
32
+ "vocab_size": 250002
33
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1cecccab406ec6d3f13da8ccdc5e853a1372e51845322163655ab530d3730071
3
+ size 2271071852
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "extra_special_tokens": {},
49
+ "mask_token": "<mask>",
50
+ "model_max_length": 8192,
51
+ "pad_token": "<pad>",
52
+ "sep_token": "</s>",
53
+ "sp_model_kwargs": {},
54
+ "tokenizer_class": "XLMRobertaTokenizer",
55
+ "unk_token": "<unk>"
56
+ }