PereLluis13 commited on
Commit
45b3e47
·
verified ·
1 Parent(s): 332ec0d

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +34 -128
README.md CHANGED
@@ -1,143 +1,49 @@
1
  ---
 
2
  tags:
3
- - sentence-transformers
4
- - sentence-similarity
5
- - feature-extraction
6
- - dense
7
  pipeline_tag: sentence-similarity
8
- library_name: sentence-transformers
9
  ---
10
 
11
- # SentenceTransformer
12
 
13
- This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
14
-
15
- ## Model Details
16
-
17
- ### Model Description
18
- - **Model Type:** Sentence Transformer
19
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
20
- - **Maximum Sequence Length:** 512 tokens
21
- - **Output Dimensionality:** 1024 dimensions
22
- - **Similarity Function:** Cosine Similarity
23
- <!-- - **Training Dataset:** Unknown -->
24
- <!-- - **Language:** Unknown -->
25
- <!-- - **License:** Unknown -->
26
-
27
- ### Model Sources
28
-
29
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
30
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
31
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
32
 
33
- ### Full Model Architecture
 
 
 
34
 
35
- ```
36
- SentenceTransformer(
37
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
38
- (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
39
- )
40
  ```
41
 
42
- ## Usage
 
43
 
44
- ### Direct Usage (Sentence Transformers)
 
45
 
46
- First install the Sentence Transformers library:
 
47
 
48
- ```bash
49
- pip install -U sentence-transformers
50
  ```
51
-
52
- Then you can load this model and run inference.
53
- ```python
54
- from sentence_transformers import SentenceTransformer
55
-
56
- # Download from the 🤗 Hub
57
- model = SentenceTransformer("sentence_transformers_model_id")
58
- # Run inference
59
- sentences = [
60
- 'The weather is lovely today.',
61
- "It's so sunny outside!",
62
- 'He drove to the stadium.',
63
- ]
64
- embeddings = model.encode(sentences)
65
- print(embeddings.shape)
66
- # [3, 1024]
67
-
68
- # Get the similarity scores for the embeddings
69
- similarities = model.similarity(embeddings, embeddings)
70
- print(similarities)
71
- # tensor([[1.0000, 0.8701, 0.8232],
72
- # [0.8701, 1.0000, 0.7860],
73
- # [0.8232, 0.7860, 1.0000]])
74
  ```
75
-
76
- <!--
77
- ### Direct Usage (Transformers)
78
-
79
- <details><summary>Click to see the direct usage in Transformers</summary>
80
-
81
- </details>
82
- -->
83
-
84
- <!--
85
- ### Downstream Usage (Sentence Transformers)
86
-
87
- You can finetune this model on your own dataset.
88
-
89
- <details><summary>Click to expand</summary>
90
-
91
- </details>
92
- -->
93
-
94
- <!--
95
- ### Out-of-Scope Use
96
-
97
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
98
- -->
99
-
100
- <!--
101
- ## Bias, Risks and Limitations
102
-
103
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
104
- -->
105
-
106
- <!--
107
- ### Recommendations
108
-
109
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
110
- -->
111
-
112
- ## Training Details
113
-
114
- ### Framework Versions
115
- - Python: 3.10.14
116
- - Sentence Transformers: 5.1.0
117
- - Transformers: 4.55.3
118
- - PyTorch: 2.8.0+cu128
119
- - Accelerate: 1.10.0
120
- - Datasets:
121
- - Tokenizers: 0.21.4
122
-
123
- ## Citation
124
-
125
- ### BibTeX
126
-
127
- <!--
128
- ## Glossary
129
-
130
- *Clearly define terms in order to be accessible across audiences.*
131
- -->
132
-
133
- <!--
134
- ## Model Card Authors
135
-
136
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
137
- -->
138
-
139
- <!--
140
- ## Model Card Contact
141
-
142
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
143
- -->
 
1
  ---
2
+ library_name: transformers
3
  tags:
4
+ - arxiv:2409.12737
5
+ license: mit
6
+ base_model:
7
+ - FacebookAI/xlm-roberta-large
8
  pipeline_tag: sentence-similarity
 
9
  ---
10
 
11
+ Current pre-trained cross-lingual sentence encoders approaches use sentence-level objectives only. This can lead to loss of information, especially for tokens, which then degrades the sentence representation. We propose MEXMA, a novel approach that integrates both sentence-level and token-level objectives. The sentence representation in one language is used to predict masked tokens in another language, with both the sentence representation and all tokens directly updating the encoder. We show that adding token-level objectives greatly improves the sentence representation quality across several tasks. Our approach outperforms current pre-trained cross-lingual sentence encoders on bi-text mining as well as several downstream tasks. We also analyse the information encoded in our tokens, and how the sentence representation is built from them.
12
 
13
+ # Usage
14
+ You use this model as you would any other XLM-RoBERTa model, taking into account that the "pooler" has not been trained, so you should use the CLS the encoder outputs directly as your sentence representation:
15
+ ```
16
+ from transformers import AutoTokenizer, XLMRobertaModel
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
+ tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
19
+ model = XLMRobertaModel.from_pretrained("facebook/MEXMA", add_pooling_layer=False)
20
+ example_sentences = ['Sentence1', 'Sentence2']
21
+ example_inputs = tokenizer(example_sentences, return_tensors='pt')
22
 
23
+ outputs = model(**example_inputs)
24
+ sentence_representation = outputs.last_hidden_state[:, 0]
25
+ print(sentence_representation.shape) # torch.Size([2, 1024])
 
 
26
  ```
27
 
28
+ # License
29
+ This model is released under the MIT license.
30
 
31
+ # Training code
32
+ For the training code of this model, please check the official [MEXMA repo](https://github.com/facebookresearch/mexma).
33
 
34
+ # Paper
35
+ [MEXMA: Token-level objectives improve sentence representations](https://arxiv.org/abs/2409.12737)
36
 
37
+ # Citation
38
+ If you use this model in your work, please cite:
39
  ```
40
+ @misc{janeiro2024mexma,
41
+ title={MEXMA: Token-level objectives improve sentence representations},
42
+ author={João Maria Janeiro and Benjamin Piwowarski and Patrick Gallinari and Loïc Barrault},
43
+ year={2024},
44
+ eprint={2409.12737},
45
+ archivePrefix={arXiv},
46
+ primaryClass={cs.CL},
47
+ url={https://arxiv.org/abs/2409.12737},
48
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ```