talsheffer commited on
Commit
4092576
·
verified ·
1 Parent(s): b82d4b2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -103
README.md CHANGED
@@ -4,66 +4,61 @@ tags:
4
  - sentence-similarity
5
  - feature-extraction
6
  - transformers
 
7
  pipeline_tag: sentence-similarity
8
  library_name: sentence-transformers
 
 
 
 
 
9
  ---
10
 
11
- # SentenceTransformer
12
-
13
- This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 3584-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
14
-
15
- ## Model Details
16
-
17
- ### Model Description
18
- - **Model Type:** Sentence Transformer
19
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
20
- - **Maximum Sequence Length:** 32768 tokens
21
- - **Output Dimensionality:** 3584 dimensions
22
- - **Similarity Function:** Cosine Similarity
23
- <!-- - **Training Dataset:** Unknown -->
24
- <!-- - **Language:** Unknown -->
25
- <!-- - **License:** Unknown -->
26
-
27
- ### Model Sources
28
-
29
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
30
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
31
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
32
-
33
- ### Full Model Architecture
34
-
35
  ```
36
- SentenceTransformer(
37
- (0): Transformer({'max_seq_length': 32768, 'do_lower_case': False}) with Transformer model: Qwen2Model
38
- (1): Pooling({'word_embedding_dimension': 3584, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
39
- )
40
  ```
41
 
42
  ## Usage
43
 
44
- ### Direct Usage (Sentence Transformers)
45
 
46
- First install the Sentence Transformers library:
47
-
48
- ```bash
49
- pip install -U sentence-transformers
50
- ```
51
-
52
- Then you can load this model and run inference.
53
  ```python
54
  from sentence_transformers import SentenceTransformer
55
 
56
  # Download from the 🤗 Hub
57
- model = SentenceTransformer("sentence_transformers_model_id")
58
  # Run inference
59
  sentences = [
60
- 'The weather is lovely today.',
61
- "It's so sunny outside!",
62
- 'He drove to the stadium.',
63
  ]
64
  embeddings = model.encode(sentences)
65
  print(embeddings.shape)
66
- # [3, 3584]
67
 
68
  # Get the similarity scores for the embeddings
69
  similarities = model.similarity(embeddings, embeddings)
@@ -71,71 +66,85 @@ print(similarities.shape)
71
  # [3, 3]
72
  ```
73
 
74
- <!--
75
- ### Direct Usage (Transformers)
76
-
77
- <details><summary>Click to see the direct usage in Transformers</summary>
78
-
79
- </details>
80
- -->
81
-
82
- <!--
83
- ### Downstream Usage (Sentence Transformers)
84
-
85
- You can finetune this model on your own dataset.
86
 
87
- <details><summary>Click to expand</summary>
88
-
89
- </details>
90
- -->
91
-
92
- <!--
93
- ### Out-of-Scope Use
94
-
95
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
96
- -->
97
-
98
- <!--
99
- ## Bias, Risks and Limitations
100
-
101
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
102
- -->
103
-
104
- <!--
105
- ### Recommendations
106
-
107
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
108
- -->
109
-
110
- ## Training Details
111
-
112
- ### Framework Versions
113
- - Python: 3.10.12
114
- - Sentence Transformers: 3.4.1
115
- - Transformers: 4.49.0
116
- - PyTorch: 2.5.1+cu124
117
- - Accelerate: 1.1.1
118
- - Datasets: 3.1.0
119
- - Tokenizers: 0.21.0
120
-
121
- ## Citation
122
-
123
- ### BibTeX
124
-
125
- <!--
126
- ## Glossary
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
 
128
- *Clearly define terms in order to be accessible across audiences.*
129
- -->
130
 
131
- <!--
132
- ## Model Card Authors
133
 
134
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
135
- -->
136
 
137
- <!--
138
- ## Model Card Contact
139
 
140
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
141
- -->
 
4
  - sentence-similarity
5
  - feature-extraction
6
  - transformers
7
+ - Qwen2
8
  pipeline_tag: sentence-similarity
9
  library_name: sentence-transformers
10
+ license: other
11
+ license_name: qodoai-open-rail-m
12
+ license_link: LICENSE
13
+ base_model:
14
+ - Alibaba-NLP/gte-Qwen2-7B-instruct
15
  ---
16
 
17
+ ## Qodo-Embed-1
18
+ **Qodo-Embed-1 is a state-of-the-art** code embedding model designed for retrieval tasks in the software development domain.
19
+ It is offered in two sizes: lite (1.5B) and medium (7B). The model is optimized for natural language-to-code and code-to-code retrieval, making it highly effective for applications such as code search, retrieval-augmented generation (RAG), and contextual understanding of programming languages.
20
+ This model outperforms all previous open-source models in the COIR and MTab leaderboards, achieving best-in-class performance with a significantly smaller size compared to competing models.
21
+
22
+ ### Languages Supported:
23
+ * Python
24
+ * C++
25
+ * C#
26
+ * Go
27
+ * Java
28
+ * Javascript
29
+ * PHP
30
+ * Ruby
31
+ * Typescript
32
+
33
+ ## Model Information
34
+ - Model Size: 7B
35
+ - Embedding Dimension: 3584
36
+ - Max Input Tokens: 32k
37
+
38
+ ## Requirements
 
 
39
  ```
40
+ transformers>=4.39.2
41
+ flash_attn>=2.5.6
 
 
42
  ```
43
 
44
  ## Usage
45
 
46
+ ### Sentence Transformers
47
 
 
 
 
 
 
 
 
48
  ```python
49
  from sentence_transformers import SentenceTransformer
50
 
51
  # Download from the 🤗 Hub
52
+ model = SentenceTransformer("Qodo/Qodo-Embed-1-7B")
53
  # Run inference
54
  sentences = [
55
+ 'accumulator = sum(item.value for item in collection)',
56
+ 'result = reduce(lambda acc, curr: acc + curr.amount, data, 0)',
57
+ 'matrix = [[i*j for j in range(n)] for i in range(n)]'
58
  ]
59
  embeddings = model.encode(sentences)
60
  print(embeddings.shape)
61
+ # [3, 1536]
62
 
63
  # Get the similarity scores for the embeddings
64
  similarities = model.similarity(embeddings, embeddings)
 
66
  # [3, 3]
67
  ```
68
 
69
+ ### Transformers
 
 
 
 
 
 
 
 
 
 
 
70
 
71
+ ```python
72
+ import torch
73
+ import torch.nn.functional as F
74
+
75
+ from torch import Tensor
76
+ from transformers import AutoTokenizer, AutoModel
77
+
78
+
79
+ def last_token_pool(last_hidden_states: Tensor,
80
+ attention_mask: Tensor) -> Tensor:
81
+ left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
82
+ if left_padding:
83
+ return last_hidden_states[:, -1]
84
+ else:
85
+ sequence_lengths = attention_mask.sum(dim=1) - 1
86
+ batch_size = last_hidden_states.shape[0]
87
+ return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
88
+
89
+
90
+ # Each query must come with a one-sentence instruction that describes the task
91
+ queries = [
92
+ 'how to handle memory efficient data streaming',
93
+ 'implement binary tree traversal'
94
+ ]
95
+
96
+ documents = [
97
+ """def process_in_chunks():
98
+ buffer = deque(maxlen=1000)
99
+ for record in source_iterator:
100
+ buffer.append(transform(record))
101
+ if len(buffer) >= 1000:
102
+ yield from buffer
103
+ buffer.clear()""",
104
+
105
+ """class LazyLoader:
106
+ def __init__(self, source):
107
+ self.generator = iter(source)
108
+ self._cache = []
109
+
110
+ def next_batch(self, size=100):
111
+ while len(self._cache) < size:
112
+ try:
113
+ self._cache.append(next(self.generator))
114
+ except StopIteration:
115
+ break
116
+ return self._cache.pop(0) if self._cache else None""",
117
+
118
+ """def dfs_recursive(root):
119
+ if not root:
120
+ return []
121
+ stack = []
122
+ stack.extend(dfs_recursive(root.right))
123
+ stack.append(root.val)
124
+ stack.extend(dfs_recursive(root.left))
125
+ return stack"""
126
+ ]
127
+ input_texts = queries + documents
128
+
129
+ tokenizer = AutoTokenizer.from_pretrained('Qodo/Qodo-Embed-1-7B', trust_remote_code=True)
130
+ model = AutoModel.from_pretrained('Qodo/Qodo-Embed-1-1.5B', trust_remote_code=True)
131
+
132
+ max_length = 8192
133
+
134
+ # Tokenize the input texts
135
+ batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
136
+ outputs = model(**batch_dict)
137
+ embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
138
+
139
+ # normalize embeddings
140
+ embeddings = F.normalize(embeddings, p=2, dim=1)
141
+ scores = (embeddings[:2] @ embeddings[2:].T) * 100
142
+ print(scores.tolist())
143
+ ```
144
 
 
 
145
 
 
 
146
 
 
 
147
 
148
+ ## License
149
+ [Qodo-Model-Terms-of-Service](https://www.qodo.ai/qodo-model-terms-of-service/)
150