File size: 4,372 Bytes
b82d4b2
 
 
 
 
 
4092576
b82d4b2
 
4092576
81fdb7d
4092576
 
 
b82d4b2
 
4092576
 
 
f9edd9b
4092576
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b82d4b2
4092576
 
b82d4b2
 
 
 
4092576
b82d4b2
 
 
 
 
4092576
b82d4b2
 
4092576
 
 
b82d4b2
 
 
4092576
b82d4b2
 
 
 
 
 
 
4092576
b82d4b2
4092576
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3d8a89a
4092576
 
 
 
 
 
 
 
 
 
 
 
 
b82d4b2
 
 
 
4092576
81fdb7d
b82d4b2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- transformers
- Qwen2
pipeline_tag: sentence-similarity
library_name: sentence-transformers
license: other
license_name: qodo-model
license_link: LICENSE
base_model:
- Alibaba-NLP/gte-Qwen2-7B-instruct
---

## Qodo-Embed-1 
**Qodo-Embed-1 is a state-of-the-art** code embedding model designed for retrieval tasks in the software development domain.
It is offered in two sizes: lite (1.5B) and medium (7B). The model is optimized for natural language-to-code and code-to-code retrieval, making it highly effective for applications such as code search, retrieval-augmented generation (RAG), and contextual understanding of programming languages.
This model outperforms all previous open-source models in the COIR and MTEB leaderboards, achieving best-in-class performance with a significantly smaller size compared to competing models.

### Languages Supported: 
* Python
* C++
* C#
* Go
* Java
* Javascript
* PHP
* Ruby
* Typescript
 
## Model Information 
- Model Size: 7B 
- Embedding Dimension: 3584
- Max Input Tokens: 32k

## Requirements
```
transformers>=4.39.2
flash_attn>=2.5.6
```

## Usage

### Sentence Transformers

```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Qodo/Qodo-Embed-1-7B")
# Run inference
sentences = [
    'accumulator = sum(item.value for item in collection)',  
    'result = reduce(lambda acc, curr: acc + curr.amount, data, 0)',  
    'matrix = [[i*j for j in range(n)] for i in range(n)]'  
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1536]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```

### Transformers

```python
import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


# Each query must come with a one-sentence instruction that describes the task
queries = [
      'how to handle memory efficient data streaming',
      'implement binary tree traversal'
  ]

documents = [
        """def process_in_chunks():
            buffer = deque(maxlen=1000)
            for record in source_iterator:
                buffer.append(transform(record))
                if len(buffer) >= 1000:
                    yield from buffer
                    buffer.clear()""",

        """class LazyLoader:
            def __init__(self, source):
                self.generator = iter(source)
                self._cache = []

            def next_batch(self, size=100):
                while len(self._cache) < size:
                    try:
                        self._cache.append(next(self.generator))
                    except StopIteration:
                        break
                return self._cache.pop(0) if self._cache else None""",

        """def dfs_recursive(root):
            if not root:
                return []
            stack = []
            stack.extend(dfs_recursive(root.right))
            stack.append(root.val)
            stack.extend(dfs_recursive(root.left))
            return stack"""
    ]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained('Qodo/Qodo-Embed-1-7B', trust_remote_code=True)
model = AutoModel.from_pretrained('Qodo/Qodo-Embed-1-7B', trust_remote_code=True)

max_length = 8192

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
```




## License
[Qodo-Model](https://www.qodo.ai/qodo-model-terms-of-service/)