File size: 8,326 Bytes

6a998f1
 
 
 
 
 
 
 
 
c727a64
6a998f1
 
 
 
 
 
 
 
 
 
 
c727a64
6a998f1
 
c727a64
6a998f1
282490c
c727a64
 
 
 
 
 
 
6a998f1
 
 
 
feb0ead
 
938d090
 
feb0ead
 
938d090
 
 
 
feb0ead
 
6a998f1
938d090
c727a64
938d090
 
6a998f1
feb0ead
c727a64
6a998f1
feb0ead
c727a64
feb0ead
c727a64
feb0ead
 
 
 
 
c727a64
 
 
feb0ead
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a998f1
c727a64
6a998f1
feb0ead
c727a64
feb0ead
 
 
 
 
 
c727a64
 
6a998f1
feb0ead
6a998f1
feb0ead
6a998f1
feb0ead
 
6a998f1
 
 
 
 
feb0ead
c727a64
6a998f1
 
 
feb0ead
c727a64
 
feb0ead
6a998f1
 
 
 
 
c727a64
feb0ead
6a998f1
feb0ead
6a998f1
feb0ead
6a998f1
feb0ead
6a998f1
 
c727a64
6a998f1
feb0ead
6a998f1
c727a64
 
 
 
 
 
 
6a998f1
c727a64
6a998f1
feb0ead
6a998f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
244d94c

---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
base_model: Shuu12121/CodeModernBERT-Owl
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- code_eval
model-index:
- name: SentenceTransformer based on Shuu12121/CodeModernBERT-Owl
  results:
  - task:
      type: semantic-similarity
      name: Semantic Similarity
    dataset:
      name: code docstring dev
      type: code-docstring-dev
    metrics:
    - type: pearson_cosine
      value: null
      name: Pearson Cosine
    - type: spearman_cosine
      value: null
      name: Spearman Cosine
license: apache-2.0
datasets:
- code-search-net/code_search_net
- Shuu12121/java-codesearch-dataset-open
- Shuu12121/rust-codesearch-dataset-open
- google/code_x_glue_ct_code_to_text
language:
- en
---





# SentenceTransformer based on Shuu12121/CodeModernBERT-Owl🦉



This model is a **sentence-transformers** model fine-tuned from **[Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl)**, which is a **ModernBERT model specifically designed for code, pre-trained from scratch by me**.  
**It is specifically designed for code search and efficiently calculates semantic similarity between code snippets and documentation.**  
One of the key features of this model is its **maximum sequence length of 2048 tokens**, which allows it to handle moderately long code snippets and documentation.  
Despite being a relatively small model with about **150 million parameters**, it demonstrates remarkable performance in code search tasks.  

---

このモデルは、**私が一から事前学習を行ったコード特化のModernBERTモデルである [Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl)** をベースにファインチューニングされた **[sentence-transformers](https://www.SBERT.net)** モデルです。  
**特にコードサーチに特化しており、コード片やドキュメントから効果的に意味的類似性を計算できる** ように設計されています。  
本モデルの特徴として、**最大シーケンス長が2048トークン**に対応しており、**中程度の長さのコード片やドキュメントにも対応可能**です。  
**150M程度と比較的小さいモデル**ながらも、コード検索タスクにおいて高い性能を発揮します。


---

### Model Evaluation / モデル評価

#### CoIR Evaluation Results / CoIRにおける評価結果

Despite being a relatively small model with around **150M parameters**, this model achieved an impressive **76.89** on the **CodeSearchNet** benchmark, demonstrating its high performance in code search tasks.  
Since this model is specialized for code search, it does not support other tasks, and thus evaluation scores for other tasks are not provided.  
In the CodeSearchNet task, this model outperforms many well-known models, as shown in the comparison table below.  

このモデルは、**150M程度と比較的小さいモデル**ながら、**コードサーチタスクにおける評価指標である CodeSearchNet で 76.89** を達成しました。  
他のタスクには対応していないため、評価値は提供されていません。  
CodeSearchNetタスクにおける評価値としては、他の有名なモデルと比較しても高いパフォーマンスを示しています。  

| Model Name                                    | CodeSearchNet Score |
|-----------------------------------------------|----------------------|
| **Shuu12121/CodeModernBERT-Owl**                | **76.89**             |
| Salesforce/SFR-Embedding-Code-2B_R              | 73.5                  |
| CodeSage-large-v2                              | 94.26                 |
| Salesforce/SFR-Embedding-Code-400M_R             | 72.53                 |
| CodeSage-large                                 | 90.58                 |
| Voyage-Code-002                                | 81.79                 |
| E5-Mistral                                     | 54.25                 |
| E5-Base-v2                                     | 67.99                 |
| OpenAI-Ada-002                                 | 74.21                 |
| BGE-Base-en-v1.5                               | 69.6                  |
| BGE-M3                                         | 43.23                 |
| UniXcoder                                      | 60.2                  |
| GTE-Base-en-v1.5                               | 43.35                 |
| Contriever                                     | 34.72                 |

---

### Model Details / モデル詳細

- **Model Type / モデルタイプ:** Sentence Transformer  
- **Base Model / ベースモデル:** [Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl)  
- **Maximum Sequence Length / 最大シーケンス長:** 2048 tokens  
- **Output Dimensions / 出力次元:** 768 dimensions  
- **Similarity Function / 類似度関数:** Cosine Similarity  
- **License / ライセンス:** Apache-2.0  

---

### Usage / 使用方法

#### Installation / インストール

To install Sentence Transformers, run the following command:  
Sentence Transformers をインストールするには、以下のコマンドを実行します。  

```bash
pip install -U sentence-transformers
```

#### Model Loading and Inference / モデルのロードと推論

```python
from sentence_transformers import SentenceTransformer

# Load the model / モデルをダウンロードしてロード
model = SentenceTransformer("Shuu12121/CodeSearch-ModernBERT-Owl")

# Example sentences for inference / 推論用の文リスト
sentences = [
    'Encrypts the zip file',
    'def freeze_encrypt(dest_dir, zip_filename, config, opt):\n    \n    pgp_keys = grok_keys(config)\n    icefile_prefix = "aomi-%s" % \\\n                     os.path.basename(os.path.dirname(opt.secretfile))\n    if opt.icefile_prefix:\n        icefile_prefix = opt.icefile_prefix\n\n    timestamp = time.strftime("%H%M%S-%m-%d-%Y",\n                              datetime.datetime.now().timetuple())\n    ice_file = "%s/%s-%s.ice" % (dest_dir, icefile_prefix, timestamp)\n    if not encrypt(zip_filename, ice_file, pgp_keys):\n        raise aomi.exceptions.GPG("Unable to encrypt zipfile")\n\n    return ice_file',
    'def transform(self, sents):\n        \n\n        def convert(tokens):\n            return torch.tensor([self.vocab.stoi[t] for t in tokens], dtype=torch.long)\n\n        if self.vocab is None:\n            raise Exception(\n                "Must run .fit() for .fit_transform() before " "calling .transform()."\n            )\n\n        seqs = sorted([convert(s) for s in sents], key=lambda x: -len(x))\n        X = torch.LongTensor(pad_sequence(seqs, batch_first=True))\n        return X',
]

# Generate embeddings / 埋め込みベクトルの生成
embeddings = model.encode(sentences)
print(embeddings.shape)  # Output: [3, 768]

# Calculate similarity scores / 類似度スコアの計算
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)  # Output: [3, 3]
```

---

### Library Versions / ライブラリバージョン

- Python: 3.11.11  
- Sentence Transformers: 3.4.1  
- Transformers: 4.50.0  
- PyTorch: 2.6.0+cu124  
- Accelerate: 1.5.2  
- Datasets: 3.4.1  
- Tokenizers: 0.21.1  

---

### Citation / 引用情報

#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
```

#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```