File size: 8,326 Bytes
6a998f1 c727a64 6a998f1 c727a64 6a998f1 c727a64 6a998f1 282490c c727a64 6a998f1 feb0ead 938d090 feb0ead 938d090 feb0ead 6a998f1 938d090 c727a64 938d090 6a998f1 feb0ead c727a64 6a998f1 feb0ead c727a64 feb0ead c727a64 feb0ead c727a64 feb0ead 6a998f1 c727a64 6a998f1 feb0ead c727a64 feb0ead c727a64 6a998f1 feb0ead 6a998f1 feb0ead 6a998f1 feb0ead 6a998f1 feb0ead c727a64 6a998f1 feb0ead c727a64 feb0ead 6a998f1 c727a64 feb0ead 6a998f1 feb0ead 6a998f1 feb0ead 6a998f1 feb0ead 6a998f1 c727a64 6a998f1 feb0ead 6a998f1 c727a64 6a998f1 c727a64 6a998f1 feb0ead 6a998f1 244d94c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 |
---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
base_model: Shuu12121/CodeModernBERT-Owl
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- code_eval
model-index:
- name: SentenceTransformer based on Shuu12121/CodeModernBERT-Owl
results:
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
name: code docstring dev
type: code-docstring-dev
metrics:
- type: pearson_cosine
value: null
name: Pearson Cosine
- type: spearman_cosine
value: null
name: Spearman Cosine
license: apache-2.0
datasets:
- code-search-net/code_search_net
- Shuu12121/java-codesearch-dataset-open
- Shuu12121/rust-codesearch-dataset-open
- google/code_x_glue_ct_code_to_text
language:
- en
---
# SentenceTransformer based on Shuu12121/CodeModernBERT-Owl🦉
This model is a **sentence-transformers** model fine-tuned from **[Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl)**, which is a **ModernBERT model specifically designed for code, pre-trained from scratch by me**.
**It is specifically designed for code search and efficiently calculates semantic similarity between code snippets and documentation.**
One of the key features of this model is its **maximum sequence length of 2048 tokens**, which allows it to handle moderately long code snippets and documentation.
Despite being a relatively small model with about **150 million parameters**, it demonstrates remarkable performance in code search tasks.
---
このモデルは、**私が一から事前学習を行ったコード特化のModernBERTモデルである [Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl)** をベースにファインチューニングされた **[sentence-transformers](https://www.SBERT.net)** モデルです。
**特にコードサーチに特化しており、コード片やドキュメントから効果的に意味的類似性を計算できる** ように設計されています。
本モデルの特徴として、**最大シーケンス長が2048トークン**に対応しており、**中程度の長さのコード片やドキュメントにも対応可能**です。
**150M程度と比較的小さいモデル**ながらも、コード検索タスクにおいて高い性能を発揮します。
---
### Model Evaluation / モデル評価
#### CoIR Evaluation Results / CoIRにおける評価結果
Despite being a relatively small model with around **150M parameters**, this model achieved an impressive **76.89** on the **CodeSearchNet** benchmark, demonstrating its high performance in code search tasks.
Since this model is specialized for code search, it does not support other tasks, and thus evaluation scores for other tasks are not provided.
In the CodeSearchNet task, this model outperforms many well-known models, as shown in the comparison table below.
このモデルは、**150M程度と比較的小さいモデル**ながら、**コードサーチタスクにおける評価指標である CodeSearchNet で 76.89** を達成しました。
他のタスクには対応していないため、評価値は提供されていません。
CodeSearchNetタスクにおける評価値としては、他の有名なモデルと比較しても高いパフォーマンスを示しています。
| Model Name | CodeSearchNet Score |
|-----------------------------------------------|----------------------|
| **Shuu12121/CodeModernBERT-Owl** | **76.89** |
| Salesforce/SFR-Embedding-Code-2B_R | 73.5 |
| CodeSage-large-v2 | 94.26 |
| Salesforce/SFR-Embedding-Code-400M_R | 72.53 |
| CodeSage-large | 90.58 |
| Voyage-Code-002 | 81.79 |
| E5-Mistral | 54.25 |
| E5-Base-v2 | 67.99 |
| OpenAI-Ada-002 | 74.21 |
| BGE-Base-en-v1.5 | 69.6 |
| BGE-M3 | 43.23 |
| UniXcoder | 60.2 |
| GTE-Base-en-v1.5 | 43.35 |
| Contriever | 34.72 |
---
### Model Details / モデル詳細
- **Model Type / モデルタイプ:** Sentence Transformer
- **Base Model / ベースモデル:** [Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl)
- **Maximum Sequence Length / 最大シーケンス長:** 2048 tokens
- **Output Dimensions / 出力次元:** 768 dimensions
- **Similarity Function / 類似度関数:** Cosine Similarity
- **License / ライセンス:** Apache-2.0
---
### Usage / 使用方法
#### Installation / インストール
To install Sentence Transformers, run the following command:
Sentence Transformers をインストールするには、以下のコマンドを実行します。
```bash
pip install -U sentence-transformers
```
#### Model Loading and Inference / モデルのロードと推論
```python
from sentence_transformers import SentenceTransformer
# Load the model / モデルをダウンロードしてロード
model = SentenceTransformer("Shuu12121/CodeSearch-ModernBERT-Owl")
# Example sentences for inference / 推論用の文リスト
sentences = [
'Encrypts the zip file',
'def freeze_encrypt(dest_dir, zip_filename, config, opt):\n \n pgp_keys = grok_keys(config)\n icefile_prefix = "aomi-%s" % \\\n os.path.basename(os.path.dirname(opt.secretfile))\n if opt.icefile_prefix:\n icefile_prefix = opt.icefile_prefix\n\n timestamp = time.strftime("%H%M%S-%m-%d-%Y",\n datetime.datetime.now().timetuple())\n ice_file = "%s/%s-%s.ice" % (dest_dir, icefile_prefix, timestamp)\n if not encrypt(zip_filename, ice_file, pgp_keys):\n raise aomi.exceptions.GPG("Unable to encrypt zipfile")\n\n return ice_file',
'def transform(self, sents):\n \n\n def convert(tokens):\n return torch.tensor([self.vocab.stoi[t] for t in tokens], dtype=torch.long)\n\n if self.vocab is None:\n raise Exception(\n "Must run .fit() for .fit_transform() before " "calling .transform()."\n )\n\n seqs = sorted([convert(s) for s in sents], key=lambda x: -len(x))\n X = torch.LongTensor(pad_sequence(seqs, batch_first=True))\n return X',
]
# Generate embeddings / 埋め込みベクトルの生成
embeddings = model.encode(sentences)
print(embeddings.shape) # Output: [3, 768]
# Calculate similarity scores / 類似度スコアの計算
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape) # Output: [3, 3]
```
---
### Library Versions / ライブラリバージョン
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.50.0
- PyTorch: 2.6.0+cu124
- Accelerate: 1.5.2
- Datasets: 3.4.1
- Tokenizers: 0.21.1
---
### Citation / 引用情報
#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
```
#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` |