🦉 Shuu12121/CodeModernBERT-Owl-2.0

CodeModernBERT-Owl-2.0 は、マルチリンガルなコード理解・検索に対応した CodeModernBERT-Owl 系列の最新モデルです。

本モデルは、事前に学習された CodeModernBERT-Owl-2.0-Pre をベースに、同一の高品質な独自コードコーパスによって継続事前学習（continued pretraining） を行ったものであり、構文・意味理解能力のさらなる強化を実現しています。モデルの学習は CUDA デバイス上で行われました。

🔍 継続学習による性能向上

Python や Java など主要プログラミング言語において、CodeSearchNet ベンチマークの公式 test split を用いて 関数レベルのコード検索タスクの評価を行いました。その結果、以下のような 性能向上（特に MRR） が確認されています：

言語	`Owl-2.0-Pre`	`Owl-2.0`
Python	0.8761	0.9080
Java	0.7992	0.8341
JavaScript	0.6948	0.7846
PHP	0.7904	0.7943
Ruby	0.7703	0.8150
Go	0.8290	0.8129

✅ 評価には、CodeSearchNet ベンチマークの 公式 test splits を使用しています。

🔧 モデル仕様

対応言語: Python, Java, JavaScript, PHP, Ruby, Go, Rust, TypeScript
学習時の最大トークン長: 2048
推論時の最大トークン長: 8192（拡張済み）
トークナイザ: 独自に学習した BPE ベース
モデルサイズ: 約150Mパラメータ（ModernBERTベース）

⚙️ 主な前処理と工夫

Tree-sitter による構文解析ベースの関数・docstring 抽出
英語以外の docstring やテンプレ的なコメントの除去
APIキーやシークレットの自動マスキング
ライセンス文言を含むコードの除外
データリーク防止のための関数ペアの重複除去

主な用途例

関数レベルのコード検索（自然言語 → コード）
コード要約、補完、分類、コードクローン検出
Retrieval-Augmented Generation（RAG）システムでのコード検索基盤

English ver

CodeModernBERT-Owl-2.0 is the latest multilingual model in the CodeModernBERT-Owl series for code understanding and retrieval.

This model was built by continued pretraining from CodeModernBERT-Owl-2.0-Pre, using the same high-quality, custom-built multilingual code corpus on CUDA devices.
The additional training improved its ability to understand structural and semantic patterns in source code.

🔍 Evaluation on CodeSearchNet Benchmark Test Splits

The model was evaluated on function-level code search using the official test splits of the CodeSearchNet benchmark.
The following table shows improvements in Mean Reciprocal Rank (MRR) across languages:

Language	`Owl-2.0-Pre`	`Owl-2.0`
Python	0.8761	0.9080
Java	0.7992	0.8341
JavaScript	0.6948	0.7846
PHP	0.7904	0.7943
Ruby	0.7703	0.8150
Go	0.8290	0.8129

🔧 Model Specs

Supported Languages: Python, Java, JavaScript, PHP, Ruby, Go, Rust, TypeScript
Max Training Length: 2048 tokens
Max Inference Length: 8192 tokens (extended)
Tokenizer: Custom-trained BPE
Model Size: ~150M parameters (ModernBERT backbone)

⚙️ Key Preprocessing Techniques

Accurate function/docstring extraction using Tree-sitter
Filtering of non-English or templated comments
Automatic masking of API keys and secrets
Exclusion of license-related content
Deduplication of code/docstring pairs to prevent leakage

Main Applications

Function-level code search (natural language → code)
Code summarization, completion, classification, clone detection
Backend for Retrieval-Augmented Generation (RAG) with code corpus

Shuu12121
/

CodeModernBERT-Owl-2.0