Code-Specialized Model2Vec Distillation Analysis
π― Executive Summary
This report presents a comprehensive analysis of Model2Vec distillation experiments using different teacher models for code-specialized embedding generation.
Evaluated Models Overview
Simplified Distillation Models: 14 Peer Comparison Models: 19 Total Models Analyzed: 33
Best Performing Simplified Model: code_model2vec_all_mpnet_base_v2
Overall CodeSearchNet Performance:
- NDCG@10: 0.7387
- Mean Reciprocal Rank (MRR): 0.7010
- Recall@5: 0.8017
- Mean Rank: 6.4
π Comprehensive Model Comparison
All Simplified Distillation Models Performance
Model | Teacher | NDCG@10 | MRR | Recall@5 | Status |
---|---|---|---|---|---|
code_model2vec_all_mpnet_base_v2 | sentence-transformers/all-mpnet-base-v2 | 0.7387 | 0.7010 | 0.8017 | π₯ Best |
code_model2vec_all_MiniLM_L6_v2 | sentence-transformers/all-MiniLM-L6-v2 | 0.7385 | 0.7049 | 0.7910 | π₯ 2nd |
code_model2vec_jina_embeddings_v2_base_code | jina-embeddings-v2-base-code | 0.7381 | 0.6996 | 0.8130 | π₯ 3rd |
code_model2vec_paraphrase_MiniLM_L6_v2 | sentence-transformers/paraphrase-MiniLM-L6-v2 | 0.7013 | 0.6638 | 0.7665 | #4 |
code_model2vec_Reason_ModernColBERT | lightonai/Reason-ModernColBERT | 0.6598 | 0.6228 | 0.7260 | #5 |
code_model2vec_all_mpnet_base_v2_fine_tuned | sentence-transformers/all-mpnet-base-v2 | 0.6147 | 0.5720 | 0.6950 | #6 |
code_model2vec_bge_m3 | BAAI/bge-m3 | 0.4863 | 0.4439 | 0.5514 | #7 |
code_model2vec_jina_embeddings_v3 | jinaai/jina-embeddings-v3 | 0.4755 | 0.4416 | 0.5456 | #8 |
code_model2vec_nomic_embed_text_v2_moe | nomic-ai/nomic-embed-text-v2-moe | 0.4532 | 0.4275 | 0.5094 | #9 |
code_model2vec_gte_Qwen2_1.5B_instruct | Alibaba-NLP/gte-Qwen2-1.5B-instruct | 0.4238 | 0.3879 | 0.4719 | #10 |
code_model2vec_Qodo_Embed_1_1.5B | Qodo/Qodo-Embed-1-1.5B | 0.4101 | 0.3810 | 0.4532 | #11 |
code_model2vec_graphcodebert_base | microsoft/codebert-base | 0.3420 | 0.3140 | 0.3704 | #12 |
code_model2vec_Linq_Embed_Mistral | Linq-AI-Research/Linq-Embed-Mistral | 0.2868 | 0.2581 | 0.3412 | #13 |
code_model2vec_codebert_base | microsoft/codebert-base | 0.2779 | 0.2534 | 0.3136 | #14 |
π Model Specifications Analysis
Our distilled models exhibit consistent architectural characteristics across different teacher models:
Model | Vocabulary Size | Parameters | Embedding Dim | Disk Size |
---|---|---|---|---|
all_mpnet_base_v2 | 29,528 | 7.6M | 256 | 14.4MB |
all_MiniLM_L6_v2 | 29,525 | 7.6M | 256 | 14.4MB |
jina_embeddings_v2_base_code | 61,053 | 15.6M | 256 | 29.8MB |
paraphrase_MiniLM_L6_v2 | 29,525 | 7.6M | 256 | 14.4MB |
Reason_ModernColBERT | 50,254 | 12.9M | 256 | 24.5MB |
all_mpnet_base_v2_fine_tuned | 36,624 | 9.4M | 256 | 35.8MB |
bge_m3 | 249,999 | 64.0M | 256 | 122.1MB |
jina_embeddings_v3 | 249,999 | 64.0M | 256 | 122.1MB |
nomic_embed_text_v2_moe | 249,999 | 64.0M | 256 | 122.1MB |
gte_Qwen2_1.5B_instruct | 151,644 | 38.8M | 256 | 74.0MB |
Qodo_Embed_1_1.5B | 151,644 | 38.8M | 256 | 74.0MB |
graphcodebert_base | 50,262 | 12.9M | 256 | 24.5MB |
Linq_Embed_Mistral | 31,999 | 8.2M | 256 | 15.6MB |
codebert_base | 50,262 | 12.9M | 256 | 24.5MB |
Comprehensive analysis of our distilled models showing vocabulary size, parameter count, embedding dimensions, and storage requirements.
Key Insights from Model Specifications:
- Vocabulary Consistency: All models use vocabulary sizes ranging from 29,525 to 249,999 tokens (avg: 101,594)
- Parameter Efficiency: Models range from 7.6M to 64.0M parameters (avg: 26.0M)
- Storage Efficiency: Disk usage ranges from 14.4MB to 122.1MB (avg: 50.9MB)
- Embedding Dimensions: Consistent 256 dimensions across all models (optimized for efficiency)
Key Findings
- Best Teacher Model: code_model2vec_all_mpnet_base_v2 (NDCG@10: 0.7387)
- Least Effective Teacher: code_model2vec_codebert_base (NDCG@10: 0.2779)
- Performance Range: 62.4% difference between best and worst
- Average Performance: 0.5248 NDCG@10
π― Language Performance Radar Charts
Best Model vs Peer Models Comparison
Comparative view showing how the best simplified distillation model performs against top peer models across programming languages.
Individual Model Performance by Language
code_model2vec_all_mpnet_base_v2 (Teacher: sentence-transformers/all-mpnet-base-v2) - NDCG@10: 0.7387
code_model2vec_all_MiniLM_L6_v2 (Teacher: sentence-transformers/all-MiniLM-L6-v2) - NDCG@10: 0.7385
code_model2vec_jina_embeddings_v2_base_code (Teacher: jina-embeddings-v2-base-code) - NDCG@10: 0.7381
code_model2vec_paraphrase_MiniLM_L6_v2 (Teacher: sentence-transformers/paraphrase-MiniLM-L6-v2) - NDCG@10: 0.7013
code_model2vec_Reason_ModernColBERT (Teacher: lightonai/Reason-ModernColBERT) - NDCG@10: 0.6598
code_model2vec_all_mpnet_base_v2_fine_tuned (Teacher: sentence-transformers/all-mpnet-base-v2) - NDCG@10: 0.6147
code_model2vec_bge_m3 (Teacher: BAAI/bge-m3) - NDCG@10: 0.4863
code_model2vec_jina_embeddings_v3 (Teacher: jinaai/jina-embeddings-v3) - NDCG@10: 0.4755
code_model2vec_nomic_embed_text_v2_moe (Teacher: nomic-ai/nomic-embed-text-v2-moe) - NDCG@10: 0.4532
code_model2vec_gte_Qwen2_1.5B_instruct (Teacher: Alibaba-NLP/gte-Qwen2-1.5B-instruct) - NDCG@10: 0.4238
code_model2vec_Qodo_Embed_1_1.5B (Teacher: Qodo/Qodo-Embed-1-1.5B) - NDCG@10: 0.4101
code_model2vec_graphcodebert_base (Teacher: microsoft/codebert-base) - NDCG@10: 0.3420
code_model2vec_Linq_Embed_Mistral (Teacher: Linq-AI-Research/Linq-Embed-Mistral) - NDCG@10: 0.2868
code_model2vec_codebert_base (Teacher: microsoft/codebert-base) - NDCG@10: 0.2779
π Peer Model Comparison
Comparison with established code-specialized embedding models using actual evaluation results.
Complete Model Ranking
Rank | Model | Type | NDCG@10 | MRR | Recall@5 |
---|---|---|---|---|---|
1 | Alibaba-NLP/gte-Qwen2-1.5B-instruct | General | 0.9729 | 0.9676 | 0.9825 |
2 | Qodo/Qodo-Embed-1-1.5B | General | 0.9715 | 0.9659 | 0.9875 |
3 | jina-embeddings-v2-base-code | General | 0.9677 | 0.9618 | 0.9849 |
4 | jinaai/jina-embeddings-v3 | General | 0.9640 | 0.9573 | 0.9839 |
5 | sentence-transformers/all-mpnet-base-v2 | General | 0.9477 | 0.9358 | 0.9732 |
6 | nomic-ai/nomic-embed-text-v2-moe | General | 0.9448 | 0.9357 | 0.9659 |
7 | sentence-transformers/all-MiniLM-L12-v2 | General | 0.9398 | 0.9265 | 0.9732 |
8 | BAAI/bge-m3 | General | 0.9383 | 0.9295 | 0.9643 |
9 | sentence-transformers/all-MiniLM-L6-v2 | General | 0.9255 | 0.9099 | 0.9642 |
10 | lightonai/Reason-ModernColBERT | General | 0.9188 | 0.9036 | 0.9486 |
11 | Linq-AI-Research/Linq-Embed-Mistral | General | 0.9080 | 0.8845 | 0.9650 |
12 | sentence-transformers/paraphrase-MiniLM-L6-v2 | General | 0.8297 | 0.8016 | 0.8828 |
13 | minishlab/potion-base-8M | Model2Vec | 0.8162 | 0.7817 | 0.8931 |
14 | minishlab/potion-retrieval-32M | Model2Vec | 0.8137 | 0.7810 | 0.8792 |
15 | code_model2vec_all_mpnet_base_v2 | π₯ Simplified Distillation | 0.7387 | 0.7010 | 0.8017 |
16 | code_model2vec_all_MiniLM_L6_v2 | π₯ Simplified Distillation | 0.7385 | 0.7049 | 0.7910 |
17 | code_model2vec_jina_embeddings_v2_base_code | π₯ Simplified Distillation | 0.7381 | 0.6996 | 0.8130 |
18 | code_model2vec_paraphrase_MiniLM_L6_v2 | π₯ Simplified Distillation | 0.7013 | 0.6638 | 0.7665 |
19 | code_model2vec_Reason_ModernColBERT | π₯ Simplified Distillation | 0.6598 | 0.6228 | 0.7260 |
20 | code_model2vec_all_mpnet_base_v2_fine_tuned | π Fine-tuned Distillation | 0.6147 | 0.5720 | 0.6950 |
21 | potion-multilingual-128M | Model2Vec | 0.6124 | 0.5683 | 0.7017 |
22 | huggingface/CodeBERTa-small-v1 | Code-Specific | 0.5903 | 0.5350 | 0.6779 |
23 | Salesforce/codet5-base | Code-Specific | 0.4872 | 0.4500 | 0.5742 |
24 | code_model2vec_bge_m3 | π₯ Simplified Distillation | 0.4863 | 0.4439 | 0.5514 |
25 | code_model2vec_jina_embeddings_v3 | π₯ Simplified Distillation | 0.4755 | 0.4416 | 0.5456 |
26 | code_model2vec_nomic_embed_text_v2_moe | π₯ Simplified Distillation | 0.4532 | 0.4275 | 0.5094 |
27 | code_model2vec_gte_Qwen2_1.5B_instruct | π₯ Simplified Distillation | 0.4238 | 0.3879 | 0.4719 |
28 | code_model2vec_Qodo_Embed_1_1.5B | π₯ Simplified Distillation | 0.4101 | 0.3810 | 0.4532 |
29 | microsoft/graphcodebert-base | Code-Specific | 0.4039 | 0.3677 | 0.4650 |
30 | code_model2vec_graphcodebert_base | π₯ Simplified Distillation | 0.3420 | 0.3140 | 0.3704 |
31 | code_model2vec_Linq_Embed_Mistral | π₯ Simplified Distillation | 0.2868 | 0.2581 | 0.3412 |
32 | code_model2vec_codebert_base | π₯ Simplified Distillation | 0.2779 | 0.2534 | 0.3136 |
33 | microsoft/codebert-base | Code-Specific | 0.1051 | 0.1058 | 0.1105 |
π Performance Analysis
Multi-Model Comparison Charts
Comprehensive comparison across all evaluation metrics.
Language Performance Analysis
Performance heatmap showing how different models perform across programming languages.
Efficiency Analysis
Performance vs model size analysis showing the efficiency benefits of distillation.
β‘ Operational Performance Analysis
Comprehensive performance benchmarking across multiple operational metrics.
Performance Scaling Analysis
How performance scales with different batch sizes for optimal throughput.
Memory usage patterns across different batch sizes.
π Language-Specific Analysis
Performance by Programming Language
Language | Best Model Performance | Average Performance | Language Difficulty |
---|---|---|---|
Go | 0.9780 | 0.6960 | Easy |
Java | 0.9921 | 0.6553 | Easy |
Javascript | 0.9550 | 0.5850 | Easy |
Php | 1.0000 | 0.6321 | Easy |
Python | 1.0000 | 0.8623 | Easy |
Ruby | 0.9493 | 0.6397 | Easy |
π― Conclusions and Recommendations
Teacher Model Analysis
Based on the evaluation results across all simplified distillation models:
- Best Teacher Model: sentence-transformers/all-MiniLM-L6-v2 (NDCG@10: 0.7385)
- Least Effective Teacher: microsoft/codebert-base (NDCG@10: 0.2779)
- Teacher Model Impact: Choice of teacher model affects performance by 62.4%
Recommendations
- For Production: Use sentence-transformers/all-MiniLM-L6-v2 as teacher model for best performance
- For Efficiency: Model2Vec distillation provides significant size reduction with competitive performance
- For Code Tasks: Specialized models consistently outperform general-purpose models
π Methodology
Evaluation Protocol
- Dataset: CodeSearchNet test sets for 6 programming languages
- Metrics: NDCG@k, MRR, Recall@k following CodeSearchNet methodology
- Query Format: Natural language documentation strings
- Corpus Format: Function code strings
- Evaluation: Retrieval of correct code for each documentation query
Teacher Models Tested
- sentence-transformers/all-MiniLM-L6-v2 (proven baseline)
- sentence-transformers/all-mpnet-base-v2 (general purpose)
- sentence-transformers/paraphrase-MiniLM-L6-v2 (paraphrase model)
- microsoft/codebert-base (code-specialized)
- microsoft/graphcodebert-base (graph-aware code model)
- Alibaba-NLP/gte-Qwen2-1.5B-instruct (instruction model)
- BAAI/bge-m3 (multilingual model)
- jinaai/jina-embeddings-v3 (modern embedding model)
- nomic-ai/nomic-embed-text-v2-moe (mixture of experts)
- Qodo/Qodo-Embed-1-1.5B (code-specialized)
- lightonai/Reason-ModernColBERT (ColBERT architecture)
- Linq-AI-Research/Linq-Embed-Mistral (Mistral-based)
- BAAI/bge-code-v1 (code-specialized BGE)
- Salesforce/SFR-Embedding-Code-2B_R (large code model)
Distillation Method
- Technique: Model2Vec static embedding generation
- Parameters: PCA dims=256, SIF coefficient=1e-3, Zipf weighting=True
- Training Data: CodeSearchNet comment-code pairs
- Languages: Python, JavaScript, Java, PHP, Ruby, Go
Report generated on 2025-06-01 08:04:06 using automated analysis pipeline. For questions about methodology or results, please refer to the CodeSearchNet documentation.