# Code-Specialized Model2Vec Distillation Analysis ## 🎯 Executive Summary This report presents a comprehensive analysis of Model2Vec distillation experiments using different teacher models for code-specialized embedding generation. ### Evaluated Models Overview **Simplified Distillation Models:** 14 **Peer Comparison Models:** 19 **Total Models Analyzed:** 33 ### Best Performing Simplified Model: code_model2vec_all_mpnet_base_v2 **Overall CodeSearchNet Performance:** - **NDCG@10**: 0.7387 - **Mean Reciprocal Rank (MRR)**: 0.7010 - **Recall@5**: 0.8017 - **Mean Rank**: 6.4 ## 📊 Comprehensive Model Comparison ### All Simplified Distillation Models Performance | Model | Teacher | NDCG@10 | MRR | Recall@5 | Status | |-------|---------|---------|-----|----------|--------| | code_model2vec_all_mpnet_base_v2 | [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | 0.7387 | 0.7010 | 0.8017 | 🥇 Best | | code_model2vec_all_MiniLM_L6_v2 | [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | 0.7385 | 0.7049 | 0.7910 | 🥈 2nd | | code_model2vec_jina_embeddings_v2_base_code | [jina-embeddings-v2-base-code](https://huggingface.co/jina-embeddings-v2-base-code) | 0.7381 | 0.6996 | 0.8130 | 🥉 3rd | | code_model2vec_paraphrase_MiniLM_L6_v2 | [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) | 0.7013 | 0.6638 | 0.7665 | #4 | | code_model2vec_Reason_ModernColBERT | [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) | 0.6598 | 0.6228 | 0.7260 | #5 | | code_model2vec_all_mpnet_base_v2_fine_tuned | [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | 0.6147 | 0.5720 | 0.6950 | #6 | | code_model2vec_bge_m3 | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 0.4863 | 0.4439 | 0.5514 | #7 | | code_model2vec_jina_embeddings_v3 | [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) | 0.4755 | 0.4416 | 0.5456 | #8 | | code_model2vec_nomic_embed_text_v2_moe | [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) | 0.4532 | 0.4275 | 0.5094 | #9 | | code_model2vec_gte_Qwen2_1.5B_instruct | [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) | 0.4238 | 0.3879 | 0.4719 | #10 | | code_model2vec_Qodo_Embed_1_1.5B | [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B) | 0.4101 | 0.3810 | 0.4532 | #11 | | code_model2vec_graphcodebert_base | [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) | 0.3420 | 0.3140 | 0.3704 | #12 | | code_model2vec_Linq_Embed_Mistral | [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) | 0.2868 | 0.2581 | 0.3412 | #13 | | code_model2vec_codebert_base | [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) | 0.2779 | 0.2534 | 0.3136 | #14 | ### 📊 Model Specifications Analysis Our distilled models exhibit consistent architectural characteristics across different teacher models: | Model | Vocabulary Size | Parameters | Embedding Dim | Disk Size | |-------|----------------|------------|---------------|-----------| | all_mpnet_base_v2 | 29,528 | 7.6M | 256 | 14.4MB | | all_MiniLM_L6_v2 | 29,525 | 7.6M | 256 | 14.4MB | | jina_embeddings_v2_base_code | 61,053 | 15.6M | 256 | 29.8MB | | paraphrase_MiniLM_L6_v2 | 29,525 | 7.6M | 256 | 14.4MB | | Reason_ModernColBERT | 50,254 | 12.9M | 256 | 24.5MB | | all_mpnet_base_v2_fine_tuned | 36,624 | 9.4M | 256 | 35.8MB | | bge_m3 | 249,999 | 64.0M | 256 | 122.1MB | | jina_embeddings_v3 | 249,999 | 64.0M | 256 | 122.1MB | | nomic_embed_text_v2_moe | 249,999 | 64.0M | 256 | 122.1MB | | gte_Qwen2_1.5B_instruct | 151,644 | 38.8M | 256 | 74.0MB | | Qodo_Embed_1_1.5B | 151,644 | 38.8M | 256 | 74.0MB | | graphcodebert_base | 50,262 | 12.9M | 256 | 24.5MB | | Linq_Embed_Mistral | 31,999 | 8.2M | 256 | 15.6MB | | codebert_base | 50,262 | 12.9M | 256 | 24.5MB | ![Model Specifications](analysis_charts/model_specifications.png) *Comprehensive analysis of our distilled models showing vocabulary size, parameter count, embedding dimensions, and storage requirements.* #### Key Insights from Model Specifications: - **Vocabulary Consistency**: All models use vocabulary sizes ranging from 29,525 to 249,999 tokens (avg: 101,594) - **Parameter Efficiency**: Models range from 7.6M to 64.0M parameters (avg: 26.0M) - **Storage Efficiency**: Disk usage ranges from 14.4MB to 122.1MB (avg: 50.9MB) - **Embedding Dimensions**: Consistent 256 dimensions across all models (optimized for efficiency) ### Key Findings - **Best Teacher Model**: code_model2vec_all_mpnet_base_v2 (NDCG@10: 0.7387) - **Least Effective Teacher**: code_model2vec_codebert_base (NDCG@10: 0.2779) - **Performance Range**: 62.4% difference between best and worst - **Average Performance**: 0.5248 NDCG@10 ## 🎯 Language Performance Radar Charts ### Best Model vs Peer Models Comparison ![Comparative Radar Chart](analysis_charts/comparative_radar.png) *Comparative view showing how the best simplified distillation model performs against top peer models across programming languages.* ### Individual Model Performance by Language #### code_model2vec_all_mpnet_base_v2 (Teacher: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) - NDCG@10: 0.7387 ![code_model2vec_all_mpnet_base_v2 Radar Chart](analysis_charts/radar_code_model2vec_all_mpnet_base_v2.png) #### code_model2vec_all_MiniLM_L6_v2 (Teacher: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)) - NDCG@10: 0.7385 ![code_model2vec_all_MiniLM_L6_v2 Radar Chart](analysis_charts/radar_code_model2vec_all_MiniLM_L6_v2.png) #### code_model2vec_jina_embeddings_v2_base_code (Teacher: [jina-embeddings-v2-base-code](https://huggingface.co/jina-embeddings-v2-base-code)) - NDCG@10: 0.7381 ![code_model2vec_jina_embeddings_v2_base_code Radar Chart](analysis_charts/radar_code_model2vec_jina_embeddings_v2_base_code.png) #### code_model2vec_paraphrase_MiniLM_L6_v2 (Teacher: [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)) - NDCG@10: 0.7013 ![code_model2vec_paraphrase_MiniLM_L6_v2 Radar Chart](analysis_charts/radar_code_model2vec_paraphrase_MiniLM_L6_v2.png) #### code_model2vec_Reason_ModernColBERT (Teacher: [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT)) - NDCG@10: 0.6598 ![code_model2vec_Reason_ModernColBERT Radar Chart](analysis_charts/radar_code_model2vec_Reason_ModernColBERT.png) #### code_model2vec_all_mpnet_base_v2_fine_tuned (Teacher: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) - NDCG@10: 0.6147 ![code_model2vec_all_mpnet_base_v2_fine_tuned Radar Chart](analysis_charts/radar_code_model2vec_all_mpnet_base_v2_fine_tuned.png) #### code_model2vec_bge_m3 (Teacher: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)) - NDCG@10: 0.4863 ![code_model2vec_bge_m3 Radar Chart](analysis_charts/radar_code_model2vec_bge_m3.png) #### code_model2vec_jina_embeddings_v3 (Teacher: [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3)) - NDCG@10: 0.4755 ![code_model2vec_jina_embeddings_v3 Radar Chart](analysis_charts/radar_code_model2vec_jina_embeddings_v3.png) #### code_model2vec_nomic_embed_text_v2_moe (Teacher: [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe)) - NDCG@10: 0.4532 ![code_model2vec_nomic_embed_text_v2_moe Radar Chart](analysis_charts/radar_code_model2vec_nomic_embed_text_v2_moe.png) #### code_model2vec_gte_Qwen2_1.5B_instruct (Teacher: [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)) - NDCG@10: 0.4238 ![code_model2vec_gte_Qwen2_1.5B_instruct Radar Chart](analysis_charts/radar_code_model2vec_gte_Qwen2_15B_instruct.png) #### code_model2vec_Qodo_Embed_1_1.5B (Teacher: [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B)) - NDCG@10: 0.4101 ![code_model2vec_Qodo_Embed_1_1.5B Radar Chart](analysis_charts/radar_code_model2vec_Qodo_Embed_1_15B.png) #### code_model2vec_graphcodebert_base (Teacher: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)) - NDCG@10: 0.3420 ![code_model2vec_graphcodebert_base Radar Chart](analysis_charts/radar_code_model2vec_graphcodebert_base.png) #### code_model2vec_Linq_Embed_Mistral (Teacher: [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral)) - NDCG@10: 0.2868 ![code_model2vec_Linq_Embed_Mistral Radar Chart](analysis_charts/radar_code_model2vec_Linq_Embed_Mistral.png) #### code_model2vec_codebert_base (Teacher: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)) - NDCG@10: 0.2779 ![code_model2vec_codebert_base Radar Chart](analysis_charts/radar_code_model2vec_codebert_base.png) ## 🏆 Peer Model Comparison ![Peer Comparison](analysis_charts/peer_comparison.png) *Comparison with established code-specialized embedding models using actual evaluation results.* ### Complete Model Ranking | Rank | Model | Type | NDCG@10 | MRR | Recall@5 | |------|-------|------|---------|-----|----------| | 1 | Alibaba-NLP/gte-Qwen2-1.5B-instruct | General | 0.9729 | 0.9676 | 0.9825 | | 2 | Qodo/Qodo-Embed-1-1.5B | General | 0.9715 | 0.9659 | 0.9875 | | 3 | jina-embeddings-v2-base-code | General | 0.9677 | 0.9618 | 0.9849 | | 4 | jinaai/jina-embeddings-v3 | General | 0.9640 | 0.9573 | 0.9839 | | 5 | sentence-transformers/all-mpnet-base-v2 | General | 0.9477 | 0.9358 | 0.9732 | | 6 | nomic-ai/nomic-embed-text-v2-moe | General | 0.9448 | 0.9357 | 0.9659 | | 7 | sentence-transformers/all-MiniLM-L12-v2 | General | 0.9398 | 0.9265 | 0.9732 | | 8 | BAAI/bge-m3 | General | 0.9383 | 0.9295 | 0.9643 | | 9 | sentence-transformers/all-MiniLM-L6-v2 | General | 0.9255 | 0.9099 | 0.9642 | | 10 | lightonai/Reason-ModernColBERT | General | 0.9188 | 0.9036 | 0.9486 | | 11 | Linq-AI-Research/Linq-Embed-Mistral | General | 0.9080 | 0.8845 | 0.9650 | | 12 | sentence-transformers/paraphrase-MiniLM-L6-v2 | General | 0.8297 | 0.8016 | 0.8828 | | 13 | minishlab/potion-base-8M | Model2Vec | 0.8162 | 0.7817 | 0.8931 | | 14 | minishlab/potion-retrieval-32M | Model2Vec | 0.8137 | 0.7810 | 0.8792 | | 15 | code_model2vec_all_mpnet_base_v2 | **🔥 Simplified Distillation** | 0.7387 | 0.7010 | 0.8017 | | 16 | code_model2vec_all_MiniLM_L6_v2 | **🔥 Simplified Distillation** | 0.7385 | 0.7049 | 0.7910 | | 17 | code_model2vec_jina_embeddings_v2_base_code | **🔥 Simplified Distillation** | 0.7381 | 0.6996 | 0.8130 | | 18 | code_model2vec_paraphrase_MiniLM_L6_v2 | **🔥 Simplified Distillation** | 0.7013 | 0.6638 | 0.7665 | | 19 | code_model2vec_Reason_ModernColBERT | **🔥 Simplified Distillation** | 0.6598 | 0.6228 | 0.7260 | | 20 | code_model2vec_all_mpnet_base_v2_fine_tuned | **🎓 Fine-tuned Distillation** | 0.6147 | 0.5720 | 0.6950 | | 21 | potion-multilingual-128M | Model2Vec | 0.6124 | 0.5683 | 0.7017 | | 22 | huggingface/CodeBERTa-small-v1 | Code-Specific | 0.5903 | 0.5350 | 0.6779 | | 23 | Salesforce/codet5-base | Code-Specific | 0.4872 | 0.4500 | 0.5742 | | 24 | code_model2vec_bge_m3 | **🔥 Simplified Distillation** | 0.4863 | 0.4439 | 0.5514 | | 25 | code_model2vec_jina_embeddings_v3 | **🔥 Simplified Distillation** | 0.4755 | 0.4416 | 0.5456 | | 26 | code_model2vec_nomic_embed_text_v2_moe | **🔥 Simplified Distillation** | 0.4532 | 0.4275 | 0.5094 | | 27 | code_model2vec_gte_Qwen2_1.5B_instruct | **🔥 Simplified Distillation** | 0.4238 | 0.3879 | 0.4719 | | 28 | code_model2vec_Qodo_Embed_1_1.5B | **🔥 Simplified Distillation** | 0.4101 | 0.3810 | 0.4532 | | 29 | microsoft/graphcodebert-base | Code-Specific | 0.4039 | 0.3677 | 0.4650 | | 30 | code_model2vec_graphcodebert_base | **🔥 Simplified Distillation** | 0.3420 | 0.3140 | 0.3704 | | 31 | code_model2vec_Linq_Embed_Mistral | **🔥 Simplified Distillation** | 0.2868 | 0.2581 | 0.3412 | | 32 | code_model2vec_codebert_base | **🔥 Simplified Distillation** | 0.2779 | 0.2534 | 0.3136 | | 33 | microsoft/codebert-base | Code-Specific | 0.1051 | 0.1058 | 0.1105 | ## 📈 Performance Analysis ### Multi-Model Comparison Charts ![Model Comparison](analysis_charts/model_comparison.png) *Comprehensive comparison across all evaluation metrics.* ### Language Performance Analysis ![Language Heatmap](analysis_charts/language_heatmap.png) *Performance heatmap showing how different models perform across programming languages.* ### Efficiency Analysis ![Efficiency Analysis](analysis_charts/efficiency_analysis.png) *Performance vs model size analysis showing the efficiency benefits of distillation.* ## ⚡ Operational Performance Analysis ![Benchmark Performance](analysis_charts/benchmark_performance.png) *Comprehensive performance benchmarking across multiple operational metrics.* ### Performance Scaling Analysis ![Batch Size Scaling](analysis_charts/batch_size_scaling.png) *How performance scales with different batch sizes for optimal throughput.* ![Memory Scaling](analysis_charts/memory_scaling.png) *Memory usage patterns across different batch sizes.* ## 🔍 Language-Specific Analysis ### Performance by Programming Language | Language | Best Model Performance | Average Performance | Language Difficulty | |----------|------------------------|--------------------|--------------------| | Go | 0.9780 | 0.6960 | Easy | | Java | 0.9921 | 0.6553 | Easy | | Javascript | 0.9550 | 0.5850 | Easy | | Php | 1.0000 | 0.6321 | Easy | | Python | 1.0000 | 0.8623 | Easy | | Ruby | 0.9493 | 0.6397 | Easy | ## 🎯 Conclusions and Recommendations ### Teacher Model Analysis Based on the evaluation results across all simplified distillation models: 1. **Best Teacher Model**: sentence-transformers/all-MiniLM-L6-v2 (NDCG@10: 0.7385) 2. **Least Effective Teacher**: microsoft/codebert-base (NDCG@10: 0.2779) 3. **Teacher Model Impact**: Choice of teacher model affects performance by 62.4% ### Recommendations - **For Production**: Use sentence-transformers/all-MiniLM-L6-v2 as teacher model for best performance - **For Efficiency**: Model2Vec distillation provides significant size reduction with competitive performance - **For Code Tasks**: Specialized models consistently outperform general-purpose models ## 📄 Methodology ### Evaluation Protocol - **Dataset**: CodeSearchNet test sets for 6 programming languages - **Metrics**: NDCG@k, MRR, Recall@k following CodeSearchNet methodology - **Query Format**: Natural language documentation strings - **Corpus Format**: Function code strings - **Evaluation**: Retrieval of correct code for each documentation query ### Teacher Models Tested - [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) (proven baseline) - [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) (general purpose) - [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) (paraphrase model) - [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) (code-specialized) - [microsoft/graphcodebert-base](https://huggingface.co/microsoft/graphcodebert-base) (graph-aware code model) - [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) (instruction model) - [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) (multilingual model) - [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) (modern embedding model) - [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) (mixture of experts) - [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B) (code-specialized) - [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) (ColBERT architecture) - [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) (Mistral-based) - [BAAI/bge-code-v1](https://huggingface.co/BAAI/bge-code-v1) (code-specialized BGE) - [Salesforce/SFR-Embedding-Code-2B_R](https://huggingface.co/Salesforce/SFR-Embedding-Code-2B_R) (large code model) ### Distillation Method - **Technique**: Model2Vec static embedding generation - **Parameters**: PCA dims=256, SIF coefficient=1e-3, Zipf weighting=True - **Training Data**: CodeSearchNet comment-code pairs - **Languages**: Python, JavaScript, Java, PHP, Ruby, Go --- *Report generated on 2025-06-01 08:04:06 using automated analysis pipeline.* *For questions about methodology or results, please refer to the CodeSearchNet documentation.*