Code-Specialized Model2Vec Distillation Analysis

🎯 Executive Summary

This report presents a comprehensive analysis of Model2Vec distillation experiments using different teacher models for code-specialized embedding generation.

Evaluated Models Overview

Simplified Distillation Models: 14 Peer Comparison Models: 19 Total Models Analyzed: 33

Best Performing Simplified Model: code_model2vec_all_mpnet_base_v2

Overall CodeSearchNet Performance:

NDCG@10: 0.7387
Mean Reciprocal Rank (MRR): 0.7010
Recall@5: 0.8017
Mean Rank: 6.4

📊 Comprehensive Model Comparison

All Simplified Distillation Models Performance

Model	Teacher	NDCG@10	MRR	Recall@5	Status
code_model2vec_all_mpnet_base_v2	sentence-transformers/all-mpnet-base-v2	0.7387	0.7010	0.8017	🥇 Best
code_model2vec_all_MiniLM_L6_v2	sentence-transformers/all-MiniLM-L6-v2	0.7385	0.7049	0.7910	🥈 2nd
code_model2vec_jina_embeddings_v2_base_code	jina-embeddings-v2-base-code	0.7381	0.6996	0.8130	🥉 3rd
code_model2vec_paraphrase_MiniLM_L6_v2	sentence-transformers/paraphrase-MiniLM-L6-v2	0.7013	0.6638	0.7665	#4
code_model2vec_Reason_ModernColBERT	lightonai/Reason-ModernColBERT	0.6598	0.6228	0.7260	#5
code_model2vec_all_mpnet_base_v2_fine_tuned	sentence-transformers/all-mpnet-base-v2	0.6147	0.5720	0.6950	#6
code_model2vec_bge_m3	BAAI/bge-m3	0.4863	0.4439	0.5514	#7
code_model2vec_jina_embeddings_v3	jinaai/jina-embeddings-v3	0.4755	0.4416	0.5456	#8
code_model2vec_nomic_embed_text_v2_moe	nomic-ai/nomic-embed-text-v2-moe	0.4532	0.4275	0.5094	#9
code_model2vec_gte_Qwen2_1.5B_instruct	Alibaba-NLP/gte-Qwen2-1.5B-instruct	0.4238	0.3879	0.4719	#10
code_model2vec_Qodo_Embed_1_1.5B	Qodo/Qodo-Embed-1-1.5B	0.4101	0.3810	0.4532	#11
code_model2vec_graphcodebert_base	microsoft/codebert-base	0.3420	0.3140	0.3704	#12
code_model2vec_Linq_Embed_Mistral	Linq-AI-Research/Linq-Embed-Mistral	0.2868	0.2581	0.3412	#13
code_model2vec_codebert_base	microsoft/codebert-base	0.2779	0.2534	0.3136	#14

📊 Model Specifications Analysis

Our distilled models exhibit consistent architectural characteristics across different teacher models:

Model	Vocabulary Size	Parameters	Embedding Dim	Disk Size
all_mpnet_base_v2	29,528	7.6M	256	14.4MB
all_MiniLM_L6_v2	29,525	7.6M	256	14.4MB
jina_embeddings_v2_base_code	61,053	15.6M	256	29.8MB
paraphrase_MiniLM_L6_v2	29,525	7.6M	256	14.4MB
Reason_ModernColBERT	50,254	12.9M	256	24.5MB
all_mpnet_base_v2_fine_tuned	36,624	9.4M	256	35.8MB
bge_m3	249,999	64.0M	256	122.1MB
jina_embeddings_v3	249,999	64.0M	256	122.1MB
nomic_embed_text_v2_moe	249,999	64.0M	256	122.1MB
gte_Qwen2_1.5B_instruct	151,644	38.8M	256	74.0MB
Qodo_Embed_1_1.5B	151,644	38.8M	256	74.0MB
graphcodebert_base	50,262	12.9M	256	24.5MB
Linq_Embed_Mistral	31,999	8.2M	256	15.6MB
codebert_base	50,262	12.9M	256	24.5MB

Comprehensive analysis of our distilled models showing vocabulary size, parameter count, embedding dimensions, and storage requirements.

Key Insights from Model Specifications:

Vocabulary Consistency: All models use vocabulary sizes ranging from 29,525 to 249,999 tokens (avg: 101,594)
Parameter Efficiency: Models range from 7.6M to 64.0M parameters (avg: 26.0M)
Storage Efficiency: Disk usage ranges from 14.4MB to 122.1MB (avg: 50.9MB)
Embedding Dimensions: Consistent 256 dimensions across all models (optimized for efficiency)

Key Findings

Best Teacher Model: code_model2vec_all_mpnet_base_v2 (NDCG@10: 0.7387)
Least Effective Teacher: code_model2vec_codebert_base (NDCG@10: 0.2779)
Performance Range: 62.4% difference between best and worst
Average Performance: 0.5248 NDCG@10

🎯 Language Performance Radar Charts

Best Model vs Peer Models Comparison

Comparative view showing how the best simplified distillation model performs against top peer models across programming languages.

Individual Model Performance by Language

code_model2vec_all_mpnet_base_v2 (Teacher: sentence-transformers/all-mpnet-base-v2) - NDCG@10: 0.7387

code_model2vec_all_MiniLM_L6_v2 (Teacher: sentence-transformers/all-MiniLM-L6-v2) - NDCG@10: 0.7385

code_model2vec_jina_embeddings_v2_base_code (Teacher: jina-embeddings-v2-base-code) - NDCG@10: 0.7381

code_model2vec_paraphrase_MiniLM_L6_v2 (Teacher: sentence-transformers/paraphrase-MiniLM-L6-v2) - NDCG@10: 0.7013

code_model2vec_Reason_ModernColBERT (Teacher: lightonai/Reason-ModernColBERT) - NDCG@10: 0.6598

code_model2vec_all_mpnet_base_v2_fine_tuned (Teacher: sentence-transformers/all-mpnet-base-v2) - NDCG@10: 0.6147

code_model2vec_bge_m3 (Teacher: BAAI/bge-m3) - NDCG@10: 0.4863

code_model2vec_jina_embeddings_v3 (Teacher: jinaai/jina-embeddings-v3) - NDCG@10: 0.4755

code_model2vec_nomic_embed_text_v2_moe (Teacher: nomic-ai/nomic-embed-text-v2-moe) - NDCG@10: 0.4532

code_model2vec_gte_Qwen2_1.5B_instruct (Teacher: Alibaba-NLP/gte-Qwen2-1.5B-instruct) - NDCG@10: 0.4238

code_model2vec_Qodo_Embed_1_1.5B (Teacher: Qodo/Qodo-Embed-1-1.5B) - NDCG@10: 0.4101

code_model2vec_graphcodebert_base (Teacher: microsoft/codebert-base) - NDCG@10: 0.3420

code_model2vec_Linq_Embed_Mistral (Teacher: Linq-AI-Research/Linq-Embed-Mistral) - NDCG@10: 0.2868

code_model2vec_codebert_base (Teacher: microsoft/codebert-base) - NDCG@10: 0.2779

🏆 Peer Model Comparison

Comparison with established code-specialized embedding models using actual evaluation results.

Complete Model Ranking

Rank	Model	Type	NDCG@10	MRR	Recall@5
1	Alibaba-NLP/gte-Qwen2-1.5B-instruct	General	0.9729	0.9676	0.9825
2	Qodo/Qodo-Embed-1-1.5B	General	0.9715	0.9659	0.9875
3	jina-embeddings-v2-base-code	General	0.9677	0.9618	0.9849
4	jinaai/jina-embeddings-v3	General	0.9640	0.9573	0.9839
5	sentence-transformers/all-mpnet-base-v2	General	0.9477	0.9358	0.9732
6	nomic-ai/nomic-embed-text-v2-moe	General	0.9448	0.9357	0.9659
7	sentence-transformers/all-MiniLM-L12-v2	General	0.9398	0.9265	0.9732
8	BAAI/bge-m3	General	0.9383	0.9295	0.9643
9	sentence-transformers/all-MiniLM-L6-v2	General	0.9255	0.9099	0.9642
10	lightonai/Reason-ModernColBERT	General	0.9188	0.9036	0.9486
11	Linq-AI-Research/Linq-Embed-Mistral	General	0.9080	0.8845	0.9650
12	sentence-transformers/paraphrase-MiniLM-L6-v2	General	0.8297	0.8016	0.8828
13	minishlab/potion-base-8M	Model2Vec	0.8162	0.7817	0.8931
14	minishlab/potion-retrieval-32M	Model2Vec	0.8137	0.7810	0.8792
15	code_model2vec_all_mpnet_base_v2	🔥 Simplified Distillation	0.7387	0.7010	0.8017
16	code_model2vec_all_MiniLM_L6_v2	🔥 Simplified Distillation	0.7385	0.7049	0.7910
17	code_model2vec_jina_embeddings_v2_base_code	🔥 Simplified Distillation	0.7381	0.6996	0.8130
18	code_model2vec_paraphrase_MiniLM_L6_v2	🔥 Simplified Distillation	0.7013	0.6638	0.7665
19	code_model2vec_Reason_ModernColBERT	🔥 Simplified Distillation	0.6598	0.6228	0.7260
20	code_model2vec_all_mpnet_base_v2_fine_tuned	🎓 Fine-tuned Distillation	0.6147	0.5720	0.6950
21	potion-multilingual-128M	Model2Vec	0.6124	0.5683	0.7017
22	huggingface/CodeBERTa-small-v1	Code-Specific	0.5903	0.5350	0.6779
23	Salesforce/codet5-base	Code-Specific	0.4872	0.4500	0.5742
24	code_model2vec_bge_m3	🔥 Simplified Distillation	0.4863	0.4439	0.5514
25	code_model2vec_jina_embeddings_v3	🔥 Simplified Distillation	0.4755	0.4416	0.5456
26	code_model2vec_nomic_embed_text_v2_moe	🔥 Simplified Distillation	0.4532	0.4275	0.5094
27	code_model2vec_gte_Qwen2_1.5B_instruct	🔥 Simplified Distillation	0.4238	0.3879	0.4719
28	code_model2vec_Qodo_Embed_1_1.5B	🔥 Simplified Distillation	0.4101	0.3810	0.4532
29	microsoft/graphcodebert-base	Code-Specific	0.4039	0.3677	0.4650
30	code_model2vec_graphcodebert_base	🔥 Simplified Distillation	0.3420	0.3140	0.3704
31	code_model2vec_Linq_Embed_Mistral	🔥 Simplified Distillation	0.2868	0.2581	0.3412
32	code_model2vec_codebert_base	🔥 Simplified Distillation	0.2779	0.2534	0.3136
33	microsoft/codebert-base	Code-Specific	0.1051	0.1058	0.1105

📈 Performance Analysis

Multi-Model Comparison Charts

Comprehensive comparison across all evaluation metrics.

Language Performance Analysis

Performance heatmap showing how different models perform across programming languages.

Efficiency Analysis

Performance vs model size analysis showing the efficiency benefits of distillation.

⚡ Operational Performance Analysis

Comprehensive performance benchmarking across multiple operational metrics.

Performance Scaling Analysis

How performance scales with different batch sizes for optimal throughput.

Memory usage patterns across different batch sizes.

🔍 Language-Specific Analysis

Performance by Programming Language

Language	Best Model Performance	Average Performance	Language Difficulty
Go	0.9780	0.6960	Easy
Java	0.9921	0.6553	Easy
Javascript	0.9550	0.5850	Easy
Php	1.0000	0.6321	Easy
Python	1.0000	0.8623	Easy
Ruby	0.9493	0.6397	Easy

🎯 Conclusions and Recommendations

Teacher Model Analysis

Based on the evaluation results across all simplified distillation models:

Best Teacher Model: sentence-transformers/all-MiniLM-L6-v2 (NDCG@10: 0.7385)
Least Effective Teacher: microsoft/codebert-base (NDCG@10: 0.2779)
Teacher Model Impact: Choice of teacher model affects performance by 62.4%

Recommendations

For Production: Use sentence-transformers/all-MiniLM-L6-v2 as teacher model for best performance
For Efficiency: Model2Vec distillation provides significant size reduction with competitive performance
For Code Tasks: Specialized models consistently outperform general-purpose models

📄 Methodology

Evaluation Protocol

Dataset: CodeSearchNet test sets for 6 programming languages
Metrics: NDCG@k, MRR, Recall@k following CodeSearchNet methodology
Query Format: Natural language documentation strings
Corpus Format: Function code strings
Evaluation: Retrieval of correct code for each documentation query

Teacher Models Tested

sentence-transformers/all-MiniLM-L6-v2 (proven baseline)
sentence-transformers/all-mpnet-base-v2 (general purpose)
sentence-transformers/paraphrase-MiniLM-L6-v2 (paraphrase model)
microsoft/codebert-base (code-specialized)
microsoft/graphcodebert-base (graph-aware code model)
Alibaba-NLP/gte-Qwen2-1.5B-instruct (instruction model)
BAAI/bge-m3 (multilingual model)
jinaai/jina-embeddings-v3 (modern embedding model)
nomic-ai/nomic-embed-text-v2-moe (mixture of experts)
Qodo/Qodo-Embed-1-1.5B (code-specialized)
lightonai/Reason-ModernColBERT (ColBERT architecture)
Linq-AI-Research/Linq-Embed-Mistral (Mistral-based)
BAAI/bge-code-v1 (code-specialized BGE)
Salesforce/SFR-Embedding-Code-2B_R (large code model)

Distillation Method

Technique: Model2Vec static embedding generation
Parameters: PCA dims=256, SIF coefficient=1e-3, Zipf weighting=True
Training Data: CodeSearchNet comment-code pairs
Languages: Python, JavaScript, Java, PHP, Ruby, Go

Report generated on 2025-06-01 08:04:06 using automated analysis pipeline. For questions about methodology or results, please refer to the CodeSearchNet documentation.