codemalt / REPORT.md
Sarthak
chore: update README and REPORT with performance insights and dataset changes
0dbb356

Code-Specialized Model2Vec Distillation Analysis

🎯 Executive Summary

This report presents a comprehensive analysis of Model2Vec distillation experiments using different teacher models for code-specialized embedding generation.

Evaluated Models Overview

Simplified Distillation Models: 14 Peer Comparison Models: 19 Total Models Analyzed: 33

Best Performing Simplified Model: code_model2vec_all_mpnet_base_v2

Overall CodeSearchNet Performance:

  • NDCG@10: 0.7387
  • Mean Reciprocal Rank (MRR): 0.7010
  • Recall@5: 0.8017
  • Mean Rank: 6.4

πŸ“Š Comprehensive Model Comparison

All Simplified Distillation Models Performance

Model Teacher NDCG@10 MRR Recall@5 Status
code_model2vec_all_mpnet_base_v2 sentence-transformers/all-mpnet-base-v2 0.7387 0.7010 0.8017 πŸ₯‡ Best
code_model2vec_all_MiniLM_L6_v2 sentence-transformers/all-MiniLM-L6-v2 0.7385 0.7049 0.7910 πŸ₯ˆ 2nd
code_model2vec_jina_embeddings_v2_base_code jina-embeddings-v2-base-code 0.7381 0.6996 0.8130 πŸ₯‰ 3rd
code_model2vec_paraphrase_MiniLM_L6_v2 sentence-transformers/paraphrase-MiniLM-L6-v2 0.7013 0.6638 0.7665 #4
code_model2vec_Reason_ModernColBERT lightonai/Reason-ModernColBERT 0.6598 0.6228 0.7260 #5
code_model2vec_all_mpnet_base_v2_fine_tuned sentence-transformers/all-mpnet-base-v2 0.6147 0.5720 0.6950 #6
code_model2vec_bge_m3 BAAI/bge-m3 0.4863 0.4439 0.5514 #7
code_model2vec_jina_embeddings_v3 jinaai/jina-embeddings-v3 0.4755 0.4416 0.5456 #8
code_model2vec_nomic_embed_text_v2_moe nomic-ai/nomic-embed-text-v2-moe 0.4532 0.4275 0.5094 #9
code_model2vec_gte_Qwen2_1.5B_instruct Alibaba-NLP/gte-Qwen2-1.5B-instruct 0.4238 0.3879 0.4719 #10
code_model2vec_Qodo_Embed_1_1.5B Qodo/Qodo-Embed-1-1.5B 0.4101 0.3810 0.4532 #11
code_model2vec_graphcodebert_base microsoft/codebert-base 0.3420 0.3140 0.3704 #12
code_model2vec_Linq_Embed_Mistral Linq-AI-Research/Linq-Embed-Mistral 0.2868 0.2581 0.3412 #13
code_model2vec_codebert_base microsoft/codebert-base 0.2779 0.2534 0.3136 #14

πŸ“Š Model Specifications Analysis

Our distilled models exhibit consistent architectural characteristics across different teacher models:

Model Vocabulary Size Parameters Embedding Dim Disk Size
all_mpnet_base_v2 29,528 7.6M 256 14.4MB
all_MiniLM_L6_v2 29,525 7.6M 256 14.4MB
jina_embeddings_v2_base_code 61,053 15.6M 256 29.8MB
paraphrase_MiniLM_L6_v2 29,525 7.6M 256 14.4MB
Reason_ModernColBERT 50,254 12.9M 256 24.5MB
all_mpnet_base_v2_fine_tuned 36,624 9.4M 256 35.8MB
bge_m3 249,999 64.0M 256 122.1MB
jina_embeddings_v3 249,999 64.0M 256 122.1MB
nomic_embed_text_v2_moe 249,999 64.0M 256 122.1MB
gte_Qwen2_1.5B_instruct 151,644 38.8M 256 74.0MB
Qodo_Embed_1_1.5B 151,644 38.8M 256 74.0MB
graphcodebert_base 50,262 12.9M 256 24.5MB
Linq_Embed_Mistral 31,999 8.2M 256 15.6MB
codebert_base 50,262 12.9M 256 24.5MB

Model Specifications

Comprehensive analysis of our distilled models showing vocabulary size, parameter count, embedding dimensions, and storage requirements.

Key Insights from Model Specifications:

  • Vocabulary Consistency: All models use vocabulary sizes ranging from 29,525 to 249,999 tokens (avg: 101,594)
  • Parameter Efficiency: Models range from 7.6M to 64.0M parameters (avg: 26.0M)
  • Storage Efficiency: Disk usage ranges from 14.4MB to 122.1MB (avg: 50.9MB)
  • Embedding Dimensions: Consistent 256 dimensions across all models (optimized for efficiency)

Key Findings

  • Best Teacher Model: code_model2vec_all_mpnet_base_v2 (NDCG@10: 0.7387)
  • Least Effective Teacher: code_model2vec_codebert_base (NDCG@10: 0.2779)
  • Performance Range: 62.4% difference between best and worst
  • Average Performance: 0.5248 NDCG@10

🎯 Language Performance Radar Charts

Best Model vs Peer Models Comparison

Comparative Radar Chart

Comparative view showing how the best simplified distillation model performs against top peer models across programming languages.

Individual Model Performance by Language

code_model2vec_all_mpnet_base_v2 (Teacher: sentence-transformers/all-mpnet-base-v2) - NDCG@10: 0.7387

code_model2vec_all_mpnet_base_v2 Radar Chart

code_model2vec_all_MiniLM_L6_v2 (Teacher: sentence-transformers/all-MiniLM-L6-v2) - NDCG@10: 0.7385

code_model2vec_all_MiniLM_L6_v2 Radar Chart

code_model2vec_jina_embeddings_v2_base_code (Teacher: jina-embeddings-v2-base-code) - NDCG@10: 0.7381

code_model2vec_jina_embeddings_v2_base_code Radar Chart

code_model2vec_paraphrase_MiniLM_L6_v2 (Teacher: sentence-transformers/paraphrase-MiniLM-L6-v2) - NDCG@10: 0.7013

code_model2vec_paraphrase_MiniLM_L6_v2 Radar Chart

code_model2vec_Reason_ModernColBERT (Teacher: lightonai/Reason-ModernColBERT) - NDCG@10: 0.6598

code_model2vec_Reason_ModernColBERT Radar Chart

code_model2vec_all_mpnet_base_v2_fine_tuned (Teacher: sentence-transformers/all-mpnet-base-v2) - NDCG@10: 0.6147

code_model2vec_all_mpnet_base_v2_fine_tuned Radar Chart

code_model2vec_bge_m3 (Teacher: BAAI/bge-m3) - NDCG@10: 0.4863

code_model2vec_bge_m3 Radar Chart

code_model2vec_jina_embeddings_v3 (Teacher: jinaai/jina-embeddings-v3) - NDCG@10: 0.4755

code_model2vec_jina_embeddings_v3 Radar Chart

code_model2vec_nomic_embed_text_v2_moe (Teacher: nomic-ai/nomic-embed-text-v2-moe) - NDCG@10: 0.4532

code_model2vec_nomic_embed_text_v2_moe Radar Chart

code_model2vec_gte_Qwen2_1.5B_instruct (Teacher: Alibaba-NLP/gte-Qwen2-1.5B-instruct) - NDCG@10: 0.4238

code_model2vec_gte_Qwen2_1.5B_instruct Radar Chart

code_model2vec_Qodo_Embed_1_1.5B (Teacher: Qodo/Qodo-Embed-1-1.5B) - NDCG@10: 0.4101

code_model2vec_Qodo_Embed_1_1.5B Radar Chart

code_model2vec_graphcodebert_base (Teacher: microsoft/codebert-base) - NDCG@10: 0.3420

code_model2vec_graphcodebert_base Radar Chart

code_model2vec_Linq_Embed_Mistral (Teacher: Linq-AI-Research/Linq-Embed-Mistral) - NDCG@10: 0.2868

code_model2vec_Linq_Embed_Mistral Radar Chart

code_model2vec_codebert_base (Teacher: microsoft/codebert-base) - NDCG@10: 0.2779

code_model2vec_codebert_base Radar Chart

πŸ† Peer Model Comparison

Peer Comparison

Comparison with established code-specialized embedding models using actual evaluation results.

Complete Model Ranking

Rank Model Type NDCG@10 MRR Recall@5
1 Alibaba-NLP/gte-Qwen2-1.5B-instruct General 0.9729 0.9676 0.9825
2 Qodo/Qodo-Embed-1-1.5B General 0.9715 0.9659 0.9875
3 jina-embeddings-v2-base-code General 0.9677 0.9618 0.9849
4 jinaai/jina-embeddings-v3 General 0.9640 0.9573 0.9839
5 sentence-transformers/all-mpnet-base-v2 General 0.9477 0.9358 0.9732
6 nomic-ai/nomic-embed-text-v2-moe General 0.9448 0.9357 0.9659
7 sentence-transformers/all-MiniLM-L12-v2 General 0.9398 0.9265 0.9732
8 BAAI/bge-m3 General 0.9383 0.9295 0.9643
9 sentence-transformers/all-MiniLM-L6-v2 General 0.9255 0.9099 0.9642
10 lightonai/Reason-ModernColBERT General 0.9188 0.9036 0.9486
11 Linq-AI-Research/Linq-Embed-Mistral General 0.9080 0.8845 0.9650
12 sentence-transformers/paraphrase-MiniLM-L6-v2 General 0.8297 0.8016 0.8828
13 minishlab/potion-base-8M Model2Vec 0.8162 0.7817 0.8931
14 minishlab/potion-retrieval-32M Model2Vec 0.8137 0.7810 0.8792
15 code_model2vec_all_mpnet_base_v2 πŸ”₯ Simplified Distillation 0.7387 0.7010 0.8017
16 code_model2vec_all_MiniLM_L6_v2 πŸ”₯ Simplified Distillation 0.7385 0.7049 0.7910
17 code_model2vec_jina_embeddings_v2_base_code πŸ”₯ Simplified Distillation 0.7381 0.6996 0.8130
18 code_model2vec_paraphrase_MiniLM_L6_v2 πŸ”₯ Simplified Distillation 0.7013 0.6638 0.7665
19 code_model2vec_Reason_ModernColBERT πŸ”₯ Simplified Distillation 0.6598 0.6228 0.7260
20 code_model2vec_all_mpnet_base_v2_fine_tuned πŸŽ“ Fine-tuned Distillation 0.6147 0.5720 0.6950
21 potion-multilingual-128M Model2Vec 0.6124 0.5683 0.7017
22 huggingface/CodeBERTa-small-v1 Code-Specific 0.5903 0.5350 0.6779
23 Salesforce/codet5-base Code-Specific 0.4872 0.4500 0.5742
24 code_model2vec_bge_m3 πŸ”₯ Simplified Distillation 0.4863 0.4439 0.5514
25 code_model2vec_jina_embeddings_v3 πŸ”₯ Simplified Distillation 0.4755 0.4416 0.5456
26 code_model2vec_nomic_embed_text_v2_moe πŸ”₯ Simplified Distillation 0.4532 0.4275 0.5094
27 code_model2vec_gte_Qwen2_1.5B_instruct πŸ”₯ Simplified Distillation 0.4238 0.3879 0.4719
28 code_model2vec_Qodo_Embed_1_1.5B πŸ”₯ Simplified Distillation 0.4101 0.3810 0.4532
29 microsoft/graphcodebert-base Code-Specific 0.4039 0.3677 0.4650
30 code_model2vec_graphcodebert_base πŸ”₯ Simplified Distillation 0.3420 0.3140 0.3704
31 code_model2vec_Linq_Embed_Mistral πŸ”₯ Simplified Distillation 0.2868 0.2581 0.3412
32 code_model2vec_codebert_base πŸ”₯ Simplified Distillation 0.2779 0.2534 0.3136
33 microsoft/codebert-base Code-Specific 0.1051 0.1058 0.1105

πŸ“ˆ Performance Analysis

Multi-Model Comparison Charts

Model Comparison

Comprehensive comparison across all evaluation metrics.

Language Performance Analysis

Language Heatmap

Performance heatmap showing how different models perform across programming languages.

Efficiency Analysis

Efficiency Analysis

Performance vs model size analysis showing the efficiency benefits of distillation.

⚑ Operational Performance Analysis

Benchmark Performance

Comprehensive performance benchmarking across multiple operational metrics.

Performance Scaling Analysis

Batch Size Scaling

How performance scales with different batch sizes for optimal throughput.

Memory Scaling

Memory usage patterns across different batch sizes.

πŸ” Language-Specific Analysis

Performance by Programming Language

Language Best Model Performance Average Performance Language Difficulty
Go 0.9780 0.6960 Easy
Java 0.9921 0.6553 Easy
Javascript 0.9550 0.5850 Easy
Php 1.0000 0.6321 Easy
Python 1.0000 0.8623 Easy
Ruby 0.9493 0.6397 Easy

🎯 Conclusions and Recommendations

Teacher Model Analysis

Based on the evaluation results across all simplified distillation models:

  1. Best Teacher Model: sentence-transformers/all-MiniLM-L6-v2 (NDCG@10: 0.7385)
  2. Least Effective Teacher: microsoft/codebert-base (NDCG@10: 0.2779)
  3. Teacher Model Impact: Choice of teacher model affects performance by 62.4%

Recommendations

  • For Production: Use sentence-transformers/all-MiniLM-L6-v2 as teacher model for best performance
  • For Efficiency: Model2Vec distillation provides significant size reduction with competitive performance
  • For Code Tasks: Specialized models consistently outperform general-purpose models

πŸ“„ Methodology

Evaluation Protocol

  • Dataset: CodeSearchNet test sets for 6 programming languages
  • Metrics: NDCG@k, MRR, Recall@k following CodeSearchNet methodology
  • Query Format: Natural language documentation strings
  • Corpus Format: Function code strings
  • Evaluation: Retrieval of correct code for each documentation query

Teacher Models Tested

Distillation Method

  • Technique: Model2Vec static embedding generation
  • Parameters: PCA dims=256, SIF coefficient=1e-3, Zipf weighting=True
  • Training Data: CodeSearchNet comment-code pairs
  • Languages: Python, JavaScript, Java, PHP, Ruby, Go

Report generated on 2025-06-01 08:04:06 using automated analysis pipeline. For questions about methodology or results, please refer to the CodeSearchNet documentation.