⭐ Memorization or Generation of Big Code Models Leaderboard

Inspired by the 🤗 Open LLM Leaderboard and 🤗 Open LLM-Perf Leaderboard 🏋️, we compare the performance of base code generation models on the HumanEval and HumanEval-ET benchmarks. We also measure the Memorization-Generalization Index and provide the results for the models. We compare both open-source and closed-source pre-trained code LLMs that can serve as base models for further training.

Model	MGI	Pass@1(temp=0)		Pass@1(temp=0.8)
Model	MGI	HumanEval	HumanEval-ET	HumanEval	HumanEval-ET

Notes

MGI stands for Memorization-Generalization Index, which is derived from Avg. Peak in the original paper. A higher MGI value indicates a greater propensity for a model to engage in memorization as opposed to generalization.
For more details check the 📝 About section.

Benchmarking and Prompts

HumanEval: Used to measure the functional correctness of programs generated from docstrings. It includes 164 Python programming problems.
HumanEval-ET: The extended version of HumanEval benchmark, where each task includes more than 100 test cases.

For all models (except for the Starcoder family), we used the original benchmark prompts from HumanEval and added a `<bos>` token before the provided prompt. The maximum generation length was set to the length of the original prompt plus 300 tokens.

For the Starcoder family models (such as Starcoder2-7b and Starcoder2-15b), we used the official bigcode-evaluation-harness for generation. More details can be found here.

Evaluation Parameters

For all models, we sampled 1 and 50 samples under temperatures of 0 and 0.8, respectively, for the subsequent result calculations. The parameters are set as follows:

top-p=1.0 (default parameter in the transformers library)
top-k=50 (default parameter in the transformers library)
max_length_generation=len(prompt)+300
temperature=0 or temperature=0.8
n_samples=50

Performance Metrics

pass@k: Represents the probability that the model successfully solves the test problem at least once out of `k` attempts.
MGI: The average peakedness of the edit distance distribution constructed by the mode samples.

How to submit models/results to the leaderboard?

We welcome the community to submit evaluation results of new models. These results will be added as non-verified, the authors are however required to upload their generations in case other members want to check.

To submit your results create a Pull Request in the community tab to add them under the folder community_results in the repository:

Create a folder called ORG_MODELNAME_USERNAME for example meta_CodeLlama_xxx.
Put the generation outputs of your modle in it.

The title of the PR should be [Community Submission] Model: org/model, Username: your_username, replace org and model with those corresponding to the model you evaluated.

⭐ Memorization or Generation of Big Code Models Leaderboard

Benchmarking and Prompts

Evaluation Parameters

Performance Metrics

How to submit models/results to the leaderboard?

Context