<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Memorization or Generation of Big Code Model Leaderboard</title>
    <link rel="stylesheet" href="style.css">
    <script src="echarts.min.js"></script>
</head>

<body>

    <section class="section_title">
        <h1>
            ⭐ <span style="color: rgb(223, 194, 25);">Memorization</span> or 
            <span style="color: rgb(223, 194, 25);">Generation</span>
             of Big 
             <span style="color: rgb(223, 194, 25);">Code</span>
              Models 
              <span style="color: rgb(223, 194, 25);">Leaderboard</span>
        </h1>

        <div class="section_title__imgs">
            <a href="https://github.com/YihongDong/CDD-TED4LLMs" id="a_github" target="_blank">
                <img src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white">
            </a>
            <a href="https://arxiv.org/abs/2402.15938" id="a_arxiv" target="_blank">
                <img src="https://img.shields.io/badge/PAPER-ACL'24-ad64d4.svg?style=for-the-badge">
            </a>
        </div>

        <div class="section_title__p">
            <p>
                Inspired by the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">🤗 Open LLM Leaderboard</a> and
                <a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard" target="_blank">🤗 Open LLM-Perf Leaderboard 🏋️</a>,
                we compare the performance of base code generation models on the
                <a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
                <a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a> benchmarks.
                We also measure the Memorization-Generalization Index and provide the results for the models.
                We compare both open-source and closed-source pre-trained code LLMs that can serve as base models for further training.
            </p>
        </div>
    </section>

    <section class="section_button">
        <button id="btn_evalTable">🔍 Evalution Table</button>
        <button id="btn_plot">📊 Performance Plot</button>
        <button id="btn_about">📝 About</button>
        <button id="btn_submit">🚀 Submit results</button>
        <button id="btn_more">🤗 More Leaderboards</button>
    </section>

    <section class="section_evalTable" id="sec_evalTable">
        <div class="section_evalTable__table">
            <table id="evalTable">
                <colgroup>
                    <col style="width: 25%">
                    <col style="width: 15%">
                    <col style="width: 15%">
                    <col style="width: 15%">
                    <col style="width: 15%">
                    <col style="width: 15%">
                </colgroup>

                <thead>
                    <!-- <th rowspan="2">Benchmark</th> -->
                    <th rowspan="2" id="th_model">Model
                        <button class="button_sort" data-direction="desc" data-type="name"></button>
                    </th>
                    <th data-direction="desc" rowspan="2" data-type="MGI">MGI
                        <button class="button_sort" data-direction="desc" data-type="MGI"></button>
                    </th>
                    <th colspan="2">Pass@1(temp=0)</th>
                    <th colspan="2">Pass@1(temp=0.8)</th>
                    <tr>
                        <th>HumanEval
                            <button class="button_sort" data-direction="desc" data-type="temp0_HumanEval"></button>
                        </th>
                        <th>HumanEval-ET
                            <button class="button_sort" data-direction="desc" data-type="temp0_HumanEval_ET"></button>
                        </th>
                        <th>HumanEval
                            <button class="button_sort" data-direction="desc" data-type="temp0_8_HumanEval"></button>
                        </th>
                        <th>HumanEval-ET
                            <button class="button_sort" data-direction="desc" data-type="temp0_8_HumanEval_ET"></button>
                        </th>
                    </tr>  
                </thead>
    
                <tbody>
                    
                </tbody>
            </table>
            </table>
            <script src="table.js"></script>
        </div>

        <div class="section_evalTable__notes">
            <p><strong>Notes</strong>
            <p>
            <ul>
                <li>MGI stands for Memorization-Generalization Index, which is derived from Avg. Peak in the original paper.&ensp;A higher MGI value indicates a greater propensity for a model to engage in memorization as opposed to generalization.</li>
                <li>For more details check the 📝 About section.</li>
            </ul>
        </div>
    </section>

    <section class="section_plot" id="sec_plot">
        <div style="display: flex;">
            <div class="section_plot__div" id="sec_plot__div1">
                <div class="section_plot__btnGroup" id="sec_plot__btnGroup1">
                    <button id="btn_temp0_HumanEval"></button>
                    <span id="span_temp0_HumanEval">Pass@1 (temp = 0)</span>
                    <button id="btn_temp0_8_HumanEval"></button>
                    <span id="span_temp0_8_HumanEval">Pass@1 (temp = 0.8)</span>
                </div>
                <div id="sec_plot__chart1" style="width:716.5px; height:550px;"></div>
            </div>
            
            <div class="section_plot__div" id="sec_plot__div2">
                <div class="section_plot__btnGroup" id="sec_plot__btnGroup2">
                    <button id="btn_temp0_HumanEval_ET"></button>
                    <span id="span_temp0_HumanEval_ET">Pass@1 (temp = 0)</span>
                    <button id="btn_temp0_8_HumanEval_ET"></button>
                    <span id="span_temp0_8_HumanEval_ET">Pass@1 (temp = 0.8)</span>
                </div>
                <div id="sec_plot__chart2" style="width:716.5px; height:550px;"></div>
            </div>
        </div>
        <script src="chart.js"></script>
    </section>


    <section class="section_about" id="sec_about">
        <h3>Benchmarking and Prompts</h3>
            <!-- <p>The growing number of code models released by the community necessitates a comprehensive evaluation to
                reliably benchmark their capabilities.
                Similar to the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">🤗 Open LLM Leaderboard</a>, 
                we selected two common benchmarks for evaluating Code LLMs on multiple programming languages:</p> -->
        <ul>
            <li><a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a>:&ensp;Used to measure the functional correctness of programs generated from docstrings. It includes 164 Python programming problems.
            </li>
            <li><a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a>:&ensp;The extended version of HumanEval benchmark, where each task includes more than 100 test cases.
            </li>
        </ul>
        <p>
            For all models (except for the Starcoder family), we used the original benchmark prompts from HumanEval and added a `&lt;bos&gt;` token before the provided prompt. 
            The maximum generation length was set to the length of the original prompt plus 300 tokens.
        </p>
        <p>
            For the Starcoder family models (such as <a href="https://huggingface.co/bigcode/starcoder2-7b" target="_blank">Starcoder2-7b</a> and <a href="https://huggingface.co/bigcode/starcoder2-7b" target="_blank">Starcoder2-15b</a>), 
            we used the official bigcode-evaluation-harness for generation. 
            More details can be found <a href="https://github.com/bigcode-project/bigcode-evaluation-harness/" target="_blank">here</a>.
        </p>
        <h3>Evaluation Parameters</h3>
        <p>
            For all models, we sampled 1 and 50 samples under temperatures of 0 and 0.8, respectively, 
            for the subsequent result calculations. The parameters are set as follows:
        </p>
        <ul>
            <li>top-p=1.0 (default parameter in the transformers library)</li>
            <li>top-k=50 (default parameter in the transformers library)</li>
            <li>max_length_generation=len(prompt)+300</li>
            <li>temperature=0 or temperature=0.8</li>
            <li>n_samples=50</li>
        </ul>
        <h3>Performance Metrics</h3>
        <ul>
            <li>pass@k:&ensp;Represents the probability that the model successfully solves the test problem at least once out of `k` attempts.</li>
            <li>MGI:&ensp;The average peakedness of the edit distance distribution constructed by the mode samples.</li>
        </ul>
    </section>

    <section class="section_submit" id="sec_submit">
        <h2>How to submit models/results to the leaderboard?</h2>
        <div>
            <p>We welcome the community to submit evaluation results of new models. 
                These results will be added as non-verified, the authors are however required to upload their generations in case other members want to check.
            </p>
            <p>
                To submit your results create a <span style="font-weight: bold;">Pull Request</span> in the community tab to add them under the 
                <a href="[https://github.com/YihongDong/CDD-TED4LLMs]" target="_blank">folder</a> <span class="span_">community_results</span> in the repository:
            </p>
            <ul>
                <li>Create a folder called <span class="span_">ORG_MODELNAME_USERNAME</span>  for example <span class="span_">meta_CodeLlama_xxx</span>.</li>
                <li>Put the generation outputs of your modle in it.</li>
            </ul>
            <p>The title of the PR should be <span class="span_">[Community Submission] Model: org/model, Username: your_username</span>, replace org and model with those corresponding to the model you evaluated.</p>
        </div>
    </section>

    <section class="section_more" id="sec_more">
        <h2>Context</h2>
        <p>In addition to Memorization or Generation of Big Code Models Leaderboard, it is recommended to comprehensively 
            understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:
        </p>
        <ul>
            <li><a href="https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard" target="_blank">Big Code Models Leaderboard</a></li>
            <li><a href="https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard" target="_blank">Chatbot Arena Leaderboard</a></li>
            <li><a href="https://fudanselab-classeval.github.io/leaderboard.html" target="_blank">ClassEval</a></li>
            <li><a href="https://bigcode-bench.github.io" target="_blank">Code Lingua</a></li>
            <li><a href="https://github.com/amazon-science/cceval" target="_blank">CrossCodeEval</a></li>
            <li><a href="https://crux-eval.github.io/leaderboard.html" target="_blank">CRUXEval</a></li>
            <li><a href="https://evalplus.github.io/leaderboard.html" target="_blank">EvalPlus Leaderboard</a></li>
            <li><a href="https://evo-eval.github.io" target="_blank">Evo-Eval</a></li>
            <li><a href="https://github.com/01-ai/HumanEval.jl" target="_blank">HumanEval.jl - Julia version HumanEval with EvalPlus test cases</a></li>
            <li><a href="https://infi-coder.github.io/infibench/" target="_blank">InfiBench</a></li>
            <li><a href="https://livecodebench.github.io/leaderboard.html" target="_blank">LiveCodeBench</a></li>
            <li><a href="https://github.com/THUDM/NaturalCodeBench" target="_blank">NaturalCodeBench</a></li>
            <li><a href="https://www.swebench.com" target="_blank">SWE-bench</a></li>
            <li><a href="https://leaderboard.tabbyml.com" target="_blank">TabbyML Leaderboard</a></li>
            <li><a href="https://github.com/Leolty/repobench" target="_blank">RepoBench</a></li>
            <li><a href="https://github.com/alphadl/OOP-eval" target="_blank">OOP</a></li>
        </ul>
    </section>



    <footer>
    </footer>

    <script src="button.js"></script>
</body>

</html>