CodeContests+: High-Quality Test Case Generation for Competitive Programming
Abstract
An LLM-based system generates high-quality test cases for competitive programming problems, enhancing the accuracy of model evaluation and RL performance.
Competitive programming, due to its high reasoning difficulty and precise correctness feedback, has become a key task for both training and evaluating the reasoning capabilities of large language models (LLMs). However, while a large amount of public problem data, such as problem statements and solutions, is available, the test cases of these problems are often difficult to obtain. Therefore, test case generation is a necessary task for building large-scale datasets, and the quality of the test cases directly determines the accuracy of the evaluation. In this paper, we introduce an LLM-based agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+. We evaluated the quality of test cases in CodeContestsPlus. First, we used 1.72 million submissions with pass/fail labels to examine the accuracy of these test cases in evaluation. The results indicated that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR). Subsequently, our experiments in LLM Reinforcement Learning (RL) further confirmed that improvements in test case quality yield considerable advantages for RL.
Community
Introduction
CodeContests+ is a competitive programming problem dataset built upon CodeContests. It includes 11,690 competitive programming problems, along with corresponding high-quality test cases, test case generators, test case validators, output checkers, and more than 13 million correct and incorrect solutions.
Highlights
High Quality Test Cases: We developed a Generator-Validator Agent System that can construct high-quality test cases for each problem. In addition to random test cases, it also generates special test cases tailored to the problem's characteristics and various corner cases, aiming to cover as many potential errors as possible. Furthermore, the correctness of these test cases is verified by an independent test case validator to ensure they comply with the problem constraints.
Test Case Generators: We provide a test case generator for each problem, along with the commands to run it. These commands can be run multiple times to produce an infinite number of test cases. This allows users to understand the specific characteristics of all test cases clearly and enables them to use these generators to create as many additional test cases as they need.
Flexible Number of Test Cases: Additionally, we also provide pre-generated test cases, available in five versions: 1x, 2x, ..., 5x. The number of test cases in these versions increases sequentially, so the computational resources required to run them will also increase. This allows users to strike a balance between computational cost and coverage according to their needs.
Test Case Validators: Competitive programming problems usually specify many constraints on the input data itself, including data ranges, format requirements, data structure requirements, and so on. Therefore, constructing fully valid test cases is not an easy task, and even professional problem setters can easily make mistakes. For each problem, we provide a test case validator that strictly checks whether the test case input satisfies all constraints outlined in the problem description, to ensure the validity of the test cases as much as possible.
Output Checkers for Multiple Answer Problems: In programming competitions, problems with multiple valid solutions are very common. This means that the same input can correspond to several valid outputs. Therefore, correctness cannot be determined simply by comparing the program's output with a single, pre-defined correct answer. For this reason, we provide custom output checkers for all such problems to verify the correctness of the output.
Rigorous Evaluation: To rigorously evaluate the quality of these test cases, we assessed their accuracy using a large number of solutions. For each problem, we used 100 correct solutions and 100 incorrect solutions to determine if the test cases could correctly distinguish between correct and incorrect submissions. We have recorded the evaluation results, including True Positive Rate (TPR) and True Negative Rate (TNR), in the dataset.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper