VERL Code Datasets

sungyub 's Collections

updated Nov 9

High-quality code generation datasets in VERL format: Python, competitive programming, and Verilog HDL for RL training

sungyub/code-verl-unified

Viewer • Updated Nov 6 • 927k • 9.27k • 1

Note Unified code reasoning dataset with 7 splits (958K+ examples): Python, competitive programming, and Verilog. Includes rstar-coder (386K), kodcode (435K), and 5 other datasets.
sungyub/kodcode-v1-verl

Viewer • Updated Nov 7 • 435k • 57

Note Largest dataset. High-quality Python from LeetCode, HumanEval, docs. Includes GPT-4 quality metrics (89.8% retention).
sungyub/rstar-coder-verl

Viewer • Updated Nov 7 • 345k • 72

Note Second largest. Microsoft rStar-Coder with test case-based evaluation. Synthetic large-scale dataset.
sungyub/acecode-87k-verl

Viewer • Updated Nov 7 • 87.1k • 69

Note TIGER-Lab AceCode. Uses pytest-style assertions for Sandbox Fusion compatibility.
sungyub/eurus-2-code-verl

Viewer • Updated Nov 7 • 25.1k • 56

Note Competitive programming: CodeContests, TACO, APPS, Codeforces. Schema unified with skywork format.
sungyub/skywork-or1-code-verl

Viewer • Updated Nov 7 • 14.1k • 55

Note Reference standard with model difficulty ratings. 80.4% cleaned of instruction prefixes.
sungyub/code-contests-plus-verl

Viewer • Updated Nov 7 • 6.54k • 34

Note ByteDance Code-Contests-Plus. Sandbox-validated test cases (72.1% success rate).
sungyub/codev-r1-verl

Viewer • Updated 30 days ago • 3.13k • 83

Note Verilog HDL for hardware design. Filtered version with 87.1% test pass rate.