Papers
arxiv:2603.16790

InCoder-32B: Code Foundation Model for Industrial Scenarios

Published on Mar 17
· Submitted by
Jian Yang
on Mar 18
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

InCoder-32B is a 32-billion-parameter code model trained on industrial datasets with extended context length and execution verification to improve performance in hardware-aware programming tasks.

AI-generated summary

Recent code large language models have achieved remarkable progress on general programming tasks. Nevertheless, their performance degrades significantly in industrial scenarios that require reasoning about hardware semantics, specialized language constructs, and strict resource constraints. To address these challenges, we introduce InCoder-32B (Industrial-Coder-32B), the first 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling. By adopting an efficient architecture, we train InCoder-32B from scratch with general code pre-training, curated industrial code annealing, mid-training that progressively extends context from 8K to 128K tokens with synthetic industrial reasoning data, and post-training with execution-grounded verification. We conduct extensive evaluation on 14 mainstream general code benchmarks and 9 industrial benchmarks spanning 4 specialized domains. Results show InCoder-32B achieves highly competitive performance on general tasks while establishing strong open-source baselines across industrial domains.

Community

InCoder-32B: Code Foundation Model for Industrial Scenarios

InCoder-32B (Industrial-Coder-32B) is the first 32B-parameter code foundation model
purpose-built for industrial software engineering. While recent code LLMs have made
impressive strides on general programming tasks, they fall short in industrial scenarios
that demand reasoning about hardware semantics, specialized language constructs, and strict
resource constraints. InCoder-32B bridges this gap by unifying code intelligence across
five industrial domains: chip design, GPU kernel optimization, embedded systems,
compiler optimization, and 3D modeling — all within a single model.


Training Pipeline

InCoder-32B is trained through a three-stage Code-Flow pipeline:

  1. Pre-training & Annealing — curated industrial code data with automated verification
    and deep deduplication across license, PII, token, repo, and cross-source dimensions

  2. Mid-training — progressive context extension from 8K → 32K → 128K tokens, using
    synthetic industrial reasoning QA, agentic trajectories, and curated code artifacts

  3. Post-training — 2.5M execution-grounded SFT samples across hardware design, GPU
    kernels, embedded firmware, and systems programming, with feedback-driven repair
    trajectories


Key Results

General Code Benchmarks

Benchmark Score
SWE-bench Verified 74.8%
LiveCodeBench 49.14%
BFCL v3 60.99%

Industrial Code Benchmarks

  • 🏆 Best open-source results across all 9 industrial benchmarks
  • 🏆 Outperforms Claude Sonnet 4.6 on CAD-Coder IoU and KernelBench (L1/L2/L3)
  • 🏆 Strong chip design performance, leading all open-source models on RealBench
    module-level tasks by a wide margin

Resources

Paper author Paper submitter

InCoder-32B: Code Foundation Model for Industrial Scenarios

InCoder-32B (Industrial-Coder-32B) is the first 32B-parameter code foundation model
purpose-built for industrial software engineering. While recent code LLMs have made
impressive strides on general programming tasks, they fall short in industrial scenarios
that demand reasoning about hardware semantics, specialized language constructs, and strict
resource constraints. InCoder-32B bridges this gap by unifying code intelligence across
five industrial domains: chip design, GPU kernel optimization, embedded systems,
compiler optimization, and 3D modeling — all within a single model.


Training Pipeline

InCoder-32B is trained through a three-stage Code-Flow pipeline:

  1. Pre-training & Annealing — curated industrial code data with automated verification
    and deep deduplication across license, PII, token, repo, and cross-source dimensions
  2. Mid-training — progressive context extension from 8K → 32K → 128K tokens, using
    synthetic industrial reasoning QA, agentic trajectories, and curated code artifacts
  3. Post-training — 2.5M execution-grounded SFT samples across hardware design, GPU
    kernels, embedded firmware, and systems programming, with feedback-driven repair
    trajectories

Key Results

General Code Benchmarks

Benchmark Score
SWE-bench Verified 74.8%
LiveCodeBench 49.14%
BFCL v3 60.99%

Industrial Code Benchmarks

  • 🏆 Best open-source results across all 9 industrial benchmarks
  • 🏆 Outperforms Claude Sonnet 4.6 on CAD-Coder IoU and KernelBench (L1/L2/L3)
  • 🏆 Strong chip design performance, leading all open-source models on RealBench
    module-level tasks by a wide margin

Resources

Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/incoder-32b-code-foundation-model-for-industrial-scenarios-7434-0557c25f
Covers the executive summary, detailed methodology, and practical applications.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.16790 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.16790 in a Space README.md to link it from this page.

Collections including this paper 1