Stable-DiffCoder

Pushing the Frontier of Code Diffusion Large Language Models
ByteDance Seed
Benchmark performance

Introduction

We are thrilled to introduce Stable-DiffCoder, a robust code diffusion Large Language Model (LLM). Built directly on the Seed-Coder architecture, data, and training pipeline, it introduces a block diffusion continual pretraining (CPT) stage equipped with a tailored warmup strategy and a block-wise clipped noise schedule.

Notably, with only CPT followed by supervised fine-tuning (SFT), Stable-DiffCoder surpasses many strong ~8B Autoregressive (AR) and diffusion-based code models. These results demonstrate that diffusion-based training can improve code modeling quality beyond what AR training alone can achieve, even under tightly controlled data and architecture constraints.

1
Fair & Controlled Comparison: We maintain an identical architecture and data pipeline as the AR baseline, strictly isolating the benefits of diffusion-based training.
2
Block Diffusion CPT: Equipped with a custom warmup strategy and block-wise clipped noise scheduling for maximum stability.
3
Actionable Insights: We provide a systematic analysis of training dynamics, offering practical guidelines for training diffusion-based LLMs.
4
Strong Performance: Surpasses leading ~8B AR and diffusion code models using only CPT and SFT stages.

Methodology & Insights

1. Unlocking Data Augmentation in DLLMs

Traditional bidirectional training in Diffusion LLMs (DLLMs) often introduces noise, hindering the model from learning clear reasoning patterns. Our analysis on 2.5B scale models reveals that effective DLLMs require:

  • Clean Evidence: To deduce clear rules with clean and reliable evidence.
  • Alignment: Consistency between training and inference processes.

Our Solution: We initialize training from a pre-annealing AR checkpoint—which retains clean, malleable knowledge—and proceed with a small block diffusion stage (learning clear knowledge and further enhancing data).

2.5B Model Analysis
Figure 1a: Analysis on 2.5B models.
Training Pipeline Design
Figure 1b: The proposed training pipeline.

2. Ensuring Training Stability

We observed significant instability in gradient norms during the CPT of DLLMs. To address this and ensure efficient block diffusion training, we implemented two key designs:

Warmup for Stable CPT
By gradually increasing mask pattern difficulty and removing cross-entropy weighting during the warmup phase, we achieve a stable transition from AR to DLLM without needing sensitive hyperparameter tuning.
Block-wise Clipped Noise Scheduling
We tailored the noise schedule boundaries specifically for block diffusion, making learning within small blocks more efficient and stable.
Impact of warmup on loss and grad norm
Figure 2: Impact of warmup strategies on Loss and Grad Norm.

Performance

Stable-DiffCoder demonstrates robust capabilities across both Base and Instruct versions. It consistently outperforms the AR baseline and maintains a competitive edge against other state-of-the-art ~8B AR and DLLM code models.

Base Model Result 1
Base Model Performance (I)
Base Model Result 2
Base Model Performance (II)
Instruct Model Result 1
Instruct Model Performance (I)
Instruct Model Result 2
Instruct Model Performance (II)