Generating realistic sparse multi-category 3D voxel structures is difficult due to the cubic memory scaling of voxel structures and moreover the significant class imbalance caused by sparsity. We introduce Scaffold Diffusion, a generative model designed for sparse multi-category 3D voxel structures. By treating voxels as tokens, Scaffold Diffusion uses a discrete diffusion language model to generate 3D voxel structures. We show that discrete diffusion language models can be extended beyond inherently sequential domains such as text to generate spatially coherent 3D structures. We evaluate on Minecraft house structures from the 3D-Craft dataset and demonstrate that—unlike prior baselines and an auto-regressive formulation—Scaffold Diffusion produces realistic and coherent structures even when trained on data with over 98% sparsity. We provide an interactive viewer where readers can visualize generated samples and the generation process. Our results highlight discrete diffusion as a promising framework for 3D sparse voxel generative modeling.
Sample generation quality comparison between Scaffold Diffusion and baselines: Scaffold Diffusion (top row), autoregressive baseline (middle row), and Lee et al., 2023 (bottom row). While Scaffold Diffusion can generate realistic and functional 3D structures, the autoregressive baseline generates structures dominated by a few block types or structures with implausible block placements. Lee et al., 2023 suffers from an over-representation of background voxels.
Diversity of generated samples. Scaffold Diffusion produces varied and realistic 3D structures for the same occupancy map.
Scaffold Diffusion generating a Minecraft structure from noise to completion
Click inside to start. Press Esc to free and use mouse.
Interactive 3D Minecraft viewer demonstrating generated structures
Sampled from an uncurated set of 2,500 samples. Note: The current renderer cannot support complex blocks such as stairs and beds and converts them to placeholder question blocks. The filtered view displays structures with minimal non-renderable blocks, which typically biases towards simpler generations.