Our model is unique in that it jointly models both pose and images. We follow an inpainting strategy where we can condition on any known pixels and rays and generate the unknown pixels and rays. In practice, we use image and raymap latents from two separate VAEs that we train from scratch. Please see our figure from the Appendix for more details on our model architecture.
We use our model to generate missing views of completely unobserved areas in a casual capture! On the right, we show some generated views alongside the result after training 3DGS. Keep scrolling through our project page for more results and videos!
We complete the NeRFiller Dataset with higher quality and consistency than their approach.
Select a baseline method and scene to see the results. Ours is on the right.
Our model can generate a completed scene from an unposed image collection! We extract 16 images from an
input video
(9 shown here) and use our model to generate both ray maps and a 360° video that explores the scene.
The generated video follows the gray trajectory shown in the middle.
Unlike previous methods, our unified model jointly handles both pose prediction and novel-view synthesis,
integrating them into a single framework rather than treating them as separate tasks.
Our model supports a flexible number of inputs and outputs. Here we condition on 1, 2, 3, or 4 images and
generate the missing content in the full 24 frames by interpolating between. In other experiments, we
condition on 16 or more images.
Select a scene and how many images to condition on. The generated video follows the gray camera
trajectory.
We complete casual captures from the Nerfbusters Dataset.
Select a baseline method and scene to see the results. Ours is on the right.
Here we condition on a set of 16 images with known pose and generate a fly-around video all in one go! We accomplish this with a single generation rather than using any keyframing strategy or doing multiple passes. It takes less than a minute to generate these results on an A5000 GPU with 24GB of VRAM. In order to understand more concretely how our model operates, consider the colorful image below. Our model is effectively conditioned on the "Masked images," the "Masked origins," and the "Masked directions" rows (where yellow indicates unknown information). The model generates the missing values in the "Inpainted" rows, copying where it can and generating content where it needs to (e.g., the ceiling). The "GT" rows are shown as a reference and are black when not available to us. Please see the paper for more details!
We would like to thank Timur Bagautdinov, Jin Kyu Kim, Julieta Martinez, Su Zhaoen, Rawal Khirodkar, Nir Sopher, Nicholas Dahm, Alexander Richard, Bob Hansen, Stanislav Pidhorskyi, Tomas Simon, David McAllister, Justin Kerr, Frederik Warburg, Riley Peterlinz, Evonne Ng, Aleksander Holynski, Artem Sevastopolsky, Tobias Kirschstein, Chen Guo, Nikhil Keetha, Ayush Tewari, Changil Kim, Lorenzo Porzi, Corinne Stucker, Katja Schwarz, and Julian Straub for helpful discussions, technical support, and/or sharing relevant knowledge.
If we missed you, it’s our fault! Let us know. 🙂
Please consider citing our work if you find it useful.
@misc{weber2025fillerbuster,
title = {Fillerbuster: Multi-View Scene Completion for Casual Captures},
author = {Ethan Weber and Norman M\"uller and Yash Kant and Vasu Agrawal and
Michael Zollh\"ofer and Angjoo Kanazawa and Christian Richardt},
note = {arXiv:2502.05175},
year = {2025},
}