Fillerbuster: Multi-View Scene Completion for Casual Captures

Our model is unique in that it jointly models both pose and images. We follow an inpainting strategy where we can condition on any known pixels and rays and generate the unknown pixels and rays. In practice, we use image and raymap latents from two separate VAEs that we train from scratch. Please see our figure from the Appendix for more details on our model architecture.

We use our model to generate missing views of completely unobserved areas in a casual capture! On the right, we show some generated views alongside the result after training 3DGS. Keep scrolling through our project page for more results and videos!

We complete the NeRFiller Dataset with higher quality and consistency than their approach.
Select a baseline method and scene to see the results. Ours is on the right.

Our model can generate a completed scene from an unposed image collection! We extract 16 images from an input video (9 shown here) and use our model to generate both ray maps and a 360° video that explores the scene. The generated video follows the gray trajectory shown in the middle.

Unlike previous methods, our unified model jointly handles both pose prediction and novel-view synthesis, integrating them into a single framework rather than treating them as separate tasks.

Our model supports a flexible number of inputs and outputs. Here we condition on 1, 2, 3, or 4 images and generate the missing content in the full 24 frames by interpolating between. In other experiments, we condition on 16 or more images.
Select a scene and how many images to condition on. The generated video follows the gray camera trajectory.

We complete casual captures from the Nerfbusters Dataset.
Select a baseline method and scene to see the results. Ours is on the right.

Calibrated Fly-Around

Here we condition on a set of 16 images with known pose and generate a fly-around video all in one go! We accomplish this with a single generation rather than using any keyframing strategy or doing multiple passes. It takes less than a minute to generate these results on an A5000 GPU with 24GB of VRAM. In order to understand more concretely how our model operates, consider the colorful image below. Our model is effectively conditioned on the "Masked images," the "Masked origins," and the "Masked directions" rows (where yellow indicates unknown information). The model generates the missing values in the "Inpainted" rows, copying where it can and generating content where it needs to (e.g., the ceiling). The "GT" rows are shown as a reference and are black when not available to us. Please see the paper for more details!

Throughout this project, both before and after its release, we encountered several related works that align with or inspire our approach. Notable examples, in no particular order, include CameraCtrl, MotionCtrl, MV-Adapter, DUSt3R, Cameras as Rays, Odyssey's Explorer, World Labs, CAT3D & CAT4D, LVSM, Long-LRM, Fast3R, and CUT3R. Recent trends show a growing emphasis on feed-forward pose estimation and geometry prediction. However, these methods often do not address extrapolating and generating missing content. Multi-view diffusion techniques can generate missing areas but are often limited to a fixed number of input and output views, making it challenging to incorporate a large amount of context as input.

Our work explores a promising direction: integrating pose prediction with the ability to extrapolate and generate novel views to fully complete scenes. We present one step in this direction and hope future research continues to unify these tasks into a single model, enabling broader multi-view reconstruction capabilities.

Scaling up our approach with a larger model and more data, and incorporating depth maps and Gaussians, is a natural next step. A key insight from our work is leveraging pixel-aligned conditioning both as the input and as the output. This enables a highly flexible conditioning process, allowing any available information to be used while still recovering missing details.

More related work is discussed in our paper. If we’ve overlooked any relevant research, please reach out—we’d love to include it!

We would like to thank Timur Bagautdinov, Jin Kyu Kim, Julieta Martinez, Su Zhaoen, Rawal Khirodkar, Nir Sopher, Nicholas Dahm, Alexander Richard, Bob Hansen, Stanislav Pidhorskyi, Tomas Simon, David McAllister, Justin Kerr, Frederik Warburg, Riley Peterlinz, Evonne Ng, Aleksander Holynski, Artem Sevastopolsky, Tobias Kirschstein, Chen Guo, Nikhil Keetha, Ayush Tewari, Changil Kim, Lorenzo Porzi, Corinne Stucker, Katja Schwarz, and Julian Straub for helpful discussions, technical support, and/or sharing relevant knowledge.

If we missed you, it’s our fault! Let us know. 🙂

BibTeX

Please consider citing our work if you find it useful.

@misc{weber2025fillerbuster,
  title  = {Fillerbuster: Multi-View Scene Completion for Casual Captures},
  author = {Ethan Weber and Norman M\"uller and Yash Kant and Vasu Agrawal and
            Michael Zollh\"ofer and Angjoo Kanazawa and Christian Richardt},
  note   = {arXiv:2502.05175},
  year   = {2025},
}

📜 Fillerbuster: Multi-View Scene Completion for Casual Captures

Fillerbuster Overview

Scene Completion Task

Completing Masked 3D Regions

Uncalibrated Scene Completion

Flexible Conditioning and Generation

Completing Casually Captured Scenes

Calibrated Fly-Around

Related (and Future) Work

Acknowledgements

BibTeX