In this paper, we propose to achieve 4D generation from directly sampling the dense multi-view multi-frame observation of dynamic content by composing the estimated score of pretrained video and multi-view diffusion models that have learned strong prior of dynamic and geometry.
A novel differentiable point-based rendering framework for material and lighting decomposition from multi-view images, enabling editing, ray-tracing, and real-time relighting of the 3D point cloud.
Combining the (coarse) planar rendering and the (fine) volume rendering to achieve higher rendering
quality and better generalizations. A depth teacher net that predicts dense
pseudo depth maps is used to supervise the joint rendering mechanism and boost the learning of
consistent 3D geometry.
The website template was borrowed from Jon Barron.