CamGeo Improves Sparse Camera-Conditioned Image-to-Video Generation with 3D Priors
Researchers have introduced CamGeo, a novel framework for sparse camera-conditioned image-to-video generation that distills 3D geometric knowledge from a pre-trained video-to-3D model (VGGT) directly into the diffusion backbone. This approach uses a training-only distillation strategy and a coarse-to-fine curriculum learning to achieve 3D consistency and geometric realism without increasing inference latency. The framework addresses challenges of pose drift and motion discontinuities prevalent in existing methods that rely on dense camera poses or simple interpolation.
Key Takeaways
- CamGeo incorporates keyframe trajectory distillation to enforce cycle-consistency with sparse input poses.
- Cross-frame consistency distillation uses camera trajectory and depth constraints for coherent structure in unsupervised frames.
- A three-stage coarse-to-fine curriculum learning strategy scales geometric complexity, from global structure to fine-grained refinement.
- The 3D guidance from VGGT is removed during inference, maintaining high efficiency in video generation.
Why It Matters
Accurate 3D scene understanding is a bottleneck for creative control and realism in generative AI for video. CamGeo's method for enhancing 3D consistency from sparse camera inputs improves the quality and plausibility of AI-generated video, particularly for scenarios where dense camera data is unavailable. This could enable more practical applications in content creation, virtual production, and visual effects, where precise camera control is critical. Watch for subsequent research on how this distillation approach can be applied to different generative models and for broader adoption in commercial video synthesis platforms.
Read full article at arxiv.org
