Adobe, Universities Unveil Auteur: Language-Driven Cinematic AI for Video Generation
Adobe and university researchers introduced Auteur, a new language-driven framework that uses a domain-specific language (DSL) and LLM-based director to automate human-centric camera framing in generative video. This method enables precise control over shot size and composition by defining camera movement relative to actor pose, suitable for conditioning downstream video generators. Auteur was trained and evaluated on a new dataset of 34K aligned text, human motion, and DSL-annotated camera trajectories.
Key Takeaways
- Auteur formalizes cinematographic framing with a human-centric camera parameterization, defining shots relative to an actor's body and movement.
- A fine-tuned multimodal LLM (Qwen-2.5-VL) acts as a virtual director, mapping natural language descriptions and human motion to sparse DSL keyframes.
- The framework outputs dense actor and 6-DoF camera trajectories, which are compatible with existing video generators like VerseCrafter and Kimodo+VACE.
- Auteur dataset compiles 34,000 samples from synthetic procedures and real-world movie footage (CondensedMovies) to train the model.
- The system showed quantitative improvements in framing accuracy, outperforming prior methods across framing metrics like F-Ori, F-Scale, and Auteur-Score.
Why It Matters
Auteur directly addresses a core challenge in generative video: achieving intentional, professional-grade camera control that is currently absent in models treating camera motion as a byproduct. By linking camera behavior to semantic framing relative to human subjects, it provides a means to create videos with coherent visual narratives that resonate with professional cinematographic principles. This development moves beyond passive viewpoint generation, enabling creators to author precise cinematic camera paths through natural language. Industry professionals should monitor how this approach influences upcoming generative video platforms and the tools provided for granular control over AI-generated content, particularly for narrative and advertising applications where aesthetic quality and specific framing are critical.
Read full article at arxiv.org
