GetInVideo

Add Anything You Want to the Video

Shaobin Zhuang1♠*,   Zhipeng Huang3♠*,   Binxin Yang2,   Ying Zhang2,   Fangyikang Wang6♠,  
Canmiao Fu2,   Chong Sun2,   Zheng-Jun Zha3,   Chen Li2,   Yali Wang4,5†
1Shanghai Jiao Tong University,   2WeChat, Tencent Inc,  
3University of Science and Technology of China,  
4Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences,  
5Shanghai Artificial Intelligence Laboratory,   6Zhejiang University  
Corresponding authors Work done as interns at WeChat

Abstract

Video editing increasingly demands the ability to incorporate specific real-world instances into existing footage, yet current approaches fundamentally fail to capture the unique visual characteristics of particular subjects and ensure natural instance/scene interactions. We formalize this overlooked yet critical editing paradigm as "Get-In-Video Editing", where users provide reference images to precisely specify visual elements they wish to incorporate into videos.

To address this task's dual challenges of severe training data scarcity and technical challenges in maintaining spatiotemporal coherence, we introduce three key contributions:

First, we develop GetIn-1M dataset created through our automated Recognize-Track-Erase pipeline, which sequentially performs video captioning, salient instance identification, object detection, temporal tracking, and instance removal to generate high-quality video editing pairs with comprehensive annotations (reference image, tracking mask, instance prompt).

Second, we present GetInVideo, a novel end-to-end framework that leverages a diffusion transformer architecture with 3D full attention to process reference images, condition videos, and masks simultaneously, maintaining temporal coherence, preserving visual identity, and ensuring natural scene interactions when integrating reference objects into videos.

Finally, we establish GetInBench, the first comprehensive benchmark for Get-In-Video Editing scenario, demonstrating our approach's superior performance through extensive evaluations. Our work enables accessible, high-quality incorporation of specific real-world subjects into videos, significantly advancing personalized video editing capabilities.

Method

GetInVideo Model Architecture.

GetIn-1M dataset construction pipeline.

Samples generated by our GetInVideo.

Generated video result

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
There was a bag left on the beach, no one cared about it, just let the sea water hit it.
A capybara is looking for food in the forest full of fallen leaves.
A fire-breathing dragon stands on the clouds and roars to the sky.
A beautiful woman appeared in front of the computer screen.
A wall features three black and white stencil art pieces, each depicting a figure in a different pose and attire. A woman in a dress holds a baby on the left, a woman in a long dress stands upright in the middle, and a man holds a bag on the right.