Video-GPT via Next Clip Diffusion

1Shanghai Jiao Tong University  2WeChat Vision, Tencent Inc.  3Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences 
4Shanghai AI Laboratory  5University of Science and Technology of China 6Zhejiang University
Work done as interns at WeChat Vision, Tencent Inc. Corresponding authors.

Abstract: GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream.

Overview

Video-GPT is a video self-supervised generative pre-trained model which treats video as new language for visual world modeling by next clip diffusion. It is designed to be simple, flexible, and easy to follow. Previous works on visual generation relies heavily on supervisory signals from textual modalities (such as Sora, WanX, HunyuanVideo, MovieGen). However, vision, as a natural ability of human beings, was formed even earlier than language. Therefore, we believe that the information of the visual modality itself is sufficient to support the model to model the world.

teaser.png

Powerful Zero-Shot Video Prediction Capabilities

The model makes future predictions based on a given conditional video, thereby examining the model's ability to model the physical world.

phys_bench.png
k600.png

Physical World Modeling Visualization

LVM

Seine

Video-GPT

LVM

Seine

Video-GPT

It is clear that our Video-GPT can predict the future of the video more accurately and is less likely to crash, while other methods not only find it difficult to simulate the laws of physics, but also often crash the video content.

Long Video Prediction Visualization

Condition

Predicted by Video-GPT

Condition

Predicted by Video-GPT

Condition

Predicted by Video-GPT

Condition

Predicted by Video-GPT

Condition

Predicted by Video-GPT

Condition

Predicted by Video-GPT

Amazing Generalization Capability For Downstream Tasks

downstream.png

Class-to-Video Generation on UCF-101 Visualization

BenchPress

ApplyLipstick

WallPushups

Skijet

BilliardsShot

CleanAndJerk

Video Object Segmentation Visualization

Fine-tuning 5K Step on 1000 data from GenIn-1M.

Image Animation Visualization

lie down

transform

fly