Video-GPT via Next Clip Diffusion

¹Shanghai Jiao Tong University ²WeChat Vision, Tencent Inc. ³Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
⁴Shanghai AI Laboratory ⁵University of Science and Technology of China ⁶Zhejiang University

^†Work done as interns at WeChat Vision, Tencent Inc. ^‡Corresponding authors.

Abstract: GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream.

Overview

Video-GPT is a video self-supervised generative pre-trained model which treats video as new language for visual world modeling by next clip diffusion. It is designed to be simple, flexible, and easy to follow. Previous works on visual generation relies heavily on supervisory signals from textual modalities (such as Sora, WanX, HunyuanVideo, MovieGen). However, vision, as a natural ability of human beings, was formed even earlier than language. Therefore, we believe that the information of the visual modality itself is sufficient to support the model to model the world.

Powerful Zero-Shot Video Prediction Capabilities

The model makes future predictions based on a given conditional video, thereby examining the model's ability to model the physical world.

Physical World Modeling Visualization

LVM

Seine

Video-GPT

LVM

Seine

Video-GPT

It is clear that our Video-GPT can predict the future of the video more accurately and is less likely to crash, while other methods not only find it difficult to simulate the laws of physics, but also often crash the video content.

Long Video Prediction Visualization