Abstract: GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream.
Video-GPT is a video self-supervised generative pre-trained model which treats video as new language for visual world modeling by next clip diffusion. It is designed to be simple, flexible, and easy to follow. Previous works on visual generation relies heavily on supervisory signals from textual modalities (such as Sora, WanX, HunyuanVideo, MovieGen). However, vision, as a natural ability of human beings, was formed even earlier than language. Therefore, we believe that the information of the visual modality itself is sufficient to support the model to model the world.
The model makes future predictions based on a given conditional video, thereby examining the model's ability to model the physical world.
LVM
Seine
Video-GPT
LVM
Seine
Video-GPT
It is clear that our Video-GPT can predict the future of the video more accurately and is less likely to crash, while other methods not only find it difficult to simulate the laws of physics, but also often crash the video content.
Condition
Predicted by Video-GPT
Condition
Predicted by Video-GPT
Condition
Predicted by Video-GPT
Condition
Predicted by Video-GPT
Condition
Predicted by Video-GPT
Condition
Predicted by Video-GPT
BenchPress
ApplyLipstick
WallPushups
Skijet
BilliardsShot
CleanAndJerk
Fine-tuning 5K Step on 1000 data from GenIn-1M.
lie down
transform
fly