Vlogger: Make Your Dream A Vlog

Shaobin Zhuang^1,2‡, Kunchang Li^3,4,2‡, Xinyuan Chen^2†, Yaohui Wang^2†, Ziwei Liu⁵ Yu Qiao^2,3, Yali Wang^1,2,3†

¹Shanghai Jiao Tong University ²Shanghai Artificial Intelligence Laboratory ³Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
⁴University of Chinese Academy of Sciences ⁵S-Lab, Nanyang Technological University

^‡Interns at Shanghai AI Laboratory ^†Corresponding authors

Paper arXiv Code

Abstract: In this work, we present Vlogger, a generic AI system for generating a minute-level video blog (i.e., vlog) of user descriptions. Different from short videos with a few seconds, vlog often contains a complex storyline with diversified scenes, which is challenging for most existing video generation approaches. To break through this bottleneck, our Vlogger smartly leverages Large Language Model (LLM) as Director and decomposes a long video generation task of vlog into four key stages, where we invoke various foundation models to play the critical roles of vlog professionals, including (1) Script, (2) Actor, (3) ShowMaker, and (4) Voicer. With such a design of mimicking human beings, our Vlogger can generate vlogs through explainable cooperation of top-down planning and bottom-up shooting. More over, we introduce a novel video diffusion model, ShowMaker, which serves as a videographer in our Vlogger for generating the video snippet of each shooting scene. By incorporating Script and Actor attentively as textual and visual prompts, it can effectively enhance spatial-temporal coherence in the snippet. Besides, we design a concise mixed training paradigm for ShowMaker, boosting its capacity for both T2V generation and prediction. Finally, the extensive experiments show that our method achieves state-of-the-art performance on zero-shot T2V generation and prediction tasks. More importantly, Vlogger can generate over 5-minute vlogs from open-world descriptions, without loss of video coherence on script and actor.

Teddy Travel

——Vlogger

Director

Videographer

Actor

Designer

Voicer

GPT-4

ShowMaker

Teddy

SDXL

Bark

The cover of the Teddy Travel is made by us.

Comparison with State-of-the-art

Story: Lots of traffic in futuristic city. An alien spaceship arrives to the futuristic city. The camera gets inside the alien spaceship. The camera moves forward until showing an astronaut in the blue room. The astronaut is typing in the keyboard. The camera moves away from the astronaut. The astronaut leaves the keyboard and walks to the left. The astronaut leaves the keyboard and walks away. The camera moves beyond the astronaut and looks at the screen. The screen behind the astronaut displays fish swimming in the sea. Crash zoom into the blue fish. We follow the blue fish as it swims in the dark ocean. The camera points up to the sky through the water. The ocean and the coastline of a futuristic city. Crash zoom towards a futuristic skyscraper. The camera zooms into one of the many windows. We are in an office room with empty desks. A lion runs on top of the office desks. The camera zooms into the lion's face, inside the office. Zoom out to the lion wearing a dark suit in an office room. The lion wearing looks at the camera and smiles. The camera zooms out slowly to the skyscraper exterior. Timelapse of sunset in the modern city. (Story from Phenaki.github.io)

Phenaki

Vlogger

(T+Ref)2V Generation

Scene Reference

Fireworks explode over the pyramids.

Scene Reference

The Great Wall burning with raging fire.

Object Reference

A cat is running on the beach.

(T+I)2V Generation

First frame

Cinematic photograph. View of piloting aaero.

First frame

A fish swims past an oriental woman.

First frame

A big drop of water falls on a rose petal.

First frame

Underwater environment cosmetic bottles.

First frame

Planet hits earth.

T2V Generation

A duck is teaching math to another duck.

Light blue water lapping on the beach.

Bezos explores tropical rainforest.

A deer looks at the sunset behind him.