Shaobin Zhuang
专注于视频&图像生成和生成理解统一模型的三年级博士生。
Rongke Consulting Center
Beijing, China
Hi! I am a fourth-year PhD student. My research mainly focuses on video and image generative super-resolution and deep learning architectures. Currently, I am also working at the Rongke Consulting Center in Beijing.
My goal is to push the boundaries of single-step generation and create more efficient, high-fidelity visual models.
When I am not running experiments or pushing code, you can probably find me assembling Lego sets to decompress.
selected publications
-
UniWeTok: An Unified Binary Tokenizer with Codebook Size 2^128 for Unified Multimodal Large Language ModelarXiv preprint arXiv:2602.14178, 2026 -
BitDance: Scaling Autoregressive Generative Models with Binary TokensarXiv preprint arXiv:2602.14041, 2026 - Video-GPT via Next Clip DiffusionICLR, 2026
- Wetok: Powerful discrete tokenization for high-fidelity visual reconstructionICLR, 2026
- LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-ResolutionICLR, 2026
- Get in video: Add anything you want to the videoarXiv preprint arXiv:2503.06268, 2025
- TimeStep Master: Asymmetrical Mixture of Timestep LoRA Experts for Versatile and Efficient Diffusion Models in VisionICML, 2025
- WeGen: A Unified Model for Interactive Multimodal Generation as We ChatCVPR, 2025
- V-Stylist: Video Stylization via Collaboration and Reflection of MLLM AgentsCVPR, 2025
- MUSES: 3D-Controllable Image Generation via Multi-Modal Agent CollaborationAAAI, 2024
- TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent CollaborationNIPS, 2024
- Vlogger: Make Your Dream A VlogCVPR, 2024
- Seine: Short-to-long video diffusion model for generative transition and predictionIn ICLR, 2024