HARIVO:

Harnessing Text-to-Image Models for Text-to-Video Generation

ECCV 2024


Mingi Kwon, Seoung Wug Oh, Yang Zhou, Difan Liu, Joon-Young Lee, Haoran Cai, Baqiao Liu, Feng Liu, Youngjung Uh,

1Yonsei University, 2Adobe, 3University of Illinois Urbana-Champaign, 4Portland State University, 5GivernyAI

*This work was done during the internship at Adobe, Project Lead, Corresponding Author.

Mobile users: Mobile devices are not supported for many of contents on this page. Please visit this page on a desktop or laptop computer for the best experience.

Hovering the mouse cursor over the videos will display the captions/information.

Abstract

We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth.

We only use the public dataset, WebVid-10M.

Overall architecture of HARIVO

Overall architecture of HARIVO. We trained injected temporal layers while keeping the Text-to-Image model frozen. We propose methods to leverage inductive biases for videos. Specifically, we introduce a Mapping network that modifies the input distribution and a Frame-wise token generator that creates different text tokens for each frame. Additionally, we propose a Temporal regularized self-attention loss that ensures the self-attention maps are similar across frames, and a decoupled contrastive loss in the deepest feature map, the h-space.

Ablation Study


A chihuahua in astronaut suit floating in space.


Effect of varying the guidance scale α in MG sampling.



A cute funny rabbit dancing, realistic

Off-the-shelf models Results

HARIVO completely freezes the image model, allowing the free use of off-the-shelf models such as ControlNet.

Condition could be any specific frame. (e.g., first frame and the last frame of the video)

Similar to AnimateDiff, we can use any DreamBooth models and LoRAs.

Text-to-Video Results

HARIVO completely freezes the image model, enabling the generation of videos in various styles such as illustration, pencil drawing, and sci-fi. Additionally, leveraging inductive biases for video further enhances motion within the videos.