Mobile users: Mobile devices are not supported for many of contents on this page. Please visit this page on a desktop or laptop computer for the best experience.
Hovering the mouse cursor over the videos will display the captions/information.
We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth.
We only use the public dataset, WebVid-10M.
Overall architecture of HARIVO. We trained injected temporal layers while keeping the Text-to-Image model frozen. We propose methods to leverage inductive biases for videos. Specifically, we introduce a Mapping network that modifies the input distribution and a Frame-wise token generator that creates different text tokens for each frame. Additionally, we propose a Temporal regularized self-attention loss that ensures the self-attention maps are similar across frames, and a decoupled contrastive loss in the deepest feature map, the h-space.
HARIVO completely freezes the image model, allowing the free use of off-the-shelf models such as ControlNet.
Condition could be any specific frame. (e.g., first frame and the last frame of the video)
Similar to AnimateDiff, we can use any DreamBooth models and LoRAs.
HARIVO completely freezes the image model, enabling the generation of videos in various styles such as illustration, pencil drawing, and sci-fi. Additionally, leveraging inductive biases for video further enhances motion within the videos.