HARIVO:

Harnessing Text-to-Image Models for Text-to-Video Generation

ECCV 2024

Mingi Kwon, Seoung Wug Oh, Yang Zhou, Difan Liu, Joon-Young Lee, Haoran Cai, Baqiao Liu, Feng Liu^‡, Youngjung Uh^†,

¹Yonsei University, ²Adobe, ³University of Illinois Urbana-Champaign, ⁴Portland State University, ⁵GivernyAI

*This work was done during the internship at Adobe, ^‡Project Lead, ^†Corresponding Author.

Mobile users: Mobile devices are not supported for many of contents on this page. Please visit this page on a desktop or laptop computer for the best experience.

Hovering the mouse cursor over the videos will display the captions/information.

A cute Teddy bear walking down 5th Avenue, beautiful sunset.

A horse trotting along a beach at sunsetA horse trotting along a beach at sunset.

A parrot hosting a cooking show in a jungle kitchen.

A cute funny rabbit dancing, realistic

A robot flying in the sky, animate cartoon style

An astronaut walking, a sci-fi digital art

A martial artist who demonstrates ancient Chinese martial arts in the fog.

Iron Man walking in the park.

A lion swimming in the pool, colored pencil drawing.

A dog sitting, film noir movie cinema concept cartoon, black and white.

A kingkong in the moon, A beautiful award winning photography.

A bear walking in the park, watercolor paintings.

Picture of fire giant inside human sceleton. The face is extremely angry and rouge, ready to punish his surroundings, fire, smoking.

Redshift style Routermaster, Dripping & Splashing Watercolor, Flowerly, passion, intricate details pretty girl, masterpiece, artwork by Carne Griffiths & Marc Allante, long shot view, HD.

Romantic painting of a ship sailing in a stormy sea.

Abstract

We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth.

We only use the public dataset, WebVid-10M.

Overall architecture of HARIVO

Overall architecture of HARIVO. We trained injected temporal layers while keeping the Text-to-Image model frozen. We propose methods to leverage inductive biases for videos. Specifically, we introduce a Mapping network that modifies the input distribution and a Frame-wise token generator that creates different text tokens for each frame. Additionally, we propose a Temporal regularized self-attention loss that ensures the self-attention maps are similar across frames, and a decoupled contrastive loss in the deepest feature map, the h-space.

Ablation Study

[A]
w/o mapping network,
w/o frame-wise tokens.

[B]
w/ mapping network,
w/o frame-wise tokens.

[C]
w/ mapping network,
w/ frame-wise tokens.

A chihuahua in astronaut suit floating in space.

Effect of varying the guidance scale α in MG sampling.

α = 0

α = 40 (Ours)

α = 150

An astronaut riding a horse

α = 0

α = 40 (Ours)

α = 150

A cute funny rabbit dancing, realistic

Off-the-shelf models Results

HARIVO completely freezes the image model, allowing the free use of off-the-shelf models such as ControlNet.

Original video

Extracted condition.

A handsome boy similing, pencil drawing sketch.

Frame-wise generated result.

ControlNet softedge condition

A beautiful pair of parrots make love.

A beautiful pair of parrots make love at the beach, sunset.

A beautiful pair of parrots make love at the beach.

ControlNet pose condition

A boy dancing, A beautiful award-winning photography.

A boy dancing, clolored pencil drawing.

A boy dancing, pencil drawing sketch.

Condition could be any specific frame. (e.g., first frame and the last frame of the video)

ControlNet pose condition

Darth Vader standing.

ControlNet softedge condition

A cute sloth smiling, 3D animation.

Similar to AnimateDiff, we can use any DreamBooth models and LoRAs.

Ours + rcnzCartoon3d

Ours + toonyou

Ours + majicmix

Ours + realisticVision

Ours + toonyou

Ours + lyriel

Ours + rcnzCartoon3d

Text-to-Video Results

HARIVO completely freezes the image model, enabling the generation of videos in various styles such as illustration, pencil drawing, and sci-fi. Additionally, leveraging inductive biases for video further enhances motion within the videos.

A close-up of a fire spitting dragon, cinematic shot, further emphasizing the dragon's powerful presence, detailed.

Robot, steampunk cybernetic biomechanical, 3d model, very coherent symmetrical artwork, unreal engine realistic render, 8k, micro detail, intricate, elegant, highly detailed, centered, digital painting, artstation, smooth, sharp focus, illustration, artgerm, Caio Fantini, wlop

Cyril Rolando style, close-up, most beautiful 16yo Charlotte de Witte, joy expression, upper body, on black, black & white, detailed gorgeous character, sharp focus, crisp quality, golden ratio, 2d, black ink, liquid melting, brush strokes, watercolor, acrylic, art deco, vector illustration, (fantasy flora, by Vhils)

A detailed drawing of a person crumbling down due to drugs, while wearing a mask adorned with a skull pattern.

A UFO fleet hovering over a futuristic metropolis, illustration.

Giant jellyfish floating in the sky like balloons.

Kingkong in the city, film noir movie cinema concept cartoon style black and white

An elephant swimming in the pool, A beautiful award-winning photography

A rabbit hopping through a field of daisies

A space shuttle launching to Mars

Baby ducks paddling in a pond

Numerous yachts moored at the yacht marina under the night sky

A dog looking at the camera with a funny shocking face, widely opened eyes, 4K HD

A traditional sailing ship navigating through iceberg filled polar waters

A colossal whale swimming under a frozen iceberg, aurora in the sky, illustration

A butterfly on a tree branch

A sea monster emerging from ocean depths, in a dramatic scene

A whale sized goldfish swimming in a giant glass bowl

An antique steam train chugging through a mystical forest shrouded

A large ship sailing through a sea of stars, depicted by illustration

A mystical forest with glowing trees, in a surreal acrylic painting

An octopus underwater

A caterpillar inching along a green leaf

A mystical stage in an ancient grove, in a moonlit clearing, a majestic deer standing silently, bathed in the ethereal glow, pencil drawing

A dog swimming

A tower of babel reaching the clouds, in a surreal painting

An enchanted castle floating on a cloud above a vibrant rainbow

A time machine activating in a lab, in a sci-fi artistic painting

Elon Musk in space suit, close shot

A woman with striking blue eye

A horse galloping through Van Gogh's painting

A young man with curly brown hair and a noticeable scar on his left cheek is typing on his laptop at a coffee shop.

Chinese traditional beauty, wearing ancient attire and adorned with traditional makeup, showcasing the elegance and grace of a beautiful woman from ancient China.