ActAnywhere
Subject-Aware Video Background Generation

Boxiao Pan1,2 Zhan Xu2 Chun-Hao Paul Huang2 Krishna Kumar Singh2
Yang Zhou2 Leonidas J. Guibas1 Jimei Yang2

1Stanford University 2Adobe Research

Subject segmentation sequence
+
Image of a background
Subject-aware video background!

Abstract

Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist's creative intention. We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere name generalizes to diverse out-of-distribution samples, including non-human subjects.

Method

Our 3D U-Net takes a sequence of foreground subject segmentation along with masks as input, and is conditioned on a frame describing the background. During training, we take a randomly sampled frame from the training video to condition the denoising process. At test time, the condition can be either a composited frame of the subject with a novel background, or a background-only image.

Results

Click on dropdowns to view different categories. Videos should play automatically and in a loop. We used Adobe Firefly to generate the composited frames shown here. Hover mouse over them to see the corresponding text prompts, which are either produced from ChatGPT 4 or manually written.

Video background generation with composited frame conditioning
Original video
(not used as model input)
Segmentation
Mallard wandering around a firepit.
Condition
Output
Original video
(not used as model input)
Segmentation
A man folding bed sheets.
Condition
Output
Original video
(not used as model input)
Segmentation
Purple tie-dye jogger runs in serene park, mist over lake.
Condition
Output
Original video
(not used as model input)
Segmentation
A woman is water-skiing.
Condition
Output
Original video
(not used as model input)
Segmentation
A woman riding a horse.
Condition
Output
Original video
(not used as model input)
Segmentation
A dog plays beside an old man.
Condition
Output
Video background generation with background-only frame conditioning
Original video
(not used as model input)
Segmentation
Condition
Output
Original video
(not used as model input)
Segmentation
Condition
Output
Original video
(not used as model input)
Segmentation
Condition
Output
Diverse generated camera motion
Segmentation
Lost in thought, figure strolls through foggy cityscape in winter attire.
Condition
Seed 1
Seed 2
Seed 3
Seed 4
Segmentation
A woman riding a motorcycle in a city.
Condition
Seed 1
Seed 2
Seed 3
Seed 4
Segmentation
Infant in blue onesie explores a toy-filled nursery.
Condition
Seed 1
Seed 2
Seed 3
Seed 4
Segmentation
Child in blue jacket joyfully picks a pumpkin in autumn patch.
Condition
Seed 1
Seed 2
Seed 3
Segmentation
Traveler, backpack in tow, seeks secrets in desolate landscape's vastness.
Condition
Seed 1
Seed 2
Seed 3
Segmentation
Immersed gamer moves intensely in high-tech room, exploring virtual reality.
Condition
Seed 1
Seed 2
Seed 3
Different backgrounds with the same foreground
Woman in red faces vast grey, reflecting an inner journey
Original video
Segmentation
Condition 1
Output 1
Condition 2
Output 2
Condition 3
Output 3
Condition 4
Output 4
Condition 5
Output 5
Condition 6
Output 6
Condition 7
Output 7
Woman poised backstage, ready for defining theater spotlight moment.
Original video
Segmentation
Condition 1
Output 1
Condition 2
Output 2
Condition 3
Output 3
Condition 4
Output 4
Determined athlete runs through cool, overcast weather, undeterred in the morning.
Original video
Segmentation
Condition 1
Output 1
Condition 2
Output 2
Condition 3
Output 3
A determined athlete trains in diverse landscapes for marathon endurance.
Original video
Segmentation
Condition 1
Output 1
Condition 2
Output 2
Woman confidently at outdoor, engaging at sunset.
Original video
Segmentation
Condition 1
Output 1
Condition 2
Output 2
Condition 3
Output 3
Condition 4
Output 4
Diverse generated contents
Segmentation
Traveler, backpack in tow, seeks secrets in desolate landscape's vastness.
Condition
Seed 1
Seed 2
Seed 3
Seed 4
Segmentation
A child creating shimmering soap bubbles at a grassland.
Condition
Seed 1
Seed 2
Seed 3
Seed 4
Segmentation
Child in beach attire joyfully runs shore, bucket in hand, playing.
Condition
Seed 1
Seed 2
Condition frame of a different subject
Original video
(not used as model input)
Segmentation
A man is holding a balloon, and floating up by the balloon.
Condition
Output
Original video
(not used as model input)
Segmentation
Cyclist pauses, admires scenic overlook with open road and tranquil landscape.
Condition
Output
Comparison with baselines

Here we show the video version of Fig. 4 in the paper.

A car drifting on a snowy mountain road
Original video
Segmentation
Condition
Ours
Gen1 [9]
Text2LIVE [3]
TokenFlow [12]
Control-A-Video [7]
AnimateDiff [13]
VideoCrafter1 [6]
A woman performing motorcycle stunts
Original video
Segmentation
Condition
Ours
Gen1 [9]
Text2LIVE [3]
TokenFlow [12]
Control-A-Video [7]
AnimateDiff [13]
VideoCrafter1 [6]