Associating Objects and Their Effects in Video Through Coordination Games

Overview of our self-supervised training pipeline to decompose videos into object-centric layers. The input is a short RGB video clip I and masks Mⁱ. The target frame I_t is masked from the input of the network. The transformer decoder predicts a single output layer from features extracted from the RGB input and the corresponding object mask. The output layers Lⁱ_t are composited over the background L₀ to form the predicted frame I′_t , which is compared to the target frame I_t . No direct supervision is provided for Lⁱ_t .

Abstract

We explore a feed-forward approach for decomposing a video into layers, where each layer contains an object of interest along with its associated shadows, reflections, and other visual effects. This problem is challenging since associated effects vary widely with the 3D geometry and lighting conditions in the scene, and ground-truth labels for visual effects are difficult (and in some cases impractical) to collect. We take a self-supervised approach and train a neural network to produce a foreground image and alpha matte from a rough object segmentation mask under a reconstruction and sparsity loss. Under reconstruction loss, the layer decomposition problem is underdetermined: many combinations of layers may reconstruct the input video. Inspired by the game theory concept of focal points—or Schelling points—we pose the problem as a coordination game, where each player (network) predicts the effects for a single object without knowledge of the other players' choices. The players learn to converge on the "natural" layer decomposition in order to maximize the likelihood of their choices aligning with the other players'. We train the network to play this game with itself, and show how to design the rules of this game so that the focal point lies at the correct layer decomposition. We demonstrate feed-forward results on a challenging synthetic dataset, then show that pretraining on this dataset significantly reduces optimization time for real videos.

Paper

Associating Objects and Their Effects in Video Through Coordination Games
Erika Lu, Forrester Cole, Weidi Xie, Tali Dekel, William T. Freeman, Andrew Zisserman, and Michael Rubinstein
NeurIPS 2022.

[paper]

Results

Synthetic data pretraining. We train on 2-object videos and show results on 2-object and 4-object videos:


Finetune on real data. After pretraining our model on synthetic data, we finetune it on real videos, achieving results on-par with Lu, et al. [1] despite requiring 1/10th the training time:

In addition to being faster, our method is more robust to random initializations than Lu, et al. [1]:

Supplementary Material

[supplementary page]

Code

[code]

Omnimatte: Associating Objects and Their Effects in Video
Erika Lu, Forrester Cole, Tali Dekel, Andrew Zisserman, William T. Freeman, and Michael Rubinstein
CVPR 2021 (Oral).

[paper] [project page]

Layered Neural Rendering for Retiming People in Video
Erika Lu, Forrester Cole, Tali Dekel, Weidi Xie, Andrew Zisserman, David Salesin, William T. Freeman, and Michael Rubinstein
SIGGRAPH Asia 2020.

[paper] [project page]

References

[1]	E. Lu, F. Cole, T. Dekel, A. Zisserman, W. T. Freeman, M. Rubinstein. "Omnimatte: Associating Objects and Their Effects in Video." CVPR 2021

¹ Google Research	² Shanghai Jiaotong University	³ Weizmann Institute of Science	² VGG, University of Oxford