Frame Interpolation with Consecutive Brownian Bridge Diffusion

1University of Utah,2Center for Research in Compter VisionUniversity of Central Florida,3University of Birmingham

Abstract

Recent work in Video Frame Interpolation (VFI) tries to formulate VFI as a diffusion-based conditional image generation problem, synthesizing the intermediate frame given a random noise and neighboring frames. Due to the relatively high resolution of videos, Latent Diffusion Models (LDMs) are employed as the conditional generation model, where the autoencoder compresses images into latent representations for diffusion and then reconstructs images from these latent representations. Such a formulation poses a crucial challenge: VFI expects that the output is determinsitically equal to the ground truth intermediate frame, but LDMs randomly generate a diverse set of different images when the model runs multiple times. The reason for the diverse generation is that the cumulative variance (variance accumulated at each step of generation) of generated latent representations in LDMs is large. This makes the sampling trajectory random, resulting in diverse rather than deterministic generations. To address this problem, we propose our unique solution: Frame Interpolation with Consecutive Brownian Bridge Diffusion. Specifically, we propose consecutive Brownian Bridge diffusion that takes a deterministic initial value as input, resulting in a much smaller cumulative variance of generated latent representations. Our experiments suggest that our method can improve together with the improvement of the autoencoder and achieve state-of-the-art performance in VFI, leaving strong potential for further enhancement.

Overview

Recent diffusion-based methods in Video Frame Interpolation (VFI) uses conditional generation, but the variances at each sampling step will accumulate. VFI requires low-variance generation because the ground truth is determinsitic. Our consecutive Brownian Bridge transits among three points: previous frame I0, current frame In, and next frame I1, and achieves lower accumulated vairance than the conditional generation (ours is about 2 while conditional generation is more than 11). To be more efficient, images are encoded to latent space, and the encoder features of neighboring frames are passed to the decoder. We also take advangtage of optical flows estimation to warp the encoder features of neighborng frame and fuse them with the decoder features of the intermediate frame.

overview

Overview of our method. (a) Autoencoder The encoder features of neighboring frames are passed to the decode to provide detail information. (b) Ground truth estimation with diffusion. Images are encoded to latent representations to effciently implement diffusion models. (c) Inference. During inference, the sampled latent representation are decoded, with the assistance of features from neighboring frames.

Architecture of The Autoencoder

Our autoencoder takes advantage of optical flow estimation to warp encoder features of neighboring frames and fused them with the decoder features via cross-attention.

arch

Architecture of the autoencoder. The encoder is in green dashed boxes, and the decoder contains all remaining parts. The output of consecutive Brownian Bridge diffusion will be fed to the VQ layer. The features of I0, I1 at different down-sampling rate will be sent to the cross-attention module at Up Sample Block in the Decoder

Results

Multi-frame Interpolation

We interpolate 7 frames between two frames and visualize the generated video with our method and LDMVFI. This multi-frame interpolation is achieved via bisection-like method. Click "next" icon to view more images.

More Multiframe Interpolation Visualization of our method. We interpolate 7 frames between two frames.