TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

Center for Research in Compter VisionUniversity of Central Florida
ICCV 2025

Abstract

Video Frame Interpolation (VFI) aims to predict the intermediate frame In (we use n to denote time in videos to avoid notation overload with the timestep t in diffusion models) based on two consecutive neighboring frames I0 and I1. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having 3x fewer parameters. Such a parameter reduction results in 2.3x speed up. By incorporating optical flow guidance, our method requires 9000x less training data and achieves over 20x fewer parameters than video-based diffusion models.

Overview

Image diffusion-based methods in Video Frame Interpolation (VFI) achieves strong performance but fails to incorporate temporal information. Video diffusion-based models are able to extract temporal information but requires significantly larger datasets. We incorporate temporal information extraction and optical flow guidance and propose our temporal-aware Brownian Bridge Diffusion to improve the temporal consistency of generated frames and reduce the computation costs.

overview

Overview of our method. (a) Training autoencoder. The autoencoder is trained with video clip V = [I0, In, I1] and aims to reconstruct In. It contains an image encoder (shared for all frames) and an image decoder, where multi-level encoder features from I0, I1 are passed to the decoder. Temporal blocks extract temporal information in the latent space and aggregate video features into a single image feature for the image decoder, and 3D Wavelet extracts temporal information in the pixel space. (b) Training Denoising UNet. The video clip V is encoded to x0 by Encoder E (spatial + temporal). Since In is unknown, we replace it by 0 and obtain another video clip Ṽ = [I0, 0, I1], which is encoded to xT. With the Brownian Bridge Diffusion Process, xt is computed and sent to denoising UNet to predict xt − x0. (c) Inference. During inference, we encode Ṽ to xT and sample with the Brownian Bridge Sampling Process to get x̂0, which is decoded to the output frame În.

Architecture of The Autoencoder

Our autoencoder takes advantage of temporal information and optical flow to guide reconstruction.

arch

(a) Model Pipeline. The Image Encoder is shared across all frames, and temporal blocks extract temporal information in the latent space. (b) Multi-level Feature Sharing. The Image Encoder and Decoder consist of several levels of resolution due to downsampling/upsampling latent features. At the ith level of the encoder and the decoder, features from I0 and I1 in the encoder are warped and concatenated with the original copy (when the downsampling rate is larger than 8, warped features are excluded). The concatenated features are used as keys and values in cross attention where the decoder feature at the same level is the query. (c) Encoder/Decoder Temporal Block. Each temporal block consists of two sets of 3D convolution + attention. In the decoder, the second attention is cross-attention between the intermediate frame (query) and all frames (key and value) to aggregate the video feature into one feature map. (d) 3D-wavelet Feature Gating. Wavelet information is extracted from the input video clip and encoded by CNNs. A sigmoid activation is applied, and the result is element-wise multiplied by the output of the Image Encoder with a skip connection.

Results

Multi-frame Interpolation

We interpolate 7 frames between two frames and visualize the generated video with our method and PerVFI. Click "next" icon to view more images.

More Multiframe Interpolation Visualization of our method. We interpolate 7 frames between two frames.