TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

Zonglin Lyu, Chen Chen

Center for Research in Compter VisionUniversity of Central Florida
ICCV 2025

Abstract

Video Frame Interpolation (VFI) aims to predict the intermediate frame I_n (we use n to denote time in videos to avoid notation overload with the timestep t in diffusion models) based on two consecutive neighboring frames I₀ and I₁. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having 3x fewer parameters. Such a parameter reduction results in 2.3x speed up. By incorporating optical flow guidance, our method requires 9000x less training data and achieves over 20x fewer parameters than video-based diffusion models.

Overview

Image diffusion-based methods in Video Frame Interpolation (VFI) achieves strong performance but fails to incorporate temporal information. Video diffusion-based models are able to extract temporal information but requires significantly larger datasets. We incorporate temporal information extraction and optical flow guidance and propose our temporal-aware Brownian Bridge Diffusion to improve the temporal consistency of generated frames and reduce the computation costs.

Overview of our method. (a) Training autoencoder. The autoencoder is trained with video clip V = [I₀, I_n, I₁] and aims to reconstruct I_n. It contains an image encoder (shared for all frames) and an image decoder, where multi-level encoder features from I₀, I₁ are passed to the decoder. Temporal blocks extract temporal information in the latent space and aggregate video features into a single image feature for the image decoder, and 3D Wavelet extracts temporal information in the pixel space. (b) Training Denoising UNet. The video clip V is encoded to x₀ by Encoder E (spatial + temporal). Since I_n is unknown, we replace it by 0 and obtain another video clip Ṽ = [I₀, 0, I₁], which is encoded to x_T. With the Brownian Bridge Diffusion Process, x_t is computed and sent to denoising UNet to predict x_t − x₀. (c) Inference. During inference, we encode Ṽ to x_T and sample with the Brownian Bridge Sampling Process to get x̂₀, which is decoded to the output frame Î_n.

Architecture of The Autoencoder

Our autoencoder takes advantage of temporal information and optical flow to guide reconstruction.

(a) Model Pipeline. The Image Encoder is shared across all frames, and temporal blocks extract temporal information in the latent space. (b) Multi-level Feature Sharing. The Image Encoder and Decoder consist of several levels of resolution due to downsampling/upsampling latent features. At the ith level of the encoder and the decoder, features from I0 and I1 in the encoder are warped and concatenated with the original copy (when the downsampling rate is larger than 8, warped features are excluded). The concatenated features are used as keys and values in cross attention where the decoder feature at the same level is the query. (c) Encoder/Decoder Temporal Block. Each temporal block consists of two sets of 3D convolution + attention. In the decoder, the second attention is cross-attention between the intermediate frame (query) and all frames (key and value) to aggregate the video feature into one feature map. (d) 3D-wavelet Feature Gating. Wavelet information is extracted from the input video clip and encoded by CNNs. A sigmoid activation is applied, and the result is element-wise multiplied by the output of the Image Encoder with a skip connection.

Results

Quantitative Comparison between our method and recent SOTAs in LPIPS/FloLPIPS/FID (the lower the better). The best performance are boldfaced, and the second best performance are underlined. Click the "next" icon to see the qualitative comparison between our method and recent SOTAs.

Qualitative comparison between our method and recent SOTAs. Images cropped in blue boxes are displayed for comparison, and red circles highlight our stronger performance. Click the "next" icon to see additional qualitative results.

Additional Qualitative comparison between our method and recent SOTAs. Images cropped in blue boxes are displayed for comparison, and red circles highlight our stronger performance. Click the "next" icon to see additional qualitative results.

Additional Qualitative comparison between our method and recent SOTAs. Images cropped in blue boxes are displayed for comparison, and red circles highlight our stronger performance.

Multi-frame Interpolation

We interpolate 7 frames between two frames and visualize the generated video with our method and PerVFI. Click "next" icon to view more images.

The starting frame of multiframe interpolation. Click the "next" icon to see the ending frame.

The ending frame of multiframe interpolation. Click the "next" icon to see our interpolation result.

Multiframe interpolation result of our method. The first and the last frame are given, the middle 7 frames are interpolated. Click the "next" icon to see the result of PerVFI

Multi-frame interpolation result of PerVFI. The first and the last frame are given, the middle 7 frames are interpolated. The tires are distorted.

The starting frame of multiframe interpolation. Click the "next" icon to see the ending frame.

The ending frame of multiframe interpolation. Click the "next" icon to see our interpolation result.

Multiframe interpolation result of our method. The first and the last frame are given, the middle 7 frames are interpolated. Click the "next" icon to see the result of PerVFI

Multi-frame interpolation result of PerVFI. The first and the last frame are given, the middle 7 frames are interpolated. The wings are distorted.

The starting frame of multiframe interpolation. Click the "next" icon to see the ending frame.

The ending frame of multiframe interpolation. Click the "next" icon to see our interpolation result.

Multiframe interpolation result of our method. The first and the last frame are given, the middle 7 frames are interpolated. Click the "next" icon to see the result of PerVFI

Multi-frame interpolation result of PerVFI. The first and the last frame are given, the middle 7 frames are interpolated. The hand is distorted.

The starting frame of multiframe interpolation. Click the "next" icon to see the ending frame.

The ending frame of multiframe interpolation. Click the "next" icon to see our interpolation result.

Multiframe interpolation result of our method. The first and the last frame are given, the middle 7 frames are interpolated. Click the "next" icon to see the result of PerVFI

Multi-frame interpolation result of PerVFI. The first and the last frame are given, the middle 7 frames are interpolated. The eye is distorted.

More Multiframe Interpolation Visualization of our method. We interpolate 7 frames between two frames.

Click the "next" icon to see more.