CPO: Condition Preference Optimization for Controllable Image Generation

Institute of Artificial Intelligence, University of Central Florida
NeurIPS 2025

Abstract

To enhance controllability in text-to-image generation, ControlNet introduces image-based control signals, while ControlNet++ improves pixel-level cycle consistency between generated images and the input control signal. To avoid the prohibitive cost of back-propagating through the sampling process, ControlNet++ optimizes only low-noise timesteps (e.g., t < 200) using a single-step approximation, which not only ignores the contribution of high-noise timesteps but also introduces additional approximation errors. A straightforward alternative for optimizing controllability across all timesteps is Direct Preference Optimization (DPO), a fine-tuning method that increases model preference for more controllable images (Iw}) over less controllable ones (I^l). However, due to uncertainty in generative models, it is difficult to ensure that win--lose image pairs differ only in controllability while keeping other factors, such as image quality, fixed. To address this, we propose performing preference learning over control conditions rather than generated images. Specifically, we construct winning and losing control signals, cw and cl , and train the model to prefer cw. This method, which we term \textit{Condition Preference Optimization} (CPO), eliminates confounding factors and yields a low-variance training objective. Our approach theoretically exhibits lower contrastive loss variance than DPO and empirically achieves superior results. Moreover, CPO requires less computation and storage for dataset curation. Extensive experiments show that CPO significantly improves controllability over the state-of-the-art ControlNet++ across multiple control types: over 10% error rate reduction in segmentation, 70 -- 80% in human pose, and consistent 2%--5% reductions in edge and depth maps. Here, the error rate is defined as the difference between the evaluated controllability and the oracle results.

Overview

Recent method such as ControlNet++ improves the controllability of ConrolNet, but its training is not applicable to all diffusion timesteps. Diffusion DPO is a nature solution that can be applied to all timesteps to improve controllability by prefer Iw (better controllability) over Il (weaker controllability), but some types of conditions such as edges are hard o observe in raw images which prevent model from learning a clear preference signal. Moreover, it is also challenging to ensure other factors such as quality of Iw is better than or the same as Il, which injects noise for preference signal. Therefore, we propose CPO: Condition Preference Optimization to improve the controllability of image generation model. Instead of Contrasting images, our proposed method contrast conditions directly. As a result, the model sees a clear contrasting signal from the conditions and learns to improve the controllabiliy.

overview

(a) DPO learns to prefer Iw over Il. (b) CPO learns to prefer cw over cl.

Key Difference with DPO

To construct DPO dataset, we need to generate a group (we set it as 20) and find the best controllable (Iw) and worst controllable ones (Il). Then we use ImageReward score to make sure the reward score of Iw is obviously better. However, we only need to generate one sample I and detect the conditions from I as cl. The ground truth condition (or the condition detects from the ground truth image) serves as cw. Importantly, in DPO, the conditions may not be easy to observe from the image, making it hard for the model to learn the preference.

arch

(a) dataset generation process of DPO. (b) dataset generation process of CPO. (c) Even with ImageReward filtering, DPO dataset still cannot ensure Iw achieves bettter quality (Pose example, artifact in red circle) than Il, resulting in noisy preference. Moreover, conditions like edges are hardly observable in raw images (Canny example), which can confuse the model. CPO resolves this issue by directly contrasting conditions.

Quantitative Results

Our methods achieve the state-of-the-art results in Controllability without impact on image quality and text-to-image alignment. Recent work (ControlAR) reveals that DINO-v2 can improve the controllability and generation quality in controllable generation task. This observation also exists in Diffusion-based method.

Qualitative Results