CPO: Condition Preference Optimization for Controllable Image Generation

Zonglin Lyu, Ming Li, Xnxin Liu, Chen Chen

Institute of Artificial Intelligence, University of Central Florida
NeurIPS 2025

Overview

Recent method such as ControlNet++ improves the controllability of ConrolNet, but its training is not applicable to all diffusion timesteps. Diffusion DPO is a nature solution that can be applied to all timesteps to improve controllability by prefer I^w (better controllability) over I^l (weaker controllability), but some types of conditions such as edges are hard o observe in raw images which prevent model from learning a clear preference signal. Moreover, it is also challenging to ensure other factors such as quality of I^w is better than or the same as I^l, which injects noise for preference signal. Therefore, we propose CPO: Condition Preference Optimization to improve the controllability of image generation model. Instead of Contrasting images, our proposed method contrast conditions directly. As a result, the model sees a clear contrasting signal from the conditions and learns to improve the controllabiliy.

(a) DPO learns to prefer I^w over I^l. (b) CPO learns to prefer c^w over c^l.

Key Difference with DPO

To construct DPO dataset, we need to generate a group (we set it as 20) and find the best controllable (I^w) and worst controllable ones (I^l). Then we use ImageReward score to make sure the reward score of I^w is obviously better. However, we only need to generate one sample I and detect the conditions from I as c^l. The ground truth condition (or the condition detects from the ground truth image) serves as c^w. Importantly, in DPO, the conditions may not be easy to observe from the image, making it hard for the model to learn the preference.

(a) dataset generation process of DPO. (b) dataset generation process of CPO. (c) Even with ImageReward filtering, DPO dataset still cannot ensure I^w achieves bettter quality (Pose example, artifact in red circle) than I^l, resulting in noisy preference. Moreover, conditions like edges are hardly observable in raw images (Canny example), which can confuse the model. CPO resolves this issue by directly contrasting conditions.

Quantitative Results

Our methods achieve the state-of-the-art results in Controllability without impact on image quality and text-to-image alignment. Recent work (ControlAR) reveals that DINO-v2 can improve the controllability and generation quality in controllable generation task. This observation also exists in Diffusion-based method.

Quantitative Comparisons with recent SOTAs on Controllability. &uarr means the higher the better, and &darr means the lower the better. Since ControlAR uses CFG scale of 4 while diffusion-based methods use 7.5, we evaluate our method under CFG scale 4 to compare with ControlAR and CFG scale 7.5 to compare with other methods. Under the same setup, our method achieves the state-of-the-art results in Controllability. Click the "next" icon to see the quantitative comparison on FID/CLIP between our method and recent SOTAs.

Quantitative Comparisons with recent SOTAs on FID/CLIP. &uarr means the higher the better, and &darr means the lower the better. Click the "next" icon to see the ablation study of DINO-v2 adapter on Controllability.

DINO-v2 adapter generally improves the controllability. &uarr means the higher the better, and &darr means the lower the better. Click the "next" icon to see the ablation study of DINO-v2 adapter on FID/CLIP.

CPO: Condition Preference Optimization for Controllable Image Generation

Abstract

Overview

(a) DPO learns to prefer I^w over I^l. (b) CPO learns to prefer c^w over c^l.

Key Difference with DPO

Quantitative Results

Quantitative Comparisons with recent SOTAs on FID/CLIP. &uarr means the higher the better, and &darr means the lower the better. Click the "next" icon to see the ablation study of DINO-v2 adapter on Controllability.

DINO-v2 adapter generally improves the controllability. &uarr means the higher the better, and &darr means the lower the better. Click the "next" icon to see the ablation study of DINO-v2 adapter on FID/CLIP.

DINO-v2 adapter generally improves the FID and CLIP score. &uarr means the higher the better, and &darr means the lower the better.

Qualitative Results

Qualitative Comparisons with recent SOTAs. Click the "next" icon to see the qualitative comparisons on pose and lineart tasks.

Qualitative Comparisons with recent SOTAs. Click the "next" icon to see additional qualitative comparisons.

Qualitative Comparisons with recent SOTAs. Click the "next" icon to see our visual examples.

Visual Examples in Human Pose task. Click the "next" icon to see our visual examples.

Visual Examples in Segmentation task. Click the "next" icon to see our visual examples.

Visual Examples in Lineart task. Click the "next" icon to see our visual examples.

Visual Examples in Hed task. Click the "next" icon to see our visual examples.

Visual Examples in Canny Edge task. Click the "next" icon to see our visual examples.

Visual Examples in Depth task. Click the "next" icon to see our visual examples.

CPO: Condition Preference Optimization for Controllable Image Generation

Abstract

Overview

(a) DPO learns to prefer Iw over Il. (b) CPO learns to prefer cw over cl.

Key Difference with DPO

Quantitative Results

Quantitative Comparisons with recent SOTAs on FID/CLIP. &uarr means the higher the better, and &darr means the lower the better. Click the "next" icon to see the ablation study of DINO-v2 adapter on Controllability.

DINO-v2 adapter generally improves the controllability. &uarr means the higher the better, and &darr means the lower the better. Click the "next" icon to see the ablation study of DINO-v2 adapter on FID/CLIP.

DINO-v2 adapter generally improves the FID and CLIP score. &uarr means the higher the better, and &darr means the lower the better.

Qualitative Results

Qualitative Comparisons with recent SOTAs. Click the "next" icon to see the qualitative comparisons on pose and lineart tasks.

Qualitative Comparisons with recent SOTAs. Click the "next" icon to see additional qualitative comparisons.

Qualitative Comparisons with recent SOTAs. Click the "next" icon to see our visual examples.

Visual Examples in Human Pose task. Click the "next" icon to see our visual examples.

Visual Examples in Segmentation task. Click the "next" icon to see our visual examples.

Visual Examples in Lineart task. Click the "next" icon to see our visual examples.

Visual Examples in Hed task. Click the "next" icon to see our visual examples.

Visual Examples in Canny Edge task. Click the "next" icon to see our visual examples.

Visual Examples in Depth task. Click the "next" icon to see our visual examples.

(a) DPO learns to prefer I^w over I^l. (b) CPO learns to prefer c^w over c^l.