Picasso: Exploring a Parameter Analysis Approach from the Temporal-Spatial Perspective for Text-To-Image Diffusion Models

Haoyu Cai1     Chao Wang1,2,3    Xuehai Zhou1,3    Wenqi Lou3

1 School of Computer Science, USTC   2 Suzhou Institute for Advanced Research, USTC   3 School of Software Engineering, USTC

Abstract

Visual Concept Implantation (VCI) is essential in text-to-image fields. While VCI methods in diffusion models (VCI in DM) have matured, Visual Concept Disentanglement (VCD) remains an unexplored area. VCD involves analyzing Prompt Spaces trained in VCI to produce disentangled SubPrompts or SubCones. However, challenges arise due to Prompt Space design complexity and feature information extraction. We propose Picasso, a unified framework for VCD in DM.

Our contributions include:

  • for performance evaluation: Transforming VCD in DM into a regular clustering task by Visualization based on SubCones (VbSC);
  • for unified framework design: Picasso processes diverse Prompt Spaces using spatially clusterable features from the Scones Set;
  • for prompt space exploration: Introducing a temporal-spatial SOTA Prompt design subset based on temporal features.

Our method provides a feasible mechanism for VCD in DM. Through functional and interpretability validation methods, we will comprehensively evaluate the effectiveness of our proposed method in visual concept implantation tasks and verify the correctness of the parameter space design principles.


Fig.1 Temporal-Spatial Parameter Analysis Approach. (a) Visualization Concept Disentanglement Mechanism in Diffusion Models based on Prompt Space Analysis; (b) The Input Extension Formats in Prompt Space Analysis; (c) Evaluation Method bridged by VbSC for Visualization Disentanglement.

Motivation

In Fig.1(a), we introduce the VCD Mechanism in DM based on Prompt Space analysis. The significance of this research lies in providing a feasible mechanism for VCD in DM tasks. The VCD in the DM problem is nontrivial because of the following complexities:
(1) Prompt Space Design: The design of the Prompt Space significantly impacts the complexity of parameter analysis. In contrast to general neural network models, one of the main challenges in analyzing the Prompt Space of diffusion models is the temporal inference computation flow. The temporal inference computation flow is in a time-multiplexing manner for the inner parameters of the diffusion model. The time-multiplexing mechanism implies that the same parameter may have vastly different effects at different time steps, leading to less interpretable analysis results. Therefore, introducing temporal elements in the Prompt Space design of diffusion models is necessary. However, introducing temporal elements complicates the analysis process, and high-quality parameter analysis methods must be prioritized.
(2) Parameter Analysis Methods: Another challenge in analyzing the Prompt Space of diffusion models is feature information extraction. The network structures of conditional generative models like GANs are typically carefully designed, and feature information related to model parameters is often easy to extract or even possesses intrinsic structural features. To reduce convergence costs, designers of diffusion models often use network structures with low interpretability. For example, Stable Diffusion uses UNet as its backbone. Although the CrossAttention layer in diffusion model networks contains critical feature information, existing research indicates that such features are often difficult to analyze directly. Therefore, enhancing the interpretability of feature information through preprocessing is crucial. The search cost of excessively large Prompt Space must also be considered.
To address the discussed challenges and limitations, in this work, we propose the first unified and efficient framework Picasso which supports the VCD Mechanism in DM based on Prompt Space analysis and further quantifies the framework’s effectiveness. It is important to lay the foundation for exploring a general solution for VCD. We analyze the performance of various implementation strategies for VCD in DM and its derivative tasks, summarize the empirical rules of Prompt design to improve the performance of the VCI in diffusion models and enhance the feasibility of VCI methods.

In Fig.1(c), by leveraging the SC, we can generate visual results and subsequently embed them into a low-dimensional space using an encoder. The resulting embeddings form distinct groups, which we refer to as Visual Clusters. In the experimental section, we will treat these Visual Clusters as the outcome of a regular clustering task and evaluate them accordingly. The approach allows us to assess the effectiveness of the proposed method in capturing and representing different visual attributes within the neural network’s parameter space.

Clustering Results - Spatial Perspective


Fig.3 Visual results of FCM clustering.

In practice, we found that too many spatial features were collected during the diffusion model sampling process, making it difficult for the clustering algorithm to converge. According to Theorem 1, we introduce a high-weight feature selection strategy based on a simple high-pass filter with τ as the threshold. Fig.3 demonstrates the visual results of FCM clustering for various values of τ, illustrating that the TSCones and their corresponding spatial activation information encapsulate critical insights of the diffusion model about specific concepts. To mitigate the impact of channel dimensionality, we first reduce all spatial activations Fs to the same dimensionality using PCA before performing clustering.


Fig.4 The relationship between τ , running time, and FPC in a chart.

It is visually evident from the graph that as τ increases, the clustering results become more coherent and reasonable, while simultaneously optimizing the time cost required by the clustering algorithm in Fig.4. Therefore, we believe that concept neurons play a pivotal role in the diffusion model’s understanding of specific concepts.

Visualization Results - Temporal Perspective


Fig.5 Visual results. (a) Visual result between different time steps. we observe red diagonal lines with a certain width. These diagonal lines with some width suggest that adjacent or closely related diffusion (denoising) steps are responsible for similar functions, and thus their corresponding neural regions are quite similar. (b) Visual result between different sub-concepts. Visualizations of two different scenarios showcase Picasso’s neuromorphic mechanism.

In Fig.5(b), it confirms our earlier emphasis on temporal locality. However, when there is a significant difference between time steps, their similarity becomes quite low, indicating that the diffusion model does not have temporal globality in a temporal sequence sense. Furthermore, all the diagonal lines exhibit a “broad at the front and narrow at the end” characteristic. It suggests that in the early stages of denoising, neurons corresponding to different denoising steps change their functions very frequently, while in the later stages, which frequency gradually decreases. It is consistent with previous work, early denoising steps are responsible for handling image layout and content, intermediate steps for handling image textures and colors, and final steps for handling image details. It implies that early denoising steps play a more significant role.

Key Parameter Ablation

We studied many different parameters and the most key ones are shown here: Temporal Extension Dimension and Cluster Num. The impact of the Temporal Extension Dimension on VCD in DM can provide a reference for designing a parameter space suitable for VCI task. The value of Cluster Num will reveal how many independent visual attributes the diffusion model can usually disentangle when recognizing visual concepts, which is important to recognize how the diffusion model understands visual concepts.


Fig.6 Performance evaluation results of VCD in DM under the Picasso framework. (a) Calinski-Harabasz Index of TSCones with different Temporal Extension Dimensions performing the VCD in DM task; (b) Davies-Bouldin Index of TSCones with different Temporal Extension Dimensions performing the VCD in DM task; (c) Silhouette Coefficient of TSCones with different Temporal Extension Dimensions performing the VCD in DM task. Mark the curve in gray to indicate that the indicator is unreasonable in the current scenario. (d) Calinski-Harabasz Index of TSCones with different cluster numbers Dimensions performing the VCD in DM task.

In Fig.6(a) and (b), The db and ch metrics consistently show a performance improvement as the dimensionality is extended from 1 to 3, with the optimal performance achieved at a dimensionality of 3. However, for sc, the optimal values differ across the Avg, Max, and Min scenarios, implying that the data patterns exhibited by the errors differ from those of the averages in Fig.6(c). From this perspective, sc does not appear to be a suitable metric for the VCD in DM task. We speculate that the reason might be that sc is not applicable for evaluating clustering effects on non-convex clusters. In contrast, the clusters generated by the VCD task are complex manifolds, making sc unsuitable for measurement in this context.
By employing the elbow method to observe the relationships between ch and Cluster Num, the results indicate that the elbow points occur when Cluster Num is 3 or 4. It implies that the neural network’s intrinsic recognition of images comprises 3-4 visual concepts. Neurobiology also suggests that the human brain decomposes a single visual concept into several sub-visual concepts for cognitive processing, which aligns with our findings.

Single-topic VCI Generation


Fig.7 Comparisons with SOTA personalization methods including Textual Inversion (TI), DreamBooth, and XTI.

In Fig.7, we show the result of “A tilting cat wearing sunglasse” and “A teddy in Times Square”. XTI and Perfusion are the latest published methods, and the model has not been released yet. The resulting images of XTI and Perfusion are borrowed from their paper, so the results of adding concepts are not shown. Our method is faithful to conveying the appearance and material of the reference image while having better controllability and diversity.