CLiC: Concept Learning in Context

Mehdi Safaee ¹

Aryan Mikaeili ¹

Or Patashnik ²

Daniel Cohen-Or ²

Ali Mahdavi-Amiri ¹

¹ Simon Fraser University

² Tel Aviv University

CVPR 2024 (Highlight)

Paper

Code

TL;DR: We Focus on learning a specific pattern from an image (like the ornaments on the red chair) through a unique text token. This token can then be used to effectively transfer the learned pattern onto various other objects (Right) or to create new objects featuring that pattern (Left).

Abstract

This paper addresses the challenge of learning a local visual pattern of an object from one image, and generating images depicting objects with that pattern. Learning a localized concept and placing it on an object in a target image is a nontrivial task, as the objects may have different orientations and shapes. Our approach builds upon recent advancements in visual concept learning. It involves acquiring a visual concept (e.g., an ornament) from a source image and subsequently applying it to an object (e.g., a chair) in a target image. Our key idea is to perform in-context concept learning, acquiring the local visual concept within the broader context of the objects they belong to. To localize the concept learning, we employ soft masks that contain both the concept within the mask and the surrounding image area. We demonstrate our approach through object generation within an image, showcasing plausible embedding of in-context learned concepts. We also introduce methods for directing acquired concepts to specific locations within target images, employing cross-attention mechanisms, and establishing correspondences between source and target objects. The effectiveness of our method is demonstrated through quantitative and qualitative experiments, along with comparisons against baseline techniques.

Transfer Results

Generation Results

Method

In-Context Concept Learning: Uses an image (I_s) and a binary mask (M_s) to learn a concept (e.g., an ornament) in the context of a base object (like a chair).
- ℓ_con: Soft-masked diffusion loss for learning the pattern in context.
- ℓ_att: Restricts attention maps of the learned token (v^*) to the pattern region defined by M_s.
- ℓ_ROI: Uses a text prompt for v^* to enhance concept reconstruction, focusing on a local region by masking I_s.
Concept Transfer: Involves an image (I_tg), a mask (M_tg) defining the edit area, and a user-defined text prompt containing the optimized (v^*).
- Noise Addition & Denoising: Adds noise to the latent of I_tg and denoises it with the fine-tuned diffusion model.
- Blending & Preservation: Blends the model's output with the masked input to preserve out-of-mask regions.
- Cross-Attention Guidance:: Enhances the pattern's presence in the final output.

Additional Results

Presentation

Citation

@article{safaee2023clic,
    title={CLiC: Concept Learning in Context},
    author={Mehdi Safaee and Aryan Mikaeili and Or Patashnik and Daniel Cohen-Or and Ali Mahdavi-Amiri},
    journal={CVPR},
    year={2024}
}