Tuesday, October 28, 2025 - 14:30
The School of Computer Science is pleased to present…
CLIP-Enhanced CIR with Schemas
MSc Thesis Proposal by: Tina Aminian
Date: Tuesday, October 28, 2025
Time: 2:30 PM
Location: Dillon Hall, Room 354
Abstract:
Composed Image Retrieval (CIR) is the task of retrieving a similar image from a database given a reference image and a textual modification, thereby capturing the user’s intent regarding how the image should be changed. It has various applications, such as e-commerce and general image search engines. In recent years, numerous CIR models have been proposed, demonstrating strong performance, particularly when textual modifications are expressed as simple, attribute-based phrases. However, CIR remains challenging, especially in domains such as fashion retail, where the vocabulary used to express retrieval requirements is rich and fine-grained.
Built on top of the CLIP model, we propose an attribute-based approach to CIR in which the attribute schema is data-dependent. We constructed the training set by pairing and automatically annotating images from the Deep Fashion Multi-Modal dataset. This annotation process followed a specific schema and leveraged the capabilities of a Multimodal Large Language Model (Qwen2.5). For validation, we utilized a Large Language Model (LLM) to accurately interpret and extract the semantic meanings of the modification texts.
Based on this structured data, we propose a more precise methodology for filtering and fine-tuning. The initial experimental results indicate that this approach achieves improved performance over several existing CIR and zero-shot CIR baselines.
Built on top of the CLIP model, we propose an attribute-based approach to CIR in which the attribute schema is data-dependent. We constructed the training set by pairing and automatically annotating images from the Deep Fashion Multi-Modal dataset. This annotation process followed a specific schema and leveraged the capabilities of a Multimodal Large Language Model (Qwen2.5). For validation, we utilized a Large Language Model (LLM) to accurately interpret and extract the semantic meanings of the modification texts.
Based on this structured data, we propose a more precise methodology for filtering and fine-tuning. The initial experimental results indicate that this approach achieves improved performance over several existing CIR and zero-shot CIR baselines.
Thesis Committee:
Internal Reader: Dr. Jianguo Lu
External Reader: Dr. Muhammad Asaduzzaman
Advisor: Dr. Jessica Chen