Open-Vocabulary Image Segmentation Algorithms with Deep Neural Networks
Thesis Defense Details
Date & Time
Thursday, November 6, 2025
12:45 PM (Athens, GMT+3)
Location
Computer Science Dpt.
Room B-108
Time Until Defense
Defense Committee
Nikos Komodakis
Thesis Supervisor
University of Crete
Yannis Pantazis
Committee Member
University of Crete
Yannis Stylianou
Committee Member
University of Crete
Results
A dinosaur near plants and a building.
John is walking Rex with a leash.
Einstein shaking hands with Turing.
Pandas near a wooden structure
Person is drinking from a cup and holding a mug.
a plate with bread, fruits, a fork and juice
SPADE is an AI approach that finds and outlines anything you want in an image just from a text description. It is a zero-shot, training-free approach to open-vocabulary semantic segmentation. Instead of being limited to a fixed list of objects, it cleverly uses the deep visual knowledge already inside text-to-image generative models like Stable Diffusion to work instantly on any concept.
The examples above show how SPADE translates free-form text into precise pixel-level outlines. Each colored region corresponds directly to a word in the prompt, showcasing our method's ability to achieve both semantic richness (understanding diverse concepts) and spatial precision (locating them accurately), a major goal in this field.
Abstract
A central goal of computer vision is to grant machines a human-like understanding of the visual world. In the realm of semantic image segmentation, this translates to a particularly challenging task: outlining every object in an image with pixel-level precision while also assigning each region a meaningful, open-ended semantic label. This requires a delicate fusion of two core skills: precise spatial localization (knowing exactly where something is) and deep semantic comprehension (knowing what something is).
For the last decade, deep learning models, from CNNs to Transformers, have made remarkable progress in segmentation. However, their potential has been constrained by the expensive and time-consuming nature of manual image annotation. As a result, most state-of-the-art systems are trained in a fully-supervised manner on datasets with a restricted and predefined "closed set" of categories. This fundamental limitation hinders their application in the real world, as they fail to recognize any concept outside their fixed vocabulary.
To address this limitation, a recent paradigm shift towards open-vocabulary systems has emerged. Fueled by powerful large-scale foundation models, these new approaches allow image segmentation based on arbitrary free-form text prompts. Yet, this shift has revealed a new challenge: a trade-off between semantic richness and spatial accuracy. The most powerful models often excel at one of these skills at the expense of the other, creating a bottleneck for truly comprehensive visual understanding.
This thesis argues that a promising solution to this bottleneck lies in the field of generative AI, introducing our approach, SPADE (Semantic Parsing with Attention and Diffusion Embeddings). Text-to-image diffusion models, like Stable Diffusion, were engineered to create images, not to analyze them. However, we explore how their internal representations, learned through the process of image synthesis, contain a surprisingly dense and granular fusion of semantic and spatial knowledge. This work provides a comprehensive overview of the techniques that repurpose these generative models, framing them as a powerful tool to bridge the divide between semantic understanding and spatial precision in open-vocabulary segmentation.