Open-Vocabulary Image Segmentation Algorithms with Deep Neural Networks

Kaziales, Ioannis; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Open-Vocabulary Image Segmentation Algorithms with Deep Neural Networks

Ioannis Kaziales

University of Crete, Computer Science Department

Thesis Supervisor: Nikos Komodakis

Thesis (soon) Code (soon)

Thesis Defense Details

Date & Time

Thursday, November 6, 2025

12:45 PM (Athens, GMT+3)

Location

Computer Science Dpt.

Room B-108

Time Until Defense

00 Days

00 Hours

00 Minutes

00 Seconds

Defense Committee

Nikos Komodakis

Thesis Supervisor

University of Crete

Yannis Pantazis

Committee Member

University of Crete

Yannis Stylianou

Committee Member

University of Crete

Results

A dinosaur near plants and a building.

John is walking Rex with a leash.

Einstein shaking hands with Turing.

Pandas near a wooden structure

Person is drinking from a cup and holding a mug.

a plate with bread, fruits, a fork and juice

SPADE is an AI approach that finds and outlines anything you want in an image just from a text description. It is a zero-shot, training-free approach to open-vocabulary semantic segmentation. Instead of being limited to a fixed list of objects, it cleverly uses the deep visual knowledge already inside text-to-image generative models like Stable Diffusion to work instantly on any concept.

The examples above show how SPADE translates free-form text into precise pixel-level outlines. Each colored region corresponds directly to a word in the prompt, showcasing our method's ability to achieve both semantic richness (understanding diverse concepts) and spatial precision (locating them accurately), a major goal in this field.

Abstract

A central goal of computer vision is to grant machines a human-like understanding of the visual world. In the realm of semantic image segmentation, this translates to a particularly challenging task: outlining every object in an image with pixel-level precision while also assigning each region a meaningful, open-ended semantic label. This requires a delicate fusion of two core skills: precise spatial localization (knowing exactly where something is) and deep semantic comprehension (knowing what something is).

For the last decade, deep learning models, from CNNs to Transformers, have made remarkable progress in segmentation. However, their potential has been constrained by the expensive and time-consuming nature of manual image annotation. As a result, most state-of-the-art systems are trained in a fully-supervised manner on datasets with a restricted and predefined "closed set" of categories. This fundamental limitation hinders their application in the real world, as they fail to recognize any concept outside their fixed vocabulary.

To address this limitation, a recent paradigm shift towards open-vocabulary systems has emerged. Fueled by powerful large-scale foundation models, these new approaches allow image segmentation based on arbitrary free-form text prompts. Yet, this shift has revealed a new challenge: a trade-off between semantic richness and spatial accuracy. The most powerful models often excel at one of these skills at the expense of the other, creating a bottleneck for truly comprehensive visual understanding.

This thesis argues that a promising solution to this bottleneck lies in the field of generative AI, introducing our approach, SPADE (Semantic Parsing with Attention and Diffusion Embeddings). Text-to-image diffusion models, like Stable Diffusion, were engineered to create images, not to analyze them. However, we explore how their internal representations, learned through the process of image synthesis, contain a surprisingly dense and granular fusion of semantic and spatial knowledge. This work provides a comprehensive overview of the techniques that repurpose these generative models, framing them as a powerful tool to bridge the divide between semantic understanding and spatial precision in open-vocabulary segmentation.