The term "multimodal embedding spaces" originates from the fields of Artificial Intelligence and Big Data, as well as Digital Transformation. It involves storing and linking information from various sources – such as text, images, speech, or even videos – together in a digital space, known as an embedding space.
Imagine this embedding space as a huge cupboard, where different types of information are sorted according to a unified system. This allows an AI, for example, to recognise that a dog in a photo and the word „dog“ in a text belong together. It thus understands the connections between images, words, and even sounds much better.
A concrete example: In modern product search for online shops, a customer can upload a photo of a trainer and be shown matching results, as if they had written a detailed search query. Multimodal embedding spaces make it possible for artificial intelligence to evaluate different data types together and thus offer more intelligent services – whether in online shopping, image search, or everyday assistance systems.















