With neural networks, an interesting trend is the use of multi modal models that can embed data across multiple modalities (text across multiple languages, images, speech, videos, routes etc) into a shared semantic space. If we can do this, we can enable interesting usecases such as cross modal retrieval (e.g. show me all parts of a video that are related to this text query). Or as Google mentions multi modal models allow new types of interesting queries - “show me a route with beautiful mountain views” or “show me the part in the video where the lion roars at sunset”.
Google talked about an implementation - Multitask Unified Model (MUM) based on a transformer model but 1000x more powerful than BERT. MUM can acquire deep knowledge of the world, generate language, is trained on 75+ languages and understands multiple modalities
I really believe explorations in this sort of multi modal representation learning will become the standard for building intelligent recommendation systems. In theory, it lets you learn about the world by modelling joint spaces (e.g. two brands might be equivalent, two parks might be comparable, some things are changeable/moveable while others aren’t .. etc) and it enables you use learned representations for new use cases (e.g. conversational retrieval across multi modal datasets).