I think Deep Learning finds its strength in its ability to model efficiently with different types of data at once. It is trivial to build models from multimodal datasets nowadays. It is not a new concept though, nor was it impossible to do it prior to the advent of DL, but the level of complexity of feature processing and modeling was much higher with much lower performance levels!
One key aspect of this success is the concept of Embedding: a lower dimensionality representation of the data. This makes it possible to perform efficient computations while minimizing the effect of the curse of dimensionality and providing more robust representations when it comes to overfitting. In practice, this is just a vector living in a “latent” or “semantic” space.
The first great success of embedding for word encoding was Word2Vec back in 2013 ( https://lnkd.in/gC62AchR) and later GloVe in 2014 (https://lnkd.in/gA8bnnX2). Since AlexNet back in 2012 (https://lnkd.in/gi27CxPF), many Convolutional network architectures (VGG16 (2014), ResNet (2015), Inception (2014), …) were used as feature extractors for images. As of 2018 starting with BERT, Transformer architectures have been used quite a bit to extract semantic representations from sentences.
One domain where embeddings changed everything is recommender engines. It all started with Latent Matrix Factorization methods made popular during the Netflix competition in 2009. The idea is to have a vector representation for each user and product and use that as base features. In fact, any sparse feature could be encoded within an embedding vector and modern rec engines typically use hundreds of embedding matrices for different categorical variables.
Dimensionality reduction is by all accounts not a new concept in Unsupervised Learning! PCA for example dates back to 1901, the concept of Autoencoder was introduced in 1986, and the variational Autoencoders (VAE) were introduced in 2013 (https://lnkd.in/dT3RjUTA). For example, VAE is a key component of Stable Diffusion. The typical difficulty with Machine Learning is the ability to have labeled data. Self-supervised learning techniques like Word2Vec, Autoencoders, generative language models allow us to build powerful latent representations of the data at low cost. Meta recently came out with Data2Vec 2.0 to learn latent representations of any data modality using self-supervised learning (https://lnkd.in/dT3RjUTA).
Beside learning latent representations, a lot of work is being done to learn aligned representations between different modality. For example, CLIP (https://lnkd.in/eGNMirji) is a recent contrastive learning method to learn semantically aligned representations between text and image data.