The growing importance of the Explainable Artificial Intelligence (XAI) field has resulted in the proposal of several methods for producing visual heatmaps of the classification decisions of deep learning models. However, visual explanations are not enough since different end-users have different backgrounds and preferences. Natural language explanations (NLEs) are inherently understandable by humans and, thus, can complement visual explanations. In the literature, the problem of generating NLEs is usually framed as traditional supervised image captioning, where the model learns to produce some human collected ground-truth explanations. In this talk, the audience is invited to navigate the state-of-the-art in image captioning and NLE generation, from the very first approaches with LSTMs to more recent Transformer-based architectures. The last part of the talk will encompass the speaker’s ongoing research on the topic, particularly focusing on the distinction between image captioning and NLE generation and on how we can go from one to the other without requiring human collected NLEs for training.