<aside> <img src="/icons/list_gray.svg" alt="/icons/list_gray.svg" width="40px" /> Contents
</aside>
In this task, you will explore the multimodal features of models that support the recognition of images.
<aside> <img src="/icons/help-alternate_yellow.svg" alt="/icons/help-alternate_yellow.svg" width="40px" /> What is multimodal? Multimodal Large Language Models can process inputs directly in other modalities such as images or voice. Not all models are multimodal and not all multimodal models support all modalities in both directions. For instance, a model may be able to directly recognise speech or an image but not output it. In that case, it will use a separate model to generate these which takes extra time.
Image input is the most common additional modality that LLMs support. At present, no available model supports all modalities in both directions but OpenAI’s GPT-4o has been announced that will be able to do this in the future.
</aside>
<aside> <img src="/icons/notification_green.svg" alt="/icons/notification_green.svg" width="40px" /> Which models support image description? At present only some models support this. 1. GPT-4, 2. Claude and 3. Gemini. Only Claude and Gemini offer this in the free version. ChatGPT only offers this to a very limited extent.
Note: The free models are usually less powerful and do not show the full potential.
</aside>
<aside> <img src="/icons/cursor-click_gray.svg" alt="/icons/cursor-click_gray.svg" width="40px" /> Click on the triangle next to each step to see more details and/or get resources necessary for the task.
Read through all steps first.
</aside>