Exploring multimodal models with vision

<aside> <img src="/icons/list_gray.svg" alt="/icons/list_gray.svg" width="40px" /> Contents

</aside>

What to learn

In this task, you will explore the multimodal features of models that support the recognition of images.

<aside> <img src="/icons/help-alternate_yellow.svg" alt="/icons/help-alternate_yellow.svg" width="40px" /> What is multimodal? Multimodal Large Language Models can process inputs directly in other modalities such as images or voice. Not all models are multimodal and not all multimodal models support all modalities in both directions. For instance, a model may be able to directly recognise speech or an image but not output it. In that case, it will use a separate model to generate these which takes extra time.

Image input is the most common additional modality that LLMs support. At present, no available model supports all modalities in both directions but OpenAI’s GPT-4o has been announced that will be able to do this in the future.

</aside>

<aside> <img src="/icons/notification_green.svg" alt="/icons/notification_green.svg" width="40px" /> Which models support image description? At present only some models support this. 1. GPT-4, 2. Claude and 3. Gemini. Only Claude and Gemini offer this in the free version. ChatGPT only offers this to a very limited extent.

Note: The free models are usually less powerful and do not show the full potential.

</aside>

What to do

<aside> <img src="/icons/cursor-click_gray.svg" alt="/icons/cursor-click_gray.svg" width="40px" /> Click on the triangle next to each step to see more details and/or get resources necessary for the task.

Read through all steps first.

</aside>

Step 1: Go to LMArena.ai or another tool that supports multimodal input
Step 2: Upload or paste an image and ask questions about it [suggested images inside]