Skip to main content

Multimodal Input – Text + Image

In this lesson, you'll learn how to use multimodal inputs (text and image together) with OpenAI’s GPT-4.1 API to analyze visual content.

Prerequisites

Ensure you have the OpenAI SDK:

pip install openai

The Code

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
model="gpt-4.1-mini",
input=[{
"role": "user",
"content": [
{"type": "input_text", "text": "what's in this image?"},
{
"type": "input_image",
"image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
],
}],
)

print(response.output_text)

Explanation

  • input_text: Poses a question or instruction.
  • input_image: Supplies an image URL for the model to analyze.
  • GPT processes both and provides a context-aware visual analysis.

Use Case

Perfect for:

  • Image captioning
  • Visual question answering (VQA)
  • AI-powered content description for accessibility