Multimodal Input – Text + Image

In this lesson, you'll learn how to use multimodal inputs (text and image together) with OpenAI’s GPT-4.1 API to analyze visual content.

Prerequisites

Ensure you have the OpenAI SDK:

pip install openai

The Code

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-4.1-mini",
    input=[{
        "role": "user",
        "content": [
            {"type": "input_text", "text": "what's in this image?"},
            {
                "type": "input_image",
                "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
            },
        ],
    }],
)

print(response.output_text)

Explanation

input_text: Poses a question or instruction.
input_image: Supplies an image URL for the model to analyze.
GPT processes both and provides a context-aware visual analysis.

Use Case

Perfect for:

Image captioning
Visual question answering (VQA)
AI-powered content description for accessibility

Prerequisites​

The Code​

Explanation​

Use Case​

Prerequisites

The Code

Explanation

Use Case