Multimodal Input – Text + Image
In this lesson, you'll learn how to use multimodal inputs (text and image together) with OpenAI’s GPT-4.1 API to analyze visual content.
Prerequisites
Ensure you have the OpenAI SDK:
pip install openai
The Code
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-4.1-mini",
input=[{
"role": "user",
"content": [
{"type": "input_text", "text": "what's in this image?"},
{
"type": "input_image",
"image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
],
}],
)
print(response.output_text)
Explanation
input_text: Poses a question or instruction.input_image: Supplies an image URL for the model to analyze.- GPT processes both and provides a context-aware visual analysis.
Use Case
Perfect for:
- Image captioning
- Visual question answering (VQA)
- AI-powered content description for accessibility