Skip to main content

Vision Input with GPT-4o

In this lesson, you'll learn how to use OpenAI’s GPT-4o model to interpret and respond to image inputs. This is useful for building AI tools that understand visual content.

Prerequisites

Ensure you have:

  • Python 3.8 or later
  • Installed OpenAI SDK:
pip install openai

The Code

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
model="gpt-4o-mini",
input=[
{"role": "user", "content": "What's in the image?"},
{
"role": "user",
"content": [
{
"type": "input_image",
"image_url": "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg"
}
]
}
]
)

print(response.output_text)

Explanation

  • input=[...]: A list containing a text prompt and an image reference.
  • "type": "input_image": Tells the API this content is an image.
  • "image_url": Publicly accessible image URL to analyze.
  • response.output_text: The model's interpretation of the image.

Note

  • The image must be hosted online and accessible via HTTPS.
  • This feature is only supported by models with vision capability like gpt-4o, not legacy GPT-3.5/4 models.

Use Case Ideas

  • Visual content captioning
  • Product cataloging
  • Accessibility for the visually impaired