Handle vision input¶
Pass images=[...] to chat() (stateful) or generate() (stateless one-shot) on a vision-capable model. AIMU normalises every input to OpenAI content blocks internally and adapts them per-provider.
Basic usage¶
import aimu
client = aimu.client("openai:gpt-4o-mini")
client.chat("What's in this image?", images=["./cat.jpg"])
# Multiple images
client.chat("Compare these two photos.", images=["a.png", "b.png"])
Stateless single-turn (generate)¶
Use generate(images=...) for a one-shot vision call that does not touch conversation history, ideal for "look once and answer" tasks (captioning, scoring, classification) where you don't want the image and reply accumulating in self.messages:
client = aimu.client("openai:gpt-4o-mini")
# Each call is independent: no history kept, no reset() needed between calls.
caption = client.generate("Caption this image in one sentence.", images=["./cat.jpg"])
score = client.generate("Rate this image's quality 1-10.", images=["./cat.jpg"])
assert client.messages == [] # generate() never mutates message state
chat(images=...) is the stateful path: the turn persists so you can ask follow-ups. generate(images=...) is the stateless path: each call stands alone. Both accept the same image forms and raise ValueError for a non-vision model. (Before, one-shot vision required a client.reset() + chat() dance to avoid polluting history; generate(images=...) removes that.)
Accepted image forms¶
Each item in images=[...] may be any of:
- File path as
strorpathlib.Path: read from disk and base64-encoded - Raw
bytes: encoded directly http(s)://URL: passed through to providers that support itdata:image/...;base64,...URL: passed through
from pathlib import Path
client.chat("describe", images=[
"./local.jpg", # path string
Path("./other.png"), # Path
b"\x89PNG\r\n\x1a\n...", # bytes
"https://example.com/cat.jpg", # URL
"data:image/png;base64,iVBORw0KGgoAAAANS...", # data URL
])
Vision-capable models¶
Each Model enum exposes a VISION_MODELS classproperty listing members where supports_vision=True. Passing images= to a non-vision model raises ValueError up front.
Common vision-capable models:
- OpenAI: GPT-4o, GPT-4.1, o3, o4-mini
- Anthropic: Claude 4.x (Sonnet, Opus, Haiku)
- Google Gemini: 1.5 Pro/Flash, 2.0 Flash, 2.5 Pro/Flash
- Ollama: Gemma 3, Gemma 4
- HuggingFace: Gemma 3, Gemma 4, Qwen 3.5/3.6 VL variants
See the model matrix for the full list.
Per-provider adaptation¶
AIMU keeps self.messages in OpenAI format always (image_url content blocks) and adapts at request time:
| Provider | Adaptation |
|---|---|
| OpenAI / Gemini / OpenAI-compatible | Pass-through (the SDK speaks image_url natively) |
| Anthropic | Rewrites image_url → Anthropic image blocks (base64 source for data URLs, url source for http(s)) |
| Ollama (native API) | Extracts to Ollama's message-level images=[<bare base64>] field. Only inline base64 supported; http(s) URLs raise ValueError |
| HuggingFace (with processor) | Decodes to PIL images and passes via AutoProcessor; image_url blocks are rewritten to {"type": "image"} placeholders for the chat template |
| llama-cpp | Pass-through; requires loading an mmproj projector via the chat_handler= constructor kwarg |
Vision + agents¶
images= is forwarded through every agent and workflow. Only the initial turn carries images; continuation prompts (after a tool call) are text-only.
Agent.run("task", images=[...]): attaches to the initial turnChain.run("task", images=[...]): forwards to step 0Router.run("task", images=[...]): forwards to the dispatched handlerParallel.run("task", images=[...]): forwards to every workerEvaluatorOptimizer.run("task", images=[...]): forwards only to the initial generator turn
See also¶
- Notebook 04 - Vision
- Model matrix:
supports_visioncolumn