Handle vision input¶
Pass images=[...] to chat() on a vision-capable model. AIMU normalises every input to OpenAI content blocks internally and adapts them per-provider.
Basic usage¶
import aimu
client = aimu.client("openai:gpt-4o-mini")
client.chat("What's in this image?", images=["./cat.jpg"])
# Multiple images
client.chat("Compare these two photos.", images=["a.png", "b.png"])
Accepted image forms¶
Each item in images=[...] may be any of:
- File path as
strorpathlib.Path— read from disk and base64-encoded - Raw
bytes— encoded directly http(s)://URL — passed through to providers that support itdata:image/...;base64,...URL — passed through
from pathlib import Path
client.chat("describe", images=[
"./local.jpg", # path string
Path("./other.png"), # Path
b"\x89PNG\r\n\x1a\n...", # bytes
"https://example.com/cat.jpg", # URL
"data:image/png;base64,iVBORw0KGgoAAAANS...", # data URL
])
Vision-capable models¶
Each Model enum exposes a VISION_MODELS classproperty listing members where supports_vision=True. Passing images= to a non-vision model raises ValueError up front.
Common vision-capable models:
- OpenAI: GPT-4o, GPT-4.1, o3, o4-mini
- Anthropic: Claude 4.x (Sonnet, Opus, Haiku)
- Google Gemini: 1.5 Pro/Flash, 2.0 Flash, 2.5 Pro/Flash
- Ollama: Gemma 3, Gemma 4
- HuggingFace: Gemma 3, Gemma 4, Qwen 3.5/3.6 VL variants
See the model matrix for the full list.
Per-provider adaptation¶
AIMU keeps self.messages in OpenAI format always (image_url content blocks) and adapts at request time:
| Provider | Adaptation |
|---|---|
| OpenAI / Gemini / OpenAI-compatible | Pass-through (the SDK speaks image_url natively) |
| Anthropic | Rewrites image_url → Anthropic image blocks (base64 source for data URLs, url source for http(s)) |
| Ollama (native API) | Extracts to Ollama's message-level images=[<bare base64>] field. Only inline base64 supported; http(s) URLs raise ValueError |
| HuggingFace (with processor) | Decodes to PIL images and passes via AutoProcessor; image_url blocks are rewritten to {"type": "image"} placeholders for the chat template |
| llama-cpp | Pass-through; requires loading an mmproj projector via the chat_handler= constructor kwarg |
Vision + agents¶
images= is forwarded through every agent and workflow. Only the initial turn carries images; continuation prompts (after a tool call) are text-only.
Agent.run("task", images=[...])— attaches to the initial turnChain.run("task", images=[...])— forwards to step 0Router.run("task", images=[...])— forwards to the dispatched handlerParallel.run("task", images=[...])— forwards to every workerEvaluatorOptimizer.run("task", images=[...])— forwards only to the initial generator turn
See also¶
- Notebook 02 - Vision
- Model matrix —
supports_visioncolumn