Generate images¶
AIMU has a parallel image-generation surface to the text chat client. The shape mirrors text: a base ABC (BaseImageClient), a factory class (ImageClient), per-provider concrete clients, and one-line entry points (aimu.image_client() / aimu.generate_image()). Two providers ship today:
- HuggingFace
diffusersfor local generation (Stable Diffusion 1.5 / XL / 3.5, FLUX 1 dev / schnell, FLUX 2 Klein 4B / 9B) - Google Nano Banana (
gemini-2.5-flash-image) via the cloud Gemini API
Both providers support image-to-image generation via reference_image= on generate(); see Image-to-image below.
Install¶
Pick the providers you want. Both are opt-in extras:
pip install -e '.[hf]' # HuggingFace diffusers: local, GPU-friendly
pip install -e '.[google]' # Google Nano Banana: cloud, fast first call
pip install -e '.[hf,google]' # both
The [hf] extra brings text + image (it already ships torch and transformers for the HuggingFace text client, plus diffusers, safetensors, and Pillow for image). [google] adds google-genai and Pillow.
Set GOOGLE_API_KEY (env or .env) to use Nano Banana.
One-shot¶
import aimu
# Local diffusers (downloads weights on first call)
path = aimu.generate_image(
"a watercolor of a fox in a snowy forest",
model="hf:runwayml/stable-diffusion-v1-5",
format="path",
)
# Cloud Nano Banana
img = aimu.generate_image(
"a watercolor of a fox in a snowy forest",
model="gemini:nano-banana",
aspect_ratio="16:9",
)
format= selects how the image is returned: "pil" (default, PIL.Image), "path" (saves PNG, returns path string), "bytes" (raw PNG), or "data_url" (base64-encoded for inline embedding).
Streaming progress¶
Pass stream=True to get an iterator of IMAGE_GENERATING chunks, one per denoising step (HF) or one start/done pair per image (Gemini, which has no per-step API):
import aimu
client = aimu.image_client("hf:runwayml/stable-diffusion-v1-5")
for chunk in client.generate("a fox", stream=True, num_inference_steps=25):
c = chunk.content
if c["final"]:
print(f"\nDone, saved to {c['result']}" if c.get("result") else "\nDone")
else:
print(f"Step {c['step']}/{c['total_steps']}", end="\r")
Intermediate-image previews: preview_every=N¶
Decoding latents to a PIL image at every step adds ~50–200 ms per call on GPU, so previews are opt-in. Pass preview_every=N to decode every Nth step (and always the final):
for chunk in client.generate(
"a fox", stream=True, num_inference_steps=25, preview_every=5
):
c = chunk.content
if c["image"] is not None:
c["image"].save(f"preview_step_{c['step']}.png")
Gemini Nano Banana ignores preview_every; its cloud API has no intermediate latents.
Streaming the built-in tool through an agent¶
The built-in generate_image tool is itself a generator. When dispatched through agent.run(stream=True), its progress chunks flow through the agent's own stream, with no side channel:
from aimu.agents import Agent
from aimu.tools import builtin
agent = Agent(text_client, tools=[builtin.generate_image])
for chunk in agent.run("draw me a fox", stream=True):
if chunk.is_image_progress():
c = chunk.content
print(f" {c['step']}/{c['total_steps']}")
elif chunk.is_tool_call():
print(f" Done: {chunk.content['response']}")
elif chunk.is_text():
print(chunk.content, end="")
To opt into intermediate previews from the agent path, build a bound tool with make_image_tool(client, preview_every=N) and pass that to the agent instead of builtin.generate_image.
Direct client: reuse weights / API client¶
A fresh client per call reloads weights for HuggingFace; for Nano Banana it just rebuilds the API client. For repeated generation, build once and reuse:
from aimu import image_client
client = image_client("hf:stabilityai/stable-diffusion-xl-base-1.0")
for prompt in ["a fox", "a deer", "an owl"]:
img = client.generate(prompt, width=1024, height=1024)
img.save(f"{prompt}.png")
You can also pass an enum member for IDE autocomplete when browsing available models:
from aimu import image_client
from aimu.models import HuggingFaceImageModel, GeminiImageModel
client = image_client(HuggingFaceImageModel.SDXL_BASE)
client = image_client(GeminiImageModel.NANO_BANANA)
Per-provider knobs differ:
- HuggingFace:
negative_prompt,width,height,num_inference_steps,guidance_scale,seed,reference_image,strength. Defaults come from theHuggingFaceImageSpec(e.g. SD 1.5 → 25 steps, 512×512, guidance 7.5; FLUX schnell / FLUX 2 Klein → 4 steps, 1024×1024). - Gemini Nano Banana:
aspect_ratio(e.g."1:1","16:9"),image_size,reference_image. Noseed; the API doesn't expose one.
Both share: num_images=N, format=, output_dir=, reference_image= (see Image-to-image).
Prompt length¶
CLIP-based models (SD 1.5, SDXL) cap prompts at 77 tokens and silently truncate the rest. Models with a T5 text encoder accept far more: SD 3.5 ≈ 256 tokens, FLUX ≈ 512. Each spec records this as max_prompt_tokens, and the client exposes it:
image_client(HuggingFaceImageModel.SD_3_5_MEDIUM).max_prompt_tokens # 256
image_client(HuggingFaceImageModel.SDXL_BASE).max_prompt_tokens # 77
image_client(GeminiImageModel.NANO_BANANA).max_prompt_tokens # None (uncapped cloud)
Use it to size prompts to the model, e.g. a summarization step that condenses a long description to fit the budget (see the hotdog scripts in examples/image-refinement/).
Note: ad-hoc "hf:<repo>" strings get the conservative 77-token default (the catalog isn't consulted). Pass the enum member to pick up the model's real budget.
GPU placement (HuggingFace)¶
Placement is automatic and memory-aware. On first generation the client measures the loaded pipeline's size and the free memory on each visible GPU (so it accounts for other processes already using the cards, e.g. a local LLM server) then picks the cheapest strategy that fits:
- Pin to the freest GPU when the whole pipeline fits there (fastest).
- Model CPU offload when it doesn't fit one GPU but the largest single component does (components stream to GPU as needed).
- Sequential CPU offload when memory is very tight (per-layer streaming; slowest, fits almost anything).
This means large models like SD 3.5 or FLUX load without OOM even when a GPU is partly occupied. The decision is logged at INFO. CPU offload requires accelerate (pip install accelerate); without it, an oversized model is pinned to the freest GPU with a warning.
Override the automatic plan through model_kwargs:
# Pin the whole pipeline to a specific GPU
client = image_client(HuggingFaceImageModel.SDXL_BASE, model_kwargs={"device": "cuda:1"})
# Hand placement to diffusers/accelerate (e.g. shard across GPUs)
client = image_client(HuggingFaceImageModel.FLUX_1_DEV, model_kwargs={"device_map": "balanced"})
Image-to-image¶
Pass reference_image= to any generate() call to use an existing image as the starting point for generation. Accepted input forms match vision input: a file path string, pathlib.Path, raw bytes, data:image/... URL, http(s):// URL, or a PIL Image.
from aimu import image_client
client = image_client("hf:stabilityai/stable-diffusion-xl-base-1.0")
# Generate a variation of an existing image
img = client.generate("a cyberpunk city, neon lights", reference_image="./photo.jpg")
# strength= controls how much the output deviates from the reference
# 0.0 = nearly identical, 1.0 = ignore it entirely; default 0.75
img = client.generate("a watercolor painting", reference_image="./photo.jpg", strength=0.5)
The img2img pipeline is loaded lazily on the first call via from_pipe(), which derives it from the already-loaded txt2img pipeline, sharing all weights (UNet, VAE, encoders) at no extra VRAM cost. width and height are ignored in img2img mode; output size comes from the reference image.
For Gemini, reference_image triggers image editing: the reference and the prompt are sent together as a multipart request, and Nano Banana returns an edited version of the image.
# Gemini image editing
client = image_client("gemini:nano-banana")
img = client.generate("make it snowy", reference_image="./summer_scene.jpg")
FLUX.2 Klein: native img2img¶
FLUX.2 Klein uses a unified pipeline (Flux2KleinPipeline) that handles both txt2img and img2img in the same call. It does not use a strength parameter; it conditions on the reference image directly:
from aimu import image_client
client = image_client("hf:black-forest-labs/FLUX.2-klein-4B") # 4-step distilled
# txt2img
img = client.generate("a cat in a sunlit garden")
# img2img: same client, same loaded weights
img = client.generate("add snow to the scene", reference_image="./cat.jpg")
FLUX.2 Klein generates in 4 steps (like FLUX.1 Schnell) with improved text rendering, better hand/face quality, and native img2img without a separate pipeline load.
As an agent tool¶
The built-in generate_image tool lets any chat LLM call image generation when the user asks. The LLM decides when to call it; the tool saves a PNG and returns the path so it appears in the conversation history.
import os
os.environ["AIMU_IMAGE_MODEL"] = "gemini:nano-banana" # or any "hf:..." string
from aimu import client
from aimu.agents import Agent
from aimu.tools import builtin
text_client = client("anthropic:claude-sonnet-4-6")
agent = Agent(
text_client,
system_message=(
"You can generate images with the generate_image tool. When the user "
"asks for an image, write a vivid prompt and call the tool. Tell the "
"user where the image was saved."
),
tools=[builtin.generate_image],
)
response = agent.run("Please make me a watercolor of a fox in a snowy forest.")
The tool uses a lazy module-level singleton: the first call constructs an image client based on AIMU_IMAGE_MODEL (default: SDXL base), and subsequent calls reuse it.
Per-agent override: make_image_tool¶
When you want a different model from the global singleton (or several agents in one process shouldn't share a pipeline), bind a tool to a specific client:
from aimu import image_client
from aimu.tools.builtin import make_image_tool
fast = image_client("hf:black-forest-labs/FLUX.1-schnell") # 4-step model
fast_tool = make_image_tool(fast, preview_every=2, num_inference_steps=20)
agent = Agent(text_client, tools=[fast_tool])
make_image_tool() returns a fresh @tool-decorated callable bound to the supplied client; the singleton stays untouched. Pass preview_every=N to opt into intermediate latent previews and num_inference_steps=N to override the model's default denoising step count (HuggingFace diffusers only; ignored by Gemini).
Skill integration (deeper, optional)¶
For SkillAgent users, drop a SKILL.md under .agents/skills/image-generation/ with prompt-engineering guidance. Skills are filesystem-discovered (not shipped with AIMU); see notebooks/18 - Image Generation.ipynb for a copyable example.
Async¶
The async surface mirrors the sync surface one-for-one. Because image providers either load weights in-process (HuggingFace) or hold an SDK client (Gemini), the factory follows the wrap pattern established for HF / LlamaCpp text clients: build a sync client first and pass it to aio.image_client():
import asyncio
from aimu import aio, image_client
sync = image_client("hf:runwayml/stable-diffusion-v1-5")
async_client = aio.image_client(sync)
async def make_two():
a, b = await asyncio.gather(
async_client.generate("a watercolor of a fox", width=512, height=512),
async_client.generate("a watercolor of a deer", width=512, height=512),
)
return a, b
Async generate() routes through asyncio.to_thread so the event loop stays free during inference. On a single GPU, sibling gather-ed HF calls don't truly overlap on the device (CUDA streams + the GIL serialise), but the event loop stays free for other coroutines.
For Gemini, the cloud SDK could be called natively async; that is kept as a follow-up for shape consistency with the in-process wrap pattern.
See also¶
- Reference: provider matrix: image vs text provider tables.
- Reference: env vars:
GOOGLE_API_KEY,AIMU_IMAGE_MODEL. - Notebook 15: Image Generation: runnable end-to-end demo.
- Add a custom tool: build your own image-generation tool variant.
- Generate audio: the audio surface, which mirrors this one.