Generate images¶

AIMU has a parallel image-generation surface to the text chat client. The shape mirrors text: a base ABC (BaseImageClient), a factory class (ImageClient), per-provider concrete clients, and one-line entry points (aimu.image_client() / aimu.generate_image()). Two providers ship today:

HuggingFace diffusers for local generation (Stable Diffusion 1.5 / XL / 3.5, FLUX 1 dev / schnell, FLUX 2 Klein 4B / 9B)
Google Nano Banana (gemini-2.5-flash-image) via the cloud Gemini API

Both providers support image-to-image generation via reference_image= on generate(); see Image-to-image below.

Install¶

Pick the providers you want. Both are opt-in extras:

pip install -e '.[hf]'         # HuggingFace diffusers: local, GPU-friendly
pip install -e '.[google]'     # Google Nano Banana: cloud, fast first call
pip install -e '.[hf,google]'  # both

The [hf] extra brings text + image (it already ships torch and transformers for the HuggingFace text client, plus diffusers, safetensors, and Pillow for image). [google] adds google-genai and Pillow.

Set GOOGLE_API_KEY (env or .env) to use Nano Banana.

One-shot¶

import aimu

# Local diffusers (downloads weights on first call)
path = aimu.generate_image(
    "a watercolor of a fox in a snowy forest",
    model="hf:runwayml/stable-diffusion-v1-5",
    format="path",
)

# Cloud Nano Banana
img = aimu.generate_image(
    "a watercolor of a fox in a snowy forest",
    model="gemini:nano-banana",
    aspect_ratio="16:9",
)

format= selects how the image is returned: "pil" (default, PIL.Image), "path" (saves PNG, returns path string), "bytes" (raw PNG), or "data_url" (base64-encoded for inline embedding).

Streaming progress¶

Pass stream=True to get an iterator of IMAGE_GENERATING chunks, one per denoising step (HF) or one start/done pair per image (Gemini, which has no per-step API):

import aimu

client = aimu.image_client("hf:runwayml/stable-diffusion-v1-5")

for chunk in client.generate("a fox", stream=True, num_inference_steps=25):
    c = chunk.content
    if c["final"]:
        print(f"\nDone, saved to {c['result']}" if c.get("result") else "\nDone")
    else:
        print(f"Step {c['step']}/{c['total_steps']}", end="\r")

Intermediate-image previews: `preview_every=N`¶

Decoding latents to a PIL image at every step adds ~50–200 ms per call on GPU, so previews are opt-in. Pass preview_every=N to decode every Nth step (and always the final):

for chunk in client.generate(
    "a fox", stream=True, num_inference_steps=25, preview_every=5
):
    c = chunk.content
    if c["image"] is not None:
        c["image"].save(f"preview_step_{c['step']}.png")

Gemini Nano Banana ignores preview_every; its cloud API has no intermediate latents.

Streaming the built-in tool through an agent¶

The built-in generate_image tool is itself a generator. When dispatched through agent.run(stream=True), its progress chunks flow through the agent's own stream, with no side channel:

from aimu.agents import Agent
from aimu.tools import builtin

agent = Agent(text_client, tools=[builtin.generate_image])

for chunk in agent.run("draw me a fox", stream=True):
    if chunk.is_image_progress():
        c = chunk.content
        print(f"  {c['step']}/{c['total_steps']}")
    elif chunk.is_tool_call():
        print(f"  Done: {chunk.content['response']}")
    elif chunk.is_text():
        print(chunk.content, end="")

To opt into intermediate previews from the agent path, build a bound tool with make_image_tool(client, preview_every=N) and pass that to the agent instead of builtin.generate_image.

Direct client: reuse weights / API client¶

A fresh client per call reloads weights for HuggingFace; for Nano Banana it just rebuilds the API client. For repeated generation, build once and reuse:

from aimu import image_client

client = image_client("hf:stabilityai/stable-diffusion-xl-base-1.0")
for prompt in ["a fox", "a deer", "an owl"]:
    img = client.generate(prompt, width=1024, height=1024)
    img.save(f"{prompt}.png")

You can also pass an enum member for IDE autocomplete when browsing available models:

from aimu import image_client
from aimu.models import HuggingFaceImageModel, GeminiImageModel

client = image_client(HuggingFaceImageModel.SDXL_BASE)
client = image_client(GeminiImageModel.NANO_BANANA)

Per-provider knobs differ:

HuggingFace: negative_prompt, width, height, num_inference_steps, guidance_scale, seed, reference_image, strength. Defaults come from the HuggingFaceImageSpec (e.g. SD 1.5 → 25 steps, 512×512, guidance 7.5; FLUX schnell / FLUX 2 Klein → 4 steps, 1024×1024).
Gemini Nano Banana: aspect_ratio (e.g. "1:1", "16:9"), image_size, reference_image. No seed; the API doesn't expose one.

Both share: num_images=N, format=, output_dir=, reference_image= (see Image-to-image).

Prompt length¶

CLIP-based models (SD 1.5, SDXL) cap prompts at 77 tokens and silently truncate the rest. Models with a T5 text encoder accept far more: SD 3.5 ≈ 256 tokens, FLUX ≈ 512. Each spec records this as max_prompt_tokens, and the client exposes it:

image_client(HuggingFaceImageModel.SD_3_5_MEDIUM).max_prompt_tokens  # 256
image_client(HuggingFaceImageModel.SDXL_BASE).max_prompt_tokens      # 77
image_client(GeminiImageModel.NANO_BANANA).max_prompt_tokens         # None (uncapped cloud)

Use it to size prompts to the model, e.g. a summarization step that condenses a long description to fit the budget (see the hotdog scripts in examples/image-refinement/).

Note: ad-hoc "hf:<repo>" strings get the conservative 77-token default (the catalog isn't consulted). Pass the enum member to pick up the model's real budget.

GPU placement (HuggingFace)¶

Placement is automatic and memory-aware. On first generation the client measures the loaded pipeline's size and the free memory on each visible GPU (so it accounts for other processes already using the cards, e.g. a local LLM server) then picks the cheapest strategy that fits:

Pin to the freest GPU when the whole pipeline fits there (fastest).
Model CPU offload when it doesn't fit one GPU but the largest single component does (components stream to GPU as needed).
Sequential CPU offload when memory is very tight (per-layer streaming; slowest, fits almost anything).

This means large models like SD 3.5 or FLUX load without OOM even when a GPU is partly occupied. The decision is logged at INFO. CPU offload requires accelerate (pip install accelerate); without it, an oversized model is pinned to the freest GPU with a warning.

Override the automatic plan through model_kwargs:

# Pin the whole pipeline to a specific GPU
client = image_client(HuggingFaceImageModel.SDXL_BASE, model_kwargs={"device": "cuda:1"})

# Hand placement to diffusers/accelerate (e.g. shard across GPUs)
client = image_client(HuggingFaceImageModel.FLUX_1_DEV, model_kwargs={"device_map": "balanced"})

Image-to-image¶

Pass reference_image= to any generate() call to use an existing image as the starting point for generation. Accepted input forms match vision input: a file path string, pathlib.Path, raw bytes, data:image/... URL, http(s):// URL, or a PIL Image.

from aimu import image_client

client = image_client("hf:stabilityai/stable-diffusion-xl-base-1.0")

# Generate a variation of an existing image
img = client.generate("a cyberpunk city, neon lights", reference_image="./photo.jpg")

# strength= controls how much the output deviates from the reference
# 0.0 = nearly identical, 1.0 = ignore it entirely; default 0.75
img = client.generate("a watercolor painting", reference_image="./photo.jpg", strength=0.5)

The img2img pipeline is loaded lazily on the first call via from_pipe(), which derives it from the already-loaded txt2img pipeline, sharing all weights (UNet, VAE, encoders) at no extra VRAM cost. width and height are ignored in img2img mode; output size comes from the reference image.

For Gemini, reference_image triggers image editing: the reference and the prompt are sent together as a multipart request, and Nano Banana returns an edited version of the image.

# Gemini image editing
client = image_client("gemini:nano-banana")
img = client.generate("make it snowy", reference_image="./summer_scene.jpg")

FLUX.2 Klein: native img2img¶

FLUX.2 Klein uses a unified pipeline (Flux2KleinPipeline) that handles both txt2img and img2img in the same call. It does not use a strength parameter; it conditions on the reference image directly:

from aimu import image_client

client = image_client("hf:black-forest-labs/FLUX.2-klein-4B")  # 4-step distilled

# txt2img
img = client.generate("a cat in a sunlit garden")

# img2img: same client, same loaded weights
img = client.generate("add snow to the scene", reference_image="./cat.jpg")

FLUX.2 Klein generates in 4 steps (like FLUX.1 Schnell) with improved text rendering, better hand/face quality, and native img2img without a separate pipeline load.

As an agent tool¶

The built-in generate_image tool lets any chat LLM call image generation when the user asks. The LLM decides when to call it; the tool saves a PNG and returns the path so it appears in the conversation history.

import os
os.environ["AIMU_IMAGE_MODEL"] = "gemini:nano-banana"  # or any "hf:..." string

from aimu import client
from aimu.agents import Agent
from aimu.tools import builtin

text_client = client("anthropic:claude-sonnet-4-6")
agent = Agent(
    text_client,
    system_message=(
        "You can generate images with the generate_image tool. When the user "
        "asks for an image, write a vivid prompt and call the tool. Tell the "
        "user where the image was saved."
    ),
    tools=[builtin.generate_image],
)

response = agent.run("Please make me a watercolor of a fox in a snowy forest.")

The tool uses a lazy module-level singleton: the first call constructs an image client based on AIMU_IMAGE_MODEL (default: SDXL base), and subsequent calls reuse it.

Per-agent override: `make_image_tool`¶

When you want a different model from the global singleton (or several agents in one process shouldn't share a pipeline), bind a tool to a specific client:

from aimu import image_client
from aimu.tools.builtin import make_image_tool

fast = image_client("hf:black-forest-labs/FLUX.1-schnell")  # 4-step model
fast_tool = make_image_tool(fast, preview_every=2, num_inference_steps=20)

agent = Agent(text_client, tools=[fast_tool])

make_image_tool() returns a fresh @tool-decorated callable bound to the supplied client; the singleton stays untouched. Pass preview_every=N to opt into intermediate latent previews and num_inference_steps=N to override the model's default denoising step count (HuggingFace diffusers only; ignored by Gemini).

Skill integration (deeper, optional)¶

For SkillAgent users, drop a SKILL.md under .agents/skills/image-generation/ with prompt-engineering guidance. Skills are filesystem-discovered (not shipped with AIMU); see notebooks/18-image-generation.qmd for a copyable example.

Async¶

The async surface mirrors the sync surface one-for-one. Because image providers either load weights in-process (HuggingFace) or hold an SDK client (Gemini), the factory follows the wrap pattern established for HF / LlamaCpp text clients: build a sync client first and pass it to aio.image_client():

import asyncio
from aimu import aio, image_client

sync = image_client("hf:runwayml/stable-diffusion-v1-5")
async_client = aio.image_client(sync)

async def make_two():
    a, b = await asyncio.gather(
        async_client.generate("a watercolor of a fox", width=512, height=512),
        async_client.generate("a watercolor of a deer", width=512, height=512),
    )
    return a, b

Async generate() routes through asyncio.to_thread so the event loop stays free during inference. On a single GPU, sibling gather-ed HF calls don't truly overlap on the device (CUDA streams + the GIL serialise), but the event loop stays free for other coroutines.

For Gemini, the cloud SDK could be called natively async; that is kept as a follow-up for shape consistency with the in-process wrap pattern.