Switch providers¶

Every AIMU client implements the same BaseModelClient interface, so swapping backends is a one-line change. ModelClient is the factory; it accepts either a Model enum member or a "provider:model_id" string.

Use a model string¶

import aimu

client = aimu.client("anthropic:claude-sonnet-4-6")
client = aimu.client("ollama:qwen3.5:9b")
client = aimu.client("openai:gpt-4o-mini")
client = aimu.client("gemini:gemini-2.5-flash")

Model strings have the form "provider:model_id". Colons inside the model id (e.g. Ollama's qwen3.5:9b) are preserved.

Point at a remote or custom OpenAI-compatible server¶

The OpenAI-compatible local-server providers default to localhost (see the provider keys table). To reach one on another host or port, append @<base_url> to the model string:

client = aimu.client("llamaserver:qwen3-8b.gguf@http://gpu-box:8080/v1")
client = aimu.client("vllm:Qwen/Qwen3-8B@http://gpu-box:8000/v1")

This is the string-form equivalent of the base_url= kwarg (see Pass provider-specific kwargs); use whichever fits how the model is configured.

Run a model not in the catalog¶

AIMU normally requires the id to name a model it ships a spec for, so it never runs with guessed capabilities. For an OpenAI-compatible server you can declare the capabilities inline instead, with ;<flags> (comma-separated from tools, thinking, vision, audio, structured). The id then resolves to an ad-hoc model with exactly those capabilities:

# a custom fine-tune served by llama-server on another host
client = aimu.client("llamaserver:my-finetune.gguf@http://gpu-box:8080/v1;tools,thinking")

# any OpenAI-compatible server, via the generic prefix (an endpoint is required)
client = aimu.client("openai-compat:my-model@http://gpu-box:9000/v1;tools")

Notes:

@<base_url> is accepted only by the OpenAI-compatible local-server providers (llamaserver, lmstudio, vllm, hf-openai, sglang, ollama-openai) and the generic openai-compat prefix. Passing it to a cloud or in-process provider raises ValueError.
The generic openai-compat prefix always requires @<base_url> (it has no default) and always takes an ad-hoc id.
Flags apply only to ad-hoc ids, and an omitted flag defaults to False, so declare ;tools (etc.) for the capabilities your model actually has. Passing flags with a catalog id raises ValueError — the catalog already declares its capabilities.
No authentication is added; the endpoint is assumed unauthenticated (api_key is unset).

Use an enum member¶

For IDE autocomplete and type checking:

from aimu.models import ModelClient, OllamaModel, AnthropicModel

client = ModelClient(OllamaModel.QWEN_3_5_9B)
client = ModelClient(AnthropicModel.CLAUDE_SONNET_4_6)

Resolve a name, string, or enum uniformly¶

When you accept a model from a CLI flag or config and don't know which form it'll arrive in, resolve_model_enum (text) and resolve_image_model_enum (image) normalise all three to an enum member:

import aimu
from aimu.models import AnthropicModel

aimu.resolve_model_enum(AnthropicModel.CLAUDE_SONNET_4_6)   # enum member → returned unchanged
aimu.resolve_model_enum("anthropic:claude-sonnet-4-6")      # "provider:model_id" string
aimu.resolve_model_enum("CLAUDE_SONNET_4_6")                # bare enum-member name

aimu.resolve_image_model_enum("FLUX_2_KLEIN_4B")           # same, for image models
aimu.resolve_image_model_enum("gemini:nano-banana")

Pass the result straight to aimu.client(...) / aimu.image_client(...) (both also accept an enum member):

client = aimu.client(aimu.resolve_model_enum(args.model))

A bare text name is often ambiguous: the same id ships under several providers (e.g. QWEN_3_8B is an Ollama, HuggingFace, and OpenAI-compat model). resolve_model_enum resolves the tie by preferring a provider where the model is already available locally (running Ollama → cached HuggingFace → reachable local server, tool-capable first), logging the choice at WARNING. If it isn't available anywhere, it raises ValueError listing the "provider:model_id" options, so pin the provider with the string form when you need a specific one. (resolve_image_model_enum has no local-availability notion and raises on the rare ambiguity.)

Use whatever model is already available locally¶

Omit model= entirely and AIMU resolves a default: the AIMU_LANGUAGE_MODEL env var if set, otherwise the first already-available local model (running Ollama → cached HuggingFace → reachable local OpenAI-compat server, tool-capable preferred). A cloud provider is never auto-selected and weights are never downloaded.

client = aimu.client()                          # auto-resolved default (logs the pick at WARNING)

export AIMU_LANGUAGE_MODEL="ollama:qwen3.5:9b"  # pin the default deterministically

To inspect or choose among what's available instead of taking the auto-pick:

aimu.available_text_models()            # list[Model]: locally loadable models, provider-priority order
aimu.resolve_default_text_model_enum()  # the single auto-pick, as an enum member

Provider keys¶

Provider key	Extra	API key env var
`ollama`	`aimu[ollama]`	None
`hf`	`aimu[hf]`	None
`llamacpp`	`aimu[llamacpp]`	None (`model_path=` required)
`anthropic`	`aimu[anthropic]`	`ANTHROPIC_API_KEY`
`openai`	`aimu[openai_compat]`	`OPENAI_API_KEY`
`gemini`	`aimu[openai_compat]`	`GOOGLE_API_KEY`
`lmstudio`	`aimu[openai_compat]`	None (localhost:1234)
`ollama-openai`	`aimu[openai_compat]`	None (localhost:11434)
`hf-openai`	`aimu[openai_compat]`	None (localhost:8000)
`vllm`	`aimu[openai_compat]`	None (localhost:8000)
`llamaserver`	`aimu[openai_compat]`	None (localhost:8080)
`sglang`	`aimu[openai_compat]`	None (localhost:30000)

See the provider matrix for full details.

Check what the provider supports¶

client = aimu.client("ollama:qwen3.5:9b")
client.is_tool_using_model    # True
client.is_thinking_model      # True
client.is_vision_model        # False

Pass provider-specific kwargs¶

Extra kwargs are forwarded to the underlying client constructor:

# Ollama: keep the model warm for 5 minutes
aimu.client("ollama:qwen3.5:9b", model_keep_alive_seconds=300)

# llama.cpp: load a local GGUF file
aimu.client("llamacpp:qwen3-8b", model_path="/path/to/qwen3-8b.gguf")

# LM Studio: point at a non-default host
aimu.client("lmstudio:qwen3.5-9b", base_url="http://myserver:1234/v1")

Each provider client's constructor signature is in the API reference.

Timeouts and retries¶

Networked clients take timeout (seconds) and max_retries, forwarded straight to the provider SDK's own request timeout and bounded retry-on-transient-failure. AIMU adds no retry logic of its own.

import aimu

# Cloud and local OpenAI-compat servers: both supported
client = aimu.client("anthropic:claude-sonnet-4-6", timeout=30, max_retries=5)
client = aimu.client("openai:gpt-4o", timeout=30, max_retries=5)
client = aimu.client("vllm:Qwen/Qwen3-8B", timeout=30, max_retries=5)

These are SDK-native (the anthropic and openai SDKs implement timeout + retry), so the names and behavior match those SDKs exactly. Unset values fall back to each SDK's defaults.

Two caveats:

Ollama (native provider) supports timeout but has no request-retry; passing max_retries raises ValueError. Use the ollama-openai provider (which routes through the OpenAI SDK) if you need retries against Ollama.
In-process providers (hf:, llamacpp:) run locally with no HTTP request, so they don't accept timeout / max_retries (passing them raises TypeError).

Provider failover¶

Per-client max_retries retries the same provider. To fall back to a different provider when one is down, wrap an ordered list of clients in a FallbackClient:

import aimu
from aimu.models import FallbackClient

client = FallbackClient([
    aimu.client("anthropic:claude-sonnet-4-6", timeout=30, max_retries=2),  # preferred
    aimu.client("openai:gpt-4o", timeout=30, max_retries=2),                # fallback
    aimu.client("ollama:qwen3.5:9b"),                                       # last resort, local
])

print(client.chat("Hello"))   # tries Anthropic; on error falls over to OpenAI, then Ollama

The first client that answers wins. A client that raises hands off to the next, carrying the same conversation history, so a multi-turn chat survives a mid-conversation failover. When every client fails, FallbackExhaustedError is raised with the last error chained as its cause.

FallbackClient is itself a BaseModelClient, so it composes everywhere a plain client does:

from aimu.agents import Agent

agent = Agent(client, "You are a helpful assistant.", tools=[...])   # resilient agent

Notes:

What triggers failover. By default any Exception. Narrow it with retry_on= (e.g. FallbackClient([...], retry_on=(TimeoutError, ConnectionError))) so permanent errors surface immediately instead of being masked.
Capabilities (is_thinking_model, model, etc.) reflect the first client, so use capability-compatible clients in one set.
Streaming fails over only before the first chunk is emitted; an error mid-stream propagates rather than replaying.
Async: aio.AsyncFallbackClient is the one-for-one async twin.

Combine the two layers: max_retries/timeout give in-SDK retry against one provider, FallbackClient gives cross-provider failover on top.

Anthropic prompt caching¶

An agent resends the same large prefix to the model every turn: its system prompt and tool schemas. On Anthropic you can cache that prefix (cheaper, faster) by opting in with cache_prompt=True:

import aimu
from aimu.agents import Agent

client = aimu.client("anthropic:claude-sonnet-4-6", cache_prompt=True)
agent = Agent(client, "You are a helpful assistant with a long, detailed system prompt...", tools=[...])

agent.run("First question")    # writes the cache (system prompt + tools)
agent.run("Second question")   # reads it back — most of the prefix is cached

cache_prompt=True marks the system prompt and the tool definitions with Anthropic's cache_control ephemeral breakpoints at request time. Notes:

Anthropic only. It's a provider-specific kwarg; passing it to another provider raises TypeError.
Safe to leave on. Below Anthropic's minimum cacheable size (1024 tokens; 2048 for Haiku) the API simply doesn't cache — no error.
Observe it. When the response reports caching, client.last_usage includes cache_creation_input_tokens (first call, writing the cache) and cache_read_input_tokens (later calls, reading it).
Conversation messages aren't cached (they change every turn); the win is the stable system-prompt + tools prefix.

Failure modes¶

aimu.client("foo:bar") raises ValueError listing the available providers if the prefix is unknown, and raises with the available model ids if the prefix is valid but the id isn't:

>>> aimu.client("foo:bar")
ValueError: Unknown provider 'foo'. Available providers (with installed deps): ['anthropic', 'ollama', ...]

>>> aimu.client("anthropic:claude-nonsense")
ValueError: Provider 'anthropic' has no model id 'claude-nonsense'. Available: ['claude-fable-5', 'claude-haiku-4-5', 'claude-opus-4-6', 'claude-opus-4-7', 'claude-opus-4-8', 'claude-sonnet-4-6']

The extended grammar has two more guardrails. An @<base_url> on a provider that doesn't take one, and ;<flags> on a catalog id, both raise ValueError:

>>> aimu.client("anthropic:claude-sonnet-4-6@http://proxy/v1")
ValueError: Provider 'anthropic' does not accept an @base_url. Supported: ['hf-openai', 'llamaserver', 'lmstudio', 'ollama-openai', 'openai-compat', 'sglang', 'vllm'].

>>> aimu.client("llamaserver:qwen3-8b.gguf@http://gpu-box:8080/v1;tools")
ValueError: Capability flags are not allowed with the known model id 'qwen3-8b.gguf'; it already declares its capabilities. Use a different id to define an ad-hoc model.