Model matrix¶
Every model enum member shipped with AIMU, with capability flags. Generated by hand; kept up-to-date with the enums in aimu/models/.
Legend: ✅ = supported, ✗ = not supported.
Anthropic (AnthropicModel)¶
| Enum member | Model id | Tools | Thinking | Vision |
|---|---|---|---|---|
CLAUDE_FABLE_5 |
claude-fable-5 |
✅ | ✅ (adaptive) | ✅ |
CLAUDE_OPUS_4_8 |
claude-opus-4-8 |
✅ | ✅ (adaptive) | ✅ |
CLAUDE_OPUS_4_7 |
claude-opus-4-7 |
✅ | ✅ (adaptive) | ✅ |
CLAUDE_OPUS_4_6 |
claude-opus-4-6 |
✅ | ✅ (budget) | ✅ |
CLAUDE_SONNET_4_6 |
claude-sonnet-4-6 |
✅ | ✅ (budget) | ✅ |
CLAUDE_HAIKU_4_5 |
claude-haiku-4-5 |
✅ | ✅ (budget) | ✅ |
AIMU requests Anthropic reasoning in one of two shapes, fixed per model by a ThinkingStyle on each AnthropicModel member (an Anthropic-specific enum, analogous to HuggingFace's ToolCallFormat):
- budget:
thinking={"type": "enabled", "budget_tokens": N}; the model always thinks up to the budget. Used by Opus 4.6, Sonnet 4.6, and Haiku 4.5. - adaptive:
thinking={"type": "adaptive", "display": "summarized"}; the model decides per request whether and how much to think (it may not think at all on simple prompts), andtemperature/top_p/top_kare not sent. Required by Opus 4.7+ and Fable 5, which reject the budget form with a 400.
Both styles surface reasoning as THINKING stream chunks and populate last_thinking. The thinking= column reflects the universal supports_thinking flag; the style only changes how the request is built, handled entirely inside AnthropicClient.
OpenAI (OpenAIModel)¶
| Enum member | Model id | Tools | Thinking | Vision |
|---|---|---|---|---|
GPT_4O_MINI |
gpt-4o-mini |
✅ | ✗ | ✅ |
GPT_4O |
gpt-4o |
✅ | ✗ | ✅ |
GPT_4_1 |
gpt-4.1 |
✅ | ✗ | ✅ |
GPT_4_1_MINI |
gpt-4.1-mini |
✅ | ✗ | ✅ |
GPT_4_1_NANO |
gpt-4.1-nano |
✅ | ✗ | ✅ |
O4_MINI |
o4-mini |
✅ | ✗ | ✅ |
O3 |
o3 |
✅ | ✗ | ✅ |
O3_MINI |
o3-mini |
✅ | ✗ | ✗ |
o-series models emit reasoning tokens that aren't exposed via the API, so thinking=False even though they reason internally. Pass reasoning_effort via generate_kwargs if needed.
Google Gemini (GeminiModel)¶
| Enum member | Model id | Tools | Thinking | Vision |
|---|---|---|---|---|
GEMINI_2_0_FLASH |
gemini-2.0-flash |
✅ | ✗ | ✅ |
GEMINI_2_0_FLASH_LITE |
gemini-2.0-flash-lite |
✅ | ✗ | ✅ |
GEMINI_1_5_PRO |
gemini-1.5-pro |
✅ | ✗ | ✅ |
GEMINI_1_5_FLASH |
gemini-1.5-flash |
✅ | ✗ | ✅ |
GEMINI_2_5_PRO |
gemini-2.5-pro |
✅ | ✅ | ✅ |
GEMINI_2_5_FLASH |
gemini-2.5-flash |
✅ | ✅ | ✅ |
Gemini 2.5 thinking models emit <think> tags on Google's OpenAI-compatible endpoint.
Ollama native (OllamaModel)¶
| Enum member | Model id | Tools | Thinking | Vision |
|---|---|---|---|---|
QWEN_3_6_35B |
qwen3.6:35b |
✅ | ✅ | ✗ |
QWEN_3_6_27B |
qwen3.6:27b |
✅ | ✅ | ✗ |
QWEN_3_5_9B |
qwen3.5:9b |
✅ | ✅ | ✗ |
QWEN_3_32B |
qwen3:32b |
✅ | ✅ | ✗ |
QWEN_3_8B |
qwen3:8b |
✅ | ✅ | ✗ |
GEMMA_4_E4B |
gemma4:e4b |
✅ | ✅ | ✅ |
GEMMA_4_26B |
gemma4:26b |
✅ | ✅ | ✅ |
GEMMA_4_31B |
gemma4:31b |
✅ | ✅ | ✅ |
GEMMA_3_12B |
gemma3:12b |
✗ | ✗ | ✅ |
NEMOTRON_CASCADE_2_30B |
nemotron-cascade-2:30b |
✅ | ✅ | ✗ |
NEMOTRON_3_NANO_30B |
nemotron-3-nano:30b |
✅ | ✅ | ✗ |
GLM_4_7_FLASH_31B_Q4 |
glm-4.7-flash:q4_K_M |
✗ | ✅ | ✗ |
GPT_OSS_20B |
gpt-oss:20b |
✅ | ✅ | ✗ |
MAGISTRAL_SMALL_24B |
magistral:24b |
✅ | ✅ | ✗ |
MINISTRAL_3_14B |
ministral-3:14b |
✅ | ✗ | ✗ |
PHI_4_MINI_3_8B |
phi4-mini:3.8b |
✗ | ✗ | ✗ |
PHI_4_14B |
phi4:14b |
✗ | ✗ | ✗ |
DEEPSEEK_R1_8B |
deepseek-r1:8b |
✗ | ✅ | ✗ |
SMOLLM2_1_7B |
smollm2:1.7b |
✗ | ✗ | ✗ |
LLAMA_3_2_3B |
llama3.2:3b |
✗ | ✗ | ✗ |
LLAMA_3_1_8B |
llama3.1:8b |
✗ | ✗ | ✗ |
Some Ollama models can technically be asked for tools but produce unreliable tool calls; those are marked tools=False and documented in the enum source.
HuggingFace (HuggingFaceModel)¶
| Enum member | Repo id | Tools | Thinking | Vision |
|---|---|---|---|---|
QWEN_3_6_27B |
Qwen/Qwen3.6-27B-FP8 |
✅ | ✅ | ✅ |
QWEN_3_5_9B |
Qwen/Qwen3.5-9B |
✅ | ✅ | ✅ |
QWEN_3_8B |
Qwen/Qwen3-8B |
✅ | ✅ | ✗ |
GEMMA_4_E4B |
google/gemma-4-E4B-it |
✅ | ✗ | ✅ |
GEMMA_3_12B |
google/gemma-3-12b-it |
✗ | ✗ | ✅ |
GPT_OSS_20B |
openai/gpt-oss-20b |
✅ | ✅ | ✗ |
MAGISTRAL_SMALL |
mistralai/Magistral-Small-2509 |
✅ | ✗ | ✗ |
MISTRAL_NEMO_12B |
mistralai/Mistral-Nemo-Instruct-2407 |
✅ | ✗ | ✗ |
MISTRAL_7B |
mistralai/Mistral-7B-Instruct-v0.3 |
✅ | ✗ | ✗ |
PHI_4_MINI_3_8B |
microsoft/Phi-4-mini-instruct |
✗ | ✗ | ✗ |
PHI_4_14B |
microsoft/phi-4 |
✗ | ✗ | ✗ |
DEEPSEEK_R1_8B |
deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
✗ | ✅ | ✗ |
SMOLLM3_3B |
HuggingFaceTB/SmolLM3-3B |
✅ | ✅ | ✗ |
LLAMA_3_2_3B |
unsloth/Llama-3.2-3B-Instruct |
✅ | ✗ | ✗ |
LLAMA_3_1_8B |
meta-llama/Meta-Llama-3.1-8B-Instruct |
✅ | ✗ | ✗ |
_VL suffix variants load with AutoModelForImageTextToText for the vision encoder.
llama-cpp (LlamaCppModel)¶
| Enum member | Hint id | Tools | Thinking | Vision |
|---|---|---|---|---|
LLAMA_3_1_8B |
llama-3.1-8b |
✗ | ✗ | ✗ |
LLAMA_3_2_3B |
llama-3.2-3b |
✗ | ✗ | ✗ |
MISTRAL_7B |
mistral-7b |
✅ | ✗ | ✗ |
QWEN_3_4B |
qwen3-4b |
✅ | ✅ | ✗ |
QWEN_3_8B |
qwen3-8b |
✅ | ✅ | ✗ |
DEEPSEEK_R1_7B |
deepseek-r1-7b |
✗ | ✅ | ✗ |
PHI_4_MINI |
phi-4-mini |
✅ | ✗ | ✗ |
llama-cpp model ids are hints; the actual model is loaded from model_path= regardless. Capability flags are honoured by the client.
OpenAI-compatible local servers¶
LMStudioOpenAIModel, OllamaOpenAIModel, HFOpenAIModel, VLLMOpenAIModel, LlamaServerOpenAIModel, and SGLangOpenAIModel all enumerate the same set of common open models (Llama 3.x, Mistral 7B, Phi-4 Mini, Qwen 3.x, DeepSeek R1, Gemma 3). The model id format differs per server (LM Studio uses loaded model keys, Ollama uses name:tag, vLLM/SGLang/HF Serve use HuggingFace repo paths, llama-server uses GGUF filenames). See the enum source for each.
Image generation¶
Image clients use a different spec class than text (HuggingFaceImageSpec / GeminiImageSpec). The capability flags don't apply, so the matrix shows model-specific defaults instead.
HuggingFace diffusers (HuggingFaceImageModel)¶
| Enum member | Repo id | Pipeline class | Default steps | Default size | img2img |
|---|---|---|---|---|---|
SD_1_5 |
runwayml/stable-diffusion-v1-5 |
StableDiffusionPipeline |
25 | 512×512 | ✓ (strength=) |
SDXL_BASE |
stabilityai/stable-diffusion-xl-base-1.0 |
StableDiffusionXLPipeline |
30 | 1024×1024 | ✓ (strength=) |
SD_3_5_MEDIUM |
stabilityai/stable-diffusion-3.5-medium |
StableDiffusion3Pipeline |
28 | 1024×1024 | ✓ (strength=) |
FLUX_1_DEV |
black-forest-labs/FLUX.1-dev |
FluxPipeline |
28 | 1024×1024 | ✓ (strength=) |
FLUX_1_SCHNELL |
black-forest-labs/FLUX.1-schnell |
FluxPipeline |
4 | 1024×1024 | ✓ (strength=) |
FLUX_2_KLEIN_4B |
black-forest-labs/FLUX.2-klein-4B |
Flux2KleinPipeline |
4 | 1024×1024 | ✓ (unified) |
FLUX_2_KLEIN_9B |
black-forest-labs/FLUX.2-klein-9B |
Flux2KleinPipeline |
4 | 1024×1024 | ✓ (unified) |
The img2img column indicates reference_image= support. strength= models derive output from a noisy version of the reference (0 = identical, 1 = ignore it; default 0.75). "unified" models (Flux2KleinPipeline) condition on the reference directly, with no strength parameter; width/height are derived from the reference.
Spec defaults are starting points: pass num_inference_steps=, guidance_scale=, width=, height=, seed= to override per call. Power users can bypass the enum with a "hf:<repo_id>" string for any HuggingFace diffusers model (defaults to DiffusionPipeline auto-detect loader, img2img_pipeline_class=None).
Google Gemini (GeminiImageModel)¶
| Enum member | Model id | Notes |
|---|---|---|
NANO_BANANA |
gemini-2.5-flash-image |
GA channel. Aspect ratio via aspect_ratio= (e.g. "1:1", "16:9"). |
NANO_BANANA_PREVIEW |
gemini-2.5-flash-image-preview |
Preview channel; kept for users who pinned it. |
Short-name aliases like "gemini:nano-banana" resolve to the full model id at construction. Nano Banana's generate_content API returns one image per call; num_images > 1 issues N requests.
Audio generation¶
Audio clients use HuggingFaceAudioSpec, distinct from the image and text spec classes. The matrix shows generation defaults rather than capability flags.
HuggingFace (HuggingFaceAudioModel)¶
| Enum member | Repo id | Pipeline type | Default duration | Default steps |
|---|---|---|---|---|
MUSICGEN_SMALL |
facebook/musicgen-small |
musicgen |
10 s | N/A |
MUSICGEN_MEDIUM |
facebook/musicgen-medium |
musicgen |
10 s | N/A |
MUSICGEN_LARGE |
facebook/musicgen-large |
musicgen |
10 s | N/A |
AUDIOLDM2 |
cvssp/audioldm2 |
audioldm2 |
10 s | 200 |
STABLE_AUDIO_OPEN |
stabilityai/stable-audio-open-1.0 |
stable_audio |
10 s | 200 |
Pipeline types:
- musicgen: token-autoregressive generation via HuggingFace transformers. Duration maps to token count (~50 tokens/s at 32 kHz); num_inference_steps does not apply. Single final AUDIO_GENERATING chunk when streaming.
- audioldm2 / stable_audio: latent diffusion via HuggingFace diffusers. Accepts num_inference_steps; emits one progress chunk per step plus a final chunk when streaming.
Override per call with duration_s=, num_inference_steps=, seed=, num_audio=. Power users can bypass the enum with "hf:<repo_id>" for any compatible model (pipeline type inferred from known repo prefixes, defaulting to musicgen).
Speech generation¶
Speech clients use SpeechSpec subclasses, distinct from image and audio spec classes. Speech is text-to-speech (TTS) only; speech-to-text will use a separate BaseTranscriptionClient surface.
HuggingFace (HuggingFaceSpeechModel)¶
| Enum member | Repo id | Pipeline type | Sample rate | Default voice |
|---|---|---|---|---|
MMS_TTS_ENG |
facebook/mms-tts-eng |
tts_pipeline |
16 kHz | N/A |
SPEECHT5 |
microsoft/speecht5_tts |
speecht5 |
16 kHz | CMU Arctic xvectors idx 7306 |
BARK |
suno/bark |
bark |
24 kHz | v2/en_speaker_6 |
Pipeline types:
- tts_pipeline: HuggingFace pipeline("text-to-speech"). Any compatible TTS pipeline model.
- speecht5: SpeechT5ForTextToSpeech + SpeechT5HifiGan vocoder + x-vector speaker embeddings. Default embedding loads from Matthijs/cmu-arctic-xvectors (index 7306) on first call. Pass voice="N" to use a different dataset index (0–1132); the dataset is cached on the client after the first lookup.
- bark: zero-shot voice cloning. Pass voice= a Bark voice code ("v2/en_speaker_6", "v2/en_speaker_9", etc.).
Power users can bypass the enum with "hf:<repo_id>" for any compatible model (pipeline type inferred from known repo prefixes, defaulting to tts_pipeline).
OpenAI (OpenAISpeechModel)¶
Requires OPENAI_API_KEY.
| Enum member | Model id | Notes |
|---|---|---|
TTS_1 |
tts-1 |
Fast, standard quality. Recommended for live narration. |
TTS_1_HD |
tts-1-hd |
Slower, higher quality. |
Available voices: alloy (default), echo, fable, onyx, nova, shimmer. Pass as voice= to generate(). OpenAI returns raw 24 kHz 16-bit PCM; encode_audio() handles WAV conversion.
Override per call with voice=, speed=, num_audio=. Override per-agent with make_speech_tool(client, voice=..., speed=...).
See also¶
- Provider matrix: provider × extra × API key
- How-to: add a new model: extending these enums
- How-to: generate images: using the image surface
- How-to: generate audio: using the audio surface
- How-to: generate speech: using the speech surface