Skip to content

Model matrix

Every model enum member shipped with AIMU, with capability flags. Generated by hand; kept up-to-date with the enums in aimu/models/.

Legend: ✅ = supported, ✗ = not supported.

Anthropic (AnthropicModel)

Enum member Model id Tools Thinking Vision
CLAUDE_FABLE_5 claude-fable-5 ✅ (adaptive)
CLAUDE_OPUS_4_8 claude-opus-4-8 ✅ (adaptive)
CLAUDE_OPUS_4_7 claude-opus-4-7 ✅ (adaptive)
CLAUDE_OPUS_4_6 claude-opus-4-6 ✅ (budget)
CLAUDE_SONNET_4_6 claude-sonnet-4-6 ✅ (budget)
CLAUDE_HAIKU_4_5 claude-haiku-4-5 ✅ (budget)

AIMU requests Anthropic reasoning in one of two shapes, fixed per model by a ThinkingStyle on each AnthropicModel member (an Anthropic-specific enum, analogous to HuggingFace's ToolCallFormat):

  • budget: thinking={"type": "enabled", "budget_tokens": N}; the model always thinks up to the budget. Used by Opus 4.6, Sonnet 4.6, and Haiku 4.5.
  • adaptive: thinking={"type": "adaptive", "display": "summarized"}; the model decides per request whether and how much to think (it may not think at all on simple prompts), and temperature/top_p/top_k are not sent. Required by Opus 4.7+ and Fable 5, which reject the budget form with a 400.

Both styles surface reasoning as THINKING stream chunks and populate last_thinking. The thinking= column reflects the universal supports_thinking flag; the style only changes how the request is built, handled entirely inside AnthropicClient.

OpenAI (OpenAIModel)

Enum member Model id Tools Thinking Vision
GPT_4O_MINI gpt-4o-mini
GPT_4O gpt-4o
GPT_4_1 gpt-4.1
GPT_4_1_MINI gpt-4.1-mini
GPT_4_1_NANO gpt-4.1-nano
O4_MINI o4-mini
O3 o3
O3_MINI o3-mini

o-series models emit reasoning tokens that aren't exposed via the API, so thinking=False even though they reason internally. Pass reasoning_effort via generate_kwargs if needed.

Google Gemini (GeminiModel)

Enum member Model id Tools Thinking Vision
GEMINI_2_0_FLASH gemini-2.0-flash
GEMINI_2_0_FLASH_LITE gemini-2.0-flash-lite
GEMINI_1_5_PRO gemini-1.5-pro
GEMINI_1_5_FLASH gemini-1.5-flash
GEMINI_2_5_PRO gemini-2.5-pro
GEMINI_2_5_FLASH gemini-2.5-flash

Gemini 2.5 thinking models emit <think> tags on Google's OpenAI-compatible endpoint.

Ollama native (OllamaModel)

Enum member Model id Tools Thinking Vision
QWEN_3_6_35B qwen3.6:35b
QWEN_3_6_27B qwen3.6:27b
QWEN_3_5_9B qwen3.5:9b
QWEN_3_32B qwen3:32b
QWEN_3_8B qwen3:8b
GEMMA_4_E4B gemma4:e4b
GEMMA_4_26B gemma4:26b
GEMMA_4_31B gemma4:31b
GEMMA_3_12B gemma3:12b
NEMOTRON_CASCADE_2_30B nemotron-cascade-2:30b
NEMOTRON_3_NANO_30B nemotron-3-nano:30b
GLM_4_7_FLASH_31B_Q4 glm-4.7-flash:q4_K_M
GPT_OSS_20B gpt-oss:20b
MAGISTRAL_SMALL_24B magistral:24b
MINISTRAL_3_14B ministral-3:14b
PHI_4_MINI_3_8B phi4-mini:3.8b
PHI_4_14B phi4:14b
DEEPSEEK_R1_8B deepseek-r1:8b
SMOLLM2_1_7B smollm2:1.7b
LLAMA_3_2_3B llama3.2:3b
LLAMA_3_1_8B llama3.1:8b

Some Ollama models can technically be asked for tools but produce unreliable tool calls; those are marked tools=False and documented in the enum source.

HuggingFace (HuggingFaceModel)

Enum member Repo id Tools Thinking Vision
QWEN_3_6_27B Qwen/Qwen3.6-27B-FP8
QWEN_3_5_9B Qwen/Qwen3.5-9B
QWEN_3_8B Qwen/Qwen3-8B
GEMMA_4_E4B google/gemma-4-E4B-it
GEMMA_3_12B google/gemma-3-12b-it
GPT_OSS_20B openai/gpt-oss-20b
MAGISTRAL_SMALL mistralai/Magistral-Small-2509
MISTRAL_NEMO_12B mistralai/Mistral-Nemo-Instruct-2407
MISTRAL_7B mistralai/Mistral-7B-Instruct-v0.3
PHI_4_MINI_3_8B microsoft/Phi-4-mini-instruct
PHI_4_14B microsoft/phi-4
DEEPSEEK_R1_8B deepseek-ai/DeepSeek-R1-Distill-Llama-8B
SMOLLM3_3B HuggingFaceTB/SmolLM3-3B
LLAMA_3_2_3B unsloth/Llama-3.2-3B-Instruct
LLAMA_3_1_8B meta-llama/Meta-Llama-3.1-8B-Instruct

_VL suffix variants load with AutoModelForImageTextToText for the vision encoder.

llama-cpp (LlamaCppModel)

Enum member Hint id Tools Thinking Vision
LLAMA_3_1_8B llama-3.1-8b
LLAMA_3_2_3B llama-3.2-3b
MISTRAL_7B mistral-7b
QWEN_3_4B qwen3-4b
QWEN_3_8B qwen3-8b
DEEPSEEK_R1_7B deepseek-r1-7b
PHI_4_MINI phi-4-mini

llama-cpp model ids are hints; the actual model is loaded from model_path= regardless. Capability flags are honoured by the client.

OpenAI-compatible local servers

LMStudioOpenAIModel, OllamaOpenAIModel, HFOpenAIModel, VLLMOpenAIModel, LlamaServerOpenAIModel, and SGLangOpenAIModel all enumerate the same set of common open models (Llama 3.x, Mistral 7B, Phi-4 Mini, Qwen 3.x, DeepSeek R1, Gemma 3). The model id format differs per server (LM Studio uses loaded model keys, Ollama uses name:tag, vLLM/SGLang/HF Serve use HuggingFace repo paths, llama-server uses GGUF filenames). See the enum source for each.

Image generation

Image clients use a different spec class than text (HuggingFaceImageSpec / GeminiImageSpec). The capability flags don't apply, so the matrix shows model-specific defaults instead.

HuggingFace diffusers (HuggingFaceImageModel)

Enum member Repo id Pipeline class Default steps Default size img2img
SD_1_5 runwayml/stable-diffusion-v1-5 StableDiffusionPipeline 25 512×512 ✓ (strength=)
SDXL_BASE stabilityai/stable-diffusion-xl-base-1.0 StableDiffusionXLPipeline 30 1024×1024 ✓ (strength=)
SD_3_5_MEDIUM stabilityai/stable-diffusion-3.5-medium StableDiffusion3Pipeline 28 1024×1024 ✓ (strength=)
FLUX_1_DEV black-forest-labs/FLUX.1-dev FluxPipeline 28 1024×1024 ✓ (strength=)
FLUX_1_SCHNELL black-forest-labs/FLUX.1-schnell FluxPipeline 4 1024×1024 ✓ (strength=)
FLUX_2_KLEIN_4B black-forest-labs/FLUX.2-klein-4B Flux2KleinPipeline 4 1024×1024 ✓ (unified)
FLUX_2_KLEIN_9B black-forest-labs/FLUX.2-klein-9B Flux2KleinPipeline 4 1024×1024 ✓ (unified)

The img2img column indicates reference_image= support. strength= models derive output from a noisy version of the reference (0 = identical, 1 = ignore it; default 0.75). "unified" models (Flux2KleinPipeline) condition on the reference directly, with no strength parameter; width/height are derived from the reference.

Spec defaults are starting points: pass num_inference_steps=, guidance_scale=, width=, height=, seed= to override per call. Power users can bypass the enum with a "hf:<repo_id>" string for any HuggingFace diffusers model (defaults to DiffusionPipeline auto-detect loader, img2img_pipeline_class=None).

Google Gemini (GeminiImageModel)

Enum member Model id Notes
NANO_BANANA gemini-2.5-flash-image GA channel. Aspect ratio via aspect_ratio= (e.g. "1:1", "16:9").
NANO_BANANA_PREVIEW gemini-2.5-flash-image-preview Preview channel; kept for users who pinned it.

Short-name aliases like "gemini:nano-banana" resolve to the full model id at construction. Nano Banana's generate_content API returns one image per call; num_images > 1 issues N requests.

Audio generation

Audio clients use HuggingFaceAudioSpec, distinct from the image and text spec classes. The matrix shows generation defaults rather than capability flags.

HuggingFace (HuggingFaceAudioModel)

Enum member Repo id Pipeline type Default duration Default steps
MUSICGEN_SMALL facebook/musicgen-small musicgen 10 s N/A
MUSICGEN_MEDIUM facebook/musicgen-medium musicgen 10 s N/A
MUSICGEN_LARGE facebook/musicgen-large musicgen 10 s N/A
AUDIOLDM2 cvssp/audioldm2 audioldm2 10 s 200
STABLE_AUDIO_OPEN stabilityai/stable-audio-open-1.0 stable_audio 10 s 200

Pipeline types: - musicgen: token-autoregressive generation via HuggingFace transformers. Duration maps to token count (~50 tokens/s at 32 kHz); num_inference_steps does not apply. Single final AUDIO_GENERATING chunk when streaming. - audioldm2 / stable_audio: latent diffusion via HuggingFace diffusers. Accepts num_inference_steps; emits one progress chunk per step plus a final chunk when streaming.

Override per call with duration_s=, num_inference_steps=, seed=, num_audio=. Power users can bypass the enum with "hf:<repo_id>" for any compatible model (pipeline type inferred from known repo prefixes, defaulting to musicgen).

Speech generation

Speech clients use SpeechSpec subclasses, distinct from image and audio spec classes. Speech is text-to-speech (TTS) only; speech-to-text will use a separate BaseTranscriptionClient surface.

HuggingFace (HuggingFaceSpeechModel)

Enum member Repo id Pipeline type Sample rate Default voice
MMS_TTS_ENG facebook/mms-tts-eng tts_pipeline 16 kHz N/A
SPEECHT5 microsoft/speecht5_tts speecht5 16 kHz CMU Arctic xvectors idx 7306
BARK suno/bark bark 24 kHz v2/en_speaker_6

Pipeline types: - tts_pipeline: HuggingFace pipeline("text-to-speech"). Any compatible TTS pipeline model. - speecht5: SpeechT5ForTextToSpeech + SpeechT5HifiGan vocoder + x-vector speaker embeddings. Default embedding loads from Matthijs/cmu-arctic-xvectors (index 7306) on first call. Pass voice="N" to use a different dataset index (0–1132); the dataset is cached on the client after the first lookup. - bark: zero-shot voice cloning. Pass voice= a Bark voice code ("v2/en_speaker_6", "v2/en_speaker_9", etc.).

Power users can bypass the enum with "hf:<repo_id>" for any compatible model (pipeline type inferred from known repo prefixes, defaulting to tts_pipeline).

OpenAI (OpenAISpeechModel)

Requires OPENAI_API_KEY.

Enum member Model id Notes
TTS_1 tts-1 Fast, standard quality. Recommended for live narration.
TTS_1_HD tts-1-hd Slower, higher quality.

Available voices: alloy (default), echo, fable, onyx, nova, shimmer. Pass as voice= to generate(). OpenAI returns raw 24 kHz 16-bit PCM; encode_audio() handles WAV conversion.

Override per call with voice=, speed=, num_audio=. Override per-agent with make_speech_tool(client, voice=..., speed=...).

See also