Handle audio input¶

Pass audio=[...] to chat() (stateful) or generate() (stateless one-shot) on an audio-capable model. AIMU normalises every input to OpenAI input_audio content blocks internally and adapts them per-provider.

Basic usage¶

import aimu

client = aimu.client("openai:gpt-4o")
client.chat("Transcribe this recording.", audio=["./interview.wav"])

# Multiple clips
client.chat("Compare the speakers in these two clips.", audio=["./a.wav", "./b.wav"])

Stateless single-turn (`generate`)¶

Use generate(audio=...) for a one-shot audio call that does not touch conversation history:

client = aimu.client("openai:gpt-4o")

# Each call is independent: no history kept, no reset() needed between calls.
lang = client.generate("What language is spoken here?", audio=["./clip.mp3"])
tone = client.generate("Describe the speaker's tone.", audio=["./clip.mp3"])

assert client.messages == []   # generate() never mutates message state

chat(audio=...) is the stateful path: the turn persists so you can ask follow-ups. generate(audio=...) is the stateless path: each call stands alone.

Accepted audio forms¶

Each item in audio=[...] may be any of:

File path as str or pathlib.Path: read from disk and base64-encoded
Raw bytes: encoded directly (defaults to wav format)
https:// URL: fetched eagerly (the input_audio block has no remote-URL field)
data:audio/...;base64,... URL: passed through; format extracted from MIME type

Supported format strings (inferred from extension or MIME type): wav, mp3, ogg, flac, m4a, webm.

from pathlib import Path

client.chat("Analyse these clips", audio=[
    "./local.wav",                                     # path string
    Path("./other.mp3"),                               # Path
    b"RIFF...",                                        # raw bytes (wav assumed)
    "https://example.com/clip.wav",                    # URL (fetched)
    "data:audio/flac;base64,AAAA...",                  # data URL
])

Audio-capable models¶

Each Model enum exposes an AUDIO_MODELS classproperty listing members where supports_audio=True. Passing audio= to a non-audio model raises ValueError up front.

from aimu.models.providers.openai.text import OpenAIClient
print(OpenAIClient.AUDIO_MODELS)
# [<OpenAIModel.GPT_4O_MINI: ...>, <OpenAIModel.GPT_4O: ...>, ...]

Supported audio-capable text models:

Provider	Models
OpenAI	GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano
Google Gemini	2.0 Flash, 2.0 Flash Lite, 2.5 Pro, 2.5 Flash
HuggingFace	Gemma 4 E4B, Gemma 4 12B, Nemotron-H-8B
Ollama	None yet (Gemma 4 weights support audio; add `audio=True` once Ollama API exposes audio input)

Mutual exclusivity with `images=`¶

audio= and images= are mutually exclusive per turn. Passing both raises ValueError:

# Raises ValueError: "Pass either images= or audio= per call, not both."
client.chat("describe", images=["photo.jpg"], audio=["clip.wav"])

Multi-turn conversations¶

Audio blocks persist in self.messages as input_audio content, so the model can refer back to earlier turns:

client = aimu.client("openai:gpt-4o")
client.chat("What is the main topic of this recording?", audio=["./meeting.wav"])
client.chat("Summarise the action items mentioned.")  # follow-up, no audio needed

Async surface¶

audio= is available on aio.chat() and aio.generate() with the same signature:

from aimu import aio
import asyncio

async def main():
    client = aio.client("openai:gpt-4o")
    result = await client.generate("Transcribe this.", audio=["./clip.wav"])
    print(result)

asyncio.run(main())

Per-provider adaptation¶

AIMU keeps self.messages in OpenAI format always (input_audio content blocks) and adapts at request time:

Provider	Adaptation
OpenAI / Gemini / OpenAI-compatible	Pass-through (`input_audio` is already OpenAI format)
Anthropic	Rewrites to Anthropic `{"type": "audio", "source": {"type": "base64", ...}}` blocks
HuggingFace	Decodes to float32 numpy arrays via `soundfile`; passes as `audio=`/`sampling_rate=16000` to the processor
Ollama	Raises `ValueError` (API does not yet support audio input)