Transcribe audio¶

Pass an audio clip to transcription_client().transcribe() or the one-shot aimu.transcribe() to get the spoken text as a string.

Basic usage¶

import aimu

# One-shot
text = aimu.transcribe("./clip.wav", model="openai:whisper-1")

# Reusable client (better for many files)
client = aimu.transcription_client("openai:whisper-1")
text = client.transcribe("./clip.wav")

Accepted audio forms¶

Each audio argument accepts the same forms as audio= on chat():

File path as str or pathlib.Path -- read from disk
Raw bytes -- WAV format assumed
https:// URL -- fetched eagerly
data:audio/...;base64,... URL -- passed through

Supported format strings: wav, mp3, ogg, flac, m4a, webm.

Structured output with timestamps¶

Pass response_format="verbose_json" to get segments with start/end timestamps:

result = client.transcribe("./clip.wav", response_format="verbose_json")
# {
#   "text": "Hello world.",
#   "language": "en",
#   "duration": 2.5,
#   "segments": [{"start": 0.0, "end": 2.5, "text": "Hello world."}]
# }

response_format="verbose_json" requires a model with supports_timestamps=True. All models in the catalog support it. response_format="json" returns {"text": "..."}.

Language hint¶

text = client.transcribe("./clip.wav", language="fr")  # BCP-47 code

None (default) triggers auto-detection.

Available models¶

Provider	Enum	Model ID
OpenAI	`OpenAITranscriptionModel.WHISPER_1`	`whisper-1`
OpenAI	`OpenAITranscriptionModel.GPT_4O_TRANSCRIBE`	`gpt-4o-transcribe`
OpenAI	`OpenAITranscriptionModel.GPT_4O_MINI_TRANSCRIBE`	`gpt-4o-mini-transcribe`
HuggingFace	`HuggingFaceTranscriptionModel.WHISPER_TINY`	`openai/whisper-tiny`
HuggingFace	`HuggingFaceTranscriptionModel.WHISPER_LARGE_V3`	`openai/whisper-large-v3`
HuggingFace	`HuggingFaceTranscriptionModel.DISTIL_WHISPER_LARGE_V3`	`distil-whisper/distil-large-v3`

Requires OPENAI_API_KEY for OpenAI models. HuggingFace models run locally (need the [hf] extra).

Default model via env var¶

export AIMU_TRANSCRIPTION_MODEL="openai:whisper-1"

Then aimu.transcription_client() and aimu.transcribe() resolve the model without an explicit argument.

Async surface¶

from aimu import aio
import asyncio

async def main():
    sync_client = aimu.transcription_client("openai:whisper-1")
    async_client = aio.transcription_client(sync_client)
    text = await async_client.transcribe("./clip.wav")
    print(text)

asyncio.run(main())

As an agent tool¶

from aimu.tools.builtin import transcribe_audio
from aimu.agents import Agent

agent = Agent(text_client, tools=[transcribe_audio])
# agent can now call transcribe_audio(audio_path="./clip.wav")

For a custom client: make_transcription_tool(client) returns a bound version.