Benchmark models¶
Benchmark runs one prompt over a dataset across multiple BaseModelClient instances and aggregates per-row scores into a comparison DataFrame. Rows are driven through chat() with client.reset() between rows, so the harness handles plain ModelClient and Agent.as_model_client() views uniformly.
Basic comparison¶
import aimu
import pandas as pd
from aimu.evals import Benchmark
from aimu.prompts.tuners.scorers import LLMJudgeScorer
df = pd.DataFrame({
"content": [
"Summarise the water cycle.",
"Explain photosynthesis briefly.",
"What is the Doppler effect?",
]
})
prompt = "Answer in one sentence: {content}"
judge = aimu.client("ollama:qwen3.5:9b")
scorer = LLMJudgeScorer(judge, criteria="Concise, accurate, plain English.")
bench = Benchmark(prompt=prompt, data=df, scorer=scorer, pass_threshold=0.7)
results = bench.run({
"qwen3.5-9b": aimu.client("ollama:qwen3.5:9b"),
"gemma4-e4b": aimu.client("ollama:gemma4:e4b"),
})
print(results.metrics)
# score pass_rate
# qwen3.5-9b 0.82 1.00
# gemma4-e4b 0.74 0.67
Compare with an agentic client¶
Agent.as_model_client() lets an agentic loop slot into the benchmark next to plain clients:
from aimu.agents import Agent
results = bench.run({
"plain": aimu.client("ollama:qwen3.5:9b"),
"agentic": Agent(aimu.client("ollama:qwen3.5:9b")).as_model_client(),
})
Reference column for scorers that use it¶
If your dataset has a reference column, it's forwarded to the scorer (and surfaces as expected_output in DeepEval metrics):
df = pd.DataFrame({
"content": ["What is the capital of France?", "What is 2+2?"],
"reference": ["Paris", "4"],
})
Export results¶
results.to_csv("bench.csv")
results.to_json("bench.json")
# Persist per-client metrics into a PromptCatalog (auto-versioned)
from aimu.prompts import PromptCatalog
with PromptCatalog("bench.db") as catalog:
results.to_catalog(catalog, prompt_name="summary-bench")
Each client becomes one Prompt(name=prompt_name, model_id=client_name, ...) row. Re-running the benchmark with the same prompt_name appends new versions.
Use DeepEval metrics as the scorer¶
Swap LLMJudgeScorer for DeepEvalScorer([metric, ...]). See Integrate DeepEval for the full setup.
How it works internally¶
For each client, the harness:
- Snapshots
client.messagesso the run is reversible. - For every row: calls
client.reset()(preservessystem_message, unlocks the lock), thenclient.chat(prompt.format(content=row.content)). - Scores the output via
scorer.score(row_with_output). - Restores the original
messagesin atry/finallyso a failure mid-client doesn't leak state.
Rows are sequential within a client; clients are sequential across runs. Concurrent execution is deferred.