`aimu.evals`¶

Benchmark harness and DeepEval adapters.

Benchmark¶

aimu.evals.Benchmark ¶

Benchmark(prompt: str, data: DataFrame, scorer: Scorer, pass_threshold: float = 0.7, generate_kwargs: Optional[dict] = None)

Run the same prompt and dataset across multiple model clients.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	Prompt template containing a `{content}` placeholder.	required
`data`	`DataFrame`	DataFrame with at least a `content` column. An optional `reference` column is forwarded to scorers that consume it (e.g. :class:`aimu.evals.DeepEvalScorer`).	required
`scorer`	`Scorer`	Per-row :class:`Scorer` used to evaluate model outputs.	required
`pass_threshold`	`float`	Minimum normalised score (0-1) for a row to count toward `pass_rate`. Default `0.7`.	`0.7`
`generate_kwargs`	`Optional[dict]`	Optional kwargs forwarded to `client.chat(...)`.	`None`

run ¶

run(clients: dict[str, BaseModelClient]) -> BenchmarkResults

Evaluate every client on the dataset and return aggregate metrics.

Each row is run as a fresh chat (client.reset()) so an agentic view (from agent.as_model_client()) engages its agent loop. The client's original messages are restored after its run completes.

Parameters:

Name	Type	Description	Default
`clients`	`dict[str, BaseModelClient]`	Mapping of display name to client. The display name is used as the row label in the result DataFrame and as the `model_id` when persisting to a catalog.	required

Returns:

Type	Description
`BenchmarkResults`	class:`BenchmarkResults` with one row per client.

aimu.evals.BenchmarkResults `dataclass` ¶

BenchmarkResults(metrics: DataFrame, prompt: str)

Aggregate results from a :class:Benchmark run.

Attributes:

Name	Type	Description
`metrics`	`DataFrame`	DataFrame indexed by client name with one column per metric (currently `score` and `pass_rate`).
`prompt`	`str`	The prompt template used for the run, kept so that :meth:`to_catalog` can persist it alongside the metrics.

to_catalog ¶

to_catalog(catalog: PromptCatalog, prompt_name: str) -> None

Persist one row per client to catalog keyed by (prompt_name, client_name).

Catalog auto-versions on each store, so re-running a benchmark with the same prompt_name appends new versions rather than overwriting.

DeepEval integration¶

Requires aimu[evals].

aimu.evals.DeepEvalModel ¶

DeepEvalModel(model_client: BaseModelClient)

Bases: DeepEvalBaseLLM

Wraps any AIMU BaseModelClient for use as a DeepEval judge model.

aimu.evals.DeepEvalScorer ¶

DeepEvalScorer(metrics: list[BaseMetric], input_field: str = 'content', output_field: str = 'output')

Bases: Scorer

Score each row using one or more DeepEval metrics, averaged.

For simple free-text-criteria scoring without DeepEval, use :class:aimu.prompts.tuners.scorers.LLMJudgeScorer instead.

Parameters:

Name	Type	Description	Default
`metrics`	`list[BaseMetric]`	One or more DeepEval `BaseMetric` instances. Each metric should already be configured with its own judge model (typically wrapped via :class:`aimu.evals.DeepEvalModel`).	required
`input_field`	`str`	Row attribute used as `LLMTestCase.input`. Default `"content"` matches the column convention used by :class:`aimu.prompts.JudgedPromptTuner`.	`'content'`
`output_field`	`str`	Row attribute used as `LLMTestCase.actual_output`. Default `"output"` matches the column produced by :meth:`JudgedPromptTuner.generate_responses`.	`'output'`

aimu.evals¶

Benchmark¶

aimu.evals.Benchmark ¶

run ¶

aimu.evals.BenchmarkResults dataclass ¶

to_catalog ¶

DeepEval integration¶

aimu.evals.DeepEvalModel ¶

aimu.evals.DeepEvalScorer ¶

`aimu.evals`¶

aimu.evals.BenchmarkResults `dataclass` ¶