Skip to content

aimu.evals

Benchmark harness and DeepEval adapters.

Benchmark

aimu.evals.Benchmark

Benchmark(prompt: str, data: DataFrame, scorer: Scorer, pass_threshold: float = 0.7, generate_kwargs: Optional[dict] = None)

Run the same prompt and dataset across multiple model clients.

Parameters:

Name Type Description Default
prompt str

Prompt template containing a {content} placeholder.

required
data DataFrame

DataFrame with at least a content column. An optional reference column is forwarded to scorers that consume it (e.g. :class:aimu.evals.DeepEvalScorer).

required
scorer Scorer

Per-row :class:Scorer used to evaluate model outputs.

required
pass_threshold float

Minimum normalised score (0-1) for a row to count toward pass_rate. Default 0.7.

0.7
generate_kwargs Optional[dict]

Optional kwargs forwarded to client.chat(...).

None

run

run(clients: dict[str, BaseModelClient]) -> BenchmarkResults

Evaluate every client on the dataset and return aggregate metrics.

Each row is run as a fresh chat (client.reset()) so an agentic view (from agent.as_model_client()) engages its agent loop. The client's original messages are restored after its run completes.

Parameters:

Name Type Description Default
clients dict[str, BaseModelClient]

Mapping of display name to client. The display name is used as the row label in the result DataFrame and as the model_id when persisting to a catalog.

required

Returns:

Type Description
BenchmarkResults

class:BenchmarkResults with one row per client.

aimu.evals.BenchmarkResults dataclass

BenchmarkResults(metrics: DataFrame, prompt: str)

Aggregate results from a :class:Benchmark run.

Attributes:

Name Type Description
metrics DataFrame

DataFrame indexed by client name with one column per metric (currently score and pass_rate).

prompt str

The prompt template used for the run, kept so that :meth:to_catalog can persist it alongside the metrics.

to_catalog

to_catalog(catalog: PromptCatalog, prompt_name: str) -> None

Persist one row per client to catalog keyed by (prompt_name, client_name).

Catalog auto-versions on each store, so re-running a benchmark with the same prompt_name appends new versions rather than overwriting.

DeepEval integration

Requires aimu[deepeval].

aimu.evals.DeepEvalModel

DeepEvalModel(model_client: BaseModelClient)

Bases: DeepEvalBaseLLM

Wraps any AIMU BaseModelClient for use as a DeepEval judge model.

aimu.evals.DeepEvalScorer

DeepEvalScorer(metrics: list[BaseMetric], input_field: str = 'content', output_field: str = 'output')

Bases: Scorer

Score each row using one or more DeepEval metrics, averaged.

Parameters:

Name Type Description Default
metrics list[BaseMetric]

One or more DeepEval BaseMetric instances. Each metric should already be configured with its own judge model (typically wrapped via :class:aimu.evals.DeepEvalModel).

required
input_field str

Row attribute used as LLMTestCase.input. Default "content" matches the column convention used by :class:aimu.prompts.JudgedPromptTuner.

'content'
output_field str

Row attribute used as LLMTestCase.actual_output. Default "output" matches the column produced by :meth:JudgedPromptTuner.generate_responses.

'output'