aimu.evals¶
Benchmark harness and DeepEval adapters.
Benchmark¶
aimu.evals.Benchmark ¶
Benchmark(prompt: str, data: DataFrame, scorer: Scorer, pass_threshold: float = 0.7, generate_kwargs: Optional[dict] = None)
Run the same prompt and dataset across multiple model clients.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
Prompt template containing a |
required |
data
|
DataFrame
|
DataFrame with at least a |
required |
scorer
|
Scorer
|
Per-row :class: |
required |
pass_threshold
|
float
|
Minimum normalised score (0-1) for a row to count
toward |
0.7
|
generate_kwargs
|
Optional[dict]
|
Optional kwargs forwarded to |
None
|
run ¶
Evaluate every client on the dataset and return aggregate metrics.
Each row is run as a fresh chat (client.reset()) so an agentic view
(from agent.as_model_client()) engages its agent loop. The client's
original messages are restored after its run completes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
clients
|
dict[str, BaseModelClient]
|
Mapping of display name to client. The display name is
used as the row label in the result DataFrame and as the
|
required |
Returns:
| Type | Description |
|---|---|
BenchmarkResults
|
class: |
aimu.evals.BenchmarkResults
dataclass
¶
Aggregate results from a :class:Benchmark run.
Attributes:
| Name | Type | Description |
|---|---|---|
metrics |
DataFrame
|
DataFrame indexed by client name with one column per metric
(currently |
prompt |
str
|
The prompt template used for the run, kept so that
:meth: |
to_catalog ¶
Persist one row per client to catalog keyed by (prompt_name, client_name).
Catalog auto-versions on each store, so re-running a benchmark with the same prompt_name appends new versions rather than overwriting.
DeepEval integration¶
Requires aimu[deepeval].
aimu.evals.DeepEvalModel ¶
Bases: DeepEvalBaseLLM
Wraps any AIMU BaseModelClient for use as a DeepEval judge model.
aimu.evals.DeepEvalScorer ¶
DeepEvalScorer(metrics: list[BaseMetric], input_field: str = 'content', output_field: str = 'output')
Bases: Scorer
Score each row using one or more DeepEval metrics, averaged.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
list[BaseMetric]
|
One or more DeepEval |
required |
input_field
|
str
|
Row attribute used as |
'content'
|
output_field
|
str
|
Row attribute used as |
'output'
|