Skip to content

aimu.prompts

Versioned prompt storage and hill-climbing prompt optimisation.

Catalog

aimu.prompts.Prompt

Bases: Base

aimu.prompts.PromptCatalog

PromptCatalog(db_path: str)

store_prompt

store_prompt(prompt: Prompt) -> None

Store a prompt, auto-assigning version and created_at if not set.

retrieve_last

retrieve_last(name: str, model_id: str) -> Prompt | None

Return the highest-versioned prompt for the given name and model.

retrieve_all

retrieve_all(name: str, model_id: str) -> list[Prompt]

Return all prompt versions for the given name and model, newest first.

delete_all

delete_all(name: str, model_id: str) -> int

Delete all versions for the given name and model. Returns row count.

retrieve_model_ids

retrieve_model_ids() -> list[str]

Return a deduplicated list of all stored model IDs.

retrieve_names

retrieve_names() -> list[str]

Return a deduplicated list of all stored prompt names.

Tuner base

aimu.prompts.PromptTuner

PromptTuner(model_client: 'BaseModelClient')

Bases: ABC

apply_prompt abstractmethod

apply_prompt(prompt: str, data: DataFrame) -> pd.DataFrame

Apply prompt to data and return the annotated dataset.

The returned object must contain a boolean column named _correct (True when the model's output matched ground truth, False otherwise). Implementations may mutate data in-place or return a copy.

evaluate abstractmethod

evaluate(data: DataFrame) -> dict

Score the annotated dataset.

Parameters:

Name Type Description Default
data DataFrame

Annotated dataset as returned by apply_prompt.

required

Returns:

Type Description
dict

Metrics dict. Must include an 'accuracy' key (float 0–1) so the

dict

loop can decide whether to keep or revert a mutation.

mutation_prompt abstractmethod

mutation_prompt(current_prompt: str, items: list) -> str

Build a prompt that asks the LLM to improve current_prompt.

Parameters:

Name Type Description Default
current_prompt str

The prompt version that failed on items.

required
items list

List of incorrectly handled rows (pandas-like row objects with attribute access).

required

Returns:

Type Description
str

A prompt string. The LLM's response is expected to contain the

str

improved prompt wrapped in <prompt>...</prompt> tags.

score

score(metrics: dict) -> float

Scalar to maximise during hill-climbing. Default: metrics['accuracy'].

Override to optimise a different metric (e.g. metrics['score'] for a judge-based tuner, or metrics['macro_f1'] for multi-class tasks).

extract_mutated_prompt

extract_mutated_prompt(result: str) -> str

Extract the improved prompt text from a mutation LLM response.

Default: parses content between <prompt> and </prompt> tags. Override for a different extraction format (JSON, bare text, etc.).

tune

tune(training_data: DataFrame, initial_prompt: str, max_iterations: int = 20, max_examples: int = 5, catalog: PromptCatalog | None = None, prompt_name: str | None = None) -> str

Hill-climbing loop: mutate the prompt until 100% accuracy is reached or max_iterations is exhausted.

Each iteration applies the current prompt, evaluates results, and, if accuracy improved, picks up to max_examples incorrect items and asks the LLM to produce a better prompt. If a mutation makes accuracy worse the previous best prompt is restored immediately without re-evaluating.

Parameters:

Name Type Description Default
training_data DataFrame

Dataset passed to apply_prompt and evaluate.

required
initial_prompt str

Starting prompt text.

required
max_iterations int

Hard stop on the number of evaluation rounds. Prevents infinite loops when 100% is unreachable.

20
max_examples int

Maximum number of incorrect examples passed to mutation_prompt per round (sampled randomly).

5
catalog PromptCatalog | None

Optional PromptCatalog. When provided (together with prompt_name), each improvement is persisted.

None
prompt_name str | None

Catalog key for saving prompt versions.

None

Returns:

Type Description
str

The best prompt found.

Concrete tuners

aimu.prompts.ClassificationPromptTuner

ClassificationPromptTuner(model_client: 'BaseModelClient')

Bases: PromptTuner

Prompt tuner for binary (YES/NO) classification.

Inherits the hill-climbing loop from PromptTuner.tune() and implements apply_prompt, evaluate, and mutation_prompt for classification tasks.

Parameters:

Name Type Description Default
model_client 'BaseModelClient'

Any ModelClient instance.

required

classify_data

classify_data(classification_prompt: str, data: DataFrame) -> pd.DataFrame

Run the prompt on every row of data and return it with a predicted_class boolean column added.

Parameters:

Name Type Description Default
classification_prompt str

Prompt template with a {content} placeholder.

required
data DataFrame

DataFrame with a content column.

required

Returns:

Type Description
DataFrame

DataFrame with added predicted_class column.

evaluate_results

evaluate_results(data: DataFrame) -> dict

Compute accuracy, precision, and recall from actual_class / predicted_class columns.

Parameters:

Name Type Description Default
data DataFrame

DataFrame with boolean actual_class and predicted_class columns.

required

Returns:

Type Description
dict

Dict with accuracy, precision, and recall keys.

aimu.prompts.MultiClassPromptTuner

MultiClassPromptTuner(model_client: 'BaseModelClient', classes: list[str])

Bases: PromptTuner

Prompt tuner for multi-class (N-way) classification.

Inherits the hill-climbing loop from PromptTuner.tune() and implements apply_prompt, evaluate, and mutation_prompt for multi-class tasks.

Parameters:

Name Type Description Default
model_client 'BaseModelClient'

Any ModelClient instance.

required
classes list[str]

Ordered list of valid class name strings. The model is expected to output [ClassName] for one of these.

required

classify_data

classify_data(classification_prompt: str, data: DataFrame) -> pd.DataFrame

Run the prompt on every row of data and return it with a predicted_class column added.

Parameters:

Name Type Description Default
classification_prompt str

Prompt template with a {content} placeholder.

required
data DataFrame

DataFrame with a content column.

required

Returns:

Type Description
DataFrame

DataFrame with added predicted_class column.

Raises:

Type Description
ValueError

If the model output does not contain any known [ClassName].

evaluate_results

evaluate_results(data: DataFrame) -> dict

Compute accuracy and per-class precision, recall, and F1.

Parameters:

Name Type Description Default
data DataFrame

DataFrame with actual_class, predicted_class columns.

required

Returns:

Type Description
dict

Dict with accuracy, macro_f1, and per-class

dict

per_class_{name}_precision, per_class_{name}_recall,

dict

per_class_{name}_f1 keys.

aimu.prompts.ExtractionPromptTuner

ExtractionPromptTuner(model_client: 'BaseModelClient', fields: list[str])

Bases: PromptTuner

Prompt tuner for structured field extraction.

Inherits the hill-climbing loop from PromptTuner.tune() and implements apply_prompt, evaluate, and mutation_prompt for extraction tasks.

Parameters:

Name Type Description Default
model_client 'BaseModelClient'

Any ModelClient instance.

required
fields list[str]

List of field names to extract and compare. Only these fields are considered when computing _correct and metrics.

required

extract_data

extract_data(extraction_prompt: str, data: DataFrame) -> pd.DataFrame

Run the prompt on every row of data and return it with an extracted column (dict) added.

Parameters:

Name Type Description Default
extraction_prompt str

Prompt template with a {content} placeholder.

required
data DataFrame

DataFrame with a content column.

required

Returns:

Type Description
DataFrame

DataFrame with added extracted column.

evaluate_results

evaluate_results(data: DataFrame) -> dict

Compute row-level accuracy and per-field match rates.

Parameters:

Name Type Description Default
data DataFrame

DataFrame with expected (dict) and extracted (dict) columns, plus a _correct boolean column.

required

Returns:

Type Description
dict

Dict with accuracy (fraction of fully-correct rows) and

dict

field_{name}_accuracy for each field in self.fields.

aimu.prompts.JudgedPromptTuner

JudgedPromptTuner(model_client: 'BaseModelClient', scorer: Scorer, pass_threshold: float = 0.7)

Bases: PromptTuner

Prompt tuner for open-ended generation tasks evaluated by a :class:Scorer.

Overrides score() to optimise mean judge score rather than accuracy, demonstrating the PromptTuner base-class extension point.

Parameters:

Name Type Description Default
model_client 'BaseModelClient'

ModelClient that generates responses.

required
scorer Scorer

Per-row :class:Scorer used for evaluation. Use :class:aimu.prompts.tuners.scorers.LLMJudgeScorer for LLM-as-judge or :class:aimu.evals.DeepEvalScorer for DeepEval-backed metrics.

required
pass_threshold float

Minimum normalised score (0-1) for a row to be _correct. Default 0.7.

0.7

score

score(metrics: dict) -> float

Optimise mean judge score rather than binary accuracy.

generate_responses

generate_responses(prompt: str, data: DataFrame) -> pd.DataFrame

Apply prompt to every row's content and store results in an output column.

Parameters:

Name Type Description Default
prompt str

Prompt template with a {content} placeholder.

required
data DataFrame

DataFrame with a content column.

required

Returns:

Type Description
DataFrame

DataFrame with added output column.

judge_responses

judge_responses(data: DataFrame) -> pd.DataFrame

Run the configured :class:Scorer on every row and store results.

Parameters:

Name Type Description Default
data DataFrame

DataFrame with content and output columns.

required

Returns:

Type Description
DataFrame

DataFrame with added judge_score (float 0-1) and judge_feedback columns.

evaluate_results

evaluate_results(data: DataFrame) -> dict

Compute mean judge score and pass rate.

Parameters:

Name Type Description Default
data DataFrame

DataFrame with judge_score (float) and _correct (bool) columns.

required

Returns:

Type Description
dict

Dict with score (mean judge score, 0-1) and pass_rate (fraction passing).

Scorers

aimu.prompts.Scorer

Bases: ABC

Base class for per-row scorers used by :class:JudgedPromptTuner.

Implementations receive a row that has at least content and output attributes (pandas Series). They return (score, feedback) where score is a float in [0, 1] and feedback is a string passed through to the mutation prompt to help the model improve.

aimu.prompts.LLMJudgeScorer

LLMJudgeScorer(judge_client: 'BaseModelClient', criteria: str, prompt_template: str | None = None, generate_kwargs: dict | None = None)

Bases: Scorer

Score each row by asking a judge model for a 1-10 rating.

Parameters:

Name Type Description Default
judge_client 'BaseModelClient'

ModelClient used to evaluate outputs.

required
criteria str

Plain-text description of what a good response looks like.

required
prompt_template str | None

Override the default judge prompt. Must contain {criteria}, {content}, {output}, and {reference_line} placeholders.

None
generate_kwargs dict | None

Optional kwargs passed through to judge_client.generate.

None