`aimu.prompts`¶

Versioned prompt storage and hill-climbing prompt optimisation.

Catalog¶

aimu.prompts.Prompt ¶

Bases: Base

aimu.prompts.PromptCatalog ¶

PromptCatalog(db_path: str)

store_prompt ¶

store_prompt(prompt: Prompt) -> None

Store a prompt, auto-assigning version and created_at if not set.

retrieve_last ¶

retrieve_last(name: str, model_id: str) -> Prompt | None

Return the highest-versioned prompt for the given name and model.

retrieve_all ¶

retrieve_all(name: str, model_id: str) -> list[Prompt]

Return all prompt versions for the given name and model, newest first.

delete_all ¶

delete_all(name: str, model_id: str) -> int

Delete all versions for the given name and model. Returns row count.

retrieve_model_ids ¶

retrieve_model_ids() -> list[str]

Return a deduplicated list of all stored model IDs.

retrieve_names ¶

retrieve_names() -> list[str]

Return a deduplicated list of all stored prompt names.

Tuner base¶

aimu.prompts.PromptTuner ¶

PromptTuner(model_client: 'BaseModelClient')

Bases: ABC

apply_prompt `abstractmethod` ¶

apply_prompt(prompt: str, data: DataFrame) -> pd.DataFrame

Apply prompt to data and return the annotated dataset.

The returned object must contain a boolean column named _correct (True when the model's output matched ground truth, False otherwise). Implementations may mutate data in-place or return a copy.

evaluate `abstractmethod` ¶

evaluate(data: DataFrame) -> dict

Score the annotated dataset.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Annotated dataset as returned by apply_prompt.	required

Returns:

Type	Description
`dict`	Metrics dict. Must include an `'accuracy'` key (float 0–1) so the
`dict`	loop can decide whether to keep or revert a mutation.

mutation_prompt `abstractmethod` ¶

mutation_prompt(current_prompt: str, items: list) -> str

Build a prompt that asks the LLM to improve current_prompt.

Parameters:

Name	Type	Description	Default
`current_prompt`	`str`	The prompt version that failed on items.	required
`items`	`list`	List of incorrectly handled rows (pandas-like row objects with attribute access).	required

Returns:

Type	Description
`str`	A prompt string. The LLM's response is expected to contain the
`str`	improved prompt wrapped in `<prompt>...</prompt>` tags.

score ¶

score(metrics: dict) -> float

Scalar to maximise during hill-climbing. Default: metrics['accuracy'].

Override to optimise a different metric (e.g. metrics['score'] for a judge-based tuner, or metrics['macro_f1'] for multi-class tasks).

extract_mutated_prompt ¶

extract_mutated_prompt(result: str) -> str

Extract the improved prompt text from a mutation LLM response.

Default: parses content between <prompt> and </prompt> tags. Override for a different extraction format (JSON, bare text, etc.).

tune ¶

tune(training_data: DataFrame, initial_prompt: str, max_iterations: int = 20, max_examples: int = 5, catalog: PromptCatalog | None = None, prompt_name: str | None = None) -> str

Hill-climbing loop: mutate the prompt until 100% accuracy is reached or max_iterations is exhausted.

Each iteration applies the current prompt, evaluates results, and, if accuracy improved, picks up to max_examples incorrect items and asks the LLM to produce a better prompt. If a mutation makes accuracy worse the previous best prompt is restored immediately without re-evaluating.

Parameters:

Name	Type	Description	Default
`training_data`	`DataFrame`	Dataset passed to apply_prompt and evaluate.	required
`initial_prompt`	`str`	Starting prompt text.	required
`max_iterations`	`int`	Hard stop on the number of evaluation rounds. Prevents infinite loops when 100% is unreachable.	`20`
`max_examples`	`int`	Maximum number of incorrect examples passed to mutation_prompt per round (sampled randomly).	`5`
`catalog`	`PromptCatalog \| None`	Optional PromptCatalog. When provided (together with prompt_name), each improvement is persisted.	`None`
`prompt_name`	`str \| None`	Catalog key for saving prompt versions.	`None`

Returns:

Type	Description
`str`	The best prompt found.

Concrete tuners¶

aimu.prompts.ClassificationPromptTuner ¶

ClassificationPromptTuner(model_client: 'BaseModelClient')

Bases: PromptTuner

Prompt tuner for binary (YES/NO) classification.

Inherits the hill-climbing loop from PromptTuner.tune() and implements apply_prompt, evaluate, and mutation_prompt for classification tasks.

Parameters:

Name	Type	Description	Default
`model_client`	`'BaseModelClient'`	Any ModelClient instance.	required

classify_data ¶

classify_data(classification_prompt: str, data: DataFrame) -> pd.DataFrame

Run the prompt on every row of data and return it with a predicted_class boolean column added.

Parameters:

Name	Type	Description	Default
`classification_prompt`	`str`	Prompt template with a `{content}` placeholder.	required
`data`	`DataFrame`	DataFrame with a `content` column.	required

Returns:

Type	Description
`DataFrame`	DataFrame with added `predicted_class` column.

evaluate_results ¶

evaluate_results(data: DataFrame) -> dict

Compute accuracy, precision, and recall from actual_class / predicted_class columns.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	DataFrame with boolean `actual_class` and `predicted_class` columns.	required

Returns:

Type	Description
`dict`	Dict with `accuracy`, `precision`, and `recall` keys.

aimu.prompts.MultiClassPromptTuner ¶

MultiClassPromptTuner(model_client: 'BaseModelClient', classes: list[str])

Bases: PromptTuner

Prompt tuner for multi-class (N-way) classification.

Constructor: MultiClassPromptTuner(model_client, classes) — classes (the list of valid category names) is required, unlike the bare-model_client tuners.

Inherits the hill-climbing loop from PromptTuner.tune() and implements apply_prompt, evaluate, and mutation_prompt for multi-class tasks.

Parameters:

Name	Type	Description	Default
`model_client`	`'BaseModelClient'`	Any ModelClient instance.	required
`classes`	`list[str]`	Ordered list of valid class name strings. The model is expected to output `[ClassName]` for one of these.	required

classify_data ¶

classify_data(classification_prompt: str, data: DataFrame) -> pd.DataFrame

Run the prompt on every row of data and return it with a predicted_class column added.

Parameters:

Name	Type	Description	Default
`classification_prompt`	`str`	Prompt template with a `{content}` placeholder.	required
`data`	`DataFrame`	DataFrame with a `content` column.	required

Returns:

Type	Description
`DataFrame`	DataFrame with added `predicted_class` column.

Raises:

Type	Description
`ValueError`	If the model output does not contain any known `[ClassName]`.

evaluate_results ¶

evaluate_results(data: DataFrame) -> dict

Compute accuracy and per-class precision, recall, and F1.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	DataFrame with `actual_class`, `predicted_class` columns.	required

Returns:

Type	Description
`dict`	Dict with `accuracy`, `macro_f1`, and per-class
`dict`	`per_class_{name}_precision`, `per_class_{name}_recall`,
`dict`	`per_class_{name}_f1` keys.

aimu.prompts.ExtractionPromptTuner ¶

ExtractionPromptTuner(model_client: 'BaseModelClient', fields: list[str])

Bases: PromptTuner

Prompt tuner for structured field extraction.

Constructor: ExtractionPromptTuner(model_client, fields) — fields (the field names to extract and compare) is required, unlike the bare-model_client tuners.

Inherits the hill-climbing loop from PromptTuner.tune() and implements apply_prompt, evaluate, and mutation_prompt for extraction tasks.

Parameters:

Name	Type	Description	Default
`model_client`	`'BaseModelClient'`	Any ModelClient instance.	required
`fields`	`list[str]`	List of field names to extract and compare. Only these fields are considered when computing `_correct` and metrics.	required

extract_data ¶

extract_data(extraction_prompt: str, data: DataFrame) -> pd.DataFrame

Run the prompt on every row of data and return it with an extracted column (dict) added.

Parameters:

Name	Type	Description	Default
`extraction_prompt`	`str`	Prompt template with a `{content}` placeholder.	required
`data`	`DataFrame`	DataFrame with a `content` column.	required

Returns:

Type	Description
`DataFrame`	DataFrame with added `extracted` column.

evaluate_results ¶

evaluate_results(data: DataFrame) -> dict

Compute row-level accuracy and per-field match rates.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	DataFrame with `expected` (dict) and `extracted` (dict) columns, plus a `_correct` boolean column.	required

Returns:

Type	Description
`dict`	Dict with `accuracy` (fraction of fully-correct rows) and
`dict`	`field_{name}_accuracy` for each field in self.fields.

aimu.prompts.JudgedPromptTuner ¶

JudgedPromptTuner(model_client: 'BaseModelClient', scorer: Scorer, pass_threshold: float = 0.7)

Bases: PromptTuner

Prompt tuner for open-ended generation tasks evaluated by a :class:Scorer.

Overrides score() to optimise mean judge score rather than accuracy, demonstrating the PromptTuner base-class extension point.

Parameters:

Name	Type	Description	Default
`model_client`	`'BaseModelClient'`	ModelClient that generates responses.	required
`scorer`	`Scorer`	Per-row :class:`Scorer` used for evaluation. Use :class:`aimu.prompts.tuners.scorers.LLMJudgeScorer` for LLM-as-judge or :class:`aimu.evals.DeepEvalScorer` for DeepEval-backed metrics.	required
`pass_threshold`	`float`	Minimum normalised score (0-1) for a row to be `_correct`. Default `0.7`.	`0.7`

score ¶

score(metrics: dict) -> float

Optimise mean judge score rather than binary accuracy.

generate_responses ¶

generate_responses(prompt: str, data: DataFrame) -> pd.DataFrame

Apply prompt to every row's content and store results in an output column.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	Prompt template with a `{content}` placeholder.	required
`data`	`DataFrame`	DataFrame with a `content` column.	required

Returns:

Type	Description
`DataFrame`	DataFrame with added `output` column.

judge_responses ¶

judge_responses(data: DataFrame) -> pd.DataFrame

Run the configured :class:Scorer on every row and store results.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	DataFrame with `content` and `output` columns.	required

Returns:

Type	Description
`DataFrame`	DataFrame with added `judge_score` (float 0-1) and `judge_feedback` columns.

evaluate_results ¶

evaluate_results(data: DataFrame) -> dict

Compute mean judge score and pass rate.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	DataFrame with `judge_score` (float) and `_correct` (bool) columns.	required

Returns:

Type	Description
`dict`	Dict with `score` (mean judge score, 0-1) and `pass_rate` (fraction passing).

Scorers¶

aimu.prompts.Scorer ¶

Bases: ABC

Base class for per-row scorers used by :class:JudgedPromptTuner.

Implementations receive a row that has at least content and output attributes (pandas Series). They return (score, feedback) where score is a float in [0, 1] and feedback is a string passed through to the mutation prompt to help the model improve.

Built-in implementations: :class:LLMJudgeScorer (a single judge model rating 1-10, configured with criteria) and :class:aimu.evals.deepeval_scorer.DeepEvalScorer (one or more pre-configured DeepEval metrics, averaged).

aimu.prompts.LLMJudgeScorer ¶

LLMJudgeScorer(judge_client: 'BaseModelClient', criteria: str, prompt_template: str | None = None, generate_kwargs: dict | None = None)

Bases: Scorer

Score each row by asking a judge model for a 1-10 rating.

Configured with free-text criteria. For metric-based scoring (relevancy, faithfulness, etc.) use :class:aimu.evals.deepeval_scorer.DeepEvalScorer instead.

Parameters:

Name	Type	Description	Default
`judge_client`	`'BaseModelClient'`	ModelClient used to evaluate outputs.	required
`criteria`	`str`	Plain-text description of what a good response looks like.	required
`prompt_template`	`str \| None`	Override the default judge prompt. Must contain `{criteria}`, `{content}`, `{output}`, and `{reference_line}` placeholders.	`None`
`generate_kwargs`	`dict \| None`	Optional kwargs passed through to `judge_client.generate`.	`None`

aimu.prompts¶

Catalog¶

aimu.prompts.Prompt ¶

aimu.prompts.PromptCatalog ¶

store_prompt ¶

retrieve_last ¶

retrieve_all ¶

delete_all ¶

retrieve_model_ids ¶

retrieve_names ¶

Tuner base¶

aimu.prompts.PromptTuner ¶

apply_prompt abstractmethod ¶

evaluate abstractmethod ¶

mutation_prompt abstractmethod ¶

score ¶

extract_mutated_prompt ¶

tune ¶

Concrete tuners¶

aimu.prompts.ClassificationPromptTuner ¶

classify_data ¶

evaluate_results ¶

aimu.prompts.MultiClassPromptTuner ¶

classify_data ¶

evaluate_results ¶

aimu.prompts.ExtractionPromptTuner ¶

extract_data ¶

evaluate_results ¶

aimu.prompts.JudgedPromptTuner ¶

score ¶

generate_responses ¶

judge_responses ¶

evaluate_results ¶

Scorers¶

aimu.prompts.Scorer ¶

aimu.prompts.LLMJudgeScorer ¶

`aimu.prompts`¶

apply_prompt `abstractmethod` ¶

evaluate `abstractmethod` ¶

mutation_prompt `abstractmethod` ¶