aimu.prompts¶
Versioned prompt storage and hill-climbing prompt optimisation.
Catalog¶
aimu.prompts.Prompt ¶
Bases: Base
aimu.prompts.PromptCatalog ¶
store_prompt ¶
Store a prompt, auto-assigning version and created_at if not set.
retrieve_last ¶
Return the highest-versioned prompt for the given name and model.
retrieve_all ¶
Return all prompt versions for the given name and model, newest first.
delete_all ¶
Delete all versions for the given name and model. Returns row count.
retrieve_model_ids ¶
Return a deduplicated list of all stored model IDs.
retrieve_names ¶
Return a deduplicated list of all stored prompt names.
Tuner base¶
aimu.prompts.PromptTuner ¶
Bases: ABC
apply_prompt
abstractmethod
¶
Apply prompt to data and return the annotated dataset.
The returned object must contain a boolean column named _correct
(True when the model's output matched ground truth, False otherwise).
Implementations may mutate data in-place or return a copy.
evaluate
abstractmethod
¶
Score the annotated dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Annotated dataset as returned by apply_prompt. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Metrics dict. Must include an |
dict
|
loop can decide whether to keep or revert a mutation. |
mutation_prompt
abstractmethod
¶
Build a prompt that asks the LLM to improve current_prompt.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
current_prompt
|
str
|
The prompt version that failed on items. |
required |
items
|
list
|
List of incorrectly handled rows (pandas-like row objects with attribute access). |
required |
Returns:
| Type | Description |
|---|---|
str
|
A prompt string. The LLM's response is expected to contain the |
str
|
improved prompt wrapped in |
score ¶
Scalar to maximise during hill-climbing. Default: metrics['accuracy'].
Override to optimise a different metric (e.g. metrics['score'] for
a judge-based tuner, or metrics['macro_f1'] for multi-class tasks).
extract_mutated_prompt ¶
Extract the improved prompt text from a mutation LLM response.
Default: parses content between <prompt> and </prompt> tags.
Override for a different extraction format (JSON, bare text, etc.).
tune ¶
tune(training_data: DataFrame, initial_prompt: str, max_iterations: int = 20, max_examples: int = 5, catalog: PromptCatalog | None = None, prompt_name: str | None = None) -> str
Hill-climbing loop: mutate the prompt until 100% accuracy is reached or max_iterations is exhausted.
Each iteration applies the current prompt, evaluates results, and, if accuracy improved, picks up to max_examples incorrect items and asks the LLM to produce a better prompt. If a mutation makes accuracy worse the previous best prompt is restored immediately without re-evaluating.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
training_data
|
DataFrame
|
Dataset passed to apply_prompt and evaluate. |
required |
initial_prompt
|
str
|
Starting prompt text. |
required |
max_iterations
|
int
|
Hard stop on the number of evaluation rounds. Prevents infinite loops when 100% is unreachable. |
20
|
max_examples
|
int
|
Maximum number of incorrect examples passed to mutation_prompt per round (sampled randomly). |
5
|
catalog
|
PromptCatalog | None
|
Optional PromptCatalog. When provided (together with prompt_name), each improvement is persisted. |
None
|
prompt_name
|
str | None
|
Catalog key for saving prompt versions. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
The best prompt found. |
Concrete tuners¶
aimu.prompts.ClassificationPromptTuner ¶
Bases: PromptTuner
Prompt tuner for binary (YES/NO) classification.
Inherits the hill-climbing loop from PromptTuner.tune() and implements apply_prompt, evaluate, and mutation_prompt for classification tasks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_client
|
'BaseModelClient'
|
Any ModelClient instance. |
required |
classify_data ¶
Run the prompt on every row of data and return it with a
predicted_class boolean column added.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
classification_prompt
|
str
|
Prompt template with a |
required |
data
|
DataFrame
|
DataFrame with a |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with added |
evaluate_results ¶
Compute accuracy, precision, and recall from actual_class /
predicted_class columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame with boolean |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Dict with |
aimu.prompts.MultiClassPromptTuner ¶
Bases: PromptTuner
Prompt tuner for multi-class (N-way) classification.
Inherits the hill-climbing loop from PromptTuner.tune() and implements apply_prompt, evaluate, and mutation_prompt for multi-class tasks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_client
|
'BaseModelClient'
|
Any ModelClient instance. |
required |
classes
|
list[str]
|
Ordered list of valid class name strings. The model is
expected to output |
required |
classify_data ¶
Run the prompt on every row of data and return it with a
predicted_class column added.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
classification_prompt
|
str
|
Prompt template with a |
required |
data
|
DataFrame
|
DataFrame with a |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with added |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the model output does not contain any known |
evaluate_results ¶
Compute accuracy and per-class precision, recall, and F1.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame with |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Dict with |
dict
|
|
dict
|
|
aimu.prompts.ExtractionPromptTuner ¶
Bases: PromptTuner
Prompt tuner for structured field extraction.
Inherits the hill-climbing loop from PromptTuner.tune() and implements apply_prompt, evaluate, and mutation_prompt for extraction tasks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_client
|
'BaseModelClient'
|
Any ModelClient instance. |
required |
fields
|
list[str]
|
List of field names to extract and compare. Only these
fields are considered when computing |
required |
extract_data ¶
Run the prompt on every row of data and return it with an
extracted column (dict) added.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extraction_prompt
|
str
|
Prompt template with a |
required |
data
|
DataFrame
|
DataFrame with a |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with added |
evaluate_results ¶
Compute row-level accuracy and per-field match rates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame with |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Dict with |
dict
|
|
aimu.prompts.JudgedPromptTuner ¶
Bases: PromptTuner
Prompt tuner for open-ended generation tasks evaluated by a :class:Scorer.
Overrides score() to optimise mean judge score rather than accuracy,
demonstrating the PromptTuner base-class extension point.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_client
|
'BaseModelClient'
|
ModelClient that generates responses. |
required |
scorer
|
Scorer
|
Per-row :class: |
required |
pass_threshold
|
float
|
Minimum normalised score (0-1) for a row to be |
0.7
|
generate_responses ¶
Apply prompt to every row's content and store results in an output column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
Prompt template with a |
required |
data
|
DataFrame
|
DataFrame with a |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with added |
judge_responses ¶
Run the configured :class:Scorer on every row and store results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame with |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with added |
evaluate_results ¶
Compute mean judge score and pass rate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame with |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Dict with |
Scorers¶
aimu.prompts.Scorer ¶
Bases: ABC
Base class for per-row scorers used by :class:JudgedPromptTuner.
Implementations receive a row that has at least content and
output attributes (pandas Series). They return (score, feedback)
where score is a float in [0, 1] and feedback is a string passed
through to the mutation prompt to help the model improve.
aimu.prompts.LLMJudgeScorer ¶
LLMJudgeScorer(judge_client: 'BaseModelClient', criteria: str, prompt_template: str | None = None, generate_kwargs: dict | None = None)
Bases: Scorer
Score each row by asking a judge model for a 1-10 rating.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
judge_client
|
'BaseModelClient'
|
ModelClient used to evaluate outputs. |
required |
criteria
|
str
|
Plain-text description of what a good response looks like. |
required |
prompt_template
|
str | None
|
Override the default judge prompt. Must contain
|
None
|
generate_kwargs
|
dict | None
|
Optional kwargs passed through to |
None
|