Integrate DeepEval¶

aimu[evals] ships two adapters:

DeepEvalModel: wraps any AIMU BaseModelClient as a DeepEval judge (DeepEvalBaseLLM). Pass it as model= to any DeepEval metric.
DeepEvalScorer: wraps a list of DeepEval metrics as an AIMU Scorer for JudgedPromptTuner and Benchmark.

Install¶

pip install aimu[evals]

Use any AIMU client as a DeepEval judge¶

import aimu
from aimu.evals import DeepEvalModel
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

judge = DeepEvalModel(aimu.client("ollama:qwen3.5:9b"))

metric = GEval(
    name="Correctness",
    criteria="Is the actual output factually correct and directly responsive to the question?",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=judge,
    threshold=0.5,
)

case = LLMTestCase(input="What is the capital of France?", actual_output="Paris.")
metric.measure(case)
print(metric.score, metric.reason)

Swap in a stronger cloud judge for harder evaluations:

judge = DeepEvalModel(aimu.client("anthropic:claude-sonnet-4-6"))

Works with any metric (GEval, AnswerRelevancyMetric, FaithfulnessMetric, etc.) and any AIMU client (Ollama, HuggingFace, OpenAI, Anthropic, Gemini, OpenAI-compatible servers).

Drive prompt tuning with DeepEval metrics¶

DeepEvalScorer adapts one or more DeepEval metrics to AIMU's Scorer protocol so they plug into JudgedPromptTuner:

from aimu.prompts import JudgedPromptTuner
from aimu.evals import DeepEvalModel, DeepEvalScorer

writer = aimu.client("ollama:qwen3.5:9b")
judge  = aimu.client("anthropic:claude-sonnet-4-6")

geval = GEval(
    name="quality",
    criteria="Concise, accurate, plain English.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=DeepEvalModel(judge),
)

tuner = JudgedPromptTuner(model_client=writer, scorer=DeepEvalScorer([geval]))
best = tuner.tune(df, initial_prompt="Answer in one sentence: {content}")

For multiple metrics, pass them all; DeepEvalScorer averages the per-metric scores:

scorer = DeepEvalScorer([geval, AnswerRelevancyMetric(model=DeepEvalModel(judge))])

Use DeepEval metrics in benchmarks¶

The same DeepEvalScorer slots into Benchmark:

from aimu.evals import Benchmark, DeepEvalScorer

bench = Benchmark(prompt=prompt, data=df, scorer=DeepEvalScorer([geval]))
results = bench.run({"qwen": aimu.client("ollama:qwen3.5:9b")})

Structured output¶

DeepEvalModel.generate(prompt, schema=PydanticModel) parses the response as JSON and validates against the Pydantic schema. AIMU tries raw JSON first, then fenced code blocks, then {...} substrings:

from pydantic import BaseModel

class Answer(BaseModel):
    response: str
    confidence: float

judge = DeepEvalModel(aimu.client("anthropic:claude-sonnet-4-6"))
parsed: Answer = judge.generate("Reply as JSON with response and confidence.", schema=Answer)

DeepEval calls this transparently when a metric requires structured output.

How `reference` columns are used¶

For benchmark data with a reference column, DeepEvalScorer populates LLMTestCase.expected_output. Metrics that use the expected output (e.g. GEval with EXPECTED_OUTPUT in evaluation_params) get it automatically.