Model Evaluation
The evaluation module helps you find the most cost-effective model that meets your validation requirements. It tests models from most expensive to least expensive, finding the cheapest one that achieves your desired pass rate.
Quick Start
lamia eval <your_script.lm>
Search Strategies
| Strategy |
Description |
binary_search (default) |
Efficiently finds cheapest working model |
step_back |
Tests from cheapest up to most expensive model |
With both strategies, the eval extracts all LLM prompts from the script and evaluates them one by one.
The evaluation stops on the first prompt where no model in the range succeeds. But if at least one model passes for a prompt, it moves on to the next prompt.
Advanced usage through Python with pass rates
| Rate |
Use Case |
100.0% (default) |
Cheapest model that always works |
90.0% |
Allow some failures, use with retry strategies |
75.0% |
Creative tasks with more variation tolerance |
result = await evaluator.evaluate_prompt(
prompt="My prompt",
return_type=Markdown[MyRepresentationType], # MyType is a pydantic model
max_model="openai:gpt-4o",
required_pass_rate_percent=90.0
)
Troubleshooting
| Issue |
Solution |
| "No models available" |
Check API keys: OPENAI_API_KEY, ANTHROPIC_API_KEY, or ensure Ollama is running |
| "No pricing provider found" |
Normal — evaluation works without pricing data |
| "Validation failed for all models" |
Simplify prompt, allow higher max_model, or check return_type, give more hints with the return_type, see the validation documentation on how to do this |