LLM Testing SDK that helps you write and run tests to monitor your LLM app in production
The latest version of this repository is now at https://github.com/athina-ai/athina-evals
Athina is an LLM output testing SDK + observability platform that helps you write tests and monitor your app in production.
Reliability of output is one of the biggest challenges for people trying to use LLM apps in production.
Since LLM outputs are non-deterministic, it’s very hard to measure how good the output is.
Eyeballing the responses from an LLM can work in development, but it’s not a great solution.
In production, it’s virtually impossible to eyeball thousands of responses. Which means you have very little visibility into how well your LLM is performing.
If these sound like problems to you (today or in the future), please reach out to us at [email protected]. We’d love to hear more!
pip install magik
See https://docs.magiklabs.app for instructions on how to write and run tests.
Who is this product meant for?
Test-driven development can speed up your development very nicely, and can help you engineer your prompts to be more robust.
For example, assuming your prompt looks like this:
Create some marketing copy for a tweet of less than 280 characters for my app {app_name}.
My app helps people generate sales emails using AI.
Make sure the marketing copy contains a complete and valid link to my app.
Here is the link to my app: https://magiklabs.app.
You can write tests like this:
from magik.evaluators import (
contains_none,
contains_link,
contains_valid_link,
is_positive_sentiment,
length_less_than,
)
# Local context - this is used as the "ground truth" data that you can compare against in your tests
test_context = {}
# Define tests here
def define_tests(context: dict):
return [
{
"description": "output contains a link",
"eval": contains_link(),
"prompt_vars": {
"app_name": "Uber",
},
"failure_labels": ["bad_response_format"],
},
{
"description": "output contains a valid link",
"eval": contains_valid_link(),
"prompt_vars": {
"app_name": "Magik",
},
"failure_labels": ["bad_response_format"],
},
{
"description": "output sentiment is positive",
"eval": is_positive_sentiment(),
"prompt_vars": {
"app_name": "Lyft",
},
"failure_labels": ["negative_sentiment"],
},
{
"description": "output length is less than 280 characters",
"eval": length_less_than(280),
"prompt_vars": {
"app_name": "Facebook",
},
"failure_labels": ["negative_sentiment", "critical"],
},
{
"description": "output does not contain hashtags",
"eval": contains_none(['#']),
"prompt_vars": {
"app_name": "Datadog",
},
"failure_labels": ["bad_response_format"],
},
]
You can use our evaluation & monitoring platform to:
Observe the prompt, response pairs in production, and analyze response times, cost, token usage, etc for different prompts and date ranges.
Evaluate your production responses against your own tests to get a quantifiable understanding of how well your LLM app is performing.
Filter by failure labels, severity, prompt, etc to identify different types of errors that are occurring in your LLM outputs.
See https://magiklabs.app for more details, or contact us at [email protected]
Soon, you will also be able to:
Fail bad outputs before they get to your users.
Set up alerts to notify you about critical errors in production.
Contact us at [email protected] to get access to our LLM observability platform where you can run the tests you've defined here against your LLM responses in production.