The LLM Evaluation Framework
For deepeval's latest release v0.21.15, we release:
-c
flag: https://docs.confident-ai.com/docs/evaluation-introduction#cache
-r
flag: https://docs.confident-ai.com/docs/evaluation-introduction#repeats
In deepeval v0.20.85:
evaluate()
function for more customizability: https://docs.confident-ai.com/docs/evaluation-introduction#evaluating-without-pytest
In DeepEval's latest release, there is now:
For the newest release, deepeval now is now stable for production use:
For the latest release, DeepEval:
LLMTestCase
now has execution_time
and cost
, useful for those looking to evaluate on these parametersminimum_score
is now threshold
instead, meaning you can now create custom metrics that either have a "minimum" or "maximum" thresholdLLMTestCase
now has execution_time
and cost
, useful for those looking to evaluate on these parametersminimum_score
is now threshold
instead, meaning you can now create custom metrics that either have a "minimum" or "maximum" thresholdLLMEvalMetric
is now GEval
In this release:
transformers
, sentence_transformers
, and pandas
to reduce package sizeLots of new features this release:
JudgementalGPT
now allows for different languages - useful for our APAC and European friendsRAGAS
metrics now supports all OpenAI models - useful for those running into context length issuesLLMEvalMetric
now returns a reasoning for its scoredeepeval test run
now has hooks that call on test run completionevaluate
now displays retrieval_context
for RAG evaluationRAGAS
metric now displays metric breakdown for all its distinct metricsAutomatically integrated with Confident AI for continous evaluation throughout the lifetime of your LLM (app):
-log evaluation results and analyze metrics pass / fails -compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results -debug evaluation results via LLM traces -manage evaluation test cases / datasets in one place -track events to identify live LLM responses in production -add production events to existing evaluation datasets to strength evals over time