Confident Ai Deepeval Versions Save

The LLM Evaluation Framework

v0.21.15

2 months ago

For deepeval's latest release v0.21.15, we release:

Synthetic Data generation. Generate synthetic data from documents easily: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data
caching. If you're running 10k test cases and it fails at the 9999th test case, you no longer have to rerun the first 9999 test case as you can just read from cache using the -c flag: https://docs.confident-ai.com/docs/evaluation-introduction#cache
repeats. If you want to repeat each test case for statistical significant, use the -r flag: https://docs.confident-ai.com/docs/evaluation-introduction#repeats
LLM Benchmarks. Supporting popular benchmarks such as MMLU, HellaSwag, and BIG-BH so anyone can evaluate ANY model on research backed benchmarks in a few lines of code.
G-Eval improvements. The G-Eval metric now supports using logprobs of tokens to find the weighted summed score.

v0.20.85

2 months ago

In deepeval v0.20.85:

asynchronous support throughout deepeval, and no longer using threads. Users can also call individual metrics asynchronously: https://docs.confident-ai.com/docs/metrics-introduction#measuring-metrics-in-async
improved the way in which you create a custom LLM for evaluation. You'll now have to implement an asynchronous generate() method to use deepeval's async features: https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm
strict mode for all metrics!
improve evaluate() function for more customizability: https://docs.confident-ai.com/docs/evaluation-introduction#evaluating-without-pytest

v0.20.80

2 months ago

In DeepEval's latest release, there is now:

conversational metrics: https://docs.confident-ai.com/docs/metrics-knowledge-retention. This metric evaluates whether your LLM is able to retain factual information presented to it throughout a conversation
synthetic data generation. Generate evaluation datasets from scratch: https://docs.confident-ai.com/docs/evaluation-datasets#generate-an-evaluation-dataset

v0.20.73

3 months ago

For the newest release, deepeval now is now stable for production use:

reduced package size
separated functionality of pytest vs deepeval test run command
included coverage score for summarization
fix contextual precision node error
released docs for better transparency into metrics calculation
allows users to configure RAGAS metrics for custom embedding models: https://docs.confident-ai.com/docs/metrics-ragas#example
fixed bugs with checking for package updates

v0.20.68

3 months ago

For the latest release, DeepEval:

Supports Hugging Face users by providing real-time evaluations during fine-tuning: https://docs.confident-ai.com/docs/integrations-huggingface
Supports LlamaIndex users by allowing unit testing of LlamaIndex apps in CI/CD, and offer metrics in LlamaIndex's evaluators: https://docs.confident-ai.com/docs/integrations-llamaindex
Improvements to accuracy and reliability in Faithfulness and Answer Relevancy
Summarization Metric now offers explanation
You can now use ANY LLM for evaluation: https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm

v0.20.48

4 months ago

LLM-Evals (LLM evaluated metrics) now support all of langchain's chat models.
LLMTestCase now has execution_time and cost, useful for those looking to evaluate on these parameters
minimum_score is now threshold instead, meaning you can now create custom metrics that either have a "minimum" or "maximum" threshold
Llamaindex Tracing integration: (https://docs.llamaindex.ai/en/stable/module_guides/observability/observability.html#deepeval)

v0.20.57

4 months ago

LLM-Evals (LLM evaluated metrics) now support all of langchain's chat models.
LLMTestCase now has execution_time and cost, useful for those looking to evaluate on these parameters
minimum_score is now threshold instead, meaning you can now create custom metrics that either have a "minimum" or "maximum" threshold
LLMEvalMetric is now GEval
Llamaindex Tracing integration: (https://docs.llamaindex.ai/en/stable/module_guides/observability/observability.html#deepeval)

v0.20.43

5 months ago

In this release:

Faithfulness, Answer Relevancy, Contextual Relevancy, Contextual Precision, and Contextual Recall, all offer a reasoning for its given score.
Azure OpenAI now supported via a single command in the CLI: https://docs.confident-ai.com/docs/metrics-introduction#using-azure-openai
New Summarization Metric that uses the QAG framework for its implementation: https://docs.confident-ai.com/docs/metrics-summarization
Pulling datasets from Confident AI now offers an intermediate step for additional data processing before evaluation: https://docs.confident-ai.com/docs/confident-ai-evaluate-datasets#pull-your-dataset-from-confident-ai
Decoupled imports from transformers, sentence_transformers, and pandas to reduce package size

v0.20.35

5 months ago

Lots of new features this release:

JudgementalGPT now allows for different languages - useful for our APAC and European friends
RAGAS metrics now supports all OpenAI models - useful for those running into context length issues
LLMEvalMetric now returns a reasoning for its score
deepeval test run now has hooks that call on test run completion
evaluate now displays retrieval_context for RAG evaluation
RAGAS metric now displays metric breakdown for all its distinct metrics

v0.20.23

5 months ago

Automatically integrated with Confident AI for continous evaluation throughout the lifetime of your LLM (app):

-log evaluation results and analyze metrics pass / fails -compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results -debug evaluation results via LLM traces -manage evaluation test cases / datasets in one place -track events to identify live LLM responses in production -add production events to existing evaluation datasets to strength evals over time