Cleanlab Versions Save

The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

v2.6.3

1 month ago

This release is non-breaking when upgrading from v2.6.2.

What's Changed

Updated image_key documentation by @sanjanag in https://github.com/cleanlab/cleanlab/pull/1048
Refine Scoring and Enhance Stability for Datasets with Identical Examples by @elisno in https://github.com/cleanlab/cleanlab/pull/1056
Add warning message about TensorFlow compatibility to docs by @elisno in https://github.com/cleanlab/cleanlab/pull/1057

Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.6.2...v2.6.3

v2.6.2

1 month ago

This release is non-breaking when upgrading from v2.6.1.

What's Changed

Convert DataFrame features to numpy arrays in null value check by @elisno in https://github.com/cleanlab/cleanlab/pull/1045

Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.6.1...v2.6.2

v2.6.1

1 month ago

This release is non-breaking when upgrading from v2.6.0. Some noteworthy updates include:

The label quality score in the cleanlab.regression module is improved to be more human-readable.
- This only involves rescaling the scores to display a more human-interpretable range of scores, without affecting how your data points are ranked within a dataset according to these scores.
Better address some edge-cases in Datalab.get_issues().

What's Changed

Readme updates by @jwmueller in #1030, #1031, #1039; @elisno in #1040
Adjust the range of regression label quality scores by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/1032
Misc fixes of get_issues method by @elisno in #1025, #1026, #1028
Support features as input for data valuation check in Datalab by @elisno in https://github.com/cleanlab/cleanlab/pull/1023
Fix/clarify docs by @mturk24 in #1029; @elisno in #1024, #1037
CI/CD changes by @elisno in https://github.com/cleanlab/cleanlab/pull/1036

New Contributors

@mturk24 made their first contribution in https://github.com/cleanlab/cleanlab/pull/1029

Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.6.0...v2.6.1

v2.6.0

2 months ago

This release is non-breaking when upgrading from v2.5.0, continuing our commitment to maintaining backward compatibility while introducing new features and improvements. However, this release drops support for Python 3.7 while adding support for Python 3.11.

Enhancements to Datalab

In this update, Datalab, our dataset analysis platform, enhances its ability to identify various types of issues within your datasets. With this release, Datalab now detects additional types of issues by default, offering users a more comprehensive analysis. Specifically, it can now:

Identify null values in your dataset.
Detect class_imbalance.
Highlight an underperforming_group, which refers to a subset of data points where your model exhibits poorer performance compared to others. See our FAQ for more information on how to provide pre-defined groups for this issue type.

Additionally, Datalab can now optionally:

Assess the value of data points in your dataset using KNN-Shapley scores as a measure of data_valuation.

If you have ideas for new features or notice any bugs, we encourage you to open an Issue or Pull Request on our GitHub repository!

Expanded Datalab Support for New ML Tasks

With cleanlab v2.6.0, Datalab extends its support to new machine-learning tasks and introduces enhancements across the board. This release introduces the task parameter in Datalab's API, enabling users to specify the type of machine learning task they are working on.

from cleanlab import Datalab

lab = Datalab(..., task="regression")

The tasks currently supported are:

classification (default): Includes all previously supported issue-checking capabilities based on pred_probs, features, or a knn_graph, and the new features introduced earlier.
regression (new):
- Run specialized label error detection algorithms on regression datasets. You can see this in action in our updated regression tutorial.
- Find other issues utilizing features or a knn_graph.
multilabel (new):
- Detect label errors in multilabel classification datasets using pred_probs exclusively. Explore the updated capabilities in our multilabel tutorial.
- Find various other types of issues based on features or a knn_graph.

Improved Object Detection Dataset Exploration

New functions have been introduced to enhance the exploration of object detection datasets, simplifying data comprehension and issue detection. Learn how to leverage some of these functions in our object detection tutorial.

Other Major Improvements

Rescaled Near Duplicate and Outlier Scores:
- Note that what matters for all cleanlab issue scores is not their absolute magnitudes but rather how these scores rank the data points from most to least severe instances of the issue. But based on user feedback, we have updated the near duplicate and outlier scores to display a more human-interpretable range of values. How these scores rank data points within a dataset remains unchanged.
Consistency in counting label issues:
- cleanlab.dataset.health_summary() now returns the same number of issues as cleanlab.classification.find_label_issues() and cleanlab.count.num_label_issues().
Improved handling of non-iid issues:
- The non-iid issue check in Datalab now handles pred_probs as input.
Better reporting in Datalab:
- Simplified Datalab.report() now highlights only detected issue types. To view all checked issue types, use Datalab.report(show_all_issues=True).
Enhanced Handling of Binary Classification Tasks:
- Examples with predicted probabilities close to 0.5 for both classes are no longer flagged as label errors, improving the handling of binary classification tasks.
Experimental Functionality:
- cleanlab now offers experimental functionality for detecting label issues in span categorization tasks with a single class, enhancing its applicability in natural language processing projects.

New Contributors

We're thrilled to welcome new contributors to the cleanlab community! Your contributions help us improve and grow cleanlab:

@smttsp made their first contribution in https://github.com/cleanlab/cleanlab/pull/867
@abhijitpal1247 made their first contribution in https://github.com/cleanlab/cleanlab/pull/856
@01PrathamS made their first contribution in https://github.com/cleanlab/cleanlab/pull/893
@mglowacki100 made their first contribution in https://github.com/cleanlab/cleanlab/pull/796
@gibsonliketheguitar made their first contribution in https://github.com/cleanlab/cleanlab/pull/831
@kylegallatin made their first contribution in https://github.com/cleanlab/cleanlab/pull/885
@ryansingman made their first contribution in https://github.com/cleanlab/cleanlab/pull/919
@R-Peleg made their first contribution in https://github.com/cleanlab/cleanlab/pull/948

Thank you for your valuable contributions! If you're interested in contributing, check out our contributing guide for ways to get involved.

Change Log

Significant changes in this release include:

Update FAQ section in docs by @tataganesh in https://github.com/cleanlab/cleanlab/pull/869; @elisno in https://github.com/cleanlab/cleanlab/pull/913
Improve Object Detection module by @Steven-Yiran in https://github.com/cleanlab/cleanlab/pull/840, https://github.com/cleanlab/cleanlab/pull/877; @aditya1503 in https://github.com/cleanlab/cleanlab/pull/883, https://github.com/cleanlab/cleanlab/pull/969, https://github.com/cleanlab/cleanlab/pull/968
Clearer documentation/tutorials/readme by @jwmueller in https://github.com/cleanlab/cleanlab/pull/851, https://github.com/cleanlab/cleanlab/pull/931, https://github.com/cleanlab/cleanlab/pull/981, https://github.com/cleanlab/cleanlab/pull/983, https://github.com/cleanlab/cleanlab/pull/1001, https://github.com/cleanlab/cleanlab/pull/978, https://github.com/cleanlab/cleanlab/pull/994, https://github.com/cleanlab/cleanlab/pull/1010; @01PrathamS in https://github.com/cleanlab/cleanlab/pull/893; @elisno in https://github.com/cleanlab/cleanlab/pull/878, https://github.com/cleanlab/cleanlab/pull/1007, https://github.com/cleanlab/cleanlab/pull/992, https://github.com/cleanlab/cleanlab/pull/1015, https://github.com/cleanlab/cleanlab/pull/1016; @huiwengoh in https://github.com/cleanlab/cleanlab/pull/984; @sanjanag in https://github.com/cleanlab/cleanlab/pull/936; @tataganesh in https://github.com/cleanlab/cleanlab/pull/916; @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/954;
CI updates by @aditya1503 in https://github.com/cleanlab/cleanlab/pull/864; @elisno in https://github.com/cleanlab/cleanlab/pull/879, https://github.com/cleanlab/cleanlab/pull/961, https://github.com/cleanlab/cleanlab/pull/963, https://github.com/cleanlab/cleanlab/pull/965, https://github.com/cleanlab/cleanlab/pull/1008, https://github.com/cleanlab/cleanlab/pull/975, https://github.com/cleanlab/cleanlab/pull/1011, https://github.com/cleanlab/cleanlab/pull/1012, https://github.com/cleanlab/cleanlab/pull/1013, https://github.com/cleanlab/cleanlab/pull/1014; @jwmueller in https://github.com/cleanlab/cleanlab/pull/852, https://github.com/cleanlab/cleanlab/pull/865; @tataganesh in https://github.com/cleanlab/cleanlab/pull/900; @anishathalye in https://github.com/cleanlab/cleanlab/pull/956; @sanjanag in https://github.com/cleanlab/cleanlab/pull/1009
Docs system updates by @elisno in https://github.com/cleanlab/cleanlab/pull/880, https://github.com/cleanlab/cleanlab/pull/881, https://github.com/cleanlab/cleanlab/pull/958, https://github.com/cleanlab/cleanlab/pull/959, https://github.com/cleanlab/cleanlab/pull/960, https://github.com/cleanlab/cleanlab/pull/964
Add Null Issue Manager by @abhijitpal1247 in https://github.com/cleanlab/cleanlab/pull/856; @tataganesh in https://github.com/cleanlab/cleanlab/pull/927, https://github.com/cleanlab/cleanlab/pull/917
Add Data Valuation Issue Manager by @coding-famer in https://github.com/cleanlab/cleanlab/pull/850, https://github.com/cleanlab/cleanlab/pull/925
Extend non-iid issue check to run if only pred_probs are provided by @abhijitpal1247 in https://github.com/cleanlab/cleanlab/pull/857; @tataganesh in https://github.com/cleanlab/cleanlab/pull/896, https://github.com/cleanlab/cleanlab/pull/897
Add Underperforming Group Issue Manager by @tataganesh in https://github.com/cleanlab/cleanlab/pull/838, https://github.com/cleanlab/cleanlab/pull/907; @elisno in https://github.com/cleanlab/cleanlab/pull/990
Add Class Imbalance issue type to Datalab defaults by @tataganesh in https://github.com/cleanlab/cleanlab/pull/912, https://github.com/cleanlab/cleanlab/pull/933; @jwmueller in https://github.com/cleanlab/cleanlab/pull/924, https://github.com/cleanlab/cleanlab/pull/934; @elisno in https://github.com/cleanlab/cleanlab/pull/940
Add regression task to Datalab by @mglowacki100 in https://github.com/cleanlab/cleanlab/pull/796; @elisno in https://github.com/cleanlab/cleanlab/pull/902
Add multilabel task to Datalab by @tataganesh in https://github.com/cleanlab/cleanlab/pull/929
702 - Shorten Refs of classes and functions in Docs by @gibsonliketheguitar in https://github.com/cleanlab/cleanlab/pull/831
Update near duplicate issues and sets by @ryansingman in https://github.com/cleanlab/cleanlab/pull/919; @elisno in https://github.com/cleanlab/cleanlab/pull/895
Rescale near duplicate scores by @elisno in https://github.com/cleanlab/cleanlab/pull/943
Rescale outlier scores by @elisno in https://github.com/cleanlab/cleanlab/pull/953
List comprehension to numpy ops for efficiency by @tataganesh in https://github.com/cleanlab/cleanlab/pull/844
Reduce memory usage of filter.find_label_issues() by @kylegallatin in https://github.com/cleanlab/cleanlab/pull/885
Updates to tests by @aditya1503 in https://github.com/cleanlab/cleanlab/pull/945; @elisno in https://github.com/cleanlab/cleanlab/pull/985, https://github.com/cleanlab/cleanlab/pull/998
Refactor Datalab functionality by @elisno in https://github.com/cleanlab/cleanlab/pull/971, https://github.com/cleanlab/cleanlab/pull/1006
Minor fixes for Datalab by @elisno in https://github.com/cleanlab/cleanlab/pull/997, https://github.com/cleanlab/cleanlab/pull/999, https://github.com/cleanlab/cleanlab/pull/1000, https://github.com/cleanlab/cleanlab/pull/1003, https://github.com/cleanlab/cleanlab/pull/1005, https://github.com/cleanlab/cleanlab/pull/979
Drop Python 3.7 support and add Python 3.11 support by @elisno in https://github.com/cleanlab/cleanlab/pull/980
Add a show_all_issues optional argument to Datalab.report() by @elisno in https://github.com/cleanlab/cleanlab/pull/970
Single Class Span Classification Support by @Steven-Yiran in https://github.com/cleanlab/cleanlab/pull/982
ensure near-predicted labels are not flagged as label issues by @aditya1503 in https://github.com/cleanlab/cleanlab/pull/950
PR template added and gitignore improved by @smttsp in https://github.com/cleanlab/cleanlab/pull/867
Update label issue count in dataset.health_summary() by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/875
Update segmentation.ipynb by @R-Peleg in https://github.com/cleanlab/cleanlab/pull/948
Refactor batching logic in cleanlab.segmentation.filter.find_label_issues by @elisno in https://github.com/cleanlab/cleanlab/pull/918

For a full list of changes, enhancements, and fixes, please refer to the Full Changelog.

v2.5.0

7 months ago

This release is non-breaking when upgrading from v2.4.0 (except for certain methods in cleanlab.experimental that have been moved, especially utility methods related to Datalab).

New ML tasks supported

Cleanlab now supports all of the most common ML tasks! This newest release adds dedicated support for the following types of datasets:

regression (finding errors in numeric data): see cleanlab.regression and the "noisy labels in regression" quickstart tutorial.
object detection: see cleanlab.object_detection and the "Object Detection" quickstart tutorial.
image segmentation: see cleanlab.segmentation and the "Semantic Segmentation tutorial.

Cleanlab previously already supported: multi-class classification, multi-label classification (image/document tagging), token classification (entity recognition, sequence prediction).

If there is another ML task you'd like to see this package support, please let us know (or even better open a Pull Request)!

Supporting these ML tasks properly required significant research and novel algorithms developed by our scientists. We have published papers on these for transparency and scientific rigor, check out the list in the README or learn more at: https://cleanlab.ai/research/ https://cleanlab.ai/blog/

Improvements to Datalab

Datalab is a general platform for detecting all sorts of common issues in real-world data, and the best place to get started for running this library on your datasets.

This release introduces major improvements and new functionalities in Datalab that include the ability to:

Detect low-quality images in computer vision data (blurry, over/under-exposed, low-information, ...) via the integration of CleanVision.
Detect label issues even without pred_probs from a ML model (you can instead just provide features).
Flag rare classes in imbalanced classification datasets.
Audit unlabeled datasets.

Other major improvements

50x speedup in the cleanlab.multiannotator code for analyzing data labeled by multiple annotators.
Out-of-Distribution detection based on pred_probs via the GEN algorithm which is particularly effective for datasets with tons of classes.
Many of the methods across the package to find label issues now support a low_memory option. When specified, it uses an approximate mini-batching algorithm that returns results much faster and requires much less RAM.

New Contributors

Transforming cleanlab into the first universal data-centric AI platform is a major effort and we need your help! Many easy ways to contribute are listed on our github or you can jump into the discussions on Slack. We immensely appreciate all of the contributors who've helped build this package into what it is today, especially:

@gordon-lim made their first contribution in https://github.com/cleanlab/cleanlab/pull/746
@tataganesh made their first contribution in https://github.com/cleanlab/cleanlab/pull/751
@vdlad made their first contribution in https://github.com/cleanlab/cleanlab/pull/677
@axl1313 made their first contribution in https://github.com/cleanlab/cleanlab/pull/798
@coding-famer made their first contribution in https://github.com/cleanlab/cleanlab/pull/800

Change Log

New feature: Label error detection in regression datasets by @krmayankb in https://github.com/cleanlab/cleanlab/pull/572; by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/830
New feature: ObjectLab for detecting mislabeled images in objection detection datasets by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/676, https://github.com/cleanlab/cleanlab/pull/739, https://github.com/cleanlab/cleanlab/pull/745, https://github.com/cleanlab/cleanlab/pull/770, https://github.com/cleanlab/cleanlab/pull/779, https://github.com/cleanlab/cleanlab/pull/807, https://github.com/cleanlab/cleanlab/pull/833; by @aditya1503 in https://github.com/cleanlab/cleanlab/pull/750, https://github.com/cleanlab/cleanlab/pull/804
New feature: Label error detection in segmentation datasets by @vdlad in https://github.com/cleanlab/cleanlab/pull/677; by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/754, https://github.com/cleanlab/cleanlab/pull/756, https://github.com/cleanlab/cleanlab/pull/759, https://github.com/cleanlab/cleanlab/pull/772; by @elisno in https://github.com/cleanlab/cleanlab/pull/775
New feature: CleanVision to detect low-quality images by @sanjanag in https://github.com/cleanlab/cleanlab/pull/679, https://github.com/cleanlab/cleanlab/pull/797
New image quickstart tutorial that uses Datalab by @sanjanag in https://github.com/cleanlab/cleanlab/pull/795
Datalab code refactoring by @elisno in https://github.com/cleanlab/cleanlab/pull/803, https://github.com/cleanlab/cleanlab/pull/783, https://github.com/cleanlab/cleanlab/pull/793, https://github.com/cleanlab/cleanlab/pull/729
Make labels optional in Datalab by @elisno in https://github.com/cleanlab/cleanlab/pull/730
Update near-duplicate sets in Datalab by @elisno in https://github.com/cleanlab/cleanlab/pull/781
Include non-IID detection in set of default Datalab issue types by @elisno in https://github.com/cleanlab/cleanlab/pull/723
Extend Datalab to be able to detect label issues based on features by @Steven-Yiran in https://github.com/cleanlab/cleanlab/pull/760
Add imbalance issue type to Datalab by @tataganesh in https://github.com/cleanlab/cleanlab/pull/758, https://github.com/cleanlab/cleanlab/pull/828
Catch specific exception for knn in Datalab issue managers by @tataganesh in https://github.com/cleanlab/cleanlab/pull/825
Make plots smaller for datalab tutorials by @tataganesh in https://github.com/cleanlab/cleanlab/pull/751
50x speedup and other improvements in multiannotator module by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/821, https://github.com/cleanlab/cleanlab/pull/784; by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/827
ENH: make clipping unnecessary for entropy by @DerWeh in https://github.com/cleanlab/cleanlab/pull/703
Extend default CleanLearning classifier to work for more datasets by @Steven-Yiran in https://github.com/cleanlab/cleanlab/pull/749
CleanLearning code improvements by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/724; by @jwmueller in https://github.com/cleanlab/cleanlab/pull/744
Change CleanLearning inspect.getfullargspec to signature for sklearn v1.3 compatibility by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/761
Expose low memory option for finding label issues by @tataganesh in https://github.com/cleanlab/cleanlab/pull/791, https://github.com/cleanlab/cleanlab/pull/822
Add GEN OOD-detection algorithm by @coding-famer in https://github.com/cleanlab/cleanlab/pull/800
Unify softmax implementations throughout package by @elisno in https://github.com/cleanlab/cleanlab/pull/826
Better warning handling for off_calibrated_custom in confident joint by @gordon-lim in https://github.com/cleanlab/cleanlab/pull/746
Clearer explanations in documentation/tutorials/readme by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/725; by @jwmueller in https://github.com/cleanlab/cleanlab/pull/726, https://github.com/cleanlab/cleanlab/pull/734, https://github.com/cleanlab/cleanlab/pull/741, https://github.com/cleanlab/cleanlab/pull/743, https://github.com/cleanlab/cleanlab/pull/766, https://github.com/cleanlab/cleanlab/pull/832, https://github.com/cleanlab/cleanlab/pull/799, https://github.com/cleanlab/cleanlab/pull/752, https://github.com/cleanlab/cleanlab/pull/841, https://github.com/cleanlab/cleanlab/pull/816, https://github.com/cleanlab/cleanlab/pull/755, https://github.com/cleanlab/cleanlab/pull/731, https://github.com/cleanlab/cleanlab/pull/753, https://github.com/cleanlab/cleanlab/pull/845, https://github.com/cleanlab/cleanlab/pull/835, https://github.com/cleanlab/cleanlab/pull/847
CI and documentation system updates by @anishathalye in https://github.com/cleanlab/cleanlab/pull/742, https://github.com/cleanlab/cleanlab/pull/768, https://github.com/cleanlab/cleanlab/pull/769; by @jwmueller in https://github.com/cleanlab/cleanlab/pull/837; by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/788, https://github.com/cleanlab/cleanlab/pull/757, https://github.com/cleanlab/cleanlab/pull/738, https://github.com/cleanlab/cleanlab/pull/794; by @sanjanag in https://github.com/cleanlab/cleanlab/pull/843; by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/777; by @elisno in https://github.com/cleanlab/cleanlab/pull/802; by @axl1313 in https://github.com/cleanlab/cleanlab/pull/798
Improved tests by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/778, https://github.com/cleanlab/cleanlab/pull/763

Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.4.0...v2.5.0

v2.4.0

11 months ago

Cleanlab has grown into a popular package used by thousands of data scientists to diagnose issues in diverse datasets and improve the data itself in order to fit more robust models. Many new methods/algorithms were added in recent months to increase the capabilities of this data-centric AI library.

Introducing Datalab

Now we've added a unified platform called Datalab for you to apply many of these capabilities in a single line of code! To audit any classification dataset for issues, first use any trained ML model to produce pred_probs (predicted class probabilities) and/or feature_embeddings (numeric vector representations of each datapoint). Then, these few lines of code can detect many types of real-world issues in your dataset like label errors, outliers, near duplicates, etc:

from cleanlab import Datalab

lab = Datalab(data=dataset, label_name="column_name_for_labels")
lab.find_issues(features=feature_embeddings, pred_probs=pred_probs)
lab.report()  # summarize the issues found, how severe they are, and other useful info about the dataset

Follow our blog to better understand how this works internally, many articles will be published there shortly! A detailed description of each type of issue Datalab can detect is provided in this guide, but we recommend first starting with the tutorials which show you how easy it is to run on your own dataset.

Datalab can be used to do things like find label issues with string class labels (whereas the prior find_label_issues() method required integer class indices). But you are still free to use all of the prior cleanlab methods you're used to! Datalab is also using these internally to detect data issues.

Our goal is for Datalab to be an easy way to run a comprehensive suite of cleanlab capabilities on any dataset. This is an evolving paradigm, so be aware some Datalab APIs may change in subsequent package versions -- as noted in the documentation. You can easily run the issue checks in Datalab together with a custom issue type you define outside of cleanlab. This customizability also makes it easy to contribute new data quality algorithms into Datalab. Help us build the best open-source platform for data-centric AI by adding your ideas or those from recent publications! Feel free to reach out via Slack.

Revamped Tutorials

We've updated some of our existing tutorials with more interesting datasets and ML models. Regarding the basic tutorials on identifying label issues in classification data from various modalities (image, text, audio, tables), we have also created an analogous versions to detect issues in these same datasets with Datalab instead (see Datalab Tutorials). This should help existing users quickly ramp up on using Datalab to see how much more powerful this comprehensive data audit can be.

Improvements for Multi-label Classification

To provide a better experience for users with multi-label classification datasets, we have explicitly separated the functionality to work with these into the cleanlab.multilabel_classification module. So please start there rather than specifying the multi_label=True flag in certain methods outside of this module, as that option will be deprecated in the future.

Particularly noteworthy are the new dataset-level issue summaries for multi-label classification datasets, available in the cleanlab.multilabel_classification.dataset module.

While moving methods to the cleanlab.multilabel_classification module, we noticed some bugs in existing methods. We got rid of these methods entirely (replacing them with new ones in the cleanlab.multilabel_classification module), so some changes may appear to be backwards incompatible, even though the original code didn't function as intended in the first place.

Backwards incompatible changes

Your existing code will break if you do not upgrade to the new versions of these methods (the existing cleanlab v.2.3.1 code was probably producing bad results anyway based on some bugs that have been fixed). Here are changes you must make in your code for it to work with newer cleanlab versions:

cleanlab.dataset.rank_classes_by_label_quality(..., multi_label=True) → cleanlab.multilabel_classification.dataset.rank_classes_by_label_quality(...)

The multi_label=False/True argument will be removed in the future from the former method.

cleanlab.dataset.find_overlapping_classes(..., multi_label=True) → cleanlab.multilabel_classification.dataset.common_multilabel_issues(...)

The multi_label=False/True argument will be removed in the future from the former method. The returned DataFrame is slightly different, please refer to the new method's documentation.

cleanlab.dataset.overall_label_health_score(...multi_label=True) → cleanlab.multilabel_classification.dataset.overall_label_health_score(...)

The multi_label=False/True argument will be removed in the future from the former method.

cleanlab.dataset.health_summary(...multi_label=True) → cleanlab.multilabel_classification.dataset.multilabel_health_summary(...)

The multi_label=False/True argument will be removed in the future from the former method.

There are no other backwards incompatible changes in the package with this release.

Deprecated workflows

We recommend updating your existing code to the new versions of these methods (existing cleanlab v2.3.1 code will still work though, for now). Here are changes we recommend:

cleanlab.filter.find_label_issues(..., multi_label=True) → cleanlab.multilabel_classification.filter.find_label_issues(...)

The multi_label=False/True argument will be removed in the future from the former method.

from cleanlab.multilabel_classification import get_label_quality_scores → from cleanlab.multilabel_classification.rank import get_label_quality_scores

Remember: All of the code to work with multi-label data now lives in the cleanlab.multilabel_classification module.

Change Log

readme updates by @jwmueller in https://github.com/cleanlab/cleanlab/pull/659, https://github.com/cleanlab/cleanlab/pull/660, https://github.com/cleanlab/cleanlab/pull/713
CI updates (by @sanjanag in https://github.com/cleanlab/cleanlab/pull/701; by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/671; by @elisno in https://github.com/cleanlab/cleanlab/pull/695, https://github.com/cleanlab/cleanlab/pull/706)
Documentation updates (by @jwmueller in https://github.com/cleanlab/cleanlab/pull/669, https://github.com/cleanlab/cleanlab/pull/710, https://github.com/cleanlab/cleanlab/pull/711, https://github.com/cleanlab/cleanlab/pull/716, https://github.com/cleanlab/cleanlab/pull/719, https://github.com/cleanlab/cleanlab/pull/720; by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/714, https://github.com/cleanlab/cleanlab/pull/717; by @elisno in https://github.com/cleanlab/cleanlab/pull/678, https://github.com/cleanlab/cleanlab/pull/684)
Documentation: use default rules for shorter, more readable links by @DerWeh in https://github.com/cleanlab/cleanlab/pull/700
Added installation instructions for package extras by @sanjanag in https://github.com/cleanlab/cleanlab/pull/697
Pass confident joint computed in CleanLearning to filter.find_label_issues by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/661
Add Example codeblock to the docstrings of important functions in the dataset module by @Steven-Yiran in https://github.com/cleanlab/cleanlab/pull/662, https://github.com/cleanlab/cleanlab/pull/663, https://github.com/cleanlab/cleanlab/pull/668
Remove batch size check in label_issues_batched by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/665
adding multilabel dataset issue summaries by @aditya1503 in https://github.com/cleanlab/cleanlab/pull/657
move int2onehot, onehot2int to top of multilabel tutorial by @jwmueller in https://github.com/cleanlab/cleanlab/pull/666
Update softmax to more stable variant by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/667
Revamp text and tabular tutorial by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/673, https://github.com/cleanlab/cleanlab/pull/693
allow for kwargs in token find_label_issues by @jwmueller in https://github.com/cleanlab/cleanlab/pull/686
Update numpy.typing import and annotations by @elisno in https://github.com/cleanlab/cleanlab/pull/688
Standardize documentation and simplify code for outliers by @DerWeh in https://github.com/cleanlab/cleanlab/pull/689
Extract function for computing OOD scores from distances by @elisno in https://github.com/cleanlab/cleanlab/pull/664
Introduce Datalab by @elisno in https://github.com/cleanlab/cleanlab/pull/614
Introduce NonIID issue type by @jecummin in https://github.com/cleanlab/cleanlab/pull/614
Further Datalab updates by @elisno in https://github.com/cleanlab/cleanlab/pull/680, https://github.com/cleanlab/cleanlab/pull/683, https://github.com/cleanlab/cleanlab/pull/687, https://github.com/cleanlab/cleanlab/pull/690, https://github.com/cleanlab/cleanlab/pull/691, https://github.com/cleanlab/cleanlab/pull/699, https://github.com/cleanlab/cleanlab/pull/705, https://github.com/cleanlab/cleanlab/pull/709, https://github.com/cleanlab/cleanlab/pull/712
Add descriptions of issues that Datalab can detect by @elisno in https://github.com/cleanlab/cleanlab/pull/682
Datalab IssueManager.get_summary() -> make_summary() in custom issue manager example by @jwmueller in https://github.com/cleanlab/cleanlab/pull/692
Improve NonIID issue checks by @elisno in https://github.com/cleanlab/cleanlab/pull/694, https://github.com/cleanlab/cleanlab/pull/707

New Contributors

@Steven-Yiran made their first contribution in https://github.com/cleanlab/cleanlab/pull/662
@DerWeh made their first contribution in https://github.com/cleanlab/cleanlab/pull/689
@jecummin made their first contribution in https://github.com/cleanlab/cleanlab/pull/614

Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.3.1...v2.4.0

v2.3.1

1 year ago

This minor release primarily just improves the user experience when encountering various edge-cases in:

find_label_issues method
find_overlapping_issues method
cleanlab.multiannotator module

This release is non-breaking when upgrading from v2.3.0. Two noteworthy updates in the cleanlab.multiannotator module include a:

better tie-breaking algorithm inside of get_majority_vote_label() to avoid diminishing the frequency of rarer classes (this only plays a role when pred_probs are not provided).
better user-experience for get_active_learning_scores() to support scoring only unlabeled data or only labeled data. More of the arguments can now be None.

What's Changed

Readme updates by @jwmueller in https://github.com/cleanlab/cleanlab/pull/645, https://github.com/cleanlab/cleanlab/pull/650, https://github.com/cleanlab/cleanlab/pull/656
describe activelab in the documentation by @jwmueller in https://github.com/cleanlab/cleanlab/pull/648
Added clipping to address issue #639 by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/647
Fix for not specifying labels in find_overlapping_issues by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/652
Bug fixes + improvements to multiannotator module by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/654
FAQ question/answer on handling label errors in train vs test data by @jwmueller in https://github.com/cleanlab/cleanlab/pull/655

Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.3.0...v2.3.1

v2.3.0

1 year ago

Cleanlab was originally open-sourced as code to accompany a research paper on label errors in classification tasks, to prove to skeptical researchers that it's possible to utilize ML models to discover mislabeled data and then train even better versions of these same models. We've been hard at work since then, turning this into an industry-grade library that helps you handle label errors in many ML tasks such as: entity recognition, image/document tagging, data labeled by multiple annotators, etc. While label errors are critical to deal with in real-world ML applications, data-centric AI involves utilizing trained ML models to improve the data in other ways as well.

With the newest release, cleanlab v2.3 can now automatically:

As always, the cleanlab library works with almost any ML model (no matter how it was trained) and type of data (image, text, tabular, audio, etc). We have user-friendly 5min tutorials to get started with any of the above objectives and easily improve your data!

We're aiming for this library to provide all the key functionalities needed to practice data-centric AI. Much of this involves inventing new algorithms for data quality, and we transparently publish all of these algorithms in scientific papers. Read these to understand how particular cleanlab methods work under the hood and see extensive benchmarks of how effective they are on real data.

Highlights of what’s new in 2.3.0:

We have added new functionality for active learning and easily making Keras models compatible with sklearn. Label issues can now be estimated 10x faster and with much less memory using new methods added to help users with massive datasets. This release is non-breaking when upgrading from v2.2.0 (except for certain methods in cleanlab.experimental that have been moved).

Active Learning with ActiveLab

For settings where you want to label more data to get better ML, active learning helps you train the best ML model with the least data labeling. Unfortunately data annotators often give imperfect labels, in which case we might sometimes prefer to have another annotator check an already-labeled example rather than labeling an entirely new example. ActiveLab is a new algorithm invented by our team that automatically answers the question: which new data should I label or which of my current labels should be checked again? ActiveLab is highly practical — it runs quickly and works with: any type of ML model, batch settings where many examples are (re)labeled before model retraining, and settings where multiple annotators can label an example (or just one annotator).

Here's all the code needed to determine active learning scores for examples in your unlabeled pool (no annotations yet) and labeled pool (at least one annotation already collected).

from cleanlab.multiannotator import get_active_learning_scores

scores_labeled_pool, scores_unlabeled_pool = get_active_learning_scores(
        multiannotator_labels, pred_probs, pred_probs_unlabeled
    )

The batch of examples with the lowest scores are those that are most informative to collect an additional label for (scores between labeled vs unlabeled pool are directly comparable). You can either have a new annotator label the batch of examples with lowest scores, or distribute them amongst your previous annotators as is most convenient. ActiveLab is also effective for: standard active learning where you collect at most one label per example (no re-labeling), as well as active label cleaning (with no unlabeled pool) where you only want to re-label examples to ensure 100% correct consensus labels (with the least amount of re-labeling).

Get started running ActiveLab with our tutorial notebook from our repo that has many other examples.

KerasWrapper

We've introduced one-line wrappers for TensorFlow/Keras models that enable you to use TensorFlow models within scikit-learn workflows with features like Pipeline, GridSearch and more. Just change one line of code to make your existing Tensorflow/Keras model compatible with scikit-learn’s rich ecosystem! All you have to do is swap out: keras.Model → KerasWrapperModel, or keras.Sequential → KerasSequentialWrapper. Imported from cleanlab.models.keras, the wrapper objects have all the same methods of their keras counterparts, plus you can use them with tons of handy scikit-learn methods.

Resources to get started include:

Blogpost and Jupyter notebook demonstrating how to make a HuggingFace Transformer (BERT model) sklearn-compatible.
Jupyter notebook showing how to fit these sklearn-compatible models to a Tensorflow Dataset.
Revamped tutorial on label errors in text classification data, which has been updated to use this new wrapper.

Computational improvements for detecting label issues

Through extensive optimization of our multiprocessing code (thanks to @clu0), find_label_issues has been made ~10x faster on Linux machines that have many CPU cores.

For massive datasets, find_label_issues may require too much memory to run our your machine. We've added new methods in cleanlab.experimental.label_issues_batched that can compute label issues with far less memory via mini-batch estimation. You can use these with billion-scale memmap arrays or Zarr arrays like this:

from cleanlab.experimental.label_issues_batched import find_label_issues_batched

labels = zarr.convenience.open("LABELS.zarr", mode="r")
pred_probs = zarr.convenience.open("PREDPROBS.zarr", mode="r")
issues = find_label_issues_batched(labels=labels, pred_probs=pred_probs, batch_size=100000)

By choosing sufficiently small batch_size, you should be able to handle pretty much any dataset (set it as large as your memory will allow for best efficiency). With default arguments, the batched methods closely approximate the results of the option: cleanlab.filter.find_label_issues(..., filter_by="low_self_confidence", return_indices_ranked_by="self_confidence") This and filter_by="low_normalized_margin" are new find_label_issues() options added in v2.3, which require less computation and still output accurate estimates of the label errors.

Other changes to be aware of

Like all major ML frameworks, we have dropped support for Python 3.6.
We have moved some particularly useful models (fasttext, keras) from cleanlab.experimental -> cleanlab.models.

Change Log

Shorten tutorial titles in docs for readability by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/553
Swap CI workflow to actions by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/560
Remove .pylintrc by @elisno in https://github.com/cleanlab/cleanlab/pull/564
Tutorial fixes by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/565
Fix typo in CONTRIBUTING.md by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/566
Multiannotator Active Learning Support by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/538
multiannotator explanation improvements by @jwmueller in https://github.com/cleanlab/cleanlab/pull/570
Specify Sphinx to order functions by source code order by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/571
Fix example in ema docstring by @elisno in https://github.com/cleanlab/cleanlab/pull/563, https://github.com/cleanlab/cleanlab/pull/573
update paper list and applications beyond label error detection in readme by @jwmueller in https://github.com/cleanlab/cleanlab/pull/574, https://github.com/cleanlab/cleanlab/pull/580
Drop Python 3.6 support (by @jwmueller in https://github.com/cleanlab/cleanlab/pull/558, https://github.com/cleanlab/cleanlab/pull/577; by @anishathalye in https://github.com/cleanlab/cleanlab/pull/562; by @krmayankb in https://github.com/cleanlab/cleanlab/pull/578; by @sanjanag in https://github.com/cleanlab/cleanlab/pull/579)
add maximum line length by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/583
Update github actions by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/589
Revamp text tutorial by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/584
clarify thresholding in issues_from_scores by @jwmueller in https://github.com/cleanlab/cleanlab/pull/582
Remove temp scaling from single annotator case by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/590
Update docs dependencies by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/593
Use euclidean distance for identifying outliers for lower dimensional features by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/581
changing copyright year 2017-2022 to 2017-2023 by @aditya1503 in https://github.com/cleanlab/cleanlab/pull/594
Handle missing type parameters for generic type "ndarray" by @elisno in https://github.com/cleanlab/cleanlab/pull/587
Remove temp scaling for single-label case in ensemble method by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/597
Adding type hints for mypy strict compatibility by @unna97 in https://github.com/cleanlab/cleanlab/pull/585
fix typo in outliers.ipynb by @eltociear in https://github.com/cleanlab/cleanlab/pull/603
10x speedup in find_label_issues on linux via better multiprocessing by @clu0 in https://github.com/cleanlab/cleanlab/pull/596
Update tabular tutorial with better language by @cmauck10 in https://github.com/cleanlab/cleanlab/pull/609
Improve num_label_issues() to reflect most accurate num issues by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/610
Removed duplicate classifier from setup.py by @sanjanag in https://github.com/cleanlab/cleanlab/pull/612
Add two methods to filter.find_label_issues by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/595
Fix dictionary type annotation for OutOfDistribution object by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/616
Fix format compatibility with latest black==23. release by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/620
Create new cleanlab.models module by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/601
upgrade torch in docs by @jwmueller in https://github.com/cleanlab/cleanlab/pull/607
fix bug: confidences -> confidence by @jwmueller in https://github.com/cleanlab/cleanlab/pull/623
Fixed duplicate issue removal in find_label_issues by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/624
Method to estimate label issues with limited memory via mini-batches by @jwmueller in https://github.com/cleanlab/cleanlab/pull/615, https://github.com/cleanlab/cleanlab/pull/629, https://github.com/cleanlab/cleanlab/pull/632, https://github.com/cleanlab/cleanlab/pull/635
Fix KerasWrapper summary method by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/631
Clarify rank.py not for multi-label classification by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/626
Removed $ from shell commands to avoid it being copied by @sanjanag in https://github.com/cleanlab/cleanlab/pull/625
label_issues_batched multiprocessing by @clu0 in https://github.com/cleanlab/cleanlab/pull/630, https://github.com/cleanlab/cleanlab/pull/634
Switch to typing.Self by @anishathalye in https://github.com/cleanlab/cleanlab/pull/489
Documentation improvements by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/643
add 2.3.0 to release versions by @jwmueller in https://github.com/cleanlab/cleanlab/pull/644

New Contributors

@krmayankb made their first contribution in https://github.com/cleanlab/cleanlab/pull/578
@sanjanag made their first contribution in https://github.com/cleanlab/cleanlab/pull/579
@unna97 made their first contribution in https://github.com/cleanlab/cleanlab/pull/585
@eltociear made their first contribution in https://github.com/cleanlab/cleanlab/pull/603
@clu0 made their first contribution in https://github.com/cleanlab/cleanlab/pull/596

Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.2.0...v2.3.0

v2.2.0

1 year ago

You asked, we listened! cleanlab v2.2.0 addresses two of the biggest pain points we often heard from our users:

Lack of clarity around how cleanlab works for multi-label datasets and how to best utilize it.
Not being usable for datasets with omitted classes (eg. rare classes dropped in a data split).

This release is non-breaking when upgrading from v2.1.0, but you will now get more accurate results (in all the datasets we tested) when finding label issues in multi-label classification datasets.

This release also adds a new satisfyingly accurate algorithms for finding label errors in multi-label data for improved multi-label classification tasks like text/image tagging.

Highlights of what’s new in 2.2.0:

Multi-label support for applications like image/document/text tagging

The newest version of cleanlab features a complete overhaul of cleanlab’s multi-label classification functionality:

We invented new algorithms for detecting label errors in multi-label datasets that are significantly more effective. These methods are formally described and extensively benchmarked in our research paper.
We added cleanlab.multilabel_classification module for label quality scoring.
We now offer an easy-to-follow quickstart tutorial for learning how to apply cleanlab to multi-label datasets.
We’ve created example notebooks on using cleanlab to clean up image tagging datasets, and how to train a state-of-the-art Pytorch neural network for multi-label classification with any image dataset.
All of this multi-label functionality is now robustly tested via a comprehensive suite of unit tests to ensure it remains performant.

cleanlab now works when your labels have some classes missing relative to your predicted probabilities

The package now works for datasets in which some classes happen to not be present (but are present say in the pred_probs output by a model). This is useful when you:

Want to use a pretrained model that was fit with additional classes
Have rare classes and happen to split the data in an unlucky way
Are doing active learning or other dynamic modeling with data that are iteratively changing
Are analyzing multi-annotator datasets with cleanlab.multiannotator and some annotators occasionally select a really rare class.

Other major improvements

(in addition to too many bugfixes to name):

Accuracy improvements to the algorithm used to estimate the number of label errors in a dataset via count.num_label_issues(). — @ulya-tkch
Introduction of flake8 code linter to ensure the highest standards for our code. — @ilnarkz, @mohitsaxenaknoldus
More comprehensive mypy type annotations for cleanlab functions to make our code safer and more understandable. — @elisno, @ChinoCodeDemon, @anishathalye, @jwmueller, @huiwengoh, @ulya-tkch

Special thanks to Po-He Tseng for helping with early tests of our improved multi-label algorithms and the research behind developing them.

Workflows of interest in cleanlab v2.2:

Finding label issues in multi-label classification is done using the same code and inputs as before (and the same object is returned as before):

from cleanlab.filter import find_label_issues

ranked_label_issues = find_label_issues(
    labels=labels,
    pred_probs=pred_probs,
    multi_label=True,
    return_indices_ranked_by="self_confidence",
)

Where for a 3-class multi-label dataset with 4 examples, we might have say:

labels = [[0], [0, 1], [0, 2], [1]]

pred_probs = np.array(
    [[0.9, 0.1, 0.1],
     [0.9, 0.1, 0.8],
     [0.9, 0.1, 0.6],
     [0.2, 0.8, 0.3]]
)

The following code (in which class 1 is missing from the dataset) did not previously work but now runs without problem in cleanlab v2.2.0:

from cleanlab.filter import find_label_issues
import numpy as np

labels = [0, 0, 2, 0, 2]
pred_probs = np.array(
    [[0.8, 0.1, 0.1],
     [0.7, 0.1, 0.2],
     [0.3, 0.1, 0.6],
     [0.5, 0.2, 0.3],
     [0.1, 0.1, 0.8]]
)

label_issues = find_label_issues(
    labels=labels,
    pred_probs=pred_probs,
)

Looking forward

The next major release of this package will introduce a paradigm shift in the way people check their datasets. Today this involves significant manual labor, but software should be able to help! Our research has developed algorithms that can automatically detect many types of common issues that plague real-world ML datasets. The next version of cleanlab will offer an easy-to-use line of code that runs all of our appropriate algorithms to help ensure a given dataset is issue-free and well-suited for supervised learning.

Change Log

updated label_quality_utils.py and rebuilt the doc by @ethanotran in https://github.com/cleanlab/cleanlab/pull/475
Add workflow for skipping notebooks by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/472
Fix return type in token classification get_label_quality_scores by @jwmueller in https://github.com/cleanlab/cleanlab/pull/477
Adding pylint CI checks by @mohitsaxenaknoldus in https://github.com/cleanlab/cleanlab/pull/465
CI: Build check cleanlab works without optional dependencies by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/470
Outlier tutorial: move uninteresting code to hidden cell by @jwmueller in https://github.com/cleanlab/cleanlab/pull/492
Update DEVELOPMENT.md with howto add new modules by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/494
Minor asthetic fix for tutorials.ipynb by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/493
Update init.py to include major files by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/490
Make type checking pass with mypy 0.981 by @anishathalye in https://github.com/cleanlab/cleanlab/pull/488
Update # issues returned by num_label_issues by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/485
Mypy typechecking fix for count.py by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/500
Add basic utilities for handling quality scores for multilabel data by @elisno in https://github.com/cleanlab/cleanlab/pull/499
reinvented algorithms for multilabel find_label_issues by @aditya1503 in https://github.com/cleanlab/cleanlab/pull/483
Trying to fix typings by @ChinoCodeDemon in https://github.com/cleanlab/cleanlab/pull/502
Add internal function to properly format labels by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/504
Mention internal format label function in multiannotator docs by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/506
Multilabel code restructuring with aggregation/scorer functions by @aditya1503 in https://github.com/cleanlab/cleanlab/pull/509
Separate word_coloring from token_replacement in color_sentence by @elisno in https://github.com/cleanlab/cleanlab/pull/514
Add support for missing classes by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/511
Better missing class support for label quality scoring by @jwmueller in https://github.com/cleanlab/cleanlab/pull/518
moving multilabel functions by @aditya1503 in https://github.com/cleanlab/cleanlab/pull/515
restrict typecheck to python v3.10 by @jwmueller in https://github.com/cleanlab/cleanlab/pull/521
support missing classes in multiannotator functions by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/519
fix mypy typing by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/524
Add studio banner by @cmauck10 in https://github.com/cleanlab/cleanlab/pull/525
added missing classes test for multilabel by @aditya1503 in https://github.com/cleanlab/cleanlab/pull/523
Improve tutorials language/formatting by @jwmueller in https://github.com/cleanlab/cleanlab/pull/526
Validate forgetting factor in EMA by @elisno in https://github.com/cleanlab/cleanlab/pull/527
Remove strong worded requirement for out-of-sample pred probs by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/520
Ensure type checks pass with new mypy v0.990 by @jwmueller in https://github.com/cleanlab/cleanlab/pull/530
replace pylint --> flake8 by @ilnarkz in https://github.com/cleanlab/cleanlab/pull/531
Tutorial for multi-label classification by @aditya1503 in https://github.com/cleanlab/cleanlab/pull/517
Fix multilabel_py dimensionality by @elisno in https://github.com/cleanlab/cleanlab/pull/535
cleanlab install on colab for multilabel tutorial by @jwmueller in https://github.com/cleanlab/cleanlab/pull/537
Refactor MultilabelScorer helper methods and tests by @elisno in https://github.com/cleanlab/cleanlab/pull/540
Make a public method for multilabel quality scores by @jwmueller in https://github.com/cleanlab/cleanlab/pull/542
Improve and standardize documentation in label error detection methods for classification datasets by @jwmueller in https://github.com/cleanlab/cleanlab/pull/543
Move mypy configuration to config file by @anishathalye in https://github.com/cleanlab/cleanlab/pull/545
Fix types to work with latest pandas-stubs by @anishathalye in https://github.com/cleanlab/cleanlab/pull/546
Fix passing of kwargs to get_label_quality_scores by @anishathalye in https://github.com/cleanlab/cleanlab/pull/547
Switch CI cron schedule by @anishathalye in https://github.com/cleanlab/cleanlab/pull/548
Remove unnecessary type: ignore annotations by @anishathalye in https://github.com/cleanlab/cleanlab/pull/549
update readme for v2.2 by @jwmueller in https://github.com/cleanlab/cleanlab/pull/551

New Contributors

@ethanotran made their first contribution in https://github.com/cleanlab/cleanlab/pull/475
@mohitsaxenaknoldus made their first contribution in https://github.com/cleanlab/cleanlab/pull/465
@aditya1503 made their first contribution in https://github.com/cleanlab/cleanlab/pull/483
@ChinoCodeDemon made their first contribution in https://github.com/cleanlab/cleanlab/pull/502
@cmauck10 made their first contribution in https://github.com/cleanlab/cleanlab/pull/525
@ilnarkz made their first contribution in https://github.com/cleanlab/cleanlab/pull/531
Po-He Tseng helped run some early tests of our new multi-label algorithms

Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.1.0...v2.2.0

v2.1.0

1 year ago

v2.1.0 begins extending this library beyond standard classification tasks, taking initial steps toward the first tool that can detect label errors in data from any Supervised Learning task (leveraging any model trained for that task). This release is non-breaking when upgrading from v2.0.0.

Highlights of what’s new in 2.1.0:

Major new functionalities:

CROWDLAB algorithms for analysis of data labeled by multiple annotators — @huiwengoh, @ulya-tkch, @jwmueller
- Accurately infer the best consensus label for each example
- Estimate the quality of each consensus label (how likely is it correct)
- Estimate the overall quality of each annotator (how trustworthy are their suggested labels)
Out of Distribution Detection based on either:
- feature values/embeddings — @ulya-tkch, @jwmueller, @JohnsonKuan
- predicted class probabilities — @ulya-tkch
Label error detection for Token Classification tasks (NLP / text data) — @ericwang1997, @elisno
CleanLearning can now:
- Run on non-array data types including: pandas Dataframe, pytorch/tensorflow Dataset objects, and many other types of data formats. — @jwmueller
- Allow base model’s fit() to utilize validation data in each fold during cross-validation (eg. for early-stopping or hyperparameter-optimization purposes). — @huiwengoh
- Train with custom sample weights for datapoints. — @rushic24, @jwmueller
- Utilize any Keras model (supporting both sequential or functional APIs) via cleanlab’s KerasWrapperModel , which makes these models compatible with sklearn and tensorflow Datasets. — @huiwengoh, @jwmueller

Major improvements (in addition to too many bugfixes to name):

Reduced dependencies: scipy is no longer needed — @anishathalye
Clearer error/warning messages throughout package when data/inputs are strangely formatted — @cgnorthcutt, @jwmueller, @huiwengoh
FAQ section in tutorials with advice for commonly encountered issues — @huiwengoh, @ulya-tkch, @jwmueller, @cgnorthcutt
Many additional tutorial and example notebooks at: docs.cleanlab.ai and https://github.com/cleanlab/examples — @ulya-tkch, @huiwengoh, @jwmueller, @ericwang1997
Static type annotations to ensure robust code — @anishathalye, @elisno

Examples of new workflows available in 2.1:

Out of Distribution and Outlier Detection

Detect out of distribution examples in a dataset based on its numeric feature embeddings

from cleanlab.outlier import OutOfDistribution

ood = OutOfDistribution()

# To get outlier scores for train_data using feature matrix train_feature_embeddings
ood_train_feature_scores = ood.fit_score(features=train_feature_embeddings)

# To get outlier scores for additional test_data using feature matrix test_feature_embeddings
ood_test_feature_scores = ood.score(features=test_feature_embeddings)

Detect out of distribution examples in a dataset based on predicted class probabilities from a trained classifier

from cleanlab.outlier import OutOfDistribution

ood = OutOfDistribution()

# To get outlier scores for train_data using predicted class probabilities (from a trained classifier) and given class labels
ood_train_predictions_scores = ood.fit_score(pred_probs=train_pred_probs, labels=labels)

# To get outlier scores for additional test_data using predicted class probabilities
ood_test_predictions_scores = ood.score(pred_probs=test_pred_probs)

Multi-annotator -- support data with multiple labels

For data labeled by multiple annotators (stored as matrix multiannotator_labels whose rows correspond to examples, columns to each annotator’s chosen labels), cleanlab v2.1 can: find improved consensus labels, score their quality, and assess annotators, all by leveraging predicted class probabilities pred_probs from any trained classifier

from cleanlab.multiannotator import get_label_quality_multiannotator

get_label_quality_multiannotator(multiannotator_labels, pred_probs)

Support Token Classification tasks

Cleanlab v2.1 can now find label issues in token classification (text) data, where each word in a sentence is labeled with one of K classes (eg. entity recognition). This relies on three inputs:

tokens: List of tokenized sentences whose ith element is a list of strings corresponding to tokens of the ith sentence in dataset. Example: [..., ["I", "love", "cleanlab"], ...]
labels: List whose ith element is a list of integers corresponding to class labels of each token in the ith sentence. Example: [..., [0, 0, 1], ...]
pred_probs: List whose ith element is a np.ndarray of shape (N_i, K) corresponding to predicted class probabilities for each token in the ith sentence (assuming this sentence contains N_i tokens and dataset has K possible classes). These should be out-of-sample pred_probs obtained from a token classification model via cross-validation. Example: [..., np.array([[0.8,0.2], [0.9,0.1], [0.3,0.7]]), ...]

Using these, you can easily find and display mislabeled tokens in your data

from cleanlab.token_classification.filter import find_label_issues
from cleanlab.token_classification.summary import display_issues

issues = find_label_issues(labels, pred_probs)
display_issues(issues, tokens, pred_probs=pred_probs, given_labels=labels,
               class_names=optional_list_of_ordered_class_names)

Support pd.DataFrames, Keras/PyTorch/TF Datasets, Keras models, etc.

CleanLearning can now operate directly on non-array dataset formats like tensorflow/pytorch Datasets and use arbitrary Keras models:

import numpy as np
import tensorflow as tf
from cleanlab.experimental.keras import KerasWrapperModel

dataset = tf.data.Dataset.from_tensor_slices((features_np_array, labels_np_array))  # example tensorflow dataset created from numpy arrays 
dataset = dataset.shuffle(buffer_size=len(features_np_array)).batch(32)

def make_model(num_features, num_classes):
    inputs = tf.keras.Input(shape=(num_features,))
    outputs = tf.keras.layers.Dense(num_classes)(inputs)
    return tf.keras.Model(inputs=inputs, outputs=outputs, name="my_keras_model")

model = KerasWrapperModel(make_model, model_kwargs={"num_features": features_np_array.shape[1], "num_classes": len(np.unique(labels_np_array))})
cl = CleanLearning(model)
cl.fit(dataset, labels_np_array)  # variant of model.fit() that is more robust to noisy labels
robust_predictions = cl.predict(dataset)  # equivalent to model.predict() after training on cleaner data

Change Log

Fix edgecase divide-by-0 in entropy-score by @jwmueller in https://github.com/cleanlab/cleanlab/pull/241
Fix some typos. by @Yulv-git in https://github.com/cleanlab/cleanlab/pull/242
Updated project urls in setup.py by @calebchiam in https://github.com/cleanlab/cleanlab/pull/249
FeatureReq #33: Added custom sample_weight by @rushic24 in https://github.com/cleanlab/cleanlab/pull/248
Allow users to pass custom weights for ensemble label quality scoring by @JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/255
Fix line index of CleanLearning(), some text of links, etc. by @Yulv-git in https://github.com/cleanlab/cleanlab/pull/260
Copy the docs build artifacts to the "stable" folder by @weijinglok in https://github.com/cleanlab/cleanlab/pull/231
Add Negative Log Loss Weighting Scheme for Ensemble Label Quality Score by @JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/267
Developed class that allow the use of cleanlab with tensorflow and huggingface models by @MattiaSangermano in https://github.com/cleanlab/cleanlab/pull/247
Add KNN distance OOD scoring function and unit tests by @JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/268
Dataset documentation clarifications by @jwmueller in https://github.com/cleanlab/cleanlab/pull/270
Add issue templates by @anishathalye in https://github.com/cleanlab/cleanlab/pull/278
Fix bug. get thresholds broken for multi_label by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/264
Clarify labels format by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/282
Drop dependency on SciPy by @anishathalye in https://github.com/cleanlab/cleanlab/pull/286
Make CleanLearning work with pandas and other non-numpy feature objects X by @jwmueller in https://github.com/cleanlab/cleanlab/pull/285
Allow CleanLearning to use validation data in each fold by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/295
Created FAQ Page in the Cleanlab documentation by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/294
Proper validation of labels values/format across package by @jwmueller in https://github.com/cleanlab/cleanlab/pull/301
Add static type checking by @anishathalye in https://github.com/cleanlab/cleanlab/pull/306
error for missing classes, consistency on determining num_classes by @jwmueller in https://github.com/cleanlab/cleanlab/pull/308
Added support to build KNN graph for OOD detection with only training data by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/305
Standardize naming on K, num_classes and N, num_examples by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/312
Added outlier detection tutorial into docs by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/310
Updating tutorials hyperlink to 2.0.0 release by @aravindputrevu in https://github.com/cleanlab/cleanlab/pull/318
Allow KNN object to be returned by get_outlier_scores, Improved OOD tutorial by @jwmueller in https://github.com/cleanlab/cleanlab/pull/319
Some FAQ tips on how to improve CleanLearning by @jwmueller in https://github.com/cleanlab/cleanlab/pull/324
Updated tutorials to include quickstart by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/323
Add y argument as alternative to labels in CleanLearning.fit() by @elisno in https://github.com/cleanlab/cleanlab/pull/322
validation.py: Annotate function args and return values by @elisno in https://github.com/cleanlab/cleanlab/pull/317
Fixed package version issues for audio tutorial by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/325
Add compatibility for tensorflow and pytorch Dataset objects by @jwmueller in https://github.com/cleanlab/cleanlab/pull/311
Re-order find_label_issues args for better clarity by @jwmueller in https://github.com/cleanlab/cleanlab/pull/329
Comment on missing/rare classes in FAQ by @jwmueller in https://github.com/cleanlab/cleanlab/pull/332
update sphinx to v5 by @jwmueller in https://github.com/cleanlab/cleanlab/pull/327
Allow missing classes in get_label_quality_scores by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/334
Allow missing classes in assert_valid_class_labels by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/335
Changed all docstring instances of np.array to np.ndarray by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/336
Update Contributing.md with Projects link and getting started instructions by @jwmueller in https://github.com/cleanlab/cleanlab/pull/349
Switch docs links from latest release to stable by @elisno in https://github.com/cleanlab/cleanlab/pull/379
Extending cleanlab to find label errors in token classification datasets by @ericwang1997 in https://github.com/cleanlab/cleanlab/pull/347
Cleanlab functionality for multiannotator data by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/333
Cleanup token classification code by @elisno in https://github.com/cleanlab/cleanlab/pull/390
Fix typing for find_label_issues by @elisno in https://github.com/cleanlab/cleanlab/pull/391
Match token/s in color_sentence by @elisno in https://github.com/cleanlab/cleanlab/pull/397
Escape special regex characters by @elisno in https://github.com/cleanlab/cleanlab/pull/404
Add FAQ question on how to get predicted labels by @jwmueller in https://github.com/cleanlab/cleanlab/pull/402
Implementing get_ood_scores function by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/338
Add termcolor dependency by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/415
Add token classification tutorial notebook to docs.cleanlab.ai by @elisno in https://github.com/cleanlab/cleanlab/pull/411
Update examples links by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/421
Polish multiannotator docs by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/422
Text tutorial improvements by @jwmueller in https://github.com/cleanlab/cleanlab/pull/429
suppress tensorflow warning logs in tutorials if not properly installed by @jwmueller in https://github.com/cleanlab/cleanlab/pull/432
Add autodoc-typehints extension for sphinx by @elisno in https://github.com/cleanlab/cleanlab/pull/412
Strip input prompts when copying code snippets by @elisno in https://github.com/cleanlab/cleanlab/pull/439
Extend KerasWrapper to Functional API by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/434
Deploy documentation for token classification module by @elisno in https://github.com/cleanlab/cleanlab/pull/438
Updated labels to allow array_like by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/426
Add keras wrapper to docs by @jwmueller in https://github.com/cleanlab/cleanlab/pull/443
Format all return docstrings and add typing by @jwmueller in https://github.com/cleanlab/cleanlab/pull/437
make num_label_issues = cj calibrated offdiag sum by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/445
fix bug in hard-coded test. generalize the test by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/448
Change output of display_issues by @elisno in https://github.com/cleanlab/cleanlab/pull/450
More improvements to token classification code and documentation by @jwmueller in https://github.com/cleanlab/cleanlab/pull/452
Fix details disclosure elements in docs by @anishathalye in https://github.com/cleanlab/cleanlab/pull/456
Add missing backticks and language annotation by @anishathalye in https://github.com/cleanlab/cleanlab/pull/461
Error handling for rare classes in multiannotator data by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/455
Fix docs build in CI by @anishathalye in https://github.com/cleanlab/cleanlab/pull/462
Added support for returning ranked issue idxs by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/459
update readme for v2.1 by @jwmueller in https://github.com/cleanlab/cleanlab/pull/457
Clearer code examples on docs main page by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/430

New Contributors

@Yulv-git made their first contribution in https://github.com/cleanlab/cleanlab/pull/242
@rushic24 made their first contribution in https://github.com/cleanlab/cleanlab/pull/248
@MattiaSangermano made their first contribution in https://github.com/cleanlab/cleanlab/pull/247
@ulya-tkch made their first contribution in https://github.com/cleanlab/cleanlab/pull/293
@huiwengoh made their first contribution in https://github.com/cleanlab/cleanlab/pull/295
@aravindputrevu made their first contribution in https://github.com/cleanlab/cleanlab/pull/318
@elisno made their first contribution in https://github.com/cleanlab/cleanlab/pull/322
@ericwang1997 made their first contribution in https://github.com/cleanlab/cleanlab/pull/340

Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.0.0...v2.1.0