The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
This release is non-breaking when upgrading from v2.6.2.
Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.6.2...v2.6.3
This release is non-breaking when upgrading from v2.6.1.
Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.6.1...v2.6.2
This release is non-breaking when upgrading from v2.6.0. Some noteworthy updates include:
cleanlab.regression
module is improved to be more human-readable.
Datalab.get_issues()
.Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.6.0...v2.6.1
This release is non-breaking when upgrading from v2.5.0, continuing our commitment to maintaining backward compatibility while introducing new features and improvements. However, this release drops support for Python 3.7 while adding support for Python 3.11.
In this update, Datalab, our dataset analysis platform, enhances its ability to identify various types of issues within your datasets. With this release, Datalab now detects additional types of issues by default, offering users a more comprehensive analysis. Specifically, it can now:
null
values in your dataset.class_imbalance
.underperforming_group
, which refers to a subset of data points where your model exhibits poorer performance compared to others.
See our FAQ
for more information on how to provide pre-defined groups for this issue type.Additionally, Datalab can now optionally:
data_valuation
.If you have ideas for new features or notice any bugs, we encourage you to open an Issue or Pull Request on our GitHub repository!
With cleanlab v2.6.0, Datalab extends its support to new machine-learning tasks and introduces enhancements across the board.
This release introduces the task
parameter in Datalab's API, enabling users to specify the type of machine learning task they are working on.
from cleanlab import Datalab
lab = Datalab(..., task="regression")
The task
s currently supported are:
pred_probs
, features
, or a knn_graph
, and the new features introduced earlier.features
or a knn_graph
.pred_probs
exclusively. Explore the updated capabilities in our multilabel tutorial.features
or a knn_graph
.New functions have been introduced to enhance the exploration of object detection datasets, simplifying data comprehension and issue detection. Learn how to leverage some of these functions in our object detection tutorial.
cleanlab.dataset.health_summary()
now returns the same number of issues as cleanlab.classification.find_label_issues()
and cleanlab.count.num_label_issues()
.pred_probs
as input.Datalab.report()
now highlights only detected issue types. To view all checked issue types, use Datalab.report(show_all_issues=True)
.We're thrilled to welcome new contributors to the cleanlab community! Your contributions help us improve and grow cleanlab:
Thank you for your valuable contributions! If you're interested in contributing, check out our contributing guide for ways to get involved.
Significant changes in this release include:
show_all_issues
optional argument to Datalab.report() by @elisno in https://github.com/cleanlab/cleanlab/pull/970
For a full list of changes, enhancements, and fixes, please refer to the Full Changelog.
This release is non-breaking when upgrading from v2.4.0 (except for certain methods in cleanlab.experimental
that have been moved, especially utility methods related to Datalab).
Cleanlab now supports all of the most common ML tasks! This newest release adds dedicated support for the following types of datasets:
cleanlab.regression
and the "noisy labels in regression" quickstart tutorial.cleanlab.object_detection
and the "Object Detection" quickstart tutorial.cleanlab.segmentation
and the "Semantic Segmentation tutorial.Cleanlab previously already supported: multi-class classification, multi-label classification (image/document tagging), token classification (entity recognition, sequence prediction).
If there is another ML task you'd like to see this package support, please let us know (or even better open a Pull Request)!
Supporting these ML tasks properly required significant research and novel algorithms developed by our scientists. We have published papers on these for transparency and scientific rigor, check out the list in the README or learn more at: https://cleanlab.ai/research/ https://cleanlab.ai/blog/
Datalab is a general platform for detecting all sorts of common issues in real-world data, and the best place to get started for running this library on your datasets.
This release introduces major improvements and new functionalities in Datalab that include the ability to:
pred_probs
from a ML model (you can instead just provide features
).pred_probs
via the GEN algorithm which is particularly effective for datasets with tons of classes.low_memory
option. When specified, it uses an approximate mini-batching algorithm that returns results much faster and requires much less RAM.Transforming cleanlab into the first universal data-centric AI platform is a major effort and we need your help! Many easy ways to contribute are listed on our github or you can jump into the discussions on Slack. We immensely appreciate all of the contributors who've helped build this package into what it is today, especially:
New feature: Label error detection in regression datasets by @krmayankb in https://github.com/cleanlab/cleanlab/pull/572; by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/830
New feature: ObjectLab for detecting mislabeled images in objection detection datasets by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/676, https://github.com/cleanlab/cleanlab/pull/739, https://github.com/cleanlab/cleanlab/pull/745, https://github.com/cleanlab/cleanlab/pull/770, https://github.com/cleanlab/cleanlab/pull/779, https://github.com/cleanlab/cleanlab/pull/807, https://github.com/cleanlab/cleanlab/pull/833; by @aditya1503 in https://github.com/cleanlab/cleanlab/pull/750, https://github.com/cleanlab/cleanlab/pull/804
New feature: Label error detection in segmentation datasets by @vdlad in https://github.com/cleanlab/cleanlab/pull/677; by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/754, https://github.com/cleanlab/cleanlab/pull/756, https://github.com/cleanlab/cleanlab/pull/759, https://github.com/cleanlab/cleanlab/pull/772; by @elisno in https://github.com/cleanlab/cleanlab/pull/775
New feature: CleanVision to detect low-quality images by @sanjanag in https://github.com/cleanlab/cleanlab/pull/679, https://github.com/cleanlab/cleanlab/pull/797
New image quickstart tutorial that uses Datalab by @sanjanag in https://github.com/cleanlab/cleanlab/pull/795
Datalab code refactoring by @elisno in https://github.com/cleanlab/cleanlab/pull/803, https://github.com/cleanlab/cleanlab/pull/783, https://github.com/cleanlab/cleanlab/pull/793, https://github.com/cleanlab/cleanlab/pull/729
Make labels optional in Datalab by @elisno in https://github.com/cleanlab/cleanlab/pull/730
Update near-duplicate sets in Datalab by @elisno in https://github.com/cleanlab/cleanlab/pull/781
Include non-IID detection in set of default Datalab issue types by @elisno in https://github.com/cleanlab/cleanlab/pull/723
Extend Datalab to be able to detect label issues based on features by @Steven-Yiran in https://github.com/cleanlab/cleanlab/pull/760
Add imbalance issue type to Datalab by @tataganesh in https://github.com/cleanlab/cleanlab/pull/758, https://github.com/cleanlab/cleanlab/pull/828
Catch specific exception for knn in Datalab issue managers by @tataganesh in https://github.com/cleanlab/cleanlab/pull/825
Make plots smaller for datalab tutorials by @tataganesh in https://github.com/cleanlab/cleanlab/pull/751
50x speedup and other improvements in multiannotator module by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/821, https://github.com/cleanlab/cleanlab/pull/784; by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/827
ENH: make clipping unnecessary for entropy by @DerWeh in https://github.com/cleanlab/cleanlab/pull/703
Extend default CleanLearning classifier to work for more datasets by @Steven-Yiran in https://github.com/cleanlab/cleanlab/pull/749
CleanLearning code improvements by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/724; by @jwmueller in https://github.com/cleanlab/cleanlab/pull/744
Change CleanLearning inspect.getfullargspec to signature for sklearn v1.3 compatibility by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/761
Expose low memory option for finding label issues by @tataganesh in https://github.com/cleanlab/cleanlab/pull/791, https://github.com/cleanlab/cleanlab/pull/822
Add GEN OOD-detection algorithm by @coding-famer in https://github.com/cleanlab/cleanlab/pull/800
Unify softmax implementations throughout package by @elisno in https://github.com/cleanlab/cleanlab/pull/826
Better warning handling for off_calibrated_custom in confident joint by @gordon-lim in https://github.com/cleanlab/cleanlab/pull/746
Clearer explanations in documentation/tutorials/readme by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/725; by @jwmueller in https://github.com/cleanlab/cleanlab/pull/726, https://github.com/cleanlab/cleanlab/pull/734, https://github.com/cleanlab/cleanlab/pull/741, https://github.com/cleanlab/cleanlab/pull/743, https://github.com/cleanlab/cleanlab/pull/766, https://github.com/cleanlab/cleanlab/pull/832, https://github.com/cleanlab/cleanlab/pull/799, https://github.com/cleanlab/cleanlab/pull/752, https://github.com/cleanlab/cleanlab/pull/841, https://github.com/cleanlab/cleanlab/pull/816, https://github.com/cleanlab/cleanlab/pull/755, https://github.com/cleanlab/cleanlab/pull/731, https://github.com/cleanlab/cleanlab/pull/753, https://github.com/cleanlab/cleanlab/pull/845, https://github.com/cleanlab/cleanlab/pull/835, https://github.com/cleanlab/cleanlab/pull/847
CI and documentation system updates by @anishathalye in https://github.com/cleanlab/cleanlab/pull/742, https://github.com/cleanlab/cleanlab/pull/768, https://github.com/cleanlab/cleanlab/pull/769; by @jwmueller in https://github.com/cleanlab/cleanlab/pull/837; by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/788, https://github.com/cleanlab/cleanlab/pull/757, https://github.com/cleanlab/cleanlab/pull/738, https://github.com/cleanlab/cleanlab/pull/794; by @sanjanag in https://github.com/cleanlab/cleanlab/pull/843; by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/777; by @elisno in https://github.com/cleanlab/cleanlab/pull/802; by @axl1313 in https://github.com/cleanlab/cleanlab/pull/798
Improved tests by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/778, https://github.com/cleanlab/cleanlab/pull/763
Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.4.0...v2.5.0
Cleanlab has grown into a popular package used by thousands of data scientists to diagnose issues in diverse datasets and improve the data itself in order to fit more robust models. Many new methods/algorithms were added in recent months to increase the capabilities of this data-centric AI library.
Now we've added a unified platform called Datalab
for you to apply many of these capabilities in a single line of code!
To audit any classification dataset for issues, first use any trained ML model to produce pred_probs
(predicted class probabilities) and/or feature_embeddings
(numeric vector representations of each datapoint). Then, these few lines of code can detect many types of real-world issues in your dataset like label errors, outliers, near duplicates, etc:
from cleanlab import Datalab
lab = Datalab(data=dataset, label_name="column_name_for_labels")
lab.find_issues(features=feature_embeddings, pred_probs=pred_probs)
lab.report() # summarize the issues found, how severe they are, and other useful info about the dataset
Follow our blog to better understand how this works internally, many articles will be published there shortly!
A detailed description of each type of issue Datalab
can detect is provided in this guide, but we recommend first starting with the tutorials which show you how easy it is to run on your own dataset.
Datalab
can be used to do things like find label issues with string class labels (whereas the prior find_label_issues()
method required integer class indices). But you are still free to use all of the prior cleanlab methods you're used to! Datalab
is also using these internally to detect data issues.
Our goal is for Datalab
to be an easy way to run a comprehensive suite of cleanlab capabilities on any dataset. This is an evolving paradigm, so be aware some Datalab
APIs may change in subsequent package versions -- as noted in the documentation.
You can easily run the issue checks in Datalab
together with a custom issue type you define outside of cleanlab. This customizability also makes it easy to contribute new data quality algorithms into Datalab
. Help us build the best open-source platform for data-centric AI by adding your ideas or those from recent publications! Feel free to reach out via Slack.
We've updated some of our existing tutorials with more interesting datasets and ML models. Regarding the basic tutorials on identifying label issues in classification data from various modalities (image, text, audio, tables), we have also created an analogous versions to detect issues in these same datasets with Datalab
instead (see Datalab Tutorials
). This should help existing users quickly ramp up on using Datalab
to see how much more powerful this comprehensive data audit can be.
To provide a better experience for users with multi-label classification datasets, we have explicitly separated the functionality to work with these into the cleanlab.multilabel_classification
module. So please start there rather than specifying the multi_label=True
flag in certain methods outside of this module, as that option will be deprecated in the future.
Particularly noteworthy are the new dataset-level issue summaries for multi-label classification datasets, available in the cleanlab.multilabel_classification.dataset
module.
While moving methods to the cleanlab.multilabel_classification
module, we noticed some bugs in existing methods. We got rid of these methods entirely (replacing them with new ones in the cleanlab.multilabel_classification
module), so some changes may appear to be backwards incompatible, even though the original code didn't function as intended in the first place.
Your existing code will break if you do not upgrade to the new versions of these methods (the existing cleanlab v.2.3.1 code was probably producing bad results anyway based on some bugs that have been fixed). Here are changes you must make in your code for it to work with newer cleanlab versions:
cleanlab.dataset.rank_classes_by_label_quality(..., multi_label=True)
→
cleanlab.multilabel_classification.dataset.rank_classes_by_label_quality(...)
The multi_label=False/True
argument will be removed in the future from the former method.
cleanlab.dataset.find_overlapping_classes(..., multi_label=True)
→
cleanlab.multilabel_classification.dataset.common_multilabel_issues(...)
The multi_label=False/True
argument will be removed in the future from the former method. The returned DataFrame is slightly different, please refer to the new method's documentation.
cleanlab.dataset.overall_label_health_score(...multi_label=True)
→
cleanlab.multilabel_classification.dataset.overall_label_health_score(...)
The multi_label=False/True
argument will be removed in the future from the former method.
cleanlab.dataset.health_summary(...multi_label=True)
→
cleanlab.multilabel_classification.dataset.multilabel_health_summary(...)
The multi_label=False/True
argument will be removed in the future from the former method.
There are no other backwards incompatible changes in the package with this release.
We recommend updating your existing code to the new versions of these methods (existing cleanlab v2.3.1 code will still work though, for now). Here are changes we recommend:
cleanlab.filter.find_label_issues(..., multi_label=True)
→
cleanlab.multilabel_classification.filter.find_label_issues(...)
The multi_label=False/True
argument will be removed in the future from the former method.
from cleanlab.multilabel_classification import get_label_quality_scores
→
from cleanlab.multilabel_classification.rank import get_label_quality_scores
Remember: All of the code to work with multi-label data now lives in the cleanlab.multilabel_classification
module.
Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.3.1...v2.4.0
This minor release primarily just improves the user experience when encountering various edge-cases in:
This release is non-breaking when upgrading from v2.3.0. Two noteworthy updates in the cleanlab.multiannotator
module include a:
get_majority_vote_label()
to avoid diminishing the frequency of rarer classes (this only plays a role when pred_probs
are not provided).get_active_learning_scores()
to support scoring only unlabeled data or only labeled data. More of the arguments can now be None
.Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.3.0...v2.3.1
Cleanlab was originally open-sourced as code to accompany a research paper on label errors in classification tasks, to prove to skeptical researchers that it's possible to utilize ML models to discover mislabeled data and then train even better versions of these same models. We've been hard at work since then, turning this into an industry-grade library that helps you handle label errors in many ML tasks such as: entity recognition, image/document tagging, data labeled by multiple annotators, etc. While label errors are critical to deal with in real-world ML applications, data-centric AI involves utilizing trained ML models to improve the data in other ways as well.
With the newest release, cleanlab v2.3 can now automatically:
As always, the cleanlab library works with almost any ML model (no matter how it was trained) and type of data (image, text, tabular, audio, etc). We have user-friendly 5min tutorials to get started with any of the above objectives and easily improve your data!
We're aiming for this library to provide all the key functionalities needed to practice data-centric AI. Much of this involves inventing new algorithms for data quality, and we transparently publish all of these algorithms in scientific papers. Read these to understand how particular cleanlab methods work under the hood and see extensive benchmarks of how effective they are on real data.
We have added new functionality for active learning and easily making Keras models compatible with sklearn. Label issues can now be estimated 10x faster and with much less memory using new methods added to help users with massive datasets. This release is non-breaking when upgrading from v2.2.0 (except for certain methods in cleanlab.experimental
that have been moved).
For settings where you want to label more data to get better ML, active learning helps you train the best ML model with the least data labeling. Unfortunately data annotators often give imperfect labels, in which case we might sometimes prefer to have another annotator check an already-labeled example rather than labeling an entirely new example. ActiveLab is a new algorithm invented by our team that automatically answers the question: which new data should I label or which of my current labels should be checked again? ActiveLab is highly practical — it runs quickly and works with: any type of ML model, batch settings where many examples are (re)labeled before model retraining, and settings where multiple annotators can label an example (or just one annotator).
Here's all the code needed to determine active learning scores for examples in your unlabeled pool (no annotations yet) and labeled pool (at least one annotation already collected).
from cleanlab.multiannotator import get_active_learning_scores
scores_labeled_pool, scores_unlabeled_pool = get_active_learning_scores(
multiannotator_labels, pred_probs, pred_probs_unlabeled
)
The batch of examples with the lowest scores are those that are most informative to collect an additional label for (scores between labeled vs unlabeled pool are directly comparable). You can either have a new annotator label the batch of examples with lowest scores, or distribute them amongst your previous annotators as is most convenient. ActiveLab is also effective for: standard active learning where you collect at most one label per example (no re-labeling), as well as active label cleaning (with no unlabeled pool) where you only want to re-label examples to ensure 100% correct consensus labels (with the least amount of re-labeling).
Get started running ActiveLab with our tutorial notebook from our repo that has many other examples.
We've introduced one-line wrappers for TensorFlow/Keras models that enable you to use TensorFlow models within scikit-learn workflows with features like Pipeline
, GridSearch
and more. Just change one line of code to make your existing Tensorflow/Keras model compatible with scikit-learn’s rich ecosystem! All you have to do is swap out: keras.Model
→ KerasWrapperModel
, or keras.Sequential
→ KerasSequentialWrapper
. Imported from cleanlab.models.keras
, the wrapper objects have all the same methods of their keras counterparts, plus you can use them with tons of handy scikit-learn methods.
Resources to get started include:
Through extensive optimization of our multiprocessing code (thanks to @clu0), find_label_issues
has been made ~10x faster on Linux machines that have many CPU cores.
For massive datasets, find_label_issues
may require too much memory to run our your machine. We've added new methods in cleanlab.experimental.label_issues_batched that can compute label issues with far less memory via mini-batch estimation. You can use these with billion-scale memmap arrays or Zarr arrays like this:
from cleanlab.experimental.label_issues_batched import find_label_issues_batched
labels = zarr.convenience.open("LABELS.zarr", mode="r")
pred_probs = zarr.convenience.open("PREDPROBS.zarr", mode="r")
issues = find_label_issues_batched(labels=labels, pred_probs=pred_probs, batch_size=100000)
By choosing sufficiently small batch_size
, you should be able to handle pretty much any dataset (set it as large as your memory will allow for best efficiency). With default arguments, the batched methods closely approximate the results of the option: cleanlab.filter.find_label_issues(..., filter_by="low_self_confidence", return_indices_ranked_by="self_confidence")
This and filter_by="low_normalized_margin"
are new find_label_issues()
options added in v2.3, which require less computation and still output accurate estimates of the label errors.
cleanlab.experimental
-> cleanlab.models
.Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.2.0...v2.3.0
You asked, we listened! cleanlab v2.2.0 addresses two of the biggest pain points we often heard from our users:
This release is non-breaking when upgrading from v2.1.0, but you will now get more accurate results (in all the datasets we tested) when finding label issues in multi-label classification datasets.
This release also adds a new satisfyingly accurate algorithms for finding label errors in multi-label data for improved multi-label classification tasks like text/image tagging.
The newest version of cleanlab features a complete overhaul of cleanlab’s multi-label classification functionality:
cleanlab.multilabel_classification
module for label quality scoring.The package now works for datasets in which some classes happen to not be present (but are present say in the pred_probs
output by a model). This is useful when you:
cleanlab.multiannotator
and some annotators occasionally select a really rare class.(in addition to too many bugfixes to name):
count.num_label_issues()
. — @ulya-tkchSpecial thanks to Po-He Tseng for helping with early tests of our improved multi-label algorithms and the research behind developing them.
Finding label issues in multi-label classification is done using the same code and inputs as before (and the same object is returned as before):
from cleanlab.filter import find_label_issues
ranked_label_issues = find_label_issues(
labels=labels,
pred_probs=pred_probs,
multi_label=True,
return_indices_ranked_by="self_confidence",
)
Where for a 3-class multi-label dataset with 4 examples, we might have say:
labels = [[0], [0, 1], [0, 2], [1]]
pred_probs = np.array(
[[0.9, 0.1, 0.1],
[0.9, 0.1, 0.8],
[0.9, 0.1, 0.6],
[0.2, 0.8, 0.3]]
)
The following code (in which class 1 is missing from the dataset) did not previously work but now runs without problem in cleanlab v2.2.0:
from cleanlab.filter import find_label_issues
import numpy as np
labels = [0, 0, 2, 0, 2]
pred_probs = np.array(
[[0.8, 0.1, 0.1],
[0.7, 0.1, 0.2],
[0.3, 0.1, 0.6],
[0.5, 0.2, 0.3],
[0.1, 0.1, 0.8]]
)
label_issues = find_label_issues(
labels=labels,
pred_probs=pred_probs,
)
The next major release of this package will introduce a paradigm shift in the way people check their datasets. Today this involves significant manual labor, but software should be able to help! Our research has developed algorithms that can automatically detect many types of common issues that plague real-world ML datasets. The next version of cleanlab will offer an easy-to-use line of code that runs all of our appropriate algorithms to help ensure a given dataset is issue-free and well-suited for supervised learning.
Transforming cleanlab into the first universal data-centric AI platform is a major effort and we need your help! Many easy ways to contribute are listed on our github or you can jump into the discussions on Slack.
Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.1.0...v2.2.0
v2.1.0 begins extending this library beyond standard classification tasks, taking initial steps toward the first tool that can detect label errors in data from any Supervised Learning task (leveraging any model trained for that task). This release is non-breaking when upgrading from v2.0.0.
Major new functionalities:
KerasWrapperModel
, which makes these models compatible with sklearn and tensorflow Datasets. — @huiwengoh, @jwmuellerMajor improvements (in addition to too many bugfixes to name):
scipy
is no longer needed — @anishathalyefrom cleanlab.outlier import OutOfDistribution
ood = OutOfDistribution()
# To get outlier scores for train_data using feature matrix train_feature_embeddings
ood_train_feature_scores = ood.fit_score(features=train_feature_embeddings)
# To get outlier scores for additional test_data using feature matrix test_feature_embeddings
ood_test_feature_scores = ood.score(features=test_feature_embeddings)
from cleanlab.outlier import OutOfDistribution
ood = OutOfDistribution()
# To get outlier scores for train_data using predicted class probabilities (from a trained classifier) and given class labels
ood_train_predictions_scores = ood.fit_score(pred_probs=train_pred_probs, labels=labels)
# To get outlier scores for additional test_data using predicted class probabilities
ood_test_predictions_scores = ood.score(pred_probs=test_pred_probs)
multiannotator_labels
whose rows correspond to examples, columns to each annotator’s chosen labels), cleanlab v2.1 can: find improved consensus labels, score their quality, and assess annotators, all by leveraging predicted class probabilities pred_probs
from any trained classifierfrom cleanlab.multiannotator import get_label_quality_multiannotator
get_label_quality_multiannotator(multiannotator_labels, pred_probs)
tokens
: List of tokenized sentences whose i
th element is a list of strings corresponding to tokens of the i
th sentence in dataset.
Example: [..., ["I", "love", "cleanlab"], ...]
labels
: List whose i
th element is a list of integers corresponding to class labels of each token in the i
th sentence. Example: [..., [0, 0, 1], ...]
pred_probs
: List whose i
th element is a np.ndarray of shape (N_i, K)
corresponding to predicted class probabilities for each token in the i
th sentence (assuming this sentence contains N_i
tokens and dataset has K
possible classes). These should be out-of-sample pred_probs
obtained from a token classification model via cross-validation.
Example: [..., np.array([[0.8,0.2], [0.9,0.1], [0.3,0.7]]), ...]
Using these, you can easily find and display mislabeled tokens in your data
from cleanlab.token_classification.filter import find_label_issues
from cleanlab.token_classification.summary import display_issues
issues = find_label_issues(labels, pred_probs)
display_issues(issues, tokens, pred_probs=pred_probs, given_labels=labels,
class_names=optional_list_of_ordered_class_names)
CleanLearning
can now operate directly on non-array dataset formats like tensorflow/pytorch Datasets
and use arbitrary Keras models:import numpy as np
import tensorflow as tf
from cleanlab.experimental.keras import KerasWrapperModel
dataset = tf.data.Dataset.from_tensor_slices((features_np_array, labels_np_array)) # example tensorflow dataset created from numpy arrays
dataset = dataset.shuffle(buffer_size=len(features_np_array)).batch(32)
def make_model(num_features, num_classes):
inputs = tf.keras.Input(shape=(num_features,))
outputs = tf.keras.layers.Dense(num_classes)(inputs)
return tf.keras.Model(inputs=inputs, outputs=outputs, name="my_keras_model")
model = KerasWrapperModel(make_model, model_kwargs={"num_features": features_np_array.shape[1], "num_classes": len(np.unique(labels_np_array))})
cl = CleanLearning(model)
cl.fit(dataset, labels_np_array) # variant of model.fit() that is more robust to noisy labels
robust_predictions = cl.predict(dataset) # equivalent to model.predict() after training on cleaner data
labels
values/format across package by @jwmueller in https://github.com/cleanlab/cleanlab/pull/301
Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.0.0...v2.1.0