The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
v2.1.0 begins extending this library beyond standard classification tasks, taking initial steps toward the first tool that can detect label errors in data from any Supervised Learning task (leveraging any model trained for that task). This release is non-breaking when upgrading from v2.0.0.
Major new functionalities:
KerasWrapperModel
, which makes these models compatible with sklearn and tensorflow Datasets. — @huiwengoh, @jwmuellerMajor improvements (in addition to too many bugfixes to name):
scipy
is no longer needed — @anishathalyefrom cleanlab.outlier import OutOfDistribution
ood = OutOfDistribution()
# To get outlier scores for train_data using feature matrix train_feature_embeddings
ood_train_feature_scores = ood.fit_score(features=train_feature_embeddings)
# To get outlier scores for additional test_data using feature matrix test_feature_embeddings
ood_test_feature_scores = ood.score(features=test_feature_embeddings)
from cleanlab.outlier import OutOfDistribution
ood = OutOfDistribution()
# To get outlier scores for train_data using predicted class probabilities (from a trained classifier) and given class labels
ood_train_predictions_scores = ood.fit_score(pred_probs=train_pred_probs, labels=labels)
# To get outlier scores for additional test_data using predicted class probabilities
ood_test_predictions_scores = ood.score(pred_probs=test_pred_probs)
multiannotator_labels
whose rows correspond to examples, columns to each annotator’s chosen labels), cleanlab v2.1 can: find improved consensus labels, score their quality, and assess annotators, all by leveraging predicted class probabilities pred_probs
from any trained classifierfrom cleanlab.multiannotator import get_label_quality_multiannotator
get_label_quality_multiannotator(multiannotator_labels, pred_probs)
tokens
: List of tokenized sentences whose i
th element is a list of strings corresponding to tokens of the i
th sentence in dataset.
Example: [..., ["I", "love", "cleanlab"], ...]
labels
: List whose i
th element is a list of integers corresponding to class labels of each token in the i
th sentence. Example: [..., [0, 0, 1], ...]
pred_probs
: List whose i
th element is a np.ndarray of shape (N_i, K)
corresponding to predicted class probabilities for each token in the i
th sentence (assuming this sentence contains N_i
tokens and dataset has K
possible classes). These should be out-of-sample pred_probs
obtained from a token classification model via cross-validation.
Example: [..., np.array([[0.8,0.2], [0.9,0.1], [0.3,0.7]]), ...]
Using these, you can easily find and display mislabeled tokens in your data
from cleanlab.token_classification.filter import find_label_issues
from cleanlab.token_classification.summary import display_issues
issues = find_label_issues(labels, pred_probs)
display_issues(issues, tokens, pred_probs=pred_probs, given_labels=labels,
class_names=optional_list_of_ordered_class_names)
CleanLearning
can now operate directly on non-array dataset formats like tensorflow/pytorch Datasets
and use arbitrary Keras models:import numpy as np
import tensorflow as tf
from cleanlab.experimental.keras import KerasWrapperModel
dataset = tf.data.Dataset.from_tensor_slices((features_np_array, labels_np_array)) # example tensorflow dataset created from numpy arrays
dataset = dataset.shuffle(buffer_size=len(features_np_array)).batch(32)
def make_model(num_features, num_classes):
inputs = tf.keras.Input(shape=(num_features,))
outputs = tf.keras.layers.Dense(num_classes)(inputs)
return tf.keras.Model(inputs=inputs, outputs=outputs, name="my_keras_model")
model = KerasWrapperModel(make_model, model_kwargs={"num_features": features_np_array.shape[1], "num_classes": len(np.unique(labels_np_array))})
cl = CleanLearning(model)
cl.fit(dataset, labels_np_array) # variant of model.fit() that is more robust to noisy labels
robust_predictions = cl.predict(dataset) # equivalent to model.predict() after training on cleaner data
labels
values/format across package by @jwmueller in https://github.com/cleanlab/cleanlab/pull/301
Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.0.0...v2.1.0
cleanlab 2.0 adds powerful new workflows and algorithms for data-centric AI, dataset curation, auto-fixing label issues in data, learning with noisy labels, and more. Nearly every module, method, parameter, and docstring has been touched by this release.
If you're coming from 1.0, here's a migration guide.
For an in-depth overview of what cleanlab 2.0 can do, check out this tutorial.
This list is non-exhaustive! Assume every aspect of API has changed.
classification.LearningWithNoisyLabels
class --> classification.CleanLearning
classpruning.py
--> filter.py
latent_estimation.py
--> count.py
cifar_cnn.py
--> experimental/cifar_cnn.py
coteaching.py
--> experimental/coteaching.py
fasttext.py
--> experimental/fasttext.py
mnist_pytorch.py
--> experimental/fmnist_pytorch.py
noise_generation.py
--> benchmarking/noise_generation.py
util.py
--> internal/util.py
latent_algebra.py
--> internal/latent_algebra.py
polyplex.py
rank.py
pruning.py
/filter.py
to heredataset.py
benchmarking.py
noise_generation.py
here.pruning.get_noise_indices()
--> filter.find_label_issues()
count.num_label_errors()
--> count.num_label_issues()
rank.py
adds
get_self_confidence_for_each_label()
get_normalized_margin_for_each_label()
filter.py
adds
filter.find_label_issues()
(select method using the filter_by
parameter)
confident_learning
, which has been shown to work very well and may become the default in the future, andpredicted_neq_given
, which is useful for benchmarking a simple baseline approach, but underperformant relative to the other filter_by methods)classification.py
adds
ClearnLearning.get_label_issues()
CleanLearning().fit(X, y).get_label_issues()
CleanLearning.find_label_issues()
s
-> labels
psx
-> pred_probs
label_errors
--> label_issues
noise_mask
--> label_issues_mask
label_errors_bool
--> label_issues_mask
prune_method
--> filter_by
prob_given_label
--> self_confidence
pruning
--> filtering
labels
, pred_probs
) parameters to be consistent (in that order) in all methods.frac_noise
) in filter.find_label_issues()order_label_issues()
sorted_index_method
--> rank_by
find_label_issues()
sorted_index_method
--> return_indices_ranked_by
prune_method
--> filter_by
filter.py
MIN_NUM_PER_CLASS = 5
--> MIN_NUM_PER_CLASS = 1
Full Changelog: https://github.com/cleanlab/cleanlab/compare/v1.0.1...v2.0.0
The cleanlab community has grown over the years. Today, we are excited to release cleanlab 1.0 as the standard package for machine learning with noisy labels and finding errors in datasets.
If you're coming from the research side (e.g. the confident learning or label errors paper) -- use this version of cleanlab.
cleanlab 1.0 supports the most common versions of python (2, 2.7, 3.4, 3.5, 3.6, 3.7, 3.8.) and operating systems (linux, macOS, Windows). It works with any deep learning or machine learning library by working with model outputs, regardless of where they come from. cleanlab also has built-in support now for new research from other scientists (e.g. Co-Teaching) outside of our group at MIT.
Alpha release of cleanlab.