Cleanlab Versions Save

The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

v2.1.0

1 year ago

v2.1.0 begins extending this library beyond standard classification tasks, taking initial steps toward the first tool that can detect label errors in data from any Supervised Learning task (leveraging any model trained for that task). This release is non-breaking when upgrading from v2.0.0.

Highlights of what’s new in 2.1.0:

Major new functionalities:

  • CROWDLAB algorithms for analysis of data labeled by multiple annotators — @huiwengoh, @ulya-tkch, @jwmueller
    • Accurately infer the best consensus label for each example
    • Estimate the quality of each consensus label (how likely is it correct)
    • Estimate the overall quality of each annotator (how trustworthy are their suggested labels)
  • Out of Distribution Detection based on either:
    • feature values/embeddings — @ulya-tkch, @jwmueller, @JohnsonKuan
    • predicted class probabilities — @ulya-tkch
  • Label error detection for Token Classification tasks (NLP / text data) — @ericwang1997, @elisno
  • CleanLearning can now:
    • Run on non-array data types including: pandas Dataframe, pytorch/tensorflow Dataset objects, and many other types of data formats. — @jwmueller
    • Allow base model’s fit() to utilize validation data in each fold during cross-validation (eg. for early-stopping or hyperparameter-optimization purposes). — @huiwengoh
    • Train with custom sample weights for datapoints. — @rushic24, @jwmueller
    • Utilize any Keras model (supporting both sequential or functional APIs) via cleanlab’s KerasWrapperModel , which makes these models compatible with sklearn and tensorflow Datasets. — @huiwengoh, @jwmueller

Major improvements (in addition to too many bugfixes to name):

  • Reduced dependencies: scipy is no longer needed — @anishathalye
  • Clearer error/warning messages throughout package when data/inputs are strangely formatted — @cgnorthcutt, @jwmueller, @huiwengoh
  • FAQ section in tutorials with advice for commonly encountered issues — @huiwengoh, @ulya-tkch, @jwmueller, @cgnorthcutt
  • Many additional tutorial and example notebooks at: docs.cleanlab.ai and https://github.com/cleanlab/examples — @ulya-tkch, @huiwengoh, @jwmueller, @ericwang1997
  • Static type annotations to ensure robust code — @anishathalye, @elisno

Examples of new workflows available in 2.1:

Out of Distribution and Outlier Detection

  1. Detect out of distribution examples in a dataset based on its numeric feature embeddings
from cleanlab.outlier import OutOfDistribution

ood = OutOfDistribution()

# To get outlier scores for train_data using feature matrix train_feature_embeddings
ood_train_feature_scores = ood.fit_score(features=train_feature_embeddings)

# To get outlier scores for additional test_data using feature matrix test_feature_embeddings
ood_test_feature_scores = ood.score(features=test_feature_embeddings)
  1. Detect out of distribution examples in a dataset based on predicted class probabilities from a trained classifier
from cleanlab.outlier import OutOfDistribution

ood = OutOfDistribution()

# To get outlier scores for train_data using predicted class probabilities (from a trained classifier) and given class labels
ood_train_predictions_scores = ood.fit_score(pred_probs=train_pred_probs, labels=labels)

# To get outlier scores for additional test_data using predicted class probabilities
ood_test_predictions_scores = ood.score(pred_probs=test_pred_probs) 

Multi-annotator -- support data with multiple labels

  1. For data labeled by multiple annotators (stored as matrix multiannotator_labels whose rows correspond to examples, columns to each annotator’s chosen labels), cleanlab v2.1 can: find improved consensus labels, score their quality, and assess annotators, all by leveraging predicted class probabilities pred_probs from any trained classifier
from cleanlab.multiannotator import get_label_quality_multiannotator

get_label_quality_multiannotator(multiannotator_labels, pred_probs)

Support Token Classification tasks

  1. Cleanlab v2.1 can now find label issues in token classification (text) data, where each word in a sentence is labeled with one of K classes (eg. entity recognition). This relies on three inputs:
  • tokens: List of tokenized sentences whose ith element is a list of strings corresponding to tokens of the ith sentence in dataset. Example: [..., ["I", "love", "cleanlab"], ...]
  • labels: List whose ith element is a list of integers corresponding to class labels of each token in the ith sentence. Example: [..., [0, 0, 1], ...]
  • pred_probs: List whose ith element is a np.ndarray of shape (N_i, K) corresponding to predicted class probabilities for each token in the ith sentence (assuming this sentence contains N_i tokens and dataset has K possible classes). These should be out-of-sample pred_probs obtained from a token classification model via cross-validation. Example: [..., np.array([[0.8,0.2], [0.9,0.1], [0.3,0.7]]), ...]

Using these, you can easily find and display mislabeled tokens in your data

from cleanlab.token_classification.filter import find_label_issues
from cleanlab.token_classification.summary import display_issues

issues = find_label_issues(labels, pred_probs)
display_issues(issues, tokens, pred_probs=pred_probs, given_labels=labels,
               class_names=optional_list_of_ordered_class_names)

Support pd.DataFrames, Keras/PyTorch/TF Datasets, Keras models, etc.

  1. CleanLearning can now operate directly on non-array dataset formats like tensorflow/pytorch Datasets and use arbitrary Keras models:
import numpy as np
import tensorflow as tf
from cleanlab.experimental.keras import KerasWrapperModel

dataset = tf.data.Dataset.from_tensor_slices((features_np_array, labels_np_array))  # example tensorflow dataset created from numpy arrays 
dataset = dataset.shuffle(buffer_size=len(features_np_array)).batch(32)

def make_model(num_features, num_classes):
    inputs = tf.keras.Input(shape=(num_features,))
    outputs = tf.keras.layers.Dense(num_classes)(inputs)
    return tf.keras.Model(inputs=inputs, outputs=outputs, name="my_keras_model")

model = KerasWrapperModel(make_model, model_kwargs={"num_features": features_np_array.shape[1], "num_classes": len(np.unique(labels_np_array))})
cl = CleanLearning(model)
cl.fit(dataset, labels_np_array)  # variant of model.fit() that is more robust to noisy labels
robust_predictions = cl.predict(dataset)  # equivalent to model.predict() after training on cleaner data

Change Log

New Contributors

Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.0.0...v2.1.0

v2.0.0

2 years ago

If you liked cleanlab v1.0.1, v2.0.0 will blow your mind! 💥🧠

cleanlab 2.0 adds powerful new workflows and algorithms for data-centric AI, dataset curation, auto-fixing label issues in data, learning with noisy labels, and more. Nearly every module, method, parameter, and docstring has been touched by this release.

If you're coming from 1.0, here's a migration guide.

A few highlights of new functionalities in cleanlab 2.0:

  1. rank every data point by label quality
  2. find label issues in any dataset.
  3. train any classifier on any dataset with label issues.
  4. find overlapping classes to merge and/or delete at the dataset-level
  5. yield an overall dataset health

For an in-depth overview of what cleanlab 2.0 can do, check out this tutorial.

To help you get started with 2.0, we've added:

Change Log

This list is non-exhaustive! Assume every aspect of API has changed.

Module name changes or moves:

  • classification.LearningWithNoisyLabels class --> classification.CleanLearning class
  • pruning.py --> filter.py
  • latent_estimation.py --> count.py
  • cifar_cnn.py --> experimental/cifar_cnn.py
  • coteaching.py --> experimental/coteaching.py
  • fasttext.py --> experimental/fasttext.py
  • mnist_pytorch.py --> experimental/fmnist_pytorch.py
  • noise_generation.py --> benchmarking/noise_generation.py
  • util.py --> internal/util.py
  • latent_algebra.py --> internal/latent_algebra.py

Module Deletions:

  • removed polyplex.py
  • removed models/` --> (moved content to experimental/)

New module created:

  • rank.py
    • moved all ranking and ordering functions from pruning.py/filter.py to here
  • dataset.py
    • brand new module supporting methods for dealing with data-level issues
  • benchmarking.py
    • Future benchmarking modules go here. Moved noise_generation.py here.

Method name changes:

  • pruning.get_noise_indices() --> filter.find_label_issues()
  • count.num_label_errors() --> count.num_label_issues()

Methods added:

  • rank.py adds
    • two ranking functions to rank data based on label quality for entire dataset (not just examples with label issues)
    • get_self_confidence_for_each_label()
    • get_normalized_margin_for_each_label()
  • filter.py adds
    • two more methods added to filter.find_label_issues() (select method using the filter_by parameter)
      • confident_learning, which has been shown to work very well and may become the default in the future, and
      • predicted_neq_given, which is useful for benchmarking a simple baseline approach, but underperformant relative to the other filter_by methods)
  • classification.py adds
    • ClearnLearning.get_label_issues()
      • for a canonical one-line of code use:CleanLearning().fit(X, y).get_label_issues()
      • no need to compute predicted probabilities in advance
    • CleanLearning.find_label_issues()
      • returns a dataframe with label issues (instead of just a mask)

Naming conventions changed in method names, comments, parameters, etc.

  • s -> labels
  • psx -> pred_probs
  • label_errors --> label_issues
  • noise_mask --> label_issues_mask
  • label_errors_bool --> label_issues_mask
  • prune_method --> filter_by
  • prob_given_label --> self_confidence
  • pruning --> filtering

Parameter re-ordering:

  • re-ordered (labels, pred_probs) parameters to be consistent (in that order) in all methods.
  • re-ordered parameters (e.g. frac_noise) in filter.find_label_issues()

Parameter changes:

  • in order_label_issues()
    • param: sorted_index_method --> rank_by
  • in find_label_issues()
    • param: sorted_index_method --> return_indices_ranked_by
    • param: prune_method --> filter_by

Global variables changed:

  • filter.py
    • Only require 1 example to be left in each class
    • MIN_NUM_PER_CLASS = 5 --> MIN_NUM_PER_CLASS = 1
    • enables cleanlab to work for toy-sized datasets

Dependencies added

  • pandas=1.0.0

Way-too-detailed Change Log

New Contributors

Full Changelog: https://github.com/cleanlab/cleanlab/compare/v1.0.1...v2.0.0

v1.0.1

2 years ago
  • The primary purpose of this release is to preserve the functionality of cleanlab (all versions up to 1.0.1) in the new docs prior to the launch of cleanlab 2.0 which significantly change the API.
  • Launched in preparation for Cleanlab 2.0.
  • Mostly superficial.

For users (+ sometimes developers):

  • This releases the new sphinx docs for cleanlab 1.0 documentation (in preparation for CL 2.0)
  • Several superficial bug fixes (reduce error printing, fix broken urls, clarify links)
  • Extensive docs/README updates
  • Support was added for Conda Installation
  • Moved to AGPL-3 license
  • Added tutorials and a learning section for Cleanlab

For developers:

  • Moved to GitHub Actions CI
  • Significantly shrunk the clone size to a few MB from 100MB+

v1.0

3 years ago

The cleanlab community has grown over the years. Today, we are excited to release cleanlab 1.0 as the standard package for machine learning with noisy labels and finding errors in datasets.

If you're coming from the research side (e.g. the confident learning or label errors paper) -- use this version of cleanlab.

cleanlab 1.0

cleanlab 1.0 supports the most common versions of python (2, 2.7, 3.4, 3.5, 3.6, 3.7, 3.8.) and operating systems (linux, macOS, Windows). It works with any deep learning or machine learning library by working with model outputs, regardless of where they come from. cleanlab also has built-in support now for new research from other scientists (e.g. Co-Teaching) outside of our group at MIT.

More details about new features of cleanlab 1.0 below:

  • Added Amazon Reviews NLP to cleanlab/examples
  • cleanlab now supports python 2, 2.7, 3.4, 3.5, 3.6, 3.7, 3.8.
  • Users have used cleanlab with python version 3.9 (use at your own risk!)
  • Added more testing. All tests pass on windows/linux/macOS.
  • Update to GNU GPL-3+ License.
  • Added documentation: https://cleanlab.readthedocs.io/
  • The cleanlab "confident learning" paper is published in the Journal of AI Research: https://jair.org/index.php/jair/article/view/12125
  • Added funding, community and contributing guidelines
  • Fixed several errors in cleanlab/examples
  • cleanlab now supports Windows, macOS, Linux, and unix systems
  • Many examples added to the README and docs
  • cleanlab now natively supports Co-Teaching for learning with noisy labels (reqs python3, PyTorch 1.4)
  • cleanlab built in support with handwritten datasets (besides MNIST)
  • cleanlab built in support for CIFAR dataset
  • Multiprocessing fixed for windows systems
  • Adhered all core modules to PEP-8 styling.
  • cleanlab is now installable via conda (besides pip).
  • Extensive benchmarking of cleanlab methods published.
  • Cleanlab now provides future features planned in cleanlab/version.py
  • Added confidentlearning-reproduce as a separate repo to reproduce state-of-the-art results.

v0.1.0

4 years ago

Alpha release of cleanlab.