Cleanlab Versions Save

The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

v2.1.0

1 year ago

v2.1.0 begins extending this library beyond standard classification tasks, taking initial steps toward the first tool that can detect label errors in data from any Supervised Learning task (leveraging any model trained for that task). This release is non-breaking when upgrading from v2.0.0.

Highlights of what’s new in 2.1.0:

Major new functionalities:

CROWDLAB algorithms for analysis of data labeled by multiple annotators — @huiwengoh, @ulya-tkch, @jwmueller
- Accurately infer the best consensus label for each example
- Estimate the quality of each consensus label (how likely is it correct)
- Estimate the overall quality of each annotator (how trustworthy are their suggested labels)
Out of Distribution Detection based on either:
- feature values/embeddings — @ulya-tkch, @jwmueller, @JohnsonKuan
- predicted class probabilities — @ulya-tkch
Label error detection for Token Classification tasks (NLP / text data) — @ericwang1997, @elisno
CleanLearning can now:
- Run on non-array data types including: pandas Dataframe, pytorch/tensorflow Dataset objects, and many other types of data formats. — @jwmueller
- Allow base model’s fit() to utilize validation data in each fold during cross-validation (eg. for early-stopping or hyperparameter-optimization purposes). — @huiwengoh
- Train with custom sample weights for datapoints. — @rushic24, @jwmueller
- Utilize any Keras model (supporting both sequential or functional APIs) via cleanlab’s KerasWrapperModel , which makes these models compatible with sklearn and tensorflow Datasets. — @huiwengoh, @jwmueller

Major improvements (in addition to too many bugfixes to name):

Reduced dependencies: scipy is no longer needed — @anishathalye
Clearer error/warning messages throughout package when data/inputs are strangely formatted — @cgnorthcutt, @jwmueller, @huiwengoh
FAQ section in tutorials with advice for commonly encountered issues — @huiwengoh, @ulya-tkch, @jwmueller, @cgnorthcutt
Many additional tutorial and example notebooks at: docs.cleanlab.ai and https://github.com/cleanlab/examples — @ulya-tkch, @huiwengoh, @jwmueller, @ericwang1997
Static type annotations to ensure robust code — @anishathalye, @elisno

Examples of new workflows available in 2.1:

Out of Distribution and Outlier Detection

Detect out of distribution examples in a dataset based on its numeric feature embeddings

from cleanlab.outlier import OutOfDistribution

ood = OutOfDistribution()

# To get outlier scores for train_data using feature matrix train_feature_embeddings
ood_train_feature_scores = ood.fit_score(features=train_feature_embeddings)

# To get outlier scores for additional test_data using feature matrix test_feature_embeddings
ood_test_feature_scores = ood.score(features=test_feature_embeddings)

Detect out of distribution examples in a dataset based on predicted class probabilities from a trained classifier

from cleanlab.outlier import OutOfDistribution

ood = OutOfDistribution()

# To get outlier scores for train_data using predicted class probabilities (from a trained classifier) and given class labels
ood_train_predictions_scores = ood.fit_score(pred_probs=train_pred_probs, labels=labels)

# To get outlier scores for additional test_data using predicted class probabilities
ood_test_predictions_scores = ood.score(pred_probs=test_pred_probs)

Multi-annotator -- support data with multiple labels

For data labeled by multiple annotators (stored as matrix multiannotator_labels whose rows correspond to examples, columns to each annotator’s chosen labels), cleanlab v2.1 can: find improved consensus labels, score their quality, and assess annotators, all by leveraging predicted class probabilities pred_probs from any trained classifier

from cleanlab.multiannotator import get_label_quality_multiannotator

get_label_quality_multiannotator(multiannotator_labels, pred_probs)

Support Token Classification tasks

Cleanlab v2.1 can now find label issues in token classification (text) data, where each word in a sentence is labeled with one of K classes (eg. entity recognition). This relies on three inputs:

tokens: List of tokenized sentences whose ith element is a list of strings corresponding to tokens of the ith sentence in dataset. Example: [..., ["I", "love", "cleanlab"], ...]
labels: List whose ith element is a list of integers corresponding to class labels of each token in the ith sentence. Example: [..., [0, 0, 1], ...]
pred_probs: List whose ith element is a np.ndarray of shape (N_i, K) corresponding to predicted class probabilities for each token in the ith sentence (assuming this sentence contains N_i tokens and dataset has K possible classes). These should be out-of-sample pred_probs obtained from a token classification model via cross-validation. Example: [..., np.array([[0.8,0.2], [0.9,0.1], [0.3,0.7]]), ...]

Using these, you can easily find and display mislabeled tokens in your data

from cleanlab.token_classification.filter import find_label_issues
from cleanlab.token_classification.summary import display_issues

issues = find_label_issues(labels, pred_probs)
display_issues(issues, tokens, pred_probs=pred_probs, given_labels=labels,
               class_names=optional_list_of_ordered_class_names)

Support pd.DataFrames, Keras/PyTorch/TF Datasets, Keras models, etc.

CleanLearning can now operate directly on non-array dataset formats like tensorflow/pytorch Datasets and use arbitrary Keras models:

import numpy as np
import tensorflow as tf
from cleanlab.experimental.keras import KerasWrapperModel

dataset = tf.data.Dataset.from_tensor_slices((features_np_array, labels_np_array))  # example tensorflow dataset created from numpy arrays 
dataset = dataset.shuffle(buffer_size=len(features_np_array)).batch(32)

def make_model(num_features, num_classes):
    inputs = tf.keras.Input(shape=(num_features,))
    outputs = tf.keras.layers.Dense(num_classes)(inputs)
    return tf.keras.Model(inputs=inputs, outputs=outputs, name="my_keras_model")

model = KerasWrapperModel(make_model, model_kwargs={"num_features": features_np_array.shape[1], "num_classes": len(np.unique(labels_np_array))})
cl = CleanLearning(model)
cl.fit(dataset, labels_np_array)  # variant of model.fit() that is more robust to noisy labels
robust_predictions = cl.predict(dataset)  # equivalent to model.predict() after training on cleaner data

Change Log

Fix edgecase divide-by-0 in entropy-score by @jwmueller in https://github.com/cleanlab/cleanlab/pull/241
Fix some typos. by @Yulv-git in https://github.com/cleanlab/cleanlab/pull/242
Updated project urls in setup.py by @calebchiam in https://github.com/cleanlab/cleanlab/pull/249
FeatureReq #33: Added custom sample_weight by @rushic24 in https://github.com/cleanlab/cleanlab/pull/248
Allow users to pass custom weights for ensemble label quality scoring by @JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/255
Fix line index of CleanLearning(), some text of links, etc. by @Yulv-git in https://github.com/cleanlab/cleanlab/pull/260
Copy the docs build artifacts to the "stable" folder by @weijinglok in https://github.com/cleanlab/cleanlab/pull/231
Add Negative Log Loss Weighting Scheme for Ensemble Label Quality Score by @JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/267
Developed class that allow the use of cleanlab with tensorflow and huggingface models by @MattiaSangermano in https://github.com/cleanlab/cleanlab/pull/247
Add KNN distance OOD scoring function and unit tests by @JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/268
Dataset documentation clarifications by @jwmueller in https://github.com/cleanlab/cleanlab/pull/270
Add issue templates by @anishathalye in https://github.com/cleanlab/cleanlab/pull/278
Fix bug. get thresholds broken for multi_label by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/264
Clarify labels format by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/282
Drop dependency on SciPy by @anishathalye in https://github.com/cleanlab/cleanlab/pull/286
Make CleanLearning work with pandas and other non-numpy feature objects X by @jwmueller in https://github.com/cleanlab/cleanlab/pull/285
Allow CleanLearning to use validation data in each fold by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/295
Created FAQ Page in the Cleanlab documentation by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/294
Proper validation of labels values/format across package by @jwmueller in https://github.com/cleanlab/cleanlab/pull/301
Add static type checking by @anishathalye in https://github.com/cleanlab/cleanlab/pull/306
error for missing classes, consistency on determining num_classes by @jwmueller in https://github.com/cleanlab/cleanlab/pull/308
Added support to build KNN graph for OOD detection with only training data by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/305
Standardize naming on K, num_classes and N, num_examples by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/312
Added outlier detection tutorial into docs by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/310
Updating tutorials hyperlink to 2.0.0 release by @aravindputrevu in https://github.com/cleanlab/cleanlab/pull/318
Allow KNN object to be returned by get_outlier_scores, Improved OOD tutorial by @jwmueller in https://github.com/cleanlab/cleanlab/pull/319
Some FAQ tips on how to improve CleanLearning by @jwmueller in https://github.com/cleanlab/cleanlab/pull/324
Updated tutorials to include quickstart by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/323
Add y argument as alternative to labels in CleanLearning.fit() by @elisno in https://github.com/cleanlab/cleanlab/pull/322
validation.py: Annotate function args and return values by @elisno in https://github.com/cleanlab/cleanlab/pull/317
Fixed package version issues for audio tutorial by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/325
Add compatibility for tensorflow and pytorch Dataset objects by @jwmueller in https://github.com/cleanlab/cleanlab/pull/311
Re-order find_label_issues args for better clarity by @jwmueller in https://github.com/cleanlab/cleanlab/pull/329
Comment on missing/rare classes in FAQ by @jwmueller in https://github.com/cleanlab/cleanlab/pull/332
update sphinx to v5 by @jwmueller in https://github.com/cleanlab/cleanlab/pull/327
Allow missing classes in get_label_quality_scores by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/334
Allow missing classes in assert_valid_class_labels by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/335
Changed all docstring instances of np.array to np.ndarray by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/336
Update Contributing.md with Projects link and getting started instructions by @jwmueller in https://github.com/cleanlab/cleanlab/pull/349
Switch docs links from latest release to stable by @elisno in https://github.com/cleanlab/cleanlab/pull/379
Extending cleanlab to find label errors in token classification datasets by @ericwang1997 in https://github.com/cleanlab/cleanlab/pull/347
Cleanlab functionality for multiannotator data by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/333
Cleanup token classification code by @elisno in https://github.com/cleanlab/cleanlab/pull/390
Fix typing for find_label_issues by @elisno in https://github.com/cleanlab/cleanlab/pull/391
Match token/s in color_sentence by @elisno in https://github.com/cleanlab/cleanlab/pull/397
Escape special regex characters by @elisno in https://github.com/cleanlab/cleanlab/pull/404
Add FAQ question on how to get predicted labels by @jwmueller in https://github.com/cleanlab/cleanlab/pull/402
Implementing get_ood_scores function by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/338
Add termcolor dependency by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/415
Add token classification tutorial notebook to docs.cleanlab.ai by @elisno in https://github.com/cleanlab/cleanlab/pull/411
Update examples links by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/421
Polish multiannotator docs by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/422
Text tutorial improvements by @jwmueller in https://github.com/cleanlab/cleanlab/pull/429
suppress tensorflow warning logs in tutorials if not properly installed by @jwmueller in https://github.com/cleanlab/cleanlab/pull/432
Add autodoc-typehints extension for sphinx by @elisno in https://github.com/cleanlab/cleanlab/pull/412
Strip input prompts when copying code snippets by @elisno in https://github.com/cleanlab/cleanlab/pull/439
Extend KerasWrapper to Functional API by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/434
Deploy documentation for token classification module by @elisno in https://github.com/cleanlab/cleanlab/pull/438
Updated labels to allow array_like by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/426
Add keras wrapper to docs by @jwmueller in https://github.com/cleanlab/cleanlab/pull/443
Format all return docstrings and add typing by @jwmueller in https://github.com/cleanlab/cleanlab/pull/437
make num_label_issues = cj calibrated offdiag sum by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/445
fix bug in hard-coded test. generalize the test by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/448
Change output of display_issues by @elisno in https://github.com/cleanlab/cleanlab/pull/450
More improvements to token classification code and documentation by @jwmueller in https://github.com/cleanlab/cleanlab/pull/452
Fix details disclosure elements in docs by @anishathalye in https://github.com/cleanlab/cleanlab/pull/456
Add missing backticks and language annotation by @anishathalye in https://github.com/cleanlab/cleanlab/pull/461
Error handling for rare classes in multiannotator data by @huiwengoh in https://github.com/cleanlab/cleanlab/pull/455
Fix docs build in CI by @anishathalye in https://github.com/cleanlab/cleanlab/pull/462
Added support for returning ranked issue idxs by @ulya-tkch in https://github.com/cleanlab/cleanlab/pull/459
update readme for v2.1 by @jwmueller in https://github.com/cleanlab/cleanlab/pull/457
Clearer code examples on docs main page by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/430

New Contributors

@Yulv-git made their first contribution in https://github.com/cleanlab/cleanlab/pull/242
@rushic24 made their first contribution in https://github.com/cleanlab/cleanlab/pull/248
@MattiaSangermano made their first contribution in https://github.com/cleanlab/cleanlab/pull/247
@ulya-tkch made their first contribution in https://github.com/cleanlab/cleanlab/pull/293
@huiwengoh made their first contribution in https://github.com/cleanlab/cleanlab/pull/295
@aravindputrevu made their first contribution in https://github.com/cleanlab/cleanlab/pull/318
@elisno made their first contribution in https://github.com/cleanlab/cleanlab/pull/322
@ericwang1997 made their first contribution in https://github.com/cleanlab/cleanlab/pull/340

Full Changelog: https://github.com/cleanlab/cleanlab/compare/v2.0.0...v2.1.0

v2.0.0

2 years ago

If you liked cleanlab v1.0.1, v2.0.0 will blow your mind! 💥🧠

cleanlab 2.0 adds powerful new workflows and algorithms for data-centric AI, dataset curation, auto-fixing label issues in data, learning with noisy labels, and more. Nearly every module, method, parameter, and docstring has been touched by this release.

If you're coming from 1.0, here's a migration guide.

A few highlights of new functionalities in cleanlab 2.0:

rank every data point by label quality
find label issues in any dataset.
train any classifier on any dataset with label issues.
find overlapping classes to merge and/or delete at the dataset-level
yield an overall dataset health

For an in-depth overview of what cleanlab 2.0 can do, check out this tutorial.

To help you get started with 2.0, we've added:

Change Log

This list is non-exhaustive! Assume every aspect of API has changed.

Module name changes or moves:

classification.LearningWithNoisyLabels class --> classification.CleanLearning class
pruning.py --> filter.py
latent_estimation.py --> count.py
cifar_cnn.py --> experimental/cifar_cnn.py
coteaching.py --> experimental/coteaching.py
fasttext.py --> experimental/fasttext.py
mnist_pytorch.py --> experimental/fmnist_pytorch.py
noise_generation.py --> benchmarking/noise_generation.py
util.py --> internal/util.py
latent_algebra.py --> internal/latent_algebra.py

Module Deletions:

removed polyplex.py
removed models/` --> (moved content to experimental/)

New module created:

rank.py
- moved all ranking and ordering functions from pruning.py/filter.py to here
dataset.py
- brand new module supporting methods for dealing with data-level issues
benchmarking.py
- Future benchmarking modules go here. Moved noise_generation.py here.

Method name changes:

pruning.get_noise_indices() --> filter.find_label_issues()
count.num_label_errors() --> count.num_label_issues()

Methods added:

rank.py adds
- two ranking functions to rank data based on label quality for entire dataset (not just examples with label issues)
- get_self_confidence_for_each_label()
- get_normalized_margin_for_each_label()
filter.py adds
- two more methods added to filter.find_label_issues() (select method using the filter_by parameter)
  - confident_learning, which has been shown to work very well and may become the default in the future, and
  - predicted_neq_given, which is useful for benchmarking a simple baseline approach, but underperformant relative to the other filter_by methods)
classification.py adds
- ClearnLearning.get_label_issues()
  - for a canonical one-line of code use:CleanLearning().fit(X, y).get_label_issues()
  - no need to compute predicted probabilities in advance
- CleanLearning.find_label_issues()
  - returns a dataframe with label issues (instead of just a mask)

Naming conventions changed in method names, comments, parameters, etc.

s -> labels
psx -> pred_probs
label_errors --> label_issues
noise_mask --> label_issues_mask
label_errors_bool --> label_issues_mask
prune_method --> filter_by
prob_given_label --> self_confidence
pruning --> filtering

Parameter re-ordering:

re-ordered (labels, pred_probs) parameters to be consistent (in that order) in all methods.
re-ordered parameters (e.g. frac_noise) in filter.find_label_issues()

Parameter changes:

in order_label_issues()
- param: sorted_index_method --> rank_by
in find_label_issues()
- param: sorted_index_method --> return_indices_ranked_by
- param: prune_method --> filter_by

Global variables changed:

filter.py
- Only require 1 example to be left in each class
- MIN_NUM_PER_CLASS = 5 --> MIN_NUM_PER_CLASS = 1
- enables cleanlab to work for toy-sized datasets

Dependencies added

pandas=1.0.0

Way-too-detailed Change Log

convert readme to markdown for pypi release. by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/126
Add EditorConfig by @anishathalye in https://github.com/cleanlab/cleanlab/pull/129
Major API change. Introducing Cleanlab 2.0 by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/128
Standardize code style to Black by @anishathalye in https://github.com/cleanlab/cleanlab/pull/107
Redirect RTD site to docs.cleanlab.ai by @weijinglok in https://github.com/cleanlab/cleanlab/pull/130
Redirect RTD site to docs.cleanlab.ai: Part 2 by @weijinglok in https://github.com/cleanlab/cleanlab/pull/132
Add image classification tutorial and streamline docs CI/CD by @weijinglok in https://github.com/cleanlab/cleanlab/pull/127
remove redundant text by @jwmueller in https://github.com/cleanlab/cleanlab/pull/134
Remove extra slashes in docs relative path by @weijinglok in https://github.com/cleanlab/cleanlab/pull/135
Fix docs TOC for v2.0 by @weijinglok in https://github.com/cleanlab/cleanlab/pull/136
Add label quality scoring functions and user API to choose the method by @JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/131
Change cleanlab version in Image Tutorial by @weijinglok in https://github.com/cleanlab/cleanlab/pull/138
Utilites -> internal submodule refactor by @jwmueller in https://github.com/cleanlab/cleanlab/pull/141
Fix NumPy deprecation warning by @anishathalye in https://github.com/cleanlab/cleanlab/pull/142
Remove unnecessary print statement by @anishathalye in https://github.com/cleanlab/cleanlab/pull/145
Add explanation that estimators must be clonable by @anishathalye in https://github.com/cleanlab/cleanlab/pull/146
Update doc site quickstart page to reflect v2.0 API by @weijinglok in https://github.com/cleanlab/cleanlab/pull/143
Fix sklearn estimator cloning by @anishathalye in https://github.com/cleanlab/cleanlab/pull/144
Update default label quality scoring method to self_confidence by @JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/147
Allow n-dim data in LearningWithNoisyLabels by @anishathalye in https://github.com/cleanlab/cleanlab/pull/148
Improve user-control by @jwmueller in https://github.com/cleanlab/cleanlab/pull/149
Enable use of find_label_issues_kwargs for hyper-parameter search by @JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/152
Add fix and test for sklearn GridSearchCV with LearningWithNoisyLabels by @JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/153
Add tutorial for tabular data classification by @weijinglok in https://github.com/cleanlab/cleanlab/pull/151
Minor tutorial edits by @jwmueller in https://github.com/cleanlab/cleanlab/pull/155
Add Python 3.10 to CI by @anishathalye in https://github.com/cleanlab/cleanlab/pull/160
Add development guide by @anishathalye in https://github.com/cleanlab/cleanlab/pull/164
Add text classification tutorial by @weijinglok in https://github.com/cleanlab/cleanlab/pull/154
Add audio tutorial to doc site by @weijinglok in https://github.com/cleanlab/cleanlab/pull/165
Add overview for computing out-of-sample predicted probabilities with cross-validation to doc site by @weijinglok in https://github.com/cleanlab/cleanlab/pull/166
Add CI check that .ipynb outputs are empty by @anishathalye in https://github.com/cleanlab/cleanlab/pull/169
Add CI check for trailing newlines in notebooks by @anishathalye in https://github.com/cleanlab/cleanlab/pull/170
Improve image tutorial accuracy and finding better label errors by @weijinglok in https://github.com/cleanlab/cleanlab/pull/167
Remove unnecessary version warning by @anishathalye in https://github.com/cleanlab/cleanlab/pull/162
Add test to check examples are found by cleanlab by @weijinglok in https://github.com/cleanlab/cleanlab/pull/172
Various tutorial improvements by @jwmueller in https://github.com/cleanlab/cleanlab/pull/173
Deploys docs only if triggered by master branch by @weijinglok in https://github.com/cleanlab/cleanlab/pull/175
added LearningWithNoisyLabels.find_label_issues instance method by @jwmueller in https://github.com/cleanlab/cleanlab/pull/157
Add note on EditorConfig to development guide by @anishathalye in https://github.com/cleanlab/cleanlab/pull/176
CleanLearning = Machine Learning with cleaned data by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/177
Simple fix to Issue 158 (and potentially other issues) by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/178
Update docs README by @weijinglok in https://github.com/cleanlab/cleanlab/pull/180
Polish the APIs and file-structure to prepare for 2.0 release by @jwmueller in https://github.com/cleanlab/cleanlab/pull/181
Make EditorConfig match Jupyter for notebooks by @anishathalye in https://github.com/cleanlab/cleanlab/pull/179
Add figure to out-of-sample pred proba via cv tutorial by @weijinglok in https://github.com/cleanlab/cleanlab/pull/183
Move rest of example_models/ -> experimental/ by @jwmueller in https://github.com/cleanlab/cleanlab/pull/184
Fix typo in test by @anishathalye in https://github.com/cleanlab/cleanlab/pull/186
Add more ergonomic method to skip notebooks by @anishathalye in https://github.com/cleanlab/cleanlab/pull/185
Add link checking for all Markdown files by @anishathalye in https://github.com/cleanlab/cleanlab/pull/187
Add link checking for compiled docs by @anishathalye in https://github.com/cleanlab/cleanlab/pull/188
Minor improvements to count docstring by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/190
Add a GitHub icon and link below the docs' project title by @weijinglok in https://github.com/cleanlab/cleanlab/pull/192
Add self.labels as attribute of FastTextClassifier by @JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/194
Move noise_generation into benchmarking module by @anishathalye in https://github.com/cleanlab/cleanlab/pull/196
Remove import of internal package by @anishathalye in https://github.com/cleanlab/cleanlab/pull/195
more specific filter-warning by @jwmueller in https://github.com/cleanlab/cleanlab/pull/193
Remove deprecated functions by @anishathalye in https://github.com/cleanlab/cleanlab/pull/197
Docs cleanup by @anishathalye in https://github.com/cleanlab/cleanlab/pull/189
Introducing the new Dataset Module for cleanlab 2.0 by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/182
Readme reformat by @jwmueller in https://github.com/cleanlab/cleanlab/pull/198
Returns DataFrame type from CleanLearning functions by @jwmueller in https://github.com/cleanlab/cleanlab/pull/199
Migration guide for v2 by @jwmueller in https://github.com/cleanlab/cleanlab/pull/200
set verbose default false. fix order of printing by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/204
Add from . import dataset to init by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/205
Make minor doc tweaks by @anishathalye in https://github.com/cleanlab/cleanlab/pull/203
Fix error in docstring. missing item in tuple by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/206
fix broken printing of matrices by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/207
Add in-depth tutorial [WIP] by @weijinglok in https://github.com/cleanlab/cleanlab/pull/208
Revise migration guide by @anishathalye in https://github.com/cleanlab/cleanlab/pull/209
Tutorial header-levels decreased by @jwmueller in https://github.com/cleanlab/cleanlab/pull/210
Dark plots by @jwmueller in https://github.com/cleanlab/cleanlab/pull/211
bug fixes and jupyter notebook support added by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/212
Add spaces at the end of doc side nav toc by @weijinglok in https://github.com/cleanlab/cleanlab/pull/213
Clarify and fix several docstrings. by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/214
Configure display setting of Dataframe in 2.0 tutorial by @JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/215
Formats dataset health tutorial by @jwmueller in https://github.com/cleanlab/cleanlab/pull/216
final tutorial edits. dataset docstrings imp by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/217
Deploy docs when new release is tagged by @weijinglok in https://github.com/cleanlab/cleanlab/pull/219
Hardcode v1.0.1 hyperlink in in doc site by @weijinglok in https://github.com/cleanlab/cleanlab/pull/221
Make dataset tutorial runnable on website docs, improve pulldown formatting by @jwmueller in https://github.com/cleanlab/cleanlab/pull/220
Final docs polishing patches by @jwmueller in https://github.com/cleanlab/cleanlab/pull/223
Bump version to 2.0.0 by @jwmueller in https://github.com/cleanlab/cleanlab/pull/222
fix black formatting compliance by @cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/224
Update readme links for v2.0 by @jwmueller in https://github.com/cleanlab/cleanlab/pull/225
Remove unneeded alt text by @anishathalye in https://github.com/cleanlab/cleanlab/pull/228
Add tutorial toc page by @weijinglok in https://github.com/cleanlab/cleanlab/pull/230
Proofreading README.md by @calebchiam in https://github.com/cleanlab/cleanlab/pull/226
Generalize text tutorial to multiclass datasets by @calebchiam in https://github.com/cleanlab/cleanlab/pull/229
Make small fixes by @anishathalye in https://github.com/cleanlab/cleanlab/pull/235
Address tutorial points of confusion by @jwmueller in https://github.com/cleanlab/cleanlab/pull/233

New Contributors

@JohnsonKuan made their first contribution in https://github.com/cleanlab/cleanlab/pull/131
@calebchiam made their first contribution in https://github.com/cleanlab/cleanlab/pull/226

Full Changelog: https://github.com/cleanlab/cleanlab/compare/v1.0.1...v2.0.0

v1.0.1

2 years ago

The primary purpose of this release is to preserve the functionality of cleanlab (all versions up to 1.0.1) in the new docs prior to the launch of cleanlab 2.0 which significantly change the API.
Launched in preparation for Cleanlab 2.0.
Mostly superficial.

For users (+ sometimes developers):

This releases the new sphinx docs for cleanlab 1.0 documentation (in preparation for CL 2.0)
Several superficial bug fixes (reduce error printing, fix broken urls, clarify links)
Extensive docs/README updates
Support was added for Conda Installation
Moved to AGPL-3 license
Added tutorials and a learning section for Cleanlab

For developers:

Moved to GitHub Actions CI
Significantly shrunk the clone size to a few MB from 100MB+

v1.0

3 years ago

The cleanlab community has grown over the years. Today, we are excited to release cleanlab 1.0 as the standard package for machine learning with noisy labels and finding errors in datasets.

If you're coming from the research side (e.g. the confident learning or label errors paper) -- use this version of cleanlab.

cleanlab 1.0

cleanlab 1.0 supports the most common versions of python (2, 2.7, 3.4, 3.5, 3.6, 3.7, 3.8.) and operating systems (linux, macOS, Windows). It works with any deep learning or machine learning library by working with model outputs, regardless of where they come from. cleanlab also has built-in support now for new research from other scientists (e.g. Co-Teaching) outside of our group at MIT.

More details about new features of cleanlab 1.0 below:

Added Amazon Reviews NLP to cleanlab/examples
cleanlab now supports python 2, 2.7, 3.4, 3.5, 3.6, 3.7, 3.8.
Users have used cleanlab with python version 3.9 (use at your own risk!)
Added more testing. All tests pass on windows/linux/macOS.
Update to GNU GPL-3+ License.
Added documentation: https://cleanlab.readthedocs.io/
The cleanlab "confident learning" paper is published in the Journal of AI Research: https://jair.org/index.php/jair/article/view/12125
Added funding, community and contributing guidelines
Fixed several errors in cleanlab/examples
cleanlab now supports Windows, macOS, Linux, and unix systems
Many examples added to the README and docs
cleanlab now natively supports Co-Teaching for learning with noisy labels (reqs python3, PyTorch 1.4)
cleanlab built in support with handwritten datasets (besides MNIST)
cleanlab built in support for CIFAR dataset
Multiprocessing fixed for windows systems
Adhered all core modules to PEP-8 styling.
cleanlab is now installable via conda (besides pip).
Extensive benchmarking of cleanlab methods published.
Cleanlab now provides future features planned in cleanlab/version.py
Added confidentlearning-reproduce as a separate repo to reproduce state-of-the-art results.

v0.1.0

4 years ago

Alpha release of cleanlab.