Linfa Versions Save

A Rust machine learning framework.

0.6.0

1 year ago

Linfa's 0.6.0 release removes the mandatory dependency on external BLAS libraries (such as intel-mkl) by using a pure-Rust linear algebra library. It also adds the Naive Multinomial Bayes and Follow The Regularized Leader algorithms. Additionally, the AsTargets trait has been separated into AsSingleTargets and AsMultiTargets.

No more BLAS

With older versions of Linfa, algorithm crates that used advanced linear algebra routines needed to be linked against an external BLAS library such as Intel-MKL. This is done by adding feature flags like linfa/intel-mkl-static to the build, and it increased the compile times significantly. Version 0.6.0 replaces the BLAS library with a pure-Rust implementation of all the required routines, which Linfa uses by default. This means all Linfa crates now build properly and quickly without any extra feature flags. It is still possible for the affected algorithm crates to link against an external BLAS libary. Doing so requires enabling the crate's blas feature, along with the feature flag for the external BLAS library. The affected crates are as follows:

linfa-ica
linfa-reduction
linfa-clustering
linfa-preprocessing
linfa-pls
linfa-linear
linfa-elasticnet

New algorithms

Multinomial Naive Bayes is a family of Naive Bayes classifiers that assume independence between variables. The advantage is a linear fitting time with maximum-likelihood training in a closed form. The algorithm is added to linfa-bayes and an example can be found at linfa-bayes/examples/winequality_multinomial.rs.

Follow The Regularized Leader (FTRL) is a linear model for CTR prediction in online learning settings. It is a special type of linear model with sigmoid function which uses L1 and L2 regularization. The algorithm is contained in the newly-added linfa-ftrl crate, and an example can be found at linfa-ftrl/examples/winequality.rs.

Distinguish between single and multi-target

Version 0.6.0 introduces a major change to the AsTarget trait, which is now split into AsSingleTargets and AsMultiTargets. Additionally, the Dataset* types are parametrized by target dimensionality, instead of always using a 2D array. Furthermore, algorithms that work on single-target data will no longer accept multi-target datasets as input. This change may cause build errors in existing code that call the affected algorithms. The fix for it is as simple as adding Ix1 to the end of the type parameters for the dataset being passed in, which forces the dataset to be single-target.

Improvements

Remove SeedableRng trait bound from KMeans and GaussianMixture.
Replace uses of Isaac RNG with Xoshiro RNG.
cross_validate changed to cross_validate_single, which is for single-target data; cross_validate_multi changed to cross_validate, which is for both single and multi-target datasets.
The probability type Pr has been constrained to 0. <= prob <= 1.. Also, the simple Pr(x) constructor has been replaced by Pr::new(x), Pr::new_unchecked(x), and Pr::try_from(x), which ensure that the invariant for Pr is met.

0.5.1

2 years ago

Release 0.5.1

Linfa's 0.5.1 release fixes errors and bugs in the previous release, as well as removing useless trait bounds on the Dataset type. Note that the commits for this release are located in the 0-5-1 branch of the GitHub repo.

Improvements

remove Float trait bound from many Dataset impls, making non-float datasets usable
fix build errors in 0.5.0 caused by breaking minor releases from dependencies
fix bug in k-means where the termination condition of the algorithm was calculated incorrectly
fix build failure when building linfa alone, caused by incorrect feature selection for ndarray

0.5.0

2 years ago

Linfa's 0.5.0 release adds initial support for the OPTICS algorithm, multinomials logistic regression, and the family of nearest neighbor algorithms. Furthermore, we have improved documentation and introduced hyperparameter checking to all algorithms.

New algorithms

OPTICS is an algorithm for finding density-based clusters. It can produce reachability-plots, hierarchical structure of clusters. Analysing data without prior assumption of any distribution is a common use-case. The algorithm is added to linfa-clustering and an example can be find at linfa-clustering/examples/optics.rs.

Extending logistic regression to the multinomial distribution generalizes it to multiclass problems. This release adds support for multinomial logistic regression to linfa-logistic, you can experiment with the example at linfa-logistic/examples/winequality_multi.rs.

Nearest neighbor search finds the set of neighborhood points to a given sample. It appears in numerous fields of applications as a distance metric provider. (e.g. clustering) This release adds a family of nearest neighbor algorithms, namely Ball tree, K-d tree and naive linear search. You can find an example in the next section.

Improvements

use least-square solver from ndarray-linalg in linfa-linear
make clustering algorithms generic over distance metrics
bump ndarray to 0.15
introduce ParamGuard trait for explicit and implicit parameter checking (read more in the CONTRIBUTE.md)
improve documentation in various places

Nearest Neighbors

You can now choose from a growing list of NN implementations. The family provides efficient distance metrics to KMeans, DBSCAN etc. The example shows how to use KDTree nearest neighbor to find all the points in a set of observations that are within a certain range of a candidate point.

You can query nearest points explicitly:

// create a KDTree index consisting of all the points in the observations, using Euclidean distance
let kdtree = CommonNearestNeighbour::KdTree.from_batch(observations, L2Dist)?;
let candidate = observations.row(2);
let points = kdtree.within_range(candidate.view(), range)?;

Or use one of the distance metrics implicitly, here demonstrated for KMeans:

use linfa_nn::distance::LInfDist;

let model = KMeans::params_with(3, rng, LInfDist)
    .max_n_iterations(200)
    .tolerance(1e-5)
    .fit(&dataset)?;

0.4.0

3 years ago

Linfa's 0.4.0 release introduces four new algorithms, improves documentation of the ICA and K-means implementations, adds more benchmarks to K-Means and updates to ndarray's 0.14 version.

New algorithms

The Partial Least Squares Regression model family is added in this release (thanks to @relf). It projects the observable, as well as predicted variables to a latent space and maximizes the correlation for them. For problems with a large number of targets or collinear predictors it gives a better performance when compared to standard regression. For more information look into the documentation of linfa-pls.

A wrapper for Barnes-Hut t-SNE is also added in this release. The t-SNE algorithm is often used for data visualization and projects data in a high-dimensional space to a similar representation in two/three dimension. It does so by maximizing the Kullback-Leibler Divergence between the high dimensional source distribution to the target distribution. The Barnes-Hut approximation improves the runtime drastically while retaining the performance. Kudos to github/frjnn for providing an implementation!

A new preprocessing crate makes working with textual data and data normalization easy (thanks to @Sauro98). It implements count-vectorizer and IT-IDF normalization for text pre-processing. Normalizations for signals include linear scaling, norm scaling and whitening with PCA/ZCA/choelsky. An example with a Naive Bayes model achieves 84% F1 score for predicting categories alt.atheism, talk.religion.misc, comp.graphics and sci.space on a news dataset.

Platt scaling calibrates a real-valued classification model to probabilities over two classes. This is used for the SV classification when probabilities are required. Further a multi class model, combining multiple binary models (e.g. calibrated SVM models) into a single multi-class model is also added. These composing models are moved to the linfa/src/composing/ subfolder.

Improvements

Numerous improvements are added to the KMeans implementation, thanks to @YuhanLiin. The implementation is optimized for offline training, an incremental training model is added and KMeans++/KMeans|| initialization gives good initial cluster means for medium and large datasets.

We also moved to ndarray's version 0.14 and introduced F::cast for simpler floating point casting. The trait signature of linfa::Fit is changed such that it always returns a Result and error handling is added for the linfa-logistic and linfa-reduction subcrates.

You often have to compare several model parametrization with k-folding. For this a new function cross_validate is added which takes the number of folds, model parameters and a closure for the evaluation metric. It automatically calls k-folding and averages the metric over the folds. To compare different L1 ratios of an elasticnet model, you can use it in the following way:

// L1 ratios to compare
let ratios = vec![0.1, 0.2, 0.5, 0.7, 1.0];

// create a model for each parameter
let models = ratios
    .iter()
    .map(|ratio| ElasticNet::params().penalty(0.3).l1_ratio(*ratio))
    .collect::<Vec<_>>();

// get the mean r2 validation score across 5 folds for each model
let r2_values =
    dataset.cross_validate(5, &models, |prediction, truth| prediction.r2(&truth))?;

// show the mean r2 score for each parameter choice
for (ratio, r2) in ratios.iter().zip(r2_values.iter()) {
    println!("L1 ratio: {}, r2 score: {}", ratio, r2);
}

Other changes

fix for border points in the DBSCAN implementation
improved documentation of the ICA subcrate
prevent overflowing code example in website

0.3.1

3 years ago

In this release of Linfa the documentation is extended, new examples are added and the functionality of datasets improved. No new algorithms were added.

The meta-issue #82 gives a good overview of the necessary documentation improvements and testing/documentation/examples were considerably extended in this release.

Further new functionality was added to datasets and multi-target datasets are introduced. Bootstrapping is now possible for features and samples and you can cross-validate your model with k-folding. We polished various bits in the kernel machines and simplified the interface there.

The trait structure of regression metrics are simplified and the silhouette score introduced for easier testing of K-Means and other algorithms.

Changes

improve documentation in all algorithms, various commits
add a website to the infrastructure (c8acc785b)
add k-folding with and without copying (b0af80546f8)
add feature naming and pearson's cross correlation (71989627f)
improve ergonomics when handling kernels (1a7982b973)
improve TikZ generator in linfa-trees (9d71f603bbe)
introduce multi-target datasets (b231118629)
simplify regression metrics and add cluster metrics (d0363a1fa8ef)

Example

You can now perform cross-validation with k-folding. @Sauro98 actually implemented two versions, one which copies the dataset into k folds and one which avoid excessive memory operations by copying only the validation dataset around. For example to test a model with 8-folding:

// perform cross-validation with the F1 score
let f1_runs = dataset
    .iter_fold(8, |v| params.fit(&v).unwrap())
    .map(|(model, valid)| {
        let cm = model
            .predict(&valid)
            .mapv(|x| x > Pr::even())
            .confusion_matrix(&valid).unwrap();
  
          cm.f1_score()
    })  
    .collect::<Array1<_>>();
  
// calculate mean and standard deviation
println!("F1 score: {}±{}",
    f1_runs.mean().unwrap(),
    f1_runs.std_axis(Axis(0), 0.0),
);

0.3.0

3 years ago

New algorithms

Approximated DBSCAN has been added to linfa-clustering by [@Sauro98]
Gaussian Naive Bayes has been added to linfa-bayes by [@VasanthakumarV]
Elastic Net linear regression has been added to linfa-elasticnet by [@paulkoerbitz] and [@bytesnake]

Changes

Added benchmark to gaussian mixture models (a3eede55)
Fixed bugs in linear decision trees, added generator for TiKZ trees (bfa5aebe7)
Implemented serde for all crates behind feature flag (4f0b63bb)
Implemented new backend features (7296c9ec4)
Introduced linfa-datasets for easier testing (3cec12b4f)
Rename Dataset to DatasetBase and introduce Dataset and DatasetView (21dd579cf)
Improve kernel tests and documentation (8e81a6d)

Example

The following section shows a small example how datasets interact with the training and testing of a Linear Decision Tree.

You can load a dataset, shuffle it and then split it into training and validation sets:

// initialize pseudo random number generator with seed 42
let mut rng = Isaac64Rng::seed_from_u64(42);
// load the Iris dataset, shuffle and split with ratio 0.8
let (train, test) = linfa_datasets::iris()
    .shuffle(&mut rng)
    .split_with_ratio(0.8);

With the training dataset a linear decision tree model can be trained. Entropy is used as a metric for the optimal split here:

let entropy_model = DecisionTree::params()
    .split_quality(SplitQuality::Entropy)
    .max_depth(Some(100))
    .min_weight_split(10.0)
    .min_weight_leaf(10.0)
    .fit(&train);

The validation dataset is now used to estimate the error. For this the true labels are predicted and then a confusion matrix gives clue about the type of error:

let cm = entropy_model
    .predict(test.records().view())
    .confusion_matrix(&test);

println!("{:?}", cm);

println!(
    "Test accuracy with Entropy criterion: {:.2}%",
    100.0 * cm.accuracy()
);

Finally you can analyze which features were used in the decision and export the whole tree it to a TeX file. It will contain a TiKZ tree with information on the splitting decision and impurity improvement:

let feats = entropy_model.features();
println!("Features trained in this tree {:?}", feats);

let mut tikz = File::create("decision_tree_example.tex").unwrap();
tikz.write(gini_model.export_to_tikz().to_string().as_bytes())
    .unwrap();

The whole example can be found in linfa-trees/examples/decision_tree.rs.

0.2.1

3 years ago

Changes

remove feature flags, blocked by https://github.com/rust-lang/cargo/issues/7915
make ready for crates.io

0.2.0

3 years ago

New algorithms

Ordinary Linear Regression has been added to linfa-linear by [@Nimpruda] and [@paulkoerbitz]
Generalized Linear Models has been added to linfa-linear by [@VasanthakumarV]
Linear decision trees were added to linfa-trees by [@mossbanay]
Fast independent component analysis (ICA) has been added to linfa-ica by [@VasanthakumarV]
Principal Component Analysis and Diffusion Maps have been added to linfa-reduction by [@bytesnake]
Support Vector Machines has been added to linfa-svm by [@bytesnake]
Logistic regression has been added to linfa-logistic by [@paulkoerbitz]
Hierarchical agglomerative clustering has been added to linfa-hierarchical by [@bytesnake]
Gaussian Mixture Models has been added to linfa-clustering by [@relf]

Changes

Common metrics for classification and regression have been added
A new dataset interface simplifies the work with targets and labels
New traits for Transformer, Fit and IncrementalFit standardizes the interface
Switched to Github Actions for better integration