R package for dimensionality reduction of small datasets
An R package for small-scale dimensionality reduction using neighborhood-preservation dimensionality reduction methods, including t-Distributed Stochastic Neighbor Embedding, LargeVis and UMAP.
LargeVis and UMAP are of particular interest because they seem to give visualizations which are very competitive with t-SNE, but can use stochastic gradient descent to give faster run times and/or better scaling with dataset size than the typical Barnes-Hut t-SNE implementation.
This package is designed to make it easier to experiment with and compare these methods, by removing differences in implementation details.
One way it does this is by abandoning the more advanced nearest-neighbor methods, distance approximations, sampling, and multi-threaded stochastic gradient descent techniques. The price paid for this simplification is that the algorithms are back to being O(N^2) in storage and computation costs (and being in pure R). Unlike UMAP, the official implementation of LargeVis, and the Barnes-Hut implementation of t-SNE, this package is therefore not suitable for large scale visualization. Hence the name smallvis.
smallvis
uses the vizier
package to plot the coordinates during optimization. It's not on CRAN, and
therefore requires a fairly new version of
devtools (1.9 or greater) to
install this as a dependency from github.
There is also an optional dependency on the
RSpectra package, which is used
only if you want to initialize from a spectral method (set Y_init = "laplacian"
or Y_init = "normlaplacian"
to do this and see the paper by
Linderman and Steinerberger for details on
why you might want to). If not present, then the standard R function eigen
is
used, but this is much slower (because we only need the first few eigenvectors,
and eigen
calculates all of them). On my Sandy Bridge-era laptop running R
3.4.2 on Windows 10, using Rspectra::eigs
to fetch the top 3 eigenvectors from
a 6,000 x 6,000 affinity matrix takes about 6 seconds; using eigen
takes
around 25 minutes.
install.packages("devtools")
devtools::install_github("jlmelville/smallvis", subdir = "smallvis")
library(smallvis)
# By default, we use all numeric columns found in a data frame, so you don't need to filter out factor or strings
# set verbose = TRUE to log progress to the console
# Automatically plots the results during optimization
tsne_iris <- smallvis(iris, perplexity = 25, verbose = TRUE)
# Using a custom epoch_callback
uniq_spec <- unique(iris$Species)
colors <- rainbow(length(uniq_spec))
names(colors) <- uniq_spec
iris_plot <- function(x) {
plot(x, col = colors[iris$Species])
}
tsne_iris <- smallvis(iris, perplexity = 25, epoch_callback = iris_plot, verbose = TRUE)
# Default method is t-SNE, use largevis cost function instead
# Also needs a gamma value specified, and not as easy to optimize as t-SNE:
# reduce learning rate (eta) and increase maximum iterations
largevis_iris <- smallvis(iris, method = "largevis", perplexity = 25, epoch_callback = iris_plot,
eta = 0.1, max_iter = 5000, verbose = TRUE)
# For extra control over method-specific parameters pass a list as the "method" parameter:
# In largevis, gamma controls the balance of repulsion vs attraction
# The smallvis man page lists the method-specific parameters which can be controlled in this way
largevis_iris <- smallvis(iris, method = list("largevis", gamma = 1), perplexity = 25, epoch_callback = iris_plot,
eta = 0.1, max_iter = 5000, verbose = TRUE)
# UMAP: see https://github.com/lmcinnes/umap
# UMAP also has extra parameters, but we use the defaults here
umap_iris <- smallvis(iris, method = "umap", perplexity = 25, eta = 0.01)
# use (scaled) PCA initialization so embedding is repeatable
tsne_iris_spca <- smallvis(iris, perplexity = 25, epoch_callback = iris_plot, Y_init = "spca")
# or initialize from Laplacian Eigenmap (similar to UMAP initialization)
tsne_iris_lap <- smallvis(iris, perplexity = 25, epoch_callback = iris_plot, Y_init = "lap")
# or initialize from normalized Laplacian eigenvectors (even closer to UMAP initialization)
tsne_iris_nlap <- smallvis(iris, perplexity = 25, epoch_callback = iris_plot, Y_init = "normlap")
# return extra information in a list, like with Rtsne
tsne_iris_extra <- smallvis(iris, perplexity = 25, epoch_callback = iris_plot, ret_extra = TRUE)
# more (potentially large and expensive to calculate) return values, but you have to ask for them specifically
tsne_iris_extra_extra <- smallvis(iris, perplexity = 25, epoch_callback = iris_plot,
ret_extra = c("P", "Q", "DX", "DY", "X"))
# Repeat embedding 10 times and keep the one with the best cost
tsne_iris_best <- smallvis_rep(nrep = 10, X = iris, perplexity = 25, ret_extra = TRUE)
iris_plot(tsne_iris_best$Y)
# Let smallvis pick a perplexity for you, using the Intrinsic Dimensionality Perplexity
tsne_iris_idp <- smallvis(iris, epoch_callback = ecb, perplexity = "idp", Y_init = "spca",
exaggeration_factor = 4)
# Classical momentum optimization instead of delta-bar-delta
umap_iris_mom <- smallvis(iris, scale = FALSE, opt = list("mom", eta = 1e-2, mu = 0.8),
method = "umap", Y_init = "spca")
# L-BFGS optimization via the mize package
umap_iris_lbfgs <- smallvis(iris, scale = FALSE, opt = list("l-bfgs", c1 = 1e-4, c2 = 0.9),
method = "umap", Y_init = "spca", max_iter = 300)
# Early Exaggeration
tsne_iris_ex <- smallvis(iris, eta = 100, exaggeration_factor = 4, stop_lying_iter = 100)
# and Late Exaggeration as suggested by Linderman and co-workers
tsne_iris_lex <- smallvis(iris, eta = 100, exaggeration_factor = 4, stop_lying_iter = 100,
late_exaggeration_factor = 1.5, start_late_lying_iter = 900)
smallvis
currently relies
on my examination of the UMAP source code, with some much-appreciated clarification from UMAP creator
Leland McInnes. Expect some bugs, and any horrifically
bogus results should be double-checked with the output of the
official UMAP implementation before casting calumnies
on the quality of UMAP itself.gamma
parameter, which weights the contribution
of attractive and repulsive contributions to the cost function. In the real LargeVis,
it is recommended to set this value to 7
, but this relies on the specifics of
the stochastic gradient descent method. In smallvis
, this value is very
dataset dependent: the more data you have, the smaller gamma should be to avoid
over-emphasising repulsive interactions.smallvis
, each interaction is a combination of both.smallvis
uses the same delta-bar-delta optimization method used in t-SNE. It works well
in my experience, but may require some tuning and more iterations compared to
optimizing t-SNE.smallvis
. It can be controlled by the lveps
parameter.smallvis
exists mainly to satisfy my urge to answer the various, minor,
stamp-collecting questions
that have occurred to me as I have read the dimensionality reduction literature.
Those that I have cobbled together into something that demonstrates the use of
smallvis
can be found at the
documentation page.
For large scale visualization in R see:
Also of relevance are:
Much of the code here is based on my fork of Justin Donaldson's R package for t-SNE.
GPLv2 or later. Any LargeVis-specific code (e.g. cost and gradient calculation) can also be considered Apache 2.0. Similarly, UMAP-related code is also licensed as BSD 3-clause.