Uwot Versions Save

An R package implementing the UMAP dimensionality reduction method.

v0.2.2

1 month ago

uwot 0.2.2

Bug fixes and minor improvements

  • RSpectra is now a required dependency (again). It was a required dependency up until version 0.1.12, when it became optional (irlba was used in its place). However, problems with interactions of the current version of irlba with an ABI change in the Matrix package means that it's hard for downstream packages and users to build uwot without re-installing Matrix and irlba from source, which may not be an option for some people. Also it was causing a CRAN check error. I have changed some tests, examples and vignettes to use RSpectra explicitly, and to only test irlba code-paths where necessary. See https://github.com/jlmelville/uwot/issues/115 and links therein for more details.

v0.2.1

1 month ago

uwot 0.2.1

New features:

  • The HNSW approximate nearest neighbor search algorithm is now supported via the RcppHNSW package. Set nn_method = "hnsw" to use it. The behavior of the method can be controlled by the new nn_args parameter, a list which may contain M, ef_construction and ef. See the hnswlib library's ALGO_PARAMS documentation for details on these parameters. Although typically faster than Annoy (for a given accuracy), be aware that the only supported metric values are "euclidean", "cosine" and "correlation". Finally, RcppHNSW is only a suggested package, not a requirement, so you need to install it yourself (e.g. via install.packages("RcppHNSW")). Also see the article on HNSW in uwot in the documentation.
  • The nearest neighbor descent approximate nearest neighbor search algorithm is now supported via the rnndescent package. Set nn_method = "nndescent" to use it. The behavior of the method can be controlled by the new nn_args parameter. There are many supported metrics and possible parameters that can be set in nn_args, so please see the article on nearest neighbor descent in uwot in the documentation, and also the rnndescent package's documentation for details. rnndescent is only a suggested package, not a requirement, so you need to install it yourself (e.g. via install.packages("rnndescent")).
  • New function: umap2, which acts like umap but with modified defaults, reflecting my experience with UMAP and correcting some small mistakes. See the umap2 article for more details.

Bug fixes and minor improvements

  • init_sdev = "range" caused an error with a user-supplied init matrix.
  • Transforming new data with the correlation metric was actually using the cosine metric if you saved and reloaded the model. Thank you Holly Hall for the report and helpful detective work (https://github.com/jlmelville/uwot/issues/117).
  • umap_transform could fail if the new data to be transformed had the scaled:center and scaled:scale attributes set (e.g. from applying the scale function).
  • If you asked umap_transform to return the fuzzy graph (ret_extra = c("fgraph")), it was transposed when batch = TRUE, n_epochs = 0. Thank you PedroMilanezAlmeida for reporting (https://github.com/jlmelville/uwot/issues/118).
  • Setting n_sgd_threads = "auto" with umap_transform caused a crash.
  • A warning was being emitted due to not being specific enough about what dist class was meant that may have been particularly affecting Seurat users. Thank you AndiMunteanu for reporting (and suggesting a solution) (https://github.com/jlmelville/uwot/issues/121).

v0.1.16

11 months ago

uwot 0.1.16

Bug fixes and minor improvements

v0.1.15

11 months ago

New features

  • New function: optimize_graph_layout. Use this to produce optimized output coordinates that reflect an input similarity graph (such as that produced by the similarity_graph function. similarity_graph followed by optimize_graph_layout is the same as running umap, so the purpose of these functions is to allow for more flexibility and decoupling between generating the nearest neighbor graph and optimizing the low-dimensional approximation to it. Based on a request by user Chengwei94 (https://github.com/jlmelville/uwot/issues/98).
  • New functions: simplicial_set_union and simplicial_set_intersect. These allow for the combination of different fuzzy graph representations of a dataset into a single fuzzy graph using the UMAP simplicial set operations. Based on a request in the Python UMAP issues tracker by user Dhar xion.
  • New parameter for umap_transform: ret_extra. This works like the equivalent parameter for umap, and should be a character vector specifying the extra information you would like returned in addition to the embedding, in which case a list will be returned with an embedding member containing the optimized coordinates. Supported values are "fgraph", "nn", "sigma" and "localr". Based on a request by user PedroMilanezAlmeida (https://github.com/jlmelville/uwot/issues/104).
  • New parameter from umap, tumap and umap_transform: seed. This will do the equivalent of calling set.seed internally, and hence will help with reproducibility. The chosen seed is exported if ret_model = TRUE and umap_transform will use that seed if present, so you only need to specify it in umap_transform if you want to change the seed. The default behavior remains to not modify the random number state. Based on a request by SuhasSrinivasan (https://github.com/jlmelville/uwot/issues/110).

Bug fixes and minor improvements

  • A new setting for init_sdev: set init_sdev = "range" and initial coordinates will be range-scaled so each column takes values between 0-10. This pre-processing was added to the Python UMAP package at some point after uwot began development and so should probably always be used with the default init = "spectral" setting. However, it is not set by default to maintain backwards compatibility with older versions of uwot.
  • ret_extra = c("sigma") is now supported by lvish. The Gaussian bandwidths are returned in a sigma vector. In addition, a vector of intrinsic dimensionalities estimated for each point using an analytical expression of the finite difference method given by Lee and co-workers is returned in the dint vector.
  • The min_dist and spread parameters are now returned in the model when umap is run with ret_model = TRUE. This is just for documentation purposes, these values are not used directly by the model in umap_transform. If the parameters a and b are set directly when invoking umap, then both min_dist and spread will be set to NULL in the returned model. This feature was added in response to a question from kjiang18 (https://github.com/jlmelville/uwot/issues/95).
  • Some new checks for NA values in input data have been added. Also a warning will be emitted if n_components seems to have been set too high.
  • If n_components was greater than n_neighbors then umap_transform would crash the R session. Thank you to ChVav for reporting this (https://github.com/jlmelville/uwot/issues/102).
  • Using umap_transform with a model where dens_scale was set could cause a segmentation fault, destroying the session. Even if it didn't it could give an entirely artifactual "ring" structure. Thank you FemkeSmit for reporting this and providing assistance in diagnosing the underlying cause (https://github.com/jlmelville/uwot/issues/103).
  • If you set binary_edge_weights = TRUE, this setting was not exported when ret_model = TRUE, and was therefore not respected by umap_transform. This has now been fixed, but you will need to regenerate any models that used binary edge weights.
  • The rdoc for the init param said that if there were multiple disconnected components, a spectral initialization would attempt to merge multiple sub-graphs. Not true: actually, spectral initialization is abandoned in favor of PCA. The documentation has been updated to reflect the true state of affairs. No idea what I was thinking of there.
  • load_model and save_model didn't work on Windows 7 due to how the version of tar there handles drive letters. Thank you mytarmail for the report (https://github.com/jlmelville/uwot/issues/109).
  • Warn if the initial coordinates have a very large scale (a standard deviation > 10.0), because this can lead to small gradients and poor optimization. Thank you SuhasSrinivasan for the report (https://github.com/jlmelville/uwot/issues/110).
  • A change to accommodate a forthcoming version of RcppAnnoy. Thank you Dirk Eddelbuettel for the PR (https://github.com/jlmelville/uwot/issues/111).
  • A test was failing on Arm architectures. Problem has been "solved" by removing the test, but it was testing a floating point value resulting from a failure due to numerical issues, so it's a bit of a corner case. Thank you Lucas Kanashiro for reporting (https://github.com/jlmelville/uwot/issues/100).

v0.1.14

1 year ago

uwot 0.1.14

New features

  • New function: similarity_graph. If you are more interested in the high-dimensional graph/fuzzy simplicial set representation of your input data, and don't care about the low dimensional approximation, the similarity_graph function offers a similar API to umap, but neither the initialization nor optimization of low-dimensional coordinates will be performed. The return value is the same as that which would be returned in the results list as the fgraph member if you had provided ret_extra = c("fgraph"). Compared to getting the same result via running umap, this function is a bit more convenient to use, makes your intention clearer if you would be discarding the embedding, and saves a small amount of time. A t-SNE/LargeVis similarity graph can be returned by setting method = "largevis".

Bug fixes and minor improvements

v0.1.13

1 year ago

uwot 0.1.13

  • This is a resubmission of 0.1.12 but with an internal function (fuzzy_simplicial_set) refactored to behave more like that of previous versions. This change was breaking the behavior of the CRAN package bbknnR.

It would be pointless to release 0.1.12 as well as 0.1.13 as they are so similar. So here are the releases notes for 0.1.12:

uwot 0.1.12

New features

  • New parameter: dens_weight. If set to a value between 0 and 1, an attempt is made to include the relative local densities of the input data in the output coordinates. This is an approximation to the densMAP method. A large value of dens_weight will use a larger range of output densities to reflect the input data. If the data is too spread out, reduce the value of dens_weight. For more information see the documentation at the uwot repo.
  • New parameter: binary_edge_weights. If set to TRUE, instead of smoothed knn distances, non-zero edge weights all have a value of 1. This is how PaCMAP works and there is practical and theoretical reasons to believe this won't have a big effect on UMAP but you can try it yourself.
  • New options for ret_extra:
    • "sigma": the return value will contain a sigma entry, a vector of the smooth knn distance scaling normalization factors, one for each observation in the input data. A small value indicates a high density of points in the local neighborhood of that observation. For lvish the equivalent bandwidths calculated for the input perplexity is returned.
    • also, a vector rho will be exported, which is the distance to the nearest neighbor after the number of neighbors specified by the local_connectivity. Only applies for umap and tumap.
    • "localr": exports a vector of the local radii, the sum of sigma and rho and used to scale the output coordinates when dens_weight is set. Even if not using dens_weight, visualizing the output coordinates using a color scale based on the value of localr can reveal regions of the input data with different densities.
  • For functions umap and tumap only: new data type for precomputed nearest neighbor data passed as the nn_method parameter: you may use a sparse distance matrix of format dgCMatrix with dimensions N x N where N is the number of observations in the input data. Distances should be arranged by column, i.e. a non-zero entry in row j of the ith column indicates that the jth observation in the input data is a nearest neighbor of the ith observation with the distance given by the value of that element. Note that this is a different format to the sparse distance matrix that can be passed as input to X: notably, the matrix is not assumed to be symmetric. Unlike other input formats, you may have a different number of neighbors for each observation (but there must be at least one neighbor defined per observation).
  • umap_transform can also take a sparse distance matrix as its nn_method parameter if precomputed nearest neighbor data is used to generate an initial model. The format is the same as for the nn_method with umap. Because distances are arranged by columns, the expected dimensions of the sparse matrix is N_model x N_new where N_model is the number of observations in the original data and N_new is the number of observations in the data to be transformed.

Bug fixes and minor improvements

  • Models couldn't be re-saved after loading. Thank you to ilyakorsunsky for reporting this (https://github.com/jlmelville/uwot/issues/88).
  • RSpectra is now a 'Suggests', rather than an 'Imports'. If you have RSpectra installed, it will be used automatically where previous versions required it (for spectral initialization). Otherwise, irlba will be used. For two-dimensional output, you are unlikely to notice much difference in speed or accuracy with real-world data. For highly-structured simulation datasets (e.g. spectral initialization of a 1D line) then RSpectra will give much better, faster initializations, but these are not the typical use cases envisaged for this package. For embedding into higher dimensions (e.g. n_components = 100 or higher), RSpectra is recommended and will likely out-perform irlba even if you have installed a good linear algebra library.
  • init = "laplacian" returned the wrong coordinates because of a slightly subtle issue around how to order the eigenvectors when using the random walk transition matrix rather than normalized graph laplacians.
  • The init_sdev parameter was ignored when the init parameter was a user-supplied matrix. Now the input will be scaled.
  • Matrix input was being converted to and from a data frame during pre-processing, causing R to allocate memory that it was disinclined to ever give up even after the function exited. This unnecessary manipulation is now avoided.
  • The behavior of the bandwidth parameter has been changed to give results more like the current version (0.5.2) of the Python UMAP implementation. This is likely to be a breaking change for non-default settings of bandwidth, but this is not a parameter which is actually exposed by the Python UMAP public API any more, so is on the road to deprecation in uwot too and I don't recommend you change this.
  • Transforming data with multiple blocks would give an error if the number of rows of the new data did not equal the number of number of rows in the original data.

v0.1.11

2 years ago

uwot 0.1.11

New features

  • New parameter: batch. If TRUE, then results are reproducible when n_sgd_threads > 1 (as long as you use set.seed). The price to be paid is that the optimization is slightly less efficient (because coordinates are not updated as quickly and hence gradients are staler for longer), so it is highly recommended to set n_epochs = 500 or higher. Thank you to Aaron Lun who not only came up with a way to implement this feature, but also wrote an entire C++ implementation of UMAP which does it (https://github.com/jlmelville/uwot/issues/83).
  • New parameter: opt_args. The default optimization method when batch = TRUE is Adam. You can control its parameters by passing them in the opt_args list. As Adam is a momentum-based method it requires extra storage of previous gradient data. To avoid the extra memory overhead you can also use opt_args = list(method = "sgd") to use a stochastic gradient descent method like that used when batch = FALSE.
  • New parameter: epoch_callback. You may now pass a function which will be invoked at the end of each epoch. Mainly useful for producing an image of the state of the embedding at different points during the optimization. This is another feature taken from umappp.
  • New parameter: pca_method, used when the pca parameter is supplied to reduce the initial dimensionality of the data. This controls which method is used to carry out the PCA and can be set to one of:
    • "irlba" which uses irlba::irlba to calculate a truncated SVD. If this routine deems that you are trying to extract 50% or more of the singular vectors, you will see a warning to that effect logged to the console.
    • "rsvd", which uses irlba::svdr for truncated SVD. This method uses a small number of iterations which should give an accuracy/speed up trade-off similar to that of the scikit-learn TruncatedSVD method. This can be much faster than using "irlba" but potentially at a cost in accuracy. However, for the purposes of dimensionality reduction as input to nearest neighbor search, this doesn't seem to matter much.
    • "bigstatsr", which uses the bigstatsr package will be used. Note: that this is not a dependency of uwot. If you want to use bigstatsr, you must install it yourself. On platforms without easy access to fast linear algebra libraries (e.g. Windows), using bigstatsr may give a speed up to PCA calculations.
    • "svd", which uses base::svd. Warning: this is likely to be very slow for most datasets and exists as a fallback for small datasets where the "irlba" method would print a warning.
    • "auto" (the default) which uses "irlba" to calculate a truncated SVD, unless you are attempting to extract 50% or more of the singular vectors, in which case "svd" is used.

Bug fixes and minor improvements

  • If row names are provided in the input data (or nearest neighbor data, or initialization data if it's a matrix), this will be used to name the rows of the output embedding (https://github.com/jlmelville/uwot/issues/81), and also the nearest neighbor data if you set ret_nn = TRUE. If the names exist in more than one of the input data parameters listed above, but are inconsistent, no guarantees are made about which names will be used. Thank you jwijffels for reporting this.
  • In umap_transform, the learning rate is now down-scaled by a factor of 4, consistent with the Python implementation of UMAP. If you need the old behavior back, use the (newly added) learning_rate parameter in umap_transform to set it explicitly. If you used the default value in umap when creating the model, the correct setting in umap_transform is learning_rate = 1.0.
  • Setting nn_method = "annoy" and verbose = TRUE would lead to an error with datasets with fewer than 50 items in them.
  • Using multiple pre-computed nearest neighbors blocks is now supported with umap_transform (this was incorrectly documented to work).
  • Documentation around pre-calculated nearest neighbor data for umap_transform was wrong in other ways: it has now been corrected to indicate that there should be neighbor data for each item in the test data, but the neighbors and distances should refer to items in training data (i.e. the data used to build the model).
  • n_neighbors parameter is now correctly ignored in model generation if pre-calculated nearest neighbor data is provided.
  • Documentation incorrectly said grain_size didn't do anything.

v0.1.10

3 years ago

uwot 0.1.10

This release is mainly to allow for some internal changes to keep compatibility with RcppAnnoy, used for the nearest neighbor calculations.

Bug fixes and minor improvements

  • Passing in data with missing values will now raise an error early. Missing data in factor columns intended for supervised UMAP is still ok. Thank you David McGaughey for tweeting about this issue.
  • The documentation for the return value of umap and tumap now note that the contents of the model list are subject to change and not intended to be part of the uwot public API. I recommend not relying on the structure of the model, especially if your package is intended to appear on CRAN or Bioconductor, as any breakages will delay future releases of uwot to CRAN.

v0.1.9

3 years ago

uwot 0.1.9

New features

  • New metric: metric = "correlation" a distance based on the Pearson correlation (https://github.com/jlmelville/uwot/issues/22). Supporting this required a change to the internals of how nearest neighbor data is stored. Backwards compatibility with models generated by previous versions using ret_model = TRUE should have been preserved.

Big fixes and minor improvements

  • New parameter, nn_method, for umap_transform: pass a list containing pre-computed nearest neighbor data (identical to that used in the umap function). You should not pass anything to the X parameter in this case. This extends the functionality for transforming new points to the case where nearest neighbor data between the original data and new data can be calculated external to uwot. Thanks to Yuhan Hao for contributing the PR (https://github.com/jlmelville/uwot/issues/63 and https://github.com/jlmelville/uwot/issues/64).
  • New parameter, init, for umap_transform: provides a variety of options for initializing the output coordinates, analogously to the same parameter in the umap function (but without as many options currently). This is intended to replace init_weighted, which should be considered deprecated, but won't be removed until uwot 1.0 (whenever that is). Instead of init_weighted = TRUE, use init = "weighted"; replace init_weighted = FALSE with init = "average". Additionally, you can pass a matrix to init to act as the initial coordinates.
  • Also in umap_transform: previously, setting n_epochs = 0 was ignored: at least one iteration of optimization was applied. Now, n_epochs = 0 is respected, and will return the initialized coordinates without any further optimization.
  • Minor performance improvement for single-threaded nearest neighbor search when verbose = TRUE: the progress bar calculations were taking up a detectable amount of time and has now been fixed. With very small data sets (< 50 items) the progress bar will no longer appear when building the index.
  • Passing a sparse distance matrix as input now supports upper/lower triangular matrix storage rather than wasting storage using an explicitly symmetric sparse matrix.
  • Minor license change: uwot used to be licensed under GPL-3 only; now it is GPL-3 or later.

v0.1.8

3 years ago

uwot 0.1.8

Better late than never, here are the release notes for CRAN release 0.1.8. It's a bumper selection due to my failure to get 0.1.6 and 0.1.7 accepted.

New features

  • New parameter, ret_extra, a vector which can contain any combination of: "model" (same as ret_model = TRUE), "nn" (same as ret_nn = TRUE) and fgraph (see below).
  • New return value data: If the ret_extra vector contains "fgraph", the returned list will contain an fgraph item representing the fuzzy simplicial input graph as a sparse N x N matrix. For lvish, use "P" instead of "fgraph" (https://github.com/jlmelville/uwot/issues/47). Note that there is a further sparsifying step where edges with a very low membership are removed if there is no prospect of the edge being sampled during optimization. This is controlled by n_epochs: the smaller the value, the more sparsifying will occur. If you are only interested in the fuzzy graph and not the embedded coordinates, set n_epochs = 0.
  • New function: unload_uwot, to unload the Annoy nearest neighbor indices in a model. This prevents the model from being used in umap_transform, but allows for the temporary working directory created by both save_uwot and load_uwot to be deleted. Previously, both load_uwot and save_uwot were attempting to delete the temporary working directories they used, but would always silently fail because Annoy is making use of files in those directories.
  • An attempt has been made to reduce the variability of results due to different compiler and C++ library versions on different machines. Visually results are unchanged in most cases, but this is a breaking change in terms of numerical output. The best chance of obtaining floating point determinism across machines is to use init = "spca", fixed values of a and b (rather than allowing them to be calculated through setting min_dist and spread) and approx_pow = TRUE. Using the tumap method with init = "spca" is probably the most robust approach.

Big fixes and minor improvements

  • default for n_threads is now NULL to provide a bit more protection from changing dependencies.
  • uwot should no longer trigger undefined behavior in sanitizers, due to replacement of RcppParallel with the standard C++11 implementation of threading (and some code "borrowed" from RcppParallel) (https://github.com/jlmelville/uwot/issues/52).
  • Further sanitizer improvements in the nearest neighbor search code due to the upstream efforts of erikbern and eddelbuettel (https://github.com/jlmelville/uwot/issues/50).
  • New behavior when n_epochs = 0. This used to behave like (n_epochs = NULL) and gave a default number of epochs (dependent on the number of vertices in the dataset). Now it more usefully carries out all calculations except optimization, so the returned coordinates are those specified by the init parameter, so this is an easy way to access e.g. the spectral or PCA initialization coordinates. If you want the input fuzzy graph (ret_extra vector contains "fgraph"), this will also prevent the graph having edges with very low membership being removed. You still get the old default epochs behavior by setting n_epochs = NULL or to a negative value.
  • save_uwot and load_uwot have been updated with a verbose parameter so it's easier to see what temporary files are being created.
  • save_uwot has a new parameter, unload, which if set to TRUE will delete the working directory for you, at the cost of unloading the model, i.e. it can't be used with umap_transform until you reload it with load_uwot.
  • save_uwot now returns the saved model with an extra field, mod_dir, which points to the location of the temporary working directory, so you should now assign the result of calling save_uwot to the model you saved, e.g. model <- save_uwot(model, "my_model_file"). This field is intended for use with unload_uwot.
  • load_uwot also returns the model with a mod_dir item for use with unload_uwot.
  • save_uwot and load_uwot were not correctly handling relative paths.
  • A previous bug fix to load_uwot in uwot 0.1.4 to work with newer versions of RcppAnnoy (https://github.com/jlmelville/uwot/issues/31) failed in the typical case of a single metric for the nearest neighbor search using all available columns, giving an error message along the lines of: Error: index size <size> is not a multiple of vector size <size>. This has now been fixed, but required changes to both save_uwot and load_uwot, so existing saved models must be regenerated. Thank you to reporter OuNao.