(Python, R, C) Collective (multi-view/multi-way) matrix factorization, including cold-start functionality (recommender systems, imputation, dimensionality reduction)
Implementation of collective matrix factorization, based on "Relational learning via collective matrix factorization", with some enhancements and alternative models for cold-start recommendations as described in "Cold-start recommendations in Collective Matrix Factorization", and adding implicit-feedback variants as described in "Collaborative filtering for implicit feedback datasets".
This is a hybrid collaborative filtering model for recommender systems that takes as input either explicit item ratings or implicit-feedback data, and side information about users and/or items (although it can also fit pure collaborative-filtering and pure content-based models). The overall idea was extended here to also be able to do cold-start recommendations (for users and items that were not in the training data but which have side information available).
Although the package was developed with recommender systems in mind, it can also be used in other domains (e.g. topic modeling, dimensionality reduction, missing value imputation) - just take any mention of users as rows in the main matrix, any mention of items as columns, and use the "explicit" models.
For more information about the implementation here, or if you would like to cite this in your research, see "Cold-start recommendations in Collective Matrix Factorization"
For a similar package with Poisson distributions see ctpfrec.
Written in C with Python and R interfaces. An additional Ruby interface can be found here.
For an introduction to the library and methods, see:
For the full benchmark, code, and details see benchmarks.
Comparing the classical matrix factorization model for explicit feedback without side information in different software libraries (50 factors, 15 iterations, regularization of 0.05, float64
when supported) on the MovieLens10M dataset:
Library | Method | Biases | Time (s) | RMSE | Additional |
---|---|---|---|---|---|
cmfrec | ALS-CG | Yes | 13.64 | 0.788233 | |
cmfrec | ALS-Chol | Yes | 35.35 | 0.782414 | Implicit features |
LibMF | SGD | No | 1.79 | 0.785585 | float32 |
Spark | ALS-Chol | No | 81 | 0.791316 | Manual center |
cornac | SGD | Yes | 13.9 | 0.816548 | |
Surprise | SGD | Yes | 178 | 1.060049 | |
spotlight | ADAM | No | 12141 | 1.054698 | See details |
LensKit | ALS-CD | Static | 26.8 | 0.796050 | Manual thread control |
PyRecLab | SGD | Yes | 90 | 0.812566 | Reads from disk |
rsparse | ALS-Chol | Yes | 30.13 | 0.786935 | |
softImpute | ALS-Chol | Static | 88.93 | 0.810450 | Unscaled lambda |
softImpute | ALS-SVD | Static | 195.73 | 0.808293 | Unscaled lambda |
Vowpal Wabbit | SGD | Yes | 293 | 1.054546 | See details |
Comparing the implicit variant without side information in different software libraries (50 factors, 15 iterations, regularization of 5, float64
when supported) on the LastFM-360K dataset:
Library | Method | Time (s) | P@10 | MAP | Additional |
---|---|---|---|---|---|
cmfrec | ALS-CG | 29.52 | 0.16969 | 0.12135 | |
cmfrec | ALS-Chol | 51.28 | 0.1701 | 0.121761 | |
implicit | ALS-CG | 29.0 | 0.17007 | 0.120986 | |
implicit | ALS-Chol | 98 | 0.17031 | 0.121167 | |
LensKit | ALS-CG | 68 | 0.17069 | 0.121846 | |
LensKit | ALS-Chol | 84 | 0.16941 | 0.122121 | |
cornac | ADAM | 13338 | 0.00889 | 0.006288 | float32 |
Spark | ALS-Chol | oom | oom | oom | See details |
rsparse | ALS-CG | 39.18 | 0.16998 | 0.121242 | |
rsparse | ALS-Chol | 69.75 | 0.16941 | 0.121353 | |
LibMF | ALS-CD | 143.67 | 0.14307 | 0.093755 | float32 |
qmf | ALS-Chol | 102 | 0.17019 | 0.122017 |
(See introductory notebook above for more details)
The model consist in predicting the rating (or weighted confidence for implicit-feedback case) that a user would give to an item by performing a low-rank factorization of an interactions matrix X
of size users
x items
(e.g. ratings)
X ~ A * B.T
(where A
and B
are the fitted model matrices)
But does so using side information about the items (such as movie tags) and/or users (such as their demographic info) by also factorizing the item side info matrix and/or the user side info matrix
U ~ A * C.T, I ~ B * D.T
Sharing the same item/user-factor matrix used to factorize the ratings, or sharing only some of the latent factors.
This also has the side effect of allowing recommendations for users and items for which there is side information but no ratings, although these predictions might not be as high quality.
Alternatively, can produce factorizations in wich the factor matrices are determined from the attributes directly (e.g. A = U * C
), with or without a free offset.
While the method was initially devised for recommender systems, can also be used as a general technique for dimensionality reduction by taking the A
matrix as low-dimensional factors, which can be calculated for new data too.
Alternatively, it might also produce good results when used as an imputer for missing values in tabular data. The Python version is scikit-learn compatible and has a separate class aimed at being used for imputation in scikit-learn pipelines. Example here.
The package has been rewritten in C with Python wrappers. If you've used earlier versions of this package which relied on Tensorflow for the calculations, the optimal hyperparameters will be very different now as it has changed some details of the loss function such as not dividing some terms by the number of entries.
The new version is faster, multi-threaded, and has some new functionality, but if for some reason you still need the old one, it can be found under the git branch "tensorflow".
float32
and float64
) - the Python package comes with both versions.INT_MAX
).Note: requires a C compiler configured for Python. See this guide for instructions.
pip install cmfrec
or if that fails:
pip install --no-use-pep517 cmfrec
(See performance tips below)
Note for macOS users: on macOS, the Python version of this package might compile without multi-threading capabilities. In order to enable multi-threading support, first install OpenMP:
brew install libomp
And then reinstall this package: pip install --upgrade --no-deps --force-reinstall cmfrec
.
IMPORTANT: the setup script will try to add compilation flag -march=native
. This instructs the compiler to tune the package for the CPU in which it is being installed (by e.g. using AVX instructions if available), but the result might not be usable in other computers. If building a binary wheel of this package or putting it into a docker image which will be used in different machines, this can be overriden either by (a) defining an environment variable DONT_SET_MARCH=1
, or by (b) manually supplying compilation CFLAGS
as an environment variable with something related to architecture. For maximum compatibility (but slowest speed), it's possible to do something like this:
export DONT_SET_MARCH=1
pip install cmfrec
or, by specifying some compilation flag for architecture:
export CFLAGS="-march=x86-64"
pip install cmfrec
Note: This package greatly benefits from extra optimizations that aren't enabled by default for R packages. See this guide for instructions on how to enable them.
install.packages("cmfrec")
(See performance tips below)
See external repository.
Package can be built as a shared library (requires BLAS and LAPACK) - see the CMake build file for options:
git clone https://www.github.com/david-cortes/cmfrec.git
cd cmfrec
mkdir build
cd build
cmake -DUSE_MARCH_NATIVE=1 ..
cmake --build .
## For a system-wide install
sudo make install
sudo ldconfig
Linkage is then done with -lcmfrec
.
By default, it compiles for types double
and int
, but this can be changed in the CMake script to float
(passing option -DUSE_FLOAT=1
) and/or int64_t
(passing option -DUSE_INT64=1
- but note that if not using MKL, will require manually setting linkage to a BLAS library with int64 support - see the CMakeLists.txt file). Additionally, if the LAPACK library used does not do fully IEEE754-compliant propagation of NAs (e.g. some very old versions of OpenBLAS), can pass option -DNO_NAN_PROPAGATION=1
.
Be aware that the snippet above includes option -DUSE_MARCH_NATIVE=1
, which will make it use the highest-available CPU instruction set (e.g. AVX2) and will produces objects that might not run on older CPUs - to build more "portable" objects, remove this option from the cmake command.
This package relies heavily on BLAS and LAPACK functions for calculations. It's recommended to use with MKL, OpenBLAS, or BLIS. Additionally, if using it from R, the package will benefit from enabling optimizations which are not CRAN-compliant (see this guide for detailed instructions).
Different backends for BLAS can make a large difference in speed - for example, on an AMD Ryzen 2700, MKL2021 makes models take 4x longer to fit than MKL2020, and using OpenBLAS-pthreads takes around 1.3x longer to fit models compared to OpenBLAS-openmp.
This library calls BLAS routines from parallel OpenMP blocks, which can cause issues with some BLAS backends - for example, if using MKL and compiling this package with GCC on linux, it could have issues with conflicting OpenMPs which could be solved by adding an environment variable MKL_THREADING_LAYER=GNU
. For the most part, it tries to disable BLAS multi-threading in openmp blocks, but the mechanism might not work with all BLAS libraries or if swapping BLAS libraries after having compiled cmfrec
.
In computers with a NUMA (non-uniform memory access) architecture, one might additionally need to play with OpenMP settings externally (such as process affinity and similar) for better performance, for example by setting environment variables with general settings that might be more appropriate than GNU's defaults for a given CPU.
Hints:
nomkl
package. If not using Anaconda, pip
installs of NumPy + SciPy are likely to bundle OpenBLAS (pthreads version).libopenblas-openmp-dev
before installing R, or by using the alternatives system).MKL_THREADING_LAYER=GNU
. In Linux and macOS, this can be done by adding export MKL_THREADING_LAYER=GNU
in ~/.bashrc
or ~/.profile
, while in Windows it can be set through the control panel.For optimal performance in R, it's recommended to set a custom Makevars file with extra compiler optimizations, and then install the package from source. On Linux, simply create a text file ~/.R/Makevars
containing this line: CFLAGS += -O3 -march=native -fno-math-errno -fno-trapping-math
(plus an empty line at the end). Then install cmfrec
with install.packages("cmfrec")
.
Alternatively, one can also install this package from source but editing the Makevars
file under src
by uncommenting the lines that are commented out, which will trigger better compiler optimizations which are not CRAN-compliant (GCC only). For alternative ways of doing this see the "Performance tips" section in the docs. This basically amounts to adding compilation options -std=c99 -O3 -march=native -fno-math-errno -fno-trapping-math
, which are typically not the defaults in R.
In modern CPUs, this can make some optimization routines in cmfrec
roughly 25% faster.
Earlier Python versions of cmfrec
used the package findblas
to link to BLAS's CBLAS interface, while newer versions take the BLAS from SciPy and build a CBLAS wrapper around it, which can make it run slightly lower. To use findblas
, define an environment variable USE_FINDBLAS=1
before installing:_
export USE_FINDBLAS=1
pip install cmfrec
(Can also define USE_OPENBLAS=1
to forcibly use -lopenblas
)
import numpy as np, pandas as pd
from cmfrec import CMF
### Generate random data
n_users = 4
n_items = 5
n_ratings = 10
n_user_attr = 4
n_item_attr = 5
k = 3
np.random.seed(1)
### 'X' matrix (can also pass it as SciPy COO)
ratings = pd.DataFrame({
"UserId" : np.random.randint(n_users, size=n_ratings),
"ItemId" : np.random.randint(n_items, size=n_ratings),
"Rating" : np.random.normal(loc=3., size=n_ratings)
})
### 'U' matrix (can also pass it as NumPy if X is a SciPy COO)
user_info = pd.DataFrame(np.random.normal(size = (n_users, n_user_attr)))
user_info["UserId"] = np.arange(n_users)
### 'I' matrix (can also pass it as NumPy if X is a SciPy COO)
item_info = pd.DataFrame(np.random.normal(size = (n_items, n_item_attr)))
item_info["ItemId"] = np.arange(n_items)
### Fit the model
model = CMF(method="als", k=k)
model.fit(X=ratings, U=user_info, I=item_info)
### Predict rating that user 3 would give to items 2 and 4
model.predict(user=[3, 3], item=[2, 4])
### Top-5 highest predicted for user 3
model.topN(user=3, n=5)
### Top-5 highest predicted for user 3, if it were a new user
model.topN_warm(X_col=ratings.ItemId.loc[ratings.UserId == 3],
X_val=ratings.Rating.loc[ratings.UserId == 3],
U=user_info.iloc[[3]],
n=5)
### Top-5 highest predicted for user 3, based on side information
model.topN_cold(U=user_info.iloc[[3]], n=5)
### Calculating the latent factors
model.factors_warm(X_col=ratings.ItemId.loc[ratings.UserId == 3],
X_val=ratings.Rating.loc[ratings.UserId == 3],
U=user_info.iloc[[3]])
Users and items can be reindexed internally (if passing data frames, but not when pasing sparse or dense matrices), so you can use strings or non-consecutive numbers as IDs when passing data to the object's methods.
(See also example using it for imputing missing values)
(See ?fit_models
for a better and longer example with real data)
library(cmfrec)
n_users <- 4
n_items <- 5
n_ratings <- 10
n_user_attr <- 4
n_item_attr <- 5
k <- 3
set.seed(1)
### 'X' matrix (can also pass it as TsparseMatrix, matrix.coo, etc.)
ratings <- data.frame(
UserId = sample(n_users, n_ratings, replace=TRUE),
ItemId = sample(n_items, n_ratings, replace=TRUE),
Rating = rnorm(n_ratings, mean=3)
)
### 'U' matrix (can also pass it as TsparseMatrix, DF, etc.)
user_info <- matrix(rnorm(n_users*n_user_attr), nrow=n_users)
rownames(user_info) <- 1:n_users ## These denote which ID is each row
### 'I' matrix (can also pass it as TsparseMatrix, DF, etc.)
item_info <- matrix(rnorm(n_items*n_item_attr), nrow=n_items)
rownames(item_info) <- 1:n_items ## These denote which ID is each row
### Fit the model
model <- CMF(X=ratings, U=user_info, I=item_info,
method="als", k=k)
### Predict rating that user 3 would give to items 2 and 4
predict(model, user=c(3, 3), item=c(2, 4))
### Top-5 highest predicted for user 3
topN(model, user=3, n=5)
### Top-5 highest predicted for user 3, if it were a new user
topN_new(model, n=5,
X_col=ratings$ItemId[ratings$UserId == 3],
X_val=ratings$Rating[ratings$UserId == 3],
U=user_info[3,])
### Top-5 highest predicted for user 3, based on side information
topN_new(model, U=user_info[3,], n=5)
### Calculating the latent factors
factors_single(model,
X_col=ratings$ItemId[ratings$UserId == 3],
X_val=ratings$Rating[ratings$UserId == 3],
U=user_info[3,])
Users and items can be reindexed internally (if passing data frames, but not when pasing sparse or dense matrices), so you can use strings or non-consecutive numbers as IDs when passing data to the object's methods.
See external repository.
See file example/c_example.c
For more details see the online documentation.
For a longer example with real data see the Python notebook MovieLens Recommender with Side Information and the R vignette.
Documentation is available at ReadTheDocs: http://cmfrec.readthedocs.io/en/latest/.
Documentation is available inside the package (e.g. ?CMF
) and at CRAN.
See external repository for syntax and Python docs for details about each parameter.
Documentation is available in the public header that is generated through the build script - see include/cmfrec.h.in.
Metrics for implicit-feedback model quality for this package can be calculated using the recometrics library.
This kind of model requires a lot more hyperparameter tuning that regular low-rank matrix factorization, and fitting a model with badly-tuned parameters might result in worse recommendations compared to discarding the side information.
If your dataset is larger than the MovieLens ratings, adding product side information is unlikely to add more predictive power, but good user side information might still be valuable.
For any installation problems or errors encountered with this software, please open an issue in this GitHub page with a reproducible example, the error message that you see, and description of your setup (e.g. Python version, NumPy version, operating system).