Rankfm Versions Save

Factorization Machines for Recommendation and Ranking Problems with Implicit Feedback Data

3 years ago

working PyPI and GitHub pip installs on both OSX and Linux
wrapped the external Mersenne Twister C library to generate better random numbers for BPR/WARP training
added a MANIFEST.in to include all C source and headers in the sdist archive

changed the logic in setup.py to favor building extensions from the generated C source rather than re-cythonizing the .pyx files. This is best practice according to the Cython docs.
removed Cython as a formal dependency as the generated C code will be included in the package sdist from now on.

3 years ago

needed to instruct Python to compile the created .c file instead of the .pyx file as the latter doesn't get added to the sdist
build tested and working now on both OSX and Linux

3 years ago

no changes, just syncing things up.

3 years ago

Cython back-end for _fit(), _predict(), _recommend() - the Cython _fit() function is 5X-10X faster than the original Numba version, and predict()/recommend() are about the same speed.

split regularization into two parameters: alpha to control the L2 regularization for user/item indicators and beta to control the regularization for user-features/item-features. In testing user-features/item-features tended to have exploding gradients/overwhelm utility scores unless more strongly regularized, especially with fairly dense side features. Typically beta should be set fairly high (e.g. 0.1) to avoid numerical instability.

3 years ago

pull the string loss param out of the private Numba internals and into the public fit() function
change _init_interactions to extend rather than replace the user_items dictionary item sets
added conditional logic to skip expensive user-feature/item-feature dot products if user and/or item features were not provided in the call to fit(). This reduces training time by over 50% if just using the base interaction matrix (no additional user/item features).

bug where similar_users(), similar_items() were performing validation checks on the original ID versus the zero-based index (wrong) instead of original values (correct) - this was causes a bunch of bogus assertion errors saying that the item_id wasn't in the training set

3 years ago

WARP loss - while slower to train this yields slightly better performance on dense interaction data and much better performance on highly sparse interaction data relative to BPR
new hyperparameters loss and max_samples
re-wrote the numba _fit() function to elegantly (IMHO) handle both BPR and WARP loss

3 years ago

added support for sample weights - you can now pass importance weights in addition to interactions
automatically determine the input data class (np.ndarray vs. pd.dataframe/pd.series)
assert/ensure that all model weights are finite after each training epoch to fail fast for exploding weights

bug where pd.dataframe interactions with columns not named [user_id, item_id] were not getting loaded/indexed correctly - fixed by using the input class determination utility created

more efficient loops for updating item feature and user/item feature factor weights - this cuts training time by around 30% with no auxiliary features, and by 50%+ in the presence of auxiliary features

3 years ago