This package implement classic machine learning algorithms. Motivations for this project includes:
battery-includedcode and data for ML prototyping.
Following algorithms and models will be implemented in this package.
For some of them,
NumPy support is necessary.
Ensure to configure
NumPy with advanced linear algebra implementations(
The default implementation is extremely slow.
The following models and algorithms will be added into this package soon:
Naive Bayes(NB) classifier for both
Multi-variate Bernoulli and
Multinomial event model.
python nb.py to train and test NB on
20 Newsgroups dataset.
By default, train and test data are generated by randomly split the benchmark dataset with an approximate ratio of
Multi-variate Bernoulli event model:
Multinomial event model:
NB is a generative model, different event model leads to different assumption on the distribution of p(x|y).
For example, in spam classification, after we decided to generate a spam or non-spam text according to the class prior p(y), we need to further decide which words to be included in the text, we can:
In the first case, each word is drawn from a Bernoulli distribution,
and the text is generated by the
Multi-variate Bernoulli event model.
In the second case, all words are drawn from a
TODO: Parallel NB classifier.
ID3 learning algorithm for training a decision tree(DT).
python dt.py to train and test ID3 on a subset of the
20 Newsgroups dataset,
which contains only 2 news groups:
ID3 on this small dataset is
infomation gain as the split criteria, and
chi-square test for early stoping.
It is easy to convert a DT to a rule-set. Check
data/dt.rule_set for the rule-set learned by DT, like:
IF (mac) THEN IF (ide) THEN PREDICT LABEL IS comp.sys.ibm.pc.hardware.d [branch-size : 8] ELSE IF (controller) THEN PREDICT LABEL IS comp.sys.mac.hardware.e [branch-size : 5] ELSE IF (difference) THEN IF (people) THEN PREDICT LABEL IS comp.sys.ibm.pc.hardware.d [branch-size : 2] ELSE PREDICT LABEL IS comp.sys.mac.hardware.e [branch-size : 13] ELSE PREDICT LABEL IS comp.sys.mac.hardware.e [branch-size : 156]
Perceptron Learning Algorithm(PLA) and its variant
python perceptron.py to train and test PLA and
Accuracy for Perceptron:
PLA is one of the simplest machine learning algorithm: whenever a mistake happens, if the model misclassify a positive sample as negative, PLA adds the x to the w; otherwise, substract it from w.
Linear regression model, in particular,
Ridge regression for regression problem.
Ridge regression is one variant of the traditional regression model.
By adding a regularization term to the loss function, the model increase its generalization ability, thus more robust to over-fitting.
From the statistics point of view, this transforms the original Maximum Likelihood Estimation(MLE) problem to the Maximum A Posterior(MAP) problem, and the L2 regularization term is equivalent to the
Gaussian prior distribution.
Another important variant of linear regression model, which is not implemented in this package yet, is
Lasso differs from ridge regression in the regularization term, by using L1 norm instead of L2, the model being learned is
This property is very important in large-scale regression problem, and can be used as a way of doing feature selection.
But L1 norm is hard to optimized, so general optimization tools can't be used directly.
The sub-gradient methods can be helpful.
L1 norm is equivalent to the
Laplacian prior distribution,
and there are models that consider L1 and L2 norm simultaneously: the Elastic net model.
For more details, please refer to Sam Roweis's lecture notes: http://www.cs.nyu.edu/~roweis/csc2515/
For optimization, simple gradient descent algorithm works fairly well. But for large-scale machine learning problems, the bottleneck lays in the computation of the loss function and gradient. In this situation, efficient optimization algorithm is necessary for reducing the total amount of iterations.
Here, the Conjugate Gradient methods is used in learning the regression model. All the advanced optimization technics will be mentioned in Optimization section.
python regression.py to train and test Ridge regression model on the
housing prediction problems.
RMSE on the test set :
TODO: Lasso/LARS, Elastic net
##Gaussian Discriminant Analysis
Gaussian Discriminant Analysis(GDA) model for binary classification.
GDA is a generative model, it makes the following assumptions:
Training a GDA model is simple, just do the Maximum Likelihood Estimation(MLE).
To make a prediction in the binary classification, compute s=p(x|y=1)p(y=1)-p(x|y=0)p(y=0), when s>=0, predict y to be 1, otherwise, y is 0.
python gda.py to train and test GDA model on
Training accuracy :
Test accuracy :
Logistic Regression(LR) model for binary classification.
Unlike NB and GDA, LR belongs to another family of statistical model: the discriminant model, which tries to model p(y|x) directly.
LR is a very widely-used classification model. It is simple, efficient, easy to train and interpretaion.
Here the CG routine is used again to train the LR model.
python lr.py to train and test LR on
Training accuracy : 79.723502% (173/217)
Test accuracy : 90.566038% (48/53)
When dealing with a multi-classification problem, we can either adopt the
by transforming it into a set of binary classification problem,
or just training the
Softmax Regression model.
Softmax Regression model is naturely born for the multi-classification problem, and can be viewed as an extension of the LR model. Training the Softmax model is much the same as LR.
python softmax.py to train and test LR on the
mnist hand-written digit recognization dataset.
##Support Vector Machines
Sequential Minimization Optimization(SMO) and
Primal estimated sub-gradient solver for SVMs(pegasos) algorithms
Support Vector Machines(SVMs).
SMO is an instance of
Coordinate Descent algorithm. It works in the dual space, so kernel trick is availabel.
The SMO algorithm reduces the computational complexity by dividing the overall problem into small sub-problems,
and this pair-wise subproblems can by solved analyticaly.
The fully implemented algorithm is repositoried in another GitHub project https://github.com/mazefeng/svm, which is written in C++, and the input data format is the same as LibSVM/LibLinear.
This SMO implementation simplify the original one in the following 3 ways:
While pegasos is an gradient-based algorithm, it runs very fast, and can be easily parallelized. Pegasos works in primal space, so only linear kernel is available.
python svm.py to train and test SVMs on
Training accuracy :
Test accuracy :
Training accuracy : 84.792627%
Test accuracy : 86.792453%
rbf kernel SMO,
Training accuracy : 87.557604%
Test accuracy : 90.566038%`
Adaptive Boosting(AdaBoost) algorithm for classification.
The most famous application of AdaBoost is face recognition: by combining AdaBoost with
Haar-like features, the face recognition algorithm achieves state-of-the-art accuracy and speed.
AdaBoost boost the classifier accuracy by maintaining a weight distribution on the samples. On each round, a weak classifier is trained on the weighted samples. And the weights of samples which are misclassified by the previous weak classifier are increased, the others are decreased. It can be proofed that, after enough round, the loss of AdaBoost go exactly to 0 (without outliar). Of course, this often leads to overfitting.
When making a prediction, all the weak classifier make a weighted majority vote.
In face recognition problem, a
decision stump is adopted as the weak classifier, which is a decision tree with only 1 level. But AdaBoost itself make no restriction on the weak classifier being used. It only require the weak classifier can be trained on the weighted samples, and many classifier can be modified to handle this situation.
Here we use a weighted version of the
python adaboost.py to train and test AdaBoost on
Accuracy for AdaBoost :
cf.py implements the following three
Collaborative Filtering(CF) algorithms for recommendation:
ItemCF assumes that people will be interested in items that are similar to those they were interested in before.
UserCF assumes that people who were interested in same items will still have same similar taste.
SlopeOneCF is an upgrade version of ItemCF. It fits a simple linear model between each pair of items, trying to capture the rating between item A and item B by
rating(A) = rating(B) + b, and that is how the name Slope-One comes from.
To speed up when making a recommendation, it is necessary to store the pair-wise similarity between either users or items,
and this requires a memory cost of
O(n^2) and will become unacceptable when n is large.
python cf.py to evaluate CF algorithms on the
100K version of
RMSE for ItemCF on the test data:
RMSE for UserCF on the test data:
RMSE for SlopeOneCF on the test data:
matrix factorization(MF) algorithm for recommendation.
python mf.py to train and test MF on the
1 million version of
RMSE on the test data is
The latent factor model is learned by
Stochastic Gradient Descent(SGD) algorithm.
There was a mistakte in the previous version in mf.py. SGD can not be implemented in mini-batch since in each iteration the corresponding user and item latent factor need to updated, and if you do mini-batch, the model get updated very rarely. So I fix the problem and use multi-thread to accelerate the process.
##Hidden Markov Model
Hidden Markov Models(HMMs) for solving
sequence labeling problem.
python hmm.py to train and test POS-tagger using trigram HMMs on a small corpus.
HMMs is trained by
Maximum Likelihood Estimated(MLE), and smooth technics is necessary for good performance.
Using HMMs for labeling sequence is a little difficult. The brute force solution is not sufficient enough,
Dynamic Programming(DP) algorithm is used here: the famous
For comparasion, a baseline tagger is also implemented by ignore the state transition probability.
Accuracy for baseline tagger:
P--V---D------J----------N-------W---V-----V-----I----D---N--------N------,--R---D---J----N---. | | | | | | | | | | | | | | | | | | It is the right-wing guerrillas who are aligned with the drug traffickers , not the left wing .
TODO: Parallel HMMs training,
Max Entropy Markov Models(MEMM),
Conditional Random Field(CRF)
nn.py implements a three layer
Neural Network model(NN) for classification.
NN is a powerful model motivated by how brain works. Training a NN is a little complicated. In the feed-forward phrase, training data go through the input layer to the hidden layer, and finally the output layer. The error is computed between the output of the model and the ground-truth label. In the back-propogation phrase, the error is back-propogate reversely, from the output layer to the hidden layer, then to the input layer. When this is done, the gradient with respect to each individual weight is obtained and feed to the optimization routine.
NN show its great power when we carefully tune its parameters. It is powerful, but not "off-the-shelf".For example, due the gradient vanish and explose phenomenon, we have to carefully initialize the weight and normalize the input, just like what we do in the experiment, normalize each dimension to follow a normal distribution, and randomly initialize each weight to a small random number to do a symmetry-breaking.
Most of the numerical optimization routines can be summarize as following steps:
Algorithms including step-2 are categorized as second-order methods (such as Newton's method, Quasi-Newton's method), the others are first-order methods.
The second-order methods converge faster, but they have to store and maintance a Hessian matrix, which is O(n^2) complexity.
And there are methods trying to approximate the Hessian matrix using recent gradients, they belong to the Quasi-Newton's methods, such as Limited-memroy BFGS.