Text classification with Convolution Neural Networks on Yelp, IMDB & sentence polarity dataset v1.0
This project demonstrates how to classify text documents / sentences with CNNs. You can find a great introduction in a similar approach on a blog entry of Denny Britz and Keras. My approach is quit similar to the one of Denny and the original paper of Yoon Kim [1]. You can find the implementation of Yoon Kim on GitHub as well.
In this update I fixed some typos as well as improved the jupyter notebook. You can execute the notebook without any requirements. Required data will be downloaded automatically.
ALPHABET
variableI’ve updated the code to TensorFlow 2. Besides I made some changes in the jupyter notebook:
Model:
Notebook:
Using characters in addition to words ends up with no improvement but can be a good starting point for further research. I keep the model as simple as possible and reuse the existing methods for character input. As written in the paper of Yann LeCun [3] using several conv-layers on each over could improve performance.
Besides i made some changes in evaluation notebook. It seems that cleaning the text by removing stopwords, nummerical values and punctuation remove important features too. Therefore I dont use this preprocessing steps anymore. As optimizer I switched from Adadelta to Adam because it converge to an optimum even faster.
This are just small changes but with a significant improvement as you can see below.
For the Yelp dataset I increased the training samples from 200000 to 600000 and test samples to 200000 instead of 50000.
Dataset | Old (loss / acc) |
New (loss / acc) |
---|---|---|
Polarity | 0.4688 / 0.7974 | 0.4058 / 0.8135 |
IMDB | 0.2994 / 0.8896 | 0.2509 / 0.9007 |
Yelp | 0.1793 / 0.9393 | 0.0997 / 0.9631 |
Yelp - Multi | 0.9356 / 0.6051 | 0.8076 / 0.6487 |
For evaluation I used different datasets that are freely available. They differ in their size of amount and the content length. What all have in common is that they have two classes to predict (positive / negative). I would like to show how CNN performs on ~10000 up to ~800000 documents with modify only a few paramters.
I used the following sets for evaluation:
The implemented model has multiple convolutional layers in parallel to obtain several features of one text. Through different kernel sizes of each convolution layer the window size varies and the text will be read with a n-gram approach. The default values are 3 convolution layers with kernel size of 3, 4 and 5.
I also used pre-trained embedding GloVe with 300 dimensional vectors and 6B tokens to show that unsupervised learning of words can have a positive effect on neural nets.
For all runs I used filter sizes of [3,4,5], Adam as optimizer, batch size of 100 and 10 epochs. As already described I used 5 runs with random state to get a final mean of loss / accuracy.
Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training (loss / acc) |
Validation (loss / acc) |
---|---|---|---|---|---|---|
[100,100,100] | GloVe 300 | 15000 / 35 | 64 | 0.4 | 0.3134 / 0.8642 | 0.4058 / 0.8135 |
[100,100,100] | 300 | 15000 / 35 | 64 | 0.4 | 0.4741 / 0.7753 | 0.4563 / 0.7807 |
Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training (loss / acc) |
Validation (loss / acc) |
Test (loss / acc) |
---|---|---|---|---|---|---|---|
[200,200,200] | GloVe 300 | 15000 / 500 | 200 | 0.4 | 0.1735 / 0.9332 | 0.2417 / 0.9064 | 0.2509 / 0.9007 |
[200,200,200] | 300 | 15000 / 500 | 200 | 0.4 | 0.2425 / 0.9037 | 0.2554 / 0.8964 | 0.2632 / 0.8920 |
Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training (loss / acc) |
Validation (loss / acc) |
Test (loss / acc) |
---|---|---|---|---|---|---|---|
[200,200,200] | GloVe 300 | 15000 / 200 | 250 | 0.5 | 0.1066 / 0.9602 | 0.1146 / 0.9567 | 0.1130 / 0.9574 |
[200,200,200] | 300 | 15000 / 200 | 250 | 0.5 | 0.1029 / 0.9617 | 0.1243 / 0.9533 | 0.1219 / 0.9547 |
ML-Model | - | - | - | - | - | - / 0.9398 | - / 0.9398 |
Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training (loss / acc) |
Validation (loss / acc) |
Test (loss / acc) |
---|---|---|---|---|---|---|---|
[200,200,200] | GloVe 300 | 15000 / 200 | 250 | 0.5 | 0.0793 / 0.9707 | 0.0958 / 0.9644 | 0.0997 / 0.9631 |
[200,200,200] | 300 | 15000 / 200 | 250 | 0.5 | 0.0820 / 0.9701 | 0.1012 / 0.9623 | 0.1045 / 0.9615 |
All previous evaluations are typical binary classification tasks. The Yelp dataset comes with reviews which can be classified into five classes (one to five stars). For the evaluations above I merged one and two star reviews together to the negative class. Reviews with four and five stars are labeled as positive reviews. Neutral reviews with three stars are not considered. In this evaluation I trained the model on all five classes. The baseline we have to reach is 20% accuracy because all classes are balanced to the same amount of samples. In a first evaluation I reached 64% accuracy. This sounds a little bit low but you have to keep in mind that in the binary classification we have a baseline of 50% accuracy. That is more than twice as much! Furthermore there is a lot subjectivity in the reviews. Take a look on the confusion matrix:
If you look carefully you can see that it’s hard to distinguish in one class that has surrounding classes side by side. If you wrote a negative review, when does this have just two stars and not one or three?! Sometimes it’s clear for sure but sometimes not!
Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training (loss / acc) |
Validation (loss / acc) |
Test (loss / acc) |
---|---|---|---|---|---|---|---|
[200,200,200] | GloVe 300 | 15000 / 200 | 250 | 0.5 | 0.7676 / 0.6658 | 0.7983 / 0.6531 | 0.8076 / 0.6487 |
[200,200,200] | 300 | 15000 / 200 | 250 | 0.5 | 0.7932 / 0.6556 | 0.8103 / 0.6470 | 0.8169 / 0.6443 |
Finally CNNs are a great approach for text classification. However a lot of data is needed for training a good model. It would be interesting to compare this results with a typical machine learning approach. I expect that using ML for all datasets except Yelp getting similar results. If you evaluate your own architecture (neural network), I recommend using IMDB or Yelp because of their amount of data.
Using pre-trained embeddings like GloVe improved accuracy by about 1-2%. In addition comes that pre-trained embeddings have a regularization effect on training. That make sense because GloVe is trained on data which is some different to Yelp and the other datasets. This means that during training the weights of the pre-trained embedding will be updated. You can see the regularization effect in the following image:
If you are interested in CNN and text classification try out the dataset from Yelp! Not only because of the best result in accuracy, it has a lot metadata. Maybe I will use this dataset to get insights for my next travel :)
I'm sure that you can get better results by tuning some parameters:
clean_text
If you have any questions or hints for improvement contact me through an issue. Thanks!
Feel free to use the model and your own dataset. As an example you can use this evaluation notebook.
[1] Convolutional Neural Networks for Sentence Classification
[2] Neural Document Embeddings for Intensive Care Patient Mortality Prediction
[3] Character-level Convolutional Networks for Text Classification
Christopher Masch