A CNN model in numpy for gesture recognition
About
The Project
The Process
Data Collection
Data Preprocessing
CNN Model
Implementation
Results
This Project was made to understand the concept of CNN and to learn about the various layers of CNN. This project is purely done in numpy (no modules even for backpropogation)
The project basically recognizes gestures from one of the below gestures
OpenCv is used for image processing, live video was captured via video camera, then with each frame
The final image that was saved was a 50x50 gray scaled image
The images can be found in the images_data folder
Fist
Hand
One
Peace
Here is the link for the data generation code
This step preprocesses the data i.e normalizing and splitting the data into train, test, validation. The split ratio is 70%,20%,10% repectively. The data is later pickled so as to load the object directly
The preprocessing code can be seen here
Total data points (approx)
train - 3000
validation - 500
test - 1000
This section describes the architechture used, the next section explains the parameters in detail and how the model was trained on the cloud
The model consisted of two CONV BOX followed by a fully connected layer with two hidden layer
One CONV BOX consists of three layers
Convolution Operation -> Relu -> Max Pooling
So, this operation was carried out twice
Convolution operation was calculated using the numpy's fast fourier transform
Max Pooling was implmented by using loops and some fancy array slicing
After Conv box, Fully connected layer was attached - fully connected layer consists of two hidden layers
The Entry point to the class is the train function
The CNN Model class can be viewed here
Links used to learn CNN
Backpropogation
This section provides some Implementation notes.
Trained the model on Google cloud platform
Google Compute Engine Configurations
Once the model was trained it was pickled and this code is used to predict the gesture
All the parameters related to the model trained was written in params file, the value of the parameters was read during the run time
Around 15 models were trained, I will be comparing here 4 models, based on the major changes
No. | # Kernels L1 | Dimension L1 | # Kernels L2 | Dimension L2 | # Hidden Nodes L1 | # Hidden Nodes L2 | Optimization Method | Validation Accuracy % |
---|---|---|---|---|---|---|---|---|
1 | 3 | 2x2 | 4 | 4x4 | 300 | 150 | TNC | 16.04 |
2 | 16 | 9x9 | 32 | 7x7 | 800 | 400 | L-BFGS | 22.32 |
3 | 32 | 9x9 | 64 | 5x5 | 3000 | 1500 | BFGS | 60.67 |
4 | 32 | 3x3 | 64 | 5x5 | 1800 | 900 | TNC | 95.71 |
Note - training time for every model is 4+ hours
The 4th model in the table gave the best results, currently params.json contains the parameters of the best model
The model was trained for 100 iterations
Below is the figure showing the loss vs iterations curve for the best model
Finally the model was tested on the test set and the accuracy was 95.06%
Also the model was tested on live feed of the web cam it was giving impressive results, one peculiar observation was that the model was getting confused between hand and the fist. Also the model can perform better if the model is trained for more number of iteration (the current model is trained for 100 iterations)