Monocular vehicle detection using SVM and Deep Learning classifiers
Vehicle detection from a monocular RGB video input using two different approaches - Supervised Learning (Support Vector Machine) and Deep Learning.
The Deep Learning implementation is the more successful of the two since it is considerably faster.
Images are fed into the classifier via sliding windows and the output is processed via a heat map.
The initial data exploration and extraction is carried out in the notebook
The labeled data came from a combination of the GTI vehicle image database and the KITTI vision benchmark suite. Training data for the neural network as well as additional negative examples for the SVM classification where extracted from the Udacity labeled dataset.
data_exploration.ipynb also contains data extraction for the Udacity set.
The final pipeline is implemented in python. Included in the project are:
For deep learning or support vector machine method, run one of the following:
python vehicleDetect.py cnn video_fileName python vehicleDetect.py svm video_fileName
For training, you have to set variables at the top of
vehicleDetect_classify.py and run:
python vehicleDetect_classify.py video_fileName
The Support Vector Machine used in this is scikit-learn's LinearSVC. It runs on modest hardware at about 1.5 seconds/frame and has lots of room for additional performance gains. It was trained on subsets of the GTI and KITTI, as well as manually extracted negative examples from the Udacity set. The latter helped in reducing false positives in areas with a lot of information (trees and complex shadows). It is trained in lines 90-143 of
vehicleDetect_classify.py and then pickled for reuse. It was trained on 8,792 car and 10,291 non-car images.
I am using all channels, as well as histogram binning of the YCrCb colorspace as features. Initial data visualization pointed to YCrCb as a color space with useful and different information in all channels. After many trials and with pointers from the Q&A for spatial and histogram binning parameters, I arrived at the settings in
vehicleDetect_svmVar.py. For the HOG features, I am using 9 orientation bins, 8 pixels per cell and 2 cells per block - I started out with these fairly standard settings and none of my tests showed noticeably improved performance with different values. Using a small amount of color (16x16) and histogram features (16 bins) helped false positive detection in particular.
Colorspace exploration (vehicle / non-vehicle)
HOG features look very useful across all three channels and are extracted in lines 140-150 in
vehicleDetectUtil.py. The entire feature extraction is managed in lines 155-222.
The feature vectors are normalized via scikit-learn's StandardScaler (lines 111-113 in
The final feature vector size is 3000. It's original size of 6108 was reduced via scikit-learn's principal component analysis (117-119 in
I am using a fairly slim convolutional neural network that has previously performed well on CIFAR10. It runs at 8fps on modest hardware. Since all I need here is a binary decision on small images, I expected it to perform reasonably well. It is implemented in Keras in lines 224-252 of
I extended the training data used for the SVM with bounding box data extracted from the Udacity set. Total training set size is 118,493 64x64 images, evenly split between car and non-car samples. Data augmentation (lines 34-58 in
vehicleDetect_classify.py) further strengthens the training set - it includes mirroring, random translation and random brightness. The model is trained and saved in lines 203-217 of
The final model was trained for 100 epochs. It started to overfit at higher epochs, resulting in many false-positives in the video.
Both SVM and CNN methods use the same approach for finding and labeling vehicles. Sliding windows are used to feed images to the classifier. A heat map is generated over several frames, then gets filtered and labelled.
To start, I am limiting my search to the right half of the frame, compensating for the lack of data about the road to focus attention properly (my lane finding code was eventually merged with this vehicle detection, but they do not communicate yet). This covers detections in my video and allowed for faster iterations.
I perform a simple sliding window search to extract cropped images for classification. For efficiency, the window locations are computed once on the first frame and then reused (lines 184-199 in
vehicleDetect.py). I use different sizes of windows at different locations in the image. Per frame, the saved windows are used to extract cropped images via
vehicleDetectUtil.get_window_imgs() (line 200 in
vehicleDetect.py) and then sent to the classifier all at once.
The SVM method uses 457 windows (overlap: 0.8), while the CNN detection only uses 76 (overlap: 0.7).
Sliding windows (CNN). Larger window sizes closer to the bottom and 0.7 overlap. 76 total.
For each frame, every bounding box that is detected as part of a vehicle by two or more bounding boxes adds "heat" to a map. The result is smoothed over eight frames, further thresholded and then fed into scipy's label function to get a single bounding box for each distinct "island" in the heat map (lines 206-221 in
I am doing two things to prevent false positives:
I implemented a Car class that keeps track of positions over time (lines 51-87 in
vehicleDetect.py). Bounding box coordinates are smoothed over six frames. Additionally, I reuse the previous location if the classifier does not detect the car for up to two frames.
Example frame with labeled image (top left), heat map (top right) and unfiltered sliding-windows detection (bottom right)