Real-time estimation of gender and age
This is a small demo project to try and test OpenCV library and also implement on-the-fly face detection, age and gender estimation using pre-trained models.
This article can also be found on medium.
What do you do when you want to try something new in deep learning? Of course you search for articles and open-source projects first!
Disclaimer: There are many more projects that are not listed here. But I believe I have covered the most popular ones, that appear at first pages of search results.
I have googled for:
gender age estimation
gender age opencv
gender age keras
gender age tensorflow
gender age caffemodel
gender age pytorch
I was looking at one or two first pages of results only. Then I excluded:
After that I dig into source code to find details of input image format, output format, model architecture, weight size, license, pre-trained model availability, etc.
Here is what I've found for the topic:
Age and Gender Classification using MobileNets by Kinar Ravishankar.
MIT
Keras/TensorFlow
224x224x3
MobileNet_v1_224
,
followed by one Dense(1024->1024)
layer plus two output Dense(1024->1)
layers.
So there are approximately (4.24 MP + 1.05 MP) = 5.29 MP (=Million Parameters).
Which is about 21 Mb for float32
.How to build an age and gender multi-task predictor with deep learning in TensorFlow by Cole Murray
TensorFlow
224x224x3
Conv(5x5, 3->32)
-> MaxPool(2->1)
-> Conv(5x5, 32->64)
-> MaxPool(2->1)
->
Conv(5x5, 64->128)
-> MaxPool(2->1)
->
Dense(28*28*128 -> 1024)
-> Dense(1024 -> 101)
, Dense(1024 -> 2)
.
2400 + 51200 + 204800 + 102760448 + 103424 + 2048 = 103.1MP
Which is approximately 393 Mb.Predicting apparent Age and Gender from face picture : Keras + Tensorflow by Youness Mansar
MIT
Keras/TensorFlow
224x224x3
ResNet50
-> Dense(100)
-> Dense(1)
.
Approximately: 100 Mb.SSR-Net: A Compact Soft Stagewise Regression Network for Age Estimation by Tsun-Yi Yang, Yi-Hsuan Huang, Yen-Yu Lin, Pi-Cheng Hsiu, Yung-Yu Chuang.
Apache License 2.0
Keras/TensorFlow
64x64x3
Mxnet version implementation of SSR-Net for age and gender estimation by @wayen820
MXNET
112x112x3
Age and Gender Classification Using Convolutional Neural Networks by Gil Levi and Tal Hassner.
as is
Caffe
. But models could be loaded with OpenCV
.256x256x3
Age and Gender Deep Learning with TensorFlow by Rude Carnie (? Daniel Pressel)
TensorFlow
256x256x3
Easy Real time gender age prediction from webcam video with Keras by Chengwei Zhang
Keras/TensorFlow
64x64x3
. Possibly, any size can be chosen.Age and Gender Estimation by Yusuke Uchida
MIT
Keras/TensorFlow
32x32x3
Age and gender estimation based on Convolutional Neural Network and TensorFlow by Boyuan Jiang
MIT
TensorFlow
160x160x3
Apparent Age and Gender Prediction in Keras by Sefik Ilkin Serengil
Keras/TensorFlow
224x224x3
Multi output neural network in Keras (Age, gender and race classification) by Sanjaya Subedi
Keras/TensorFlow
198x198x3
No | Name | Article | Source | License | Framework | Input | Output | Size | Pretrained |
---|---|---|---|---|---|---|---|---|---|
1 | MobileNets by Kinar Ravishankar | link | link | MIT | Keras/TensorFlow | 224x224x3 | gender: 2 classes, age: 21 classes | ~21Mb | NO |
2 | ConvNet by Cole Murray | link | link | unspecified | TensorFlow | 224x224x3 | gender: 2 classes, age: 101 classes | ~393Mb | NO |
3 | ResNet50 by Youness Mansar | link | link | MIT | Keras/TensorFlow | 224x224x3 | gender: one number, age: 8 classes | ~100Mb | NO |
4 | SSR-Net (original) | link | link | Apache License 2.0 | Keras/TensorFlow | 64x64x3 | gender: one number, age: one number | 0.32Mb | YES |
5 | SSR-Net on MXNET | None | link | unspecified | MXNET | 112x112x3 | gender: one number, age: one number | 1.95Mb, 3.94Mb | YES |
6 | ConvNet by Gil Levi and Tal Hassner | link | link | as is | Caffe | 256x256x3 | gender: 2 classses, age: 8 classes | 43.5Mb, 43.5Mb | YES |
7 | Inception_v3 by Rude Carnie | None | link | unspecified | TensorFlow | 256x256x3 | gender: 2 classses, age: 8 classes | 166Mb, 166Mb | YES |
8 | ConvNet by Chengwei Zhang | link | link | unspecified | Keras/TensorFlow | 64x64x3 | gender: 1 number, age: 101 class | 186Mb | YES |
9 | ConvNet by Yusuke Uchida | None | link | MIT | Keras/TensorFlow | 32x32x3 | gender: 1 number, age: 101 class | 187Mb | YES |
10 | ConvNet by Boyuan Jiang | None | link | MIT | TensorFlow | 160x160x3 | gender: one number, age: one number | 246.5Mb | YES |
11 | ConvNet by Sefik Ilkin Serengil | link | link | unspecified | Keras/TensorFlow | 224x224x3 | gender: 1 number, age: 101 class | 553Mb, 514Mb | YES |
12 | ConvNet by Sanjaya Subedi | link | link | unspecified | Keras/TensorFlow | 198x198x3 | gender: 1 number, age: 1 number, race: 5 classes | unknown | NO |
Note: I did not include model's accuracy provided by authors in the description because it has no meaning when different models are tested on different test datasets!
I decided to choose two most lightweight networks, which are able to process video on-the-fly using only average CPU.
My choice is:
No 4, SSR-Net, which has separate models for gender and age of size only 0.32 Mb! They are very fast in comparision with other models.
No 6, models by Gil Levi and Tal Hassner, these are also two separate models for gender and age that are widely used by developers as they are about 43 Mb.
Of course I would like to have one neural net for both gender and age estimation. Maybe I will spend some time and train a model by myself. In this case I would definitely use staged training technique proposed by SSR-Net authors.
This simple program randomly chooses a video file from videos
directory.
Then it reads frame by frame in cycle until the end or until user pressed ESC key.
For each frame:
Below you may find some more details.
Face detector is initialized basing on the face_detector_kind
argument:
# Initialize face detector
if (face_detector_kind == 'haar'):
#face_cascade = cv.CascadeClassifier('face_haar/lbpcascade_frontalface_improved.xml')
face_cascade = cv.CascadeClassifier('face_haar/haarcascade_frontalface_alt.xml')
else:
face_net = cv.dnn.readNetFromTensorflow('face_net/opencv_face_detector_uint8.pb', 'face_net/opencv_face_detector.pbtxt')
Model to estimate age and gender is initialized basing on the age_gender_kind
argument:
# Load age and gender models
if (age_gender_kind == 'ssrnet'):
# Setup global parameters
face_size = 64
face_padding_ratio = 0.10
# Default parameters for SSR-Net
stage_num = [3, 3, 3]
lambda_local = 1
lambda_d = 1
# Initialize gender net
gender_net = SSR_net_general(face_size, stage_num, lambda_local, lambda_d)()
gender_net.load_weights('age_gender_ssrnet/ssrnet_gender_3_3_3_64_1.0_1.0.h5')
# Initialize age net
age_net = SSR_net(face_size, stage_num, lambda_local, lambda_d)()
age_net.load_weights('age_gender_ssrnet/ssrnet_age_3_3_3_64_1.0_1.0.h5')
else:
# Setup global parameters
face_size = 227
face_padding_ratio = 0.0
# Initialize gender detector
gender_net = cv.dnn.readNetFromCaffe('age_gender_net/deploy_gender.prototxt', 'age_gender_net/gender_net.caffemodel')
# Initialize age detector
age_net = cv.dnn.readNetFromCaffe('age_gender_net/deploy_age.prototxt', 'age_gender_net/age_net.caffemodel')
# Mean values for gender_net and age_net
Genders = ['Male', 'Female']
Ages = ['(0-2)', '(4-6)', '(8-12)', '(15-20)', '(25-32)', '(38-43)', '(48-53)', '(60-100)']
Currently video stream is read from random file from videos
directory.
import os
import cv2 as cv
import numpy as np
import time
# Initialize numpy random generator
np.random.seed(int(time.time()))
# Set video to load
videos = []
for file_name in os.listdir('videos'):
file_name = 'videos/' + file_name
if os.path.isfile(file_name) and file_name.endswith('.mp4'):
videos.append(file_name)
source_path = videos[np.random.randint(len(videos))]
# Create a video capture object to read videos
cap = cv.VideoCapture(source_path)
Generally, there are two common ways to detect faces:
Of course, CNN model is more accurate, but it requires more computational resources and runs slower.
In this project I decided to implement both ways and choose one via argument face_detector_kind
.
Detecting faces with either HAAR or ConvNet is very easy:
def findFaces(img, confidence_threshold=0.7):
# Get original width and height
height = img.shape[0]
width = img.shape[1]
face_boxes = []
if (face_detector_kind == 'haar'):
# Get grayscale image
gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
# Detect faces
detections = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5)
for (x, y, w, h) in detections:
padding_h = int(math.floor(0.5 + h * face_padding_ratio))
padding_w = int(math.floor(0.5 + w * face_padding_ratio))
x1, y1 = max(0, x - padding_w), max(0, y - padding_h)
x2, y2 = min(x + w + padding_w, width - 1), min(y + h + padding_h, height - 1)
face_boxes.append([x1, y1, x2, y2])
else:
# Convert input image to 3x300x300, as NN model expects only 300x300 RGB images
blob = cv.dnn.blobFromImage(img, 1.0, (300, 300), mean=(104, 117, 123), swapRB=True, crop=False)
# Pass blob through model and get detected faces
face_net.setInput(blob)
detections = face_net.forward()
for i in range(detections.shape[2]):
confidence = detections[0, 0, i, 2]
if (confidence < confidence_threshold):
continue
x1 = int(detections[0, 0, i, 3] * width)
y1 = int(detections[0, 0, i, 4] * height)
x2 = int(detections[0, 0, i, 5] * width)
y2 = int(detections[0, 0, i, 6] * height)
padding_h = int(math.floor(0.5 + (y2 - y1) * face_padding_ratio))
padding_w = int(math.floor(0.5 + (x2 - x1) * face_padding_ratio))
x1, y1 = max(0, x1 - padding_w), max(0, y1 - padding_h)
x2, y2 = min(x2 + padding_w, width - 1), min(y2 + padding_h, height - 1)
face_boxes.append([x1, y1, x2, y2])
return face_boxes
Please note the global variable face_padding_ratio
which determines how to enlarge face_box detected by any algorithm.
It's value depends on the face detection algorithm and on age/gender estimation algorithm.
Ideally, you should choose it's value so that faces you get will be very similar to those that model was trained on.
This is done in two steps:
box
coordinates from small frame to the big original frame: box_orig
.face_bgr
.We could, of course, extract faces from the small frame. The reason to extract patches from big frame this is that we want to keep as much quality as possible. But we should keep in mind that this also may require slightly more calculations than in the first case.
def collectFaces(frame, face_boxes):
faces = []
# Process faces
for i, box in enumerate(face_boxes):
# Convert box coordinates from resized frame_bgr back to original frame
box_orig = [
int(round(box[0] * width_orig / width)),
int(round(box[1] * height_orig / height)),
int(round(box[2] * width_orig / width)),
int(round(box[3] * height_orig / height)),
]
# Extract face box from original frame w.r.t. image boundary
face_bgr = frame[
max(0, box_orig[1]):min(box_orig[3] + 1, height_orig - 1),
max(0, box_orig[0]):min(box_orig[2] + 1, width_orig - 1),
:
]
faces.append(face_bgr)
return faces
Now faces
list contains faces patches, all of different sizes.
In most cases neural networks are designed to work in batch mode. I.e. they can process many input samples at ones. This is especially useful at training time, as such batch mode training usually helps models to converge faster than in stochastic mode training (one sample at a time).
But before we could feed all faces into model we must resize them into a format that model expects. At least we should make all faces the same size and normalize their values.
SSR-Net expects input to be a tensor of size: N x 64 x 64 x 3
, where N is the number of faces,
64x64 is the height and width correspondingly and 3 stands for RGB.
Individual values in tensor should be scaled to [0...1].
Please note the function call cv.normalize(blob[i, :, :, :], None, alpha=0, beta=255, norm_type=cv.NORM_MINMAX)
which does the required normalization.
ConvNet by Gil Levi and Tal Hassner expects input to be a tensor of size: N x 3 x 227 x 227
,
where N is the number of faces, 3 means channels of RGB and 227x227 is for height and width correspondingly.
Individual channels in tensor should have mean 0 but should not be scaled.
Please note the parameters scalefactor=1.0
and mean=(78.4263377603, 87.7689143744, 114.895847746)
in the function call cv.dnn.blobFromImages
which do exactly this.
As said, different models require different images preprocessing. So it is done as follows:
def predictAgeGender(faces):
if (age_gender_kind == 'ssrnet'):
# Convert faces to N,64,64,3 blob
blob = np.empty((len(faces), face_size, face_size, 3))
for i, face_bgr in enumerate(faces):
blob[i, :, :, :] = cv.resize(face_bgr, (64, 64))
blob[i, :, :, :] = cv.normalize(blob[i, :, :, :], None, alpha=0, beta=255, norm_type=cv.NORM_MINMAX)
# Predict gender and age
genders = gender_net.predict(blob)
ages = age_net.predict(blob)
# Construct labels
labels = ['{},{}'.format('Male' if (gender >= 0.5) else 'Female', int(age)) for (gender, age) in zip(genders, ages)]
else:
# Convert faces to N,3,227,227 blob
blob = cv.dnn.blobFromImages(faces, scalefactor=1.0, size=(227, 227),
mean=(78.4263377603, 87.7689143744, 114.895847746), swapRB=False)
# Predict gender
gender_net.setInput(blob)
genders = gender_net.forward()
# Predict age
age_net.setInput(blob)
ages = age_net.forward()
# Construct labels
labels = ['{},{}'.format(Genders[gender.argmax()], Ages[age.argmax()]) for (gender, age) in zip(genders, ages)]
return labels
That's it.
While implementing this project I analyzed different articles and models to estimate human gender and age by image.
I have discovered that there are a lot of good models with high accuracy that are yet too big and slow to compute.
On the other hand there are some small models with lower accuracy that could be used for real-time video processing.
I have successfully used two such models for real-time estimation of age and gender using only average CPU:
The result is great. It was fun to do!
Gender is estimated firmly while age estimation fluctuates around true value. All is done in real-time!
Nowadays cameras are getting cheaper and are placed literally everywhere. But we can never have enough people to watch all those cameras.
I believe there exists a demand for small and accurate models that could estimate and describe content of video stream in real-time. Models that could run on a RaspberryPI or other small platforms.
But today researches are mostly concentrated on accuracy, but not on applicability of their models. Researchers get more benefits if their model wins first score for accuracy in Kaggle competition. But no benefits if model is the most efficient one. i.e. has decent results with significantly less computations. My thoughts are the same as in this article by Michał Marcinkiewicz: The Real World is not a Kaggle Competition
Of course, one may argue that analyzing content of a video is still a complex task. And complex tasks require tons of calculations anyway.
But I see at least several ways to achieve high efficiency:
Soft stagewise regression as proposed by authors of SSR-Net. I encourage you to read their article. It is actually a novel approach in NN training. I believe that if we re-formulate their basic idea it can be distributed to all other areas of deep learning. Not only to regression but also to classification, feature extraction, etc.
Layers reusage as proposed by Okan K ̈op ̈ukl ̈u, Maryam Babaee, Stefan H ̈ormann, Gerhard Rigol in their article CONVOLUTIONAL NEURAL NETWORKS WITH LAYER REUSE. Why use many layers each with their own parameters if we can repeat the same filters multiple times?
Hidden units reusage. I did not find any article or even mention of this simple idea. Please tell me if you know any. The idea is described below.
A typical content analyzing pipeline consists of several modules running in sequence or in parallel.
For instance, in this simple project we have:
Input frame
-> ConvNet to detect faces
-> faces
faces
-> ConvNet to estimage gender
-> genders
faces
-> ConvNet to estimage age
-> ages
Where 2 and 3 may run in parallel.
In more sophisticated projects we could also find:
Input frame
-> ConvNet to recognize common objects
-> COCO names
Input frame
-> ConvNet for semantic segmentation
-> segmented image mask
Note that each ConvNet
typically consists of many sequential layers.
But I guess that first convolution layers of different networks are very similar.
I believe that if you take two different networks trained for different tasks, you will find similar filter's weights in first layers of both networks. As they act like basic filters for borders detection.
It means that in complex projects similar filters process the same image several times.
I.e. first you apply these filters when you find faces in image. Then you again apply these (or similar) filters when you detect gender of a person. And then again - when you estimate person's age.
We can save processing time if we get rid of unnecessary calculations and reuse hidden units as results of first layer's filters applied to input image.
Of course, it's a little bit challenging as it requires:
That is it. Thank you for reading!