Building a Gaze Estimator For Full Face Images Using ResNet-50
1. Introduction
Gaze Estimation is the task for predicting where a person is looking by using the image of face/eyes of the person. In this article we will be using full face images of the persons. Head position angles and eye position angles are used to determine where the person is looking at, we shall focus more on the eye position angles predicting the pitch and yaw angles (please see the animation in the link).
2. Datasets Used
A cross dataset evaluation is performed for better generalizability.
Training: Gaze Capture Dataset: A large scale eye tracking dataset from over 1450 people consisting of almost 2.5 million images.
Testing: MPIIGaze Dataset: This contains over 213,659 images collected from 15 participants during natural everyday laptop use over 3 months.
3. Dataset Preprocessing
For each of the images, facial detection and landmark localization is performed and then data normalization is performed to yield the processed image. The following pipeline is used to do the same
4. Model
Residual Neural Networks are used here to perform the prediction, below is architecture of the model.
5. Training/Evaluation Methodology
- Batch Processing: Given the size of each image being 128x128, the training needs to be done in batches. Fixed number of images (say 10) are chosen randomly from each person till the total number of training images is reached. Then using dataloaders and a batch size of 128, training is performed.
- Losses: The pitch and yaw angles are converted to vectors, and the cosine similarity b/w the two is calculated to give the angular loss. The L1 loss is used for calculating the backpropagation gradients and updating weights.
- Optimizers, Schedulers: The ADAM optimizer is used here to perform backpropagation and update the weights. The learning rate is chosen to be 0.0003 and the beta values are chosen to be 0.9 and 0.95.
6. Results
After training on a total of 20,000 images the model has an angular error of 6 degrees. Here is a comparison of the performance of estimator when trained on a variable number of images and also with data augmentation.
All the required code and also a pre trained model can be found on the Github project page!