Original article was published by Josip Matak on Deep Learning on Medium
Enhancing transfer learning for pose estimation using evolutionary computing
The hot topic in today’s technology and software development is the face modeling task. Facial recognition, detection, verification, or alignment are among the most popular. The need for described tasks has developed with the availability of devices for digital image analysis. Nowadays, an emphasis on designing software that can handle these tasks is put on increasing precision and reducing computational complexity.
Nowadays designing such software relies on deep learning research. Work in this area has skyrocketed during the last few years. While the emphasis is brought to increasing the precision of newly made models, memory footprint has often been put in the background.
Lots of these techniques originate from similar domains. The baseline task of face modeling is common for each of them. Rather than developing the model from scratch, one can use the current state of the art model as a feature-extracting starting point for the different tasks. An example of such a task is estimating head pose.
Using the facial recognition model as a feature extractor for head pose task, close to the state of the art result can be preserved with minimal additional computational and memory cost. Answers on which features should be extracted and how to design deep model architecture are answered below.
1. Head pose estimation problem
Estimating head pose means regressing the 3D vector that describes yaw, pitch, and roll, meaning angles defining rotation around Z, Y, and X-axis in a common coordinate system, respectively. These angles are commonly called Euler angles. It is the relative orientation of the head with respect to the head local coordinate system. In other words, it is a problem of mapping 2D data (image) to a 3D space (angles).
Why should we bother with head pose estimation? Legitimate question. The first thing that comes to mind might be the automotive industry. Interest in developing driver-assist systems in cars has brought the following of head pose into the main focus. Additionally, information about head pose can be used in popular gesture-driven applications that bring virtual reality to your everyday life.
2. Transfer learning
Let’s utilize sequential transfer learning, a subcategory of the inductive transfer learning technique. This means that the target task domain is similar or closely related to the source. The assumption is that applying features from face recognition models with an additional number of trainable should be sufficient enough to maintain precision.
One approach to this problem could be fine-tuning the whole network to cope with the new task. On the other hand, another approach is training only additionally added parameters, while statically preserving features. In this example, we opted for the second case.
3. Deep model architecture
Encountering the deep learning problem usually starts by choosing the right model architecture for the task. The obvious choice in dealing with a computer vision problem is using CNN (convolutional neural network). Furthermore, in the CNN subcategory, an enormous number of various architectures have been developed during years. Automating this search bring large benefit to the engineer.
Let’s split this search into two categories, macro-search, and micro-search. The first one is used to determine width (number of channels) and depth (number of layers) hyper-parameters. Following the research of [Tan and Le (2019)], by limiting the FLOPs (floating-point operations) resources, the scaling can be done in either of two dimensions, by linearly changing the width, or squaring the depth using the formula below. Factors 𝛼, 𝛽, and 𝛾 are used for compound dimension scaling, concerning N, a factor that scales the number of operations by 2ᴺ.
After gaining insight into the right scaling factor, micro-search refers to finding the right operations that would lead models for better precision. We apply already used tactics for this problem. First, let’s define something called “master module”, a module which is the backbone for fusing different operations into it. It consists of sequential different convolutional blocks. Upscale blocks are used for doubling the channel size and reducing resolution.
All different used blocks can be seen below. How to work with them is explained in the next section.
4. Evolutionary enhanced neural architecture search
Now let’s code above written information into a representation that will unambiguously map it to the deep model architecture. Range mapping is one solution, a 0–1 limited floating-point vector that maps each point to the index of its real representation.
An example of how to pick a certain architecture parameter from a floating number can be seen above.
Once when representation is determined, meta-heuristic algorithms are employed for the task of sampling the search space. A goal was minimizing the complex multi-criterion fitness function, made for maximizing the precision and minimizing latency. The first criterion is determined by training and testing architecture to minimize the error between real and predicted angle labels. The second criterion is measured by the number of floating operations used for the model‘s forward pass. In the following function, m defines the model, T defines the maximum non-penalized number of flops, and w is a factor that determines its importance.
To be more precise, the next algorithms are used in this work, without further explanation of how they work:
- CMA — Evolution Strategy
- Genetic algorithm
- Particle swarm algorithm
- BOBYQA (Powell’s derivate free algorithm)
This is what matters the most, each of the algorithms converged in the end, resulting in the different individuals. Finally, those models were trained for the larger number of epochs with decaying learning rate factor. A standard procedure for training head pose models has been obtained:
- Training: 300W_LP (70%) dataset
- Validation: 300W_LP (30%) dataset
- Test: AFLW2000/BIWI datasets
Finally, after testing on standard datasets, results have been compared to the current state of the art methods. Mean average errors and memory footprint in megabytes can be seen below.
To break dullness from showing only numerical results, the one visually pleasing result can be seen below.