Original article was published by Josip Matak on Deep Learning on Medium

# Enhancing transfer learning for pose estimation using evolutionary computing

The hot topic in today’s technology and software development is the face modeling task. **Facial recognition**, detection, verification, or alignment are among the most popular. The need for described tasks has developed with the availability of devices for digital image analysis. Nowadays, an emphasis on designing software that can handle these tasks is put on increasing precision and reducing computational complexity.

Nowadays designing such software relies on **deep learning** research. Work in this area has skyrocketed during the last few years. While the emphasis is brought to increasing the precision of newly made models, memory footprint has often been put in the background.

Lots of these techniques originate from similar domains. The baseline task of face modeling is common for each of them. Rather than developing the model from scratch, one can use the current state of the art model as a feature-extracting starting point for the different tasks. An example of such a task is **estimating** **head pose**.

Using the facial recognition model as a feature extractor for head pose task, close to the state of the art result can be preserved with minimal additional computational and memory cost. Answers on which features should be extracted and how to design deep model architecture are answered below.

# 1. Head pose estimation problem

Estimating head pose means regressing the **3D vector** that describes **yaw, pitch, **and** roll**, meaning angles defining rotation around Z, Y, and X-axis in a common coordinate system, respectively. These angles are commonly called **Euler angles**. It is the relative orientation of the head with respect to the head local coordinate system. In other words, it is a problem of mapping 2D data (image) to a 3D space (angles).

Why should we bother with head pose estimation? Legitimate question. The first thing that comes to mind might be the **automotive industry**. Interest in developing driver-assist systems in cars has brought the following of head pose into the main focus. Additionally, information about head pose can be used in popular gesture-driven applications that bring **virtual reality** to your everyday life.

# 2. Transfer learning

Let’s utilize **sequential transfer learning**, a subcategory of the **inductive transfer learning **technique**.** This means that the target task domain is similar or closely related to the source. The assumption is that applying features from face recognition models with an additional number of trainable should be sufficient enough to maintain precision.

One approach to this problem could be fine-tuning the whole network to cope with the new task. On the other hand, another approach is training only additionally added parameters, while statically preserving features. In this example, we opted for the second case.

# 3. Deep model architecture

Encountering the deep learning problem usually starts by choosing the right model architecture for the task. The obvious choice in dealing with a computer vision problem is using **CNN** (convolutional neural network). Furthermore, in the CNN subcategory, an enormous number of various architectures have been developed during years. Automating this search bring large benefit to the engineer.

Let’s split this search into two categories, **macro-search**, and **micro-search**. The first one is used to determine **width** (number of channels) and **depth** (number of layers) hyper-parameters. Following the research of *[Tan and Le (2019)]**, *by limiting the **FLOPs** (floating-point operations) resource*s, *the scaling can be done in either of two dimensions, by linearly changing the width, or squaring the depth using the formula below. Factors 𝛼, 𝛽, and 𝛾 are used for **compound dimension scaling**, concerning N, a factor that scales the number of operations by 2ᴺ.

After gaining insight into the right scaling factor, micro-search refers to finding the right operations that would lead models for better precision. We apply already used tactics for this problem. First, let’s define something called “**master module**”, a module which is the backbone for fusing different operations into it. It consists of sequential different convolutional blocks. Upscale blocks are used for doubling the channel size and reducing resolution.

All different used blocks can be seen below. How to work with them is explained in the next section.

# 4. Evolutionary enhanced neural architecture search

Now let’s code above written information into a representation that will **unambiguously** map it to the deep model architecture. Range mapping is one solution, a 0–1 limited floating-point vector that maps each point to the index of its real representation.

An example of how to pick a certain architecture parameter from a floating number can be seen above.

Once when representation is determined, meta-heuristic algorithms are employed for the task of sampling the search space. A goal was minimizing the complex **multi-criterion fitness function**, made for maximizing the **precision** and minimizing **latency**. The first criterion is determined by training and testing architecture to minimize the error between real and predicted angle labels. The second criterion is measured by the number of floating operations used for the model‘s forward pass. In the following function, *m *defines the model, *T* defines the maximum non-penalized number of flops, and *w* is a factor that determines its importance.

To be more precise, the next algorithms are used in this work, without further explanation of how they work:

**CMA**— Evolution Strategy**Genetic algorithm****Particle swarm algorithm****BOBYQA**(Powell’s derivate free algorithm)

# 5. Results

This is what matters the most, each of the algorithms converged in the end, resulting in the different individuals. Finally, those models were trained for the larger number of epochs with decaying learning rate factor. A standard procedure for training head pose models has been obtained:

- Training:
**300W_LP**(70%) dataset - Validation:
**300W_LP**(30%) dataset - Test:
**AFLW2000**/**BIWI**datasets

Finally, after testing on standard datasets, results have been compared to the current state of the art methods. Mean average errors and memory footprint in megabytes can be seen below.

To break dullness from showing only numerical results, the one visually pleasing result can be seen below.