Leveraging workforce analytics using deep learning

Source: Deep Learning on Medium

Workforce analytics is something that is needed by almost every service based business these days. It’s also known with names such as workforce management, people analytics and various other terms, but the end work and outcome expected for all these use cases is one and the same — forecasting the labor needed to manage the business on a day-to-day basis. The primary objective of workforce analytics is to cut down costs on labor without compromising on the work quality. Predicting the daily workforce requirement is based on several internal and external factors in the organization, such as holiday schedule, sick and vacation leaves, weather conditions, and all other such factors contributing some affect on the business, directly and indirectly. We are laying our focus primarily on the forecasting aspect of workforce analytics based on the historical data we’ve got, without going into other aspects like budgeting, individual performance analysis, productivity checks, etc.

It’s simple, we want to know how much labor is needed daily to carry out the daily business without hindrance. Since we are dealing with daily data, we are going to take the daily based trends into consideration, in the sense that, business could be high on the weekends and reach a low during the mid week, on Tuesdays and Wednesdays. On a higher level, trends could also been seen in weekly data, the workforce requirement could shoot up during holiday weeks and fall down during off -seasonal weeks. Monthly trends are seen rarely in this kind of data, and even if it’s seen, it can be ignored unless we are have a huge historical database.

Time series modeling

This kind of data is based on time — we see some significant trends and patterns in the data, all pertaining to time. Data has temporal dependence; meaning that the occurrence of data points in the future are related to the occurrence of data points in the past. This type of data cannot be modeled with regular machine learning models, but need specialized set of models that can take care of this aspect of the data.

This article is a run through of all the predictive time series models that I had applied in my analysis to predict workforce requirement. This article is NOT about all the various different analytical tasks that we can perform on workforce related data. It is only about the machine learning and deep learning models that I have applied to forecast uni-variate data over time, for my use-case and problem in particular. There could be many more models in the market that could do better justice to different type of time series problems. These are a few that did best for the one that I had.

GOAL: Forecast the number of hours of work needed to complete a task on a daily basis. (Based on this total number of hours required, the management plans the workforce needed to accomplish the task.)

DATA: Daily time series data. The variable of interest is “the total number of hours worked per day”. Additional variables were used as supporting variables in our analysis.

Below is a break-down of the classes of machine learning / deep learning models that I ran on this data.

Predictive models used

The classes of Models used:

  • Classical Methods
  • Machine learning models
  • Deep Learning models
  • Ensemble models

Data Exploration

Just like any other data, time series data also needs to be well understood before we go ahead in applying any method to forecast. We need to perform some data exploration in order to identify the underlying trends.

I used statsmodels library with python implementation to perform EDA. Statsmodels is a module that provides classes and functions for:

  • Estimation of different statistical models
  • Conducting statistical tests
  • Conducting statistical data exploration

Using statsmodels, we can identify the seasonality and cyclic trends in the data. Using the Hodrick-Prescott filter we can separate the trend and the cyclic components. This filter minimizes the quadratic loss function to determine the individual components. It uses a smoothing parameter (lambda), which we can tune as per the data we have at hand.

Visualize these components individually to check their distribution in the data and also keenly observe the ranges that they lie in. Noise should be negligible as compared to the seasonality and trend components.

In the above graphs, we can see that there is a slightly upward trend and some seasonality is certainly present.There is noise (residual) throughout the dataset.

For detailed information on the methods used to perform EDA, refer to my article on Descriptive statistics in time series modeling: [ Link ]

Classical Models

  1. ETS Models (Error-Trend-Seasonality)

Error- trend- seasonality (ETS) models is available in statsmodels library. These models will take each of these terms (error, trend, seasonality) for smoothing, i.e., add them, multiply them or leave out some of them. Based off these key factors, we can create a model that best fits our data.

ETS Decomposition is a specific use-case of ETS model. seasonal_decompose() from statsmodels is used for its implementation.

Trend component of the data needs to be checked before we decide what kind of ETS model to use.

  • Linear vs. Exponential
  • Upward vs. Downward

Also check the isolated seasonal component and residual component of the data, whether there is less noise in the ending of the dataset than the beginning. But, mostly looking at the trend component should help you make a decision about the model selection.

Based on the individual components, mentioned above, we decide whether to go for an additive model or a multiplicative model.

Additive ETS Model: We apply this, when the trend is more linear, and the seasonality and the residual components are more constant over time.

Multiplicative ETS Model: When the trend is exponential (trend is increasing or decreasing at a non-linear rate)

Make sure you handle the missing values efficiently , either impute them or remove them from the dataset, before you apply the ETS decomposition, because ETS Decomposition doesn’t work in their presence.

2. EWMA Models (Exponentially Weighted Moving Average)

The above plot shows simple moving average with window sizes 7 and 30 days (1 week, 1 month). 7-day average shows detailed trend whereas 30-day average shows a more general trend. The time frame you choose for your window gives you the trend correspondingly.

We can use SMA (Simple Moving Average) to check the general trends in the data. it can be used as a generalized model in our analysis. But, we need more sophisticated models for understanding the time series trends. For this, we can use EWMA models.

EWMA is an extended concept of SMA model. The problem with SMA is that it is constrained to using the same window size throughout the data. On the other hand, EWMA allows us to use different window sizes of time, i.e., more recent data can be weighted over the older data. More weights are given to the more recent data considering its importance in the modeling to predict the future data.

EWMA fixes other issues with SMA like it reduces the lag effect from SMA as it puts more weight on the recent occurring values. The weights assigned to the values are dependent on the actual parameters given to the model and the number of periods given a window size. “Decay term” is used to exponentially decrease the weights being assigned to the data values as we go from newer values to older values.

3. Holt- Winters method

We previously discussed about EWMA, which is a simple exponential smoothing method. We applied a single smoothing factor ‘alpha’. This fails to account for contributing factors like trend and seasonality.

Holt winters is a double exponential smoothing method. It comprises of one forecast equation and three smoothing equations. The three smoothing parameters here are for level, trend and seasonality. (alpha, beta, gamma respectively)

Again, Holt Winters model can be used an additive and multiplicative model. Additive model is used when seasonal variations are constant through the series. Whereas, multiplicative Holt winters model is used when the seasonal variations are changing proportional to the level of series.

Double exponential smoothing addresses the trend component and Triple exponential smoothing addresses the seasonality component in the data as well.

4. ARIMA (Auto Regressive Integrated Moving Average) models

ARIMA based models are usually the most preferred models in time series modeling. It’s a very extensive topic and I have written a separate article on its implementation — [ Link ]

Deep Learning Models

Deep neural networks have capabilities that are much advance than the traditional machine learning models. They have powerful features that offer a lot of promise for time series forecasting problems, especially on the problems that have multi-step forecasting, multivalent inputs and complex non-linear
dependencies. They also provide automatic feature selection and native support for sequence data (like, time series data).

Modeling the data with temporal dependence, like our time series data, needs specialized handling when fitting and evaluating models. This kind of temporal structure in the data needs specialized models to identify the trends and seasonality patterns. We have earlier applied machine learning linear models like ARIMA, for the reason that they are very effective for many problems and well understood. But linear methods like these suffer few limitations, such as:

-Missing data is generally not supported.
– Linear methods assume linear relationships in the data excluding more complex joint distributions.
– Most of the linear models focus on uni-variate data. (except for few) whereas most of the real world problems have multiple inputs/variables.
– We need more advanced models that could predict results over a wide time horizon.

But the neural nets are, in general, “black boxes” — it is very difficult to interpret them beyond their performance metrics. For this reason, whenever you get a time series related problem, try using the ARIMA models first , and if they don’t give you good results, then go for the neural networks as the last resort. Coming up is a brief note on all the models I used in deep learning. Again, deep learning is a vast topic and my attempt through this article is just confined to showcasing the basic utility of the models and not to explain the concept in detail.

5. Multi layer Perceptron Model (MLP)

Multi-layer perceptron model is one of the simplest feed forward neural network model in deep learning. Before we go into the complex neural networks, let’s just understand what a simple artificial neuron is all about. MLP takes a set of observations of previous time steps (lagged observations) to predict one or more future time steps.

Perceptron or artificial neuron: Artificial neural networks have a strong analogy with biological neurons in our body. Just as how a biological neuron receives input signals from various parts of the body and sends out an output signal, artificial neural networks have a similar concept of a neuron that receives various inputs that are weighted before passing through an activation function, that in turn decides the output going out from the neuron.

The perceptron adds a weight and bias to the input, and an activation function decides whether or not to activate this neuron. (to fire it or not) Common activation functions used are sigmoid function, hyperbolic tangent function, rectified linear unit (ReLu).

z = wx + b

Mathematically, the perceptron can be represented with above equation, where ‘w’ is the weight added to the ‘x’ input with a bias ‘b’.

Simplest representation of an artificial neuron

As you can see below, the above perceptron (or neuron) is a single most unit in the multi layer perceptron network. The yellow layer represents the input layer, red layer represents the output layer and all the blue layers in between represent the hidden layers in the network. As you go forward with more layers, the level of abstraction increases. And, if the hidden layers are 3 or more, then such a network is called a “deep network”.

Example of a Multilayer Perceptron model [2]

6. Recurrent Neural Networks

This class of neural networks are specifically designed to work with sequential data like our time series data. They are made of recurrent neurons, that are slightly different from a normal neuron.

Recurrent neuron is different from a normal neuron, in the aspect that it receives input from the previous time step as well as the current time step. The below diagram shows the recurrent neural network rolled over time.

Un-rolling the recurrent neuron over time [1]

For example, the recurrent neuron at 3rd time step, will be a function of input at t, t-1, t-2. These neurons are a function of input from previous time steps, thus are known as “memory cells”. This way, a recurrent neuron forms some sort of a memory state at any given time step with all the historical data supplied to it.

Coming to the format of inputs and outputs, RNNs are quite flexible with this. Sequence-to-sequence, sequence-to-vector, vector-to-sequence are few of the formats that are widely used in applications.

While implementing the RNNs, make sure to scale your data, because the neurons might behave differently on data with huge difference between the minimum and maximum values. Also, feed the data in batches to the model, because small batch size leads to better training.

And the RNNs might require the input data to be in certain format. If we are feeding a sequence of data to the model, make sure to include the sequence with a corresponding label, for example, [1,2,3,4] -> [5]. The data value at the 5th time step becomes the label for the first 4 time steps. We select this feature set based on the seasonality present in our data. If it’s yearly seasonality, then the input sequence(feature set) will contain 12 inputs(in the case of monthly data).

You can add layers to the network as you wish. The hyperparameters that have to be taken care of, while adding these layers, are: number of neuron in each layer, activation function, input shape. Each of these values can be determined by trial and error first. If the model is not optimal, then improvements can be made using grid search. Spend a substantial amount of time on this, because determining appropriate hyperparameter values plays a vital role in your model performance.

Also, define the loss function and an optimizer for it. The number of epochs to run the model on your data should also be initialized.

After all these are implemented, test it on test/validation data and once the optimal performance is obtained, forecast the data into the future.

7. LSTM (Long Short Term Memory Units)

RNNs lose the information after a while, from the initial few time steps. The weights given to the relatively new time step override them. We need a sort of a ‘long term memory’ for our networks. LSTMs address this issue. It tackles the problem of vanishing gradients and works towards tracking long-term dependencies in sequential data. LSTM is a type of RNN.

Un-rolling of an LSTM cell over time [3]

Once unit of LSTM contains 4 main layers:

Layer 1- Forget-Gate Layer: Consists of information that we are going to forget or throw away from the cell state. We pass in the ht-1 and Xt as inputs into sigmoid function after performing some linear transformation with certain weights and bias terms.

Layer 2-part 1: Sigmoid Layer or Input Gate Layer: We again pass in the two inputs ht-1 and Xt into sigmoid function, after a linear transformation of these two with weight and bias terms.This creates a vector of new candidate values.

part 2: Hyperbolic Tangent Layer: In this layer, the same inputs undergo linear transformation, but instead of passing them into a sigmoid function, we now pass them into a hyperbolic tangent function. They create a vector of new candidate values as well.

Layer 3- In the next layer, we are going to add the new candidate values from both the layers and update them to the cell state. Note that this cell state always has the current cell state value Ct-1.

Layer 4- Output ht is based off the cell state but is a filtered version of the same. Again in this layer too, the two inputs ht-1 and Xt undergo linear transformation and are passed onto the sigmoid function. This, in turn, is multiplied by hyperbolic tangent of Ct, which gives the output ht.

8. Gated Recurrent Unit (GRU)

This is a slight variation of the LSTM Cell.

Structure of a Gated Recurrent Unit

Unlike LSTM, The GRU uses a reset gate and an update gate. The reset gate sits between the previous activation and the next candidate activation to forget previous state, and the update gate decides how much of the candidate activation to use in updating the cell state.

In terms of exposing the cell state to the other units in the network, LSTMs control this exposure to certain extent while GRUs expose the entire cell state to other units in the network. The LSTM unit has separate input and forget gates, while the GRU performs both of these operations together via its reset gate.

References

[1] [3] https://colah.github.io/posts/2015-08-Understanding-LSTMs/

[2] https://www.neuraldesigner.com/blog/perceptron-the-main-component-of-neural-networks