Deep Learning Vs Deep Reinforcement Learning Algorithms in Retail Industry — II

Source: Deep Learning on Medium

LSTM, Transfer, Federated Learning, Reinforcement and Deep Reinforcement Learning


In continuation to my previous blog, which discussed on the different use-cases of machine learning algorithms in retail industry, this blog highlights some of the recent advanced technological concepts like role of IoT, Federated learning and Reinforcement learning in the context of retail industry.

This blog is structured as follows:

  • The use of IoT in retail domain and how different ML algorithms and feature extraction strategies are used to handle missing geo-spatial and temporal data received from IoT devices.
  • Role of secured federated and transfer learning in retail domain across enterprises.
  • Different use-cases of reinforcement learning in retail domain, with a special understanding on its role in dynamic pricing.
  • Role of LSTM in predicting customer’s preferred items/order in the basket along with identifying real customers vs fraudulent customers.

IoT Data and Algorithms in Retail

In retail sector, use of IoT-enabled devices have played considerable role in controlling and streamlining supply chains, capturing real-time metrics to track product availability, sales and deciding the best placement locations for different products. Some of the popular deep learning algorithms used in IoT retail space includes LSTM for time series prediction and CNN for image analysis. However a few common problems encountered while collecting and analyzing IoT data from disparate sources and ways to handle them are :

Firstly, the sensor readings from retail stores and warehouses may be intermittently absent or randomly missing at consecutive timestamps, or lost at a certain time-stamp for an entire store. Missing data makes the traditional regression-based methods or non-negative matrix decomposition
method useless due to one or many columns and rows are missing at the same time. Secondly, those sensor data generated by sensors deployed in different locations (e.g.,with different latitudes and longitudes, even altitudes), exhibit significant non-nonlinearities which not only strongly relate to the time dimension but also highly depend on their spatial attributes (i.e., latitudes, longitudes or altitudes).

To deal with the above challenges, methods for recovering missing sensor readings are used, such as algorithms based on filtering algorithms such as Median Filtering, Kriging, Kalman Filtering, or built upon regression methods with various complexities including ARIMA (AutoRegressive Integrated Moving Average), SVR (Support Vector Regression), kNN (k-Nearest Neighbors), stKNN (spatial and temporal K-Nearest Neighbors) etc. These methods have limited capability to capture data’s global dependencies due to the limitation of their model structures (only quantifying the local or regional data points in terms of time or spatial attributes). Even matrix completion methods to interpolate the missing is limited to capture the one-dimensional spatial similarity.

Spatio-temporal multiview-based learning (ST-MVL) method is used to collectively fill missing readings in a collection of geo-sensory time series data, considering :

  1. Temporal correlation between readings at different timestamps in the same series, to generate a more accurate estimate.
  2. Spatial correlation between different time series.

ST-MVL integrates the advantages of global views, i.e. empirical models derived from the data over a long period, and those of local views, i.e. data-driven algorithms that are concerned with recent readings, to achieve
better accuracy. It can handle the block missing problem, combining the four views in a multi-view learning framework.

Though ST-MVL achieves satisfactory performance in terms of filling missing geo-tagged sensor readings, it ensembles five different models and
each model requires to fine-tune several parameters, which is labor intensive. Moreover, ST-MVL is still limited to capture one dimensional spatio and temporal information and fail to model high dimension spatial features (e.g.,sensors with longitude, latitude and altitude) and periodic pattern in the time dimension.

NN-based heuristic searching methods have evolved to map the
sensors with irregular geo-locations into a matrix. The process occurs with iteratively searching the spatially nearest neighbor for each sensor.
•A tensor completion based method to recover the
missing values by capturing the spatial and temporal information
in a multi-dimensional way. It only requires to tune one key
parameter, without requiring non-missing training data.
• An efficient t-SVD (Tensor Singular Value Decomposition) based optimization scheme to solve the tensor completion problem with a theoretical guarantee of convergence to optimal solution.

t-SVD method requires to tune one key parameters in a unsupervised manner and is computationally efficient on representation of high-dimension sensor data. It can accurately model spatial and temporal dependencies, periodic patterns among sensors to enable a high performance model on missing sensor data recovery. This method is primarily designed for recovering noisy images or videos (naturally can be seen as a tensor). At first the NN-based heuristic searching method is used to transform the irregularly deployed sensors into an array, with adjacent sensor data being placed closed to each other. The next step is to feed this input into t-SVD system for computation of missing values based on Fast Fourier Transform (FFT).

Many sensors are irregularly deployed throughout the city generate huge amount of time series data with two dimensions — time and spatial dimensions. Those sensor readings are easily missing or lost, which t-SVD aims to recover. It formulates the data as a 3-order tensor such as two spatial dimensions (i.e., longitude and latitude) plus one time dimension, or 4-order tensor such as two spatial dimensions and two time dimensions (e.g., hours × days).

Federated and Transfer Learning in Retail

The concept of Federated Learning introduced by google in 2017 has initiated the concept of learning task from daily activity and delegating it to the edge. Further the learnt behaviour is modified and the local knowledge is shared with the Center and other edges.

Federated learning techniques are widely used in retail to provide customers with personalized services, including product recommendation and sales service after considering user purchasing power, personal preferences, and product characteristics.

The data features of a specific user are likely to be scattered among different departments or enterprises. For example, a user’s purchasing power can be inferred from her bank savings and her personal preference can be analyzed from her social networks, while the characteristics of products are recorded by an e-shop. There are two major problems to solve that has not been possible with traditional machine learning methods :

  • Maintaining data privacy and data security, data barriers between banks, social networking sites, and e-shopping sites to aggregate the data and train to a model.
  • The data is heterogeneous and traditional machine learning models cannot directly work on heterogeneous data.

Federated learning and transfer learning solve these problems, by building a model for the three parties without exporting the enterprise data, to protect data privacy and data security. In addition to federated learning, transfer learning is used to address the data heterogeneity problem to build a cross-enterprise, cross-data, and cross-domain ecosystem for big data and artificial intelligence in retail space.

Reinforcement Learning in Retail

Reinforcement Learning (RL) in Artificial Intelligence includes algorithms that works in an environment to take decisions to maximize the cumulative reward and improve the learning efficiency. RL could show to slot machine (or armed-bandit) players the best strategy on how much to invest in trying different machines and how much to bet on the most promising ones. The foremost feature of these algorithms is the search for the optimal balance between exploration of unknown situations and exploitation of the knowledge accumulated through trial and error. RL exists in Machine Learning (ML) alongside the better-known “Supervised and Unsupervised Learning” approaches.

The first class of algorithms involve humans tagging the correct result of several examples for the algorithms learning phase followed by measurement of efficacy (i.e. validation and verification of results from original and alternate sources of information).

In the second group, the algorithms search for the best result on their own without external indications (such as the grouping of customers with their purchase data in few homogeneous classes).

Reinforcement leaning in retail sector starts when the agent interacts with the environment to receive feedback in form of rewards and work on it to maximize the reward. The rewards may be collected from consumer activity, robots and automated agents installed in warehouses or state and actions reported by a class of IoT devices. The main objective is to maximize discounted rewards from a given action which can be formulated as:

The above image illustrates what a policy agent does, mapping a state to the one of the best actions or policy actions, a=π(s). Q maps state action pairs to combinations of immediate rewards such that all future rewards might be harvested by later actions in the trajectory. Having assigned values to the expected rewards, the Q function selects the state-action pair with the highest so-called Q value (termed because it represents the “quality” of certain action in given state).

Reinforcement learning starts with the neural network coefficients being initialized stochastically, or randomly. Using feedback from the environment, the neural net can use the difference between its expected reward and the ground-truth reward to adjust its weights and improve its interpretation of state-action pairs.

In Retail chains, RL is used widely to optimize assortment, stock levels and prices region by region or, even better, store by store, by constantly adapting to the evolution of lifestyles, and ensuring commercial communications of producers and local competitors.

When introducing a new promotion, when no data is available to understand the best correlations with the different types of customer and their last purchase, RL begins to take decisions, of an “exploratory” nature, and improve the profit day after day.

RL algorithms in Retail sector encompasses the following tasks :

  • Entire supply chain maintenance to maximize efficiency and reduce reduction of resources consumption.
  • Robot uses deep reinforcement learning to get trained to learn and perform a new task, for e.g. capturing video footage, memorizing the knowledge gained as part of the deep learning model governing the actions of the robot (success or failure).
  • Optimizing space utilization in warehouses to reduce transit time for stocking and warehouse operations.
  • Devising dynamic pricing strategy using Q-learning to increase profit.
  • Introduce Split Delivery Vehicle Routing Problem with multi-agents system to reduce overall fleet cost and execution time while meeting all demands of the customers.
  • Enable e-commerce merchants to learn and analyze customer behaviors, tailor products and offer personalized recommendation services to suit customer interests.

Dynamic Pricing with Reinforcement Learning

Real- world E-commerce dynamic pricing problems are first modeled as Markov Decision Process. As illustrated in the above and below figures, the agent periodically changes prices of the products as its action after observing environment state. The new environment state could then be observed and the reward could also be received. Each pricing episode reaches its end if the product is out of stock. The model is pre-trained by historical sales data and previous specialists’ pricing actions.

Dynamic pricing framework using DRL with demonstrations on E-commerce platform,

DRL (Deep Reinforcement Learning) based dynamic pricing approaches can be extended to :

  • Continuous pricing action space.
  • Solving the unknown demand function problem by designing different reward functions.
  • Addressing the cold-start problem by introducing pre-training and evaluation using the historical sales data.

In retail online sales, revenue conversion rate as reward function works fine in markdown pricing application when : there is a very clear and accurate stock determining the life-cycle of the pricing process. In addition, reward function also functions when majority of the markdown products are low-sales-volume luxuries, having low but sensitive revenue conversion rates with prices.

The use of dynamic pricing problems is applied in pricing a bid by a seller in a single-seller or multi-seller procurement situation.

Single Seller Model in Dynamic Pricing

The above figure depicts a single-seller model who wishes to maximize his revenue by using dynamic pricing methodology. The seller maintains a finite inventory of the product and maintains a reorder policy, each time the inventory level drops to a level, he would order a replenishment of size. The replenishment lead time (time elapsed between placement of replenishment order and the arrival of the items) is exponentially distributed. The seller uses both price-dispersion by changing the price of a unit of the product dynamically, with pricing discrimination based on volume discounts.

Impartial buyers/Captives (buyers whose buying patterns are not influenced by availability or otherwise of volume discounts) wait in Queue1 and shoppers wait in Queue2. Captives get higher priority by the sellers such that even if the item is not available in inventory the seller would provide the incoming captive with a price quote and lead time quote. If the quoted lead time and price are within his interest, he will commit to purchasing the item (if the lead time 0 then he has to wait in Queue1), otherwise he leaves the system.

An incoming shopper does not get a quote regarding lead time or price. A shopper may choose to wait in Queue2 if he likes the present price and he can balk from Queue2 according to an exponentially distributed waiting time if he does not get service within that time. If a shopper is offered the item then he has to pay that price (if shopper does not like the price then also he can balk from the system).

In the revenue optimization problem of the seller in the above model, the seller does not have any information about the strategies of other sellers in the retail market and preferences of the buyers who are approaching the retail market.Stochastic approximation built on the principle of Q-learning can be used to describe the system state transition from one state to another in a specific time, where the reward assigned to the system is determined by the the amount of business done plus the inventory holding cost.

The next scenario presented here is for two-seller model as illustrated in the figure below. Two competing sellers wish to maximize their respective revenues by using RL-based adaptive behavior with two actor-critic learners. All assumptions about individual sellers and buyers are the same as explained in the single seller model. We further assume that every captive is associated with a utility function that combines price and delay.

Two Seller Model in Dynamic Pricing

A captive buys from a seller where his utility is more and positive. If the captive cannot find positive utility then, he drops from the system. Since an incoming shopper does not get a quotation regarding lead time or price, the shopper observes the prices at both the sellers and joins in a queue (Queue2 or Queue4), where the price is less and within his price limits and he can balk from those queues according to some waiting time distribution (exponentially distributed), if he does not get service with in that time.

If a shopper is offered the item he has to pay that time a price. If the shopper does not like the price then he can leave the system or balk to the other shop if gets the item immediately and at a lower price. Each seller is equipped with a mechanism to observe queue status and inventory at the other seller. Since the system is simultaneously controlled by more than one decision maker, the problem is modeled as a stochastic game. Sellers simultaneously choose actions where the possible set of actions at any state is a set, from which the first and second seller chooses different set of prices. The prices will be changed whenever a customer enters (or leaves) the system and on arrival of an inventory lot to one of the seller’s shop from the distributor.

The objective of dynamic price optimization for 2-seller model is to model it with Markovian game when transition/sequence of events are not known, and simulation-based approximation methods based on reinforcement learning is used. By transition structure it refers to the sequence of eventsdenoting an instance when either a departure or arrival of a buyer at any of the two sellers or arrival of inventory at any of the two sellers happen.

A Nash equilibrium state exists, if the policies of one seller is freezed, then it becomes a Markov decision process for the other seller. This motivates the design of the system to use actor-critic type of learning paradigm for stochastic games to learn such strategies. When both the sellers try to learn their Nash equilibrium strategies following best response dynamics, it can be hoped that both will converge to a Nash equilibrium if seller 1 observes seller 2 as quasi-static and seller 2 observes seller 1 as playing equilibrium strategy in their pursuit for mutual best responses.

Sellers can randomize their prices differently, for e.g. seller-1 randomizes his prices in high price domain and seller-2 randomizes his prices in low price domain. This could be due to the fact that the shop of seller-1 is overcrowded, and he tries to discourage the incoming buyers by displaying a high price. On the other hand seller-2 does not have customers, so for reducing the inventory cost he displays a low price for attracting more incoming customers.

The below figure on the left illustrates how multi-agent DRL (Deep Reinforcement Learning) framework develops pricing algorithms in retail market through agents/brokers that clusters (K-Means with Dynamic Time Warping (DTW)) consumers into different groups. Each agent employs a DQN to interact with tariff market and gets its contribution value calculated from the reward function. The figure on the right represents two-hidden-layer recurrent DQN, where the first hidden layer uses LSTM to extract features from the sequential state inputs.

Deep Reinforcement Learning in Retail Pricing Strategy, Source :

Deep Learning (LSTM) to predict fraud/authentic user-order

User’s browsing behavior involving the sequence of a user’s clicks within a session can be captured and modeled using neural-network based embedding, that models sequences of such clicks using the Recurrent Neural Network (RNN). Prior to applying models like Recurrent Neural Network and LSTM based models, the main objective remains to capture users’ behaviors in a way that is as detailed as possible.

Browsing history of fraudulent users : Source :

Statistically, the behaviors of the fraudsters are different from legitimate users. Real users browse items following a certain pattern. They are likely to browse a lot of items similar to the one they have bought for research. In contrast, fraudsters behave more uniformly (e.g. go directly to the items they want to buy, which are usually virtual items, such as #1 in the above figure), or randomly (e.g. browse unrelated items before buying, such as #2). Thus, it is important to capture the sequence of each user’s clicks, while automatically detect the abnormal behavior patterns.

The steps involved in detecting fraud with LSTM network involves:

  • Use Item2Vec, a technique similar to Word2Vec, to learn to embed the details of each click (e.g. the item being browsed) into a compact vector representation.
  • Use a RNN to capture the sequence of clicks, revealing the browsing behaviors on the time-domain.
  • Use under-sampling techniques on legitimate user-sessions to balance the imbalanced datasets, resulting from limited number of fraudulent transactions.
  • As the user browsing behaviors, both legitimate and fraudulent, change over time, the drift phenomenon is handled by automatically fine-tuning the model with new data points incrementally.

The following figure illustrates a typical RNN used in fraud detection where each user-click is fed to the corresponding time slot of the RNN, with the RNN finally giving the risk score as output.

Clicks to detect Fraus , Source :

Other possible use-cases with sequence data in retail environment involves:

Basket sequence generation process using the LSTM and Generator modules, Source
  • Predict a customer’s likely purchase basket in Week_n+1 using transaction sequences from a period Week_0 … to Week_n with GAN (Generative Adversarial Network (GAN)) using LSTM. The above figure illustrates a pipeline with the product generator G at the initial state to produce individual products in the basket, which is then fed to LSTM module to model the evolution of a customer’s state.
  • Generating realistic sequences of baskets by feeding the generated basket in previous step to LSTM, that contains the customer hidden state.
  • Predict likely basket items for a similar set of customers by generating products for him/her.
  • Generate customer e-commerce order for a given product with a given a product embedding. The order is a detailed summary of a product embedding, customer embedding, price, and date of purchases.
  • Predict whether or not a product is the last product in the basket.
  • Predict the category of the next product
  • Predict the price of the next product.

The following figure illustrates mechanism of embedding customers via multi-task learning with an LSTMs. The input is the sequence of products a customer has purchased throughout their transactional history. After convergence, the hidden state of the LSTM characterizes a customer’s state.


In this article we have seen how varieties of retail problems can be solved using different deep learning techniques like LSTM, Federated and Transfer Learning. In addition we also see how Reinforcement learning with Q learning plays a predominant role not only in dynamic pricing and order optimization in supply chain, but also in yielding personalized recommendation service. In the next article lets look at the machine learning algorithms used in retail blockchain industry.