Original article can be found here (source): Artificial Intelligence on Medium

# Reinforcement Learning based HVAC Optimization in Factories

*Abstract. **Heating, Ventilation and Air Conditioning (HVAC) units are responsible for maintaining the temperature and humidity settings in a building. Studies have shown that HVAC accounts for almost 50% energy consumption in a building and 10% of global electricity usage. HVAC optimization thus has the potential to contribute significantly towards our sustainability goals, reducing energy consumption and CO2 emissions. In this work, we explore ways to optimize the HVAC controls in factories. Unfortunately, this is a complex problem as it requires computing an optimal state considering multiple variable factors, e.g. the occupancy, manufacturing schedule, temperature requirements of operating machines, air flow dynamics within the building, external weather conditions, energy savings, etc. We present a Reinforcement Learning (RL) based energy optimization model that has been applied in our factories. We show that RL is a good fit as it is able to learn and adapt to multi-parameterized system dynamics in real-time. It provides around 25% energy savings on top of the previously used Proportional–Integral–Derivative (PID) controllers.*

# Introduction

Heating, Ventilation and Air Conditioning (HVAC) units are responsible for maintaining the temperature and humidity settings in a building. We specifically consider their usage in factories in this work, where the primary goal of the HVAC units is to keep the temperature and (relative) humidity within the prescribed manufacturing tolerance ranges. This needs to be balanced with energy savings and CO2 emission reductions to offset the environmental impact of running them.

Given their prevalence in not only factories, but homes and office buildings; any efficient control logic has the potential of making significant contributions with respect to their environmental impact. Unfortunately, given the complexity of HVAC units, designing an efficient control logic is a hard optimization problem. The control logic needs to consider multiple variable factors, e.g. the occupancy, manufacturing schedule, temperature requirements of operating machines, air flow dynamics within the building, external weather conditions, energy savings, etc.; in order to decide how much to heat, cool, or humidity the zone.

The HVAC optimization literature can be broadly divided into two categories: (i) understand the recurring patterns among the optimization parameters to better schedule the HVAC functioning, and (ii) build a simulation model of the HVAC unit and assess different control strategies on the model — to determine the most efficient one.

Examples of the first category include [1, 2] which employ a building thermodynamics model to predict the buildings’ temperature evolution. Unfortunately, this is not really applicable for our factories where the manufacturing workload varies every day, and there is no schedule to be predicted. It is also worth mentioning that most such models only consider one optimization parameter at a time, i.e. control heating / cooling to regulate the temperature; whereas in our case, we need to regulate both the temperature and (relative) humidity simultaneously to maintain the optimal manufacturing conditions.

The second category of model-based approaches applies to both PID and RL controllers. PID (proportional integral derivative) controllers [3] use a control loop feedback mechanism to control process variables. Unfortunately, PID based controllers require extensive calibration with respect to the underlying HVAC unit, to be able to effectively control them. [4] outlines one such calibration approach for PIDs based on a combination of simulation tools.

Reinforcement Learning (RL) [5] based approaches [6, 7] have recently been proposed to address such problems given their ability to learn and optimize multi-parameterized systems in real-time. An initial (offline) training phase is required for RL based approaches, as training an RL algorithm in live settings (online) can take time to converge leading to potentially hazardous violations as the RL agent explores its state space. [6, 7] outline solutions to perform this offline training based on EnergyPlus based simulation models of the HVAC unit. EnergyPlus [8] is an open source HVAC simulator from the US Department of Energy that can be used to model both energy consumption — for heating, cooling, ventilation, lighting and plug and process loads — and water use in buildings. Unfortunately, developing an accurate EnergyPlus based simulation model of an HVAC unit is a non-trivial, time consuming and expensive process; and as such is a blocker for their use in offline training.

In this work, we propose an efficient RL based HVAC optimization algorithm that is able to learn and adapt to a live HVAC system in weeks. The algorithm can be deployed independently, or as a ‘training module’ to generate data that can be used to perform offline training of an RL model — to further optimize the HVAC control logic. This allows for a speedy and cost-effective deployment of the developed RL model. In addition, the model output in [6, 7] is the optimal temperature and humidity setpoints, which then rely on the HVAC control logic to ensure that the prescribed setpoint is achieved in an efficient manner. In this work, we propose a more granular RL model, whose output is the actual valve opening percentages of the Heating, Cooling, Re-heating and Humidifier units. This enables a more self-sufficient approach, with the RL output bypassing (removing any dependency and making redundant) any in-built HVAC control logic — allowing for a more vendor (platform) agnostic solution.

The rest of the paper is organized as follows: Section 2 introduces the basics, providing an RL formulation of the HVAC optimization problem. Section 3 outlines the RL logic that is initially deployed to generate the training data for offline training, leading to the (trained) RL model in Section 4 providing the recommended valve opening percentages in real-time. In Section 5, we provide some benchmarking results of the developed RL model that has been deployed in one of our factory zones. Initial studies show that we are able to achieve *25*% energy efficiency over the previously existing PID controller logic. Section 5 concludes the paper and provides directions for future work.

# PROBLEM SCENARIO

## HVAC

Fig. 1 shows the energy balance of the HVAC unit. In a simplified way, the HVAC unit has to bring the mix of the Recirculation air and the Fresh air to the temperature and humidity needed to maintain the area temperature and (relative) humidity at the required level. It is easy to monitor the theoretical energy needed by performing the difference between the energy of outgoing air and the incoming air, comparing this amount with the amount of energy needed for unit gives the energy efficiency of the HVAC unit.

The energies flows can be determined based on the flow of media (air, hot water, cold water, steam) and the temperature difference between the supply and return of the media. And the consumed electrical energy.

## Reinforcement Learning (RL)

RL refers to a branch of Artificial Intelligence (AI), which is able to achieve complex goals by maximizing a reward function in real-time. The reward function works similar to incentivizing a child with candy and spankings, such that the algorithm is penalized when it takes a wrong decision and rewarded when it takes a right one — this is reinforcement. For a detailed introduction to RL frameworks, the interested reader is referred to [5].

Let us take the analogy of a video game. At any point in the game, the player has a set of available actions, within the rules of the game. Each action contributes positively or negatively towards the player’s endgoal of winning the game. For instance, with ref. to the game snapshot below (Fig. 2), the RL model might compute that running right will return +*5* points, running left none, and jumping –*10* (as it will lead to the player dying in the game).

## RL Formulation

We now map the scenario to our HVAC setting. At any point in time, a factory zone is in a state characterized by the temperature and (relative) humidity values observed inside and outside the zone.

The game rules in this case correspond to the temperature and humidity tolerance levels, which basically mandate that the zone temperature and humidity values should be within the range: *19–25* degrees and *45–55*% respectively. The set of available actions in this case are the Cooling, Heating, Re-heating and Humidifier valve opening percentages (%).

To summarize, given the zone state in terms of the (inside and outside) temperature and humidity values, the RL model needs to decide by how much to open the Cooling, Heating, Re-heating and Humidifier valves. To take an informed decision in this scenario, the RL model needs to first understand the HVAC system behavior, in terms say how much zone temperature drop can be expected by opening the Cooling valve to *X*%?

Once the RL model understands the HVAC system behavior, the final step is to design the control strategy, or ‘Policy’ in RL terminology. For instance, the RL model now has to choose whether to open the Cooling value to 25% when the zone temperature reaches *23* degrees, or wait till the zone temperature reaches *24* degrees before opening the Cooling valve to *40*%. Note that the longer it waits before opening the valve, contributes positively towards lowering the energy consumption; however, it then runs the risk of violating the temperature / humidity tolerance levels as the outside weather conditions are always unpredictable. As a result, it might actually have to open the Cooling valve to a higher percentage if it waits longer, consuming more energy. The above probabilities are quantified by a reward function (Equation 1) in RL terminology, which assigns a reward to each possible action based on the following three parameters:

Reward (a) = (weight1 x Setpoint closeness) — (weight2 x Energy cost) — (weight3 x Tolerance violation) (1)

The Energy cost is captured in terms of electricity consumption and CO2 emission. A control strategy is then to decide on the weightage of the three parameters. For instance, a ‘safe’ control strategy would assign a very high negative weightage (penalty) to Tolerance violations, ensuring that they never happen, albeit at a higher Energy cost. Similarly, an ‘energy optimal policy’ would prioritize energy savings over the other two parameters. Setpoint closeness encourages a “business friendly” policy where the RL model attempts to keep the zone temperature as close as possible to the temperature / humidity setpoints, implicitly reducing the risk of violations, but at a higher Energy cost. We opt for a “balanced” control policy which maximizes Energy savings and Setpoint closeness, while minimizing the risk of Tolerance violations.

The RL formulation described above is illustrated in Fig. 3.

# REINFORCEMENT LEARNING BASED HVAC ENERGY OPTIMIZATION

We outline a RL algorithm that outputs how much to open the Heating, Cooling, Humidifier and Re-heating valves at time *t*, based on the current Indoor Temperature and Humidity (at time *t*), and the previous Heating, Cooling, Humidifier, Re-heating valve opening percentage values, Indoor Temperature and Humidity values at time *t-1*. The *rl_hvac* function runs in real-time computing the new valve opening values every *1* min.