Reinforcement Learning based HVAC Optimization in Factories

Original article can be found here (source): Artificial Intelligence on Medium

Reinforcement Learning based HVAC Optimization in Factories

Abstract. Heating, Ventilation and Air Conditioning (HVAC) units are responsible for maintaining the temperature and humidity settings in a building. Studies have shown that HVAC accounts for almost 50% energy consumption in a building and 10% of global electricity usage. HVAC optimization thus has the potential to contribute significantly towards our sustainability goals, reducing energy consumption and CO2 emissions. In this work, we explore ways to optimize the HVAC controls in factories. Unfortunately, this is a complex problem as it requires computing an optimal state considering multiple variable factors, e.g. the occupancy, manufacturing schedule, temperature requirements of operating machines, air flow dynamics within the building, external weather conditions, energy savings, etc. We present a Reinforcement Learning (RL) based energy optimization model that has been applied in our factories. We show that RL is a good fit as it is able to learn and adapt to multi-parameterized system dynamics in real-time. It provides around 25% energy savings on top of the previously used Proportional–Integral–Derivative (PID) controllers.


Heating, Ventilation and Air Conditioning (HVAC) units are responsible for maintaining the temperature and humidity settings in a building. We specifically consider their usage in factories in this work, where the primary goal of the HVAC units is to keep the temperature and (relative) humidity within the prescribed manufacturing tolerance ranges. This needs to be balanced with energy savings and CO2 emission reductions to offset the environmental impact of running them.

Given their prevalence in not only factories, but homes and office buildings; any efficient control logic has the potential of making significant contributions with respect to their environmental impact. Unfortunately, given the complexity of HVAC units, designing an efficient control logic is a hard optimization problem. The control logic needs to consider multiple variable factors, e.g. the occupancy, manufacturing schedule, temperature requirements of operating machines, air flow dynamics within the building, external weather conditions, energy savings, etc.; in order to decide how much to heat, cool, or humidity the zone.

The HVAC optimization literature can be broadly divided into two categories: (i) understand the recurring patterns among the optimization parameters to better schedule the HVAC functioning, and (ii) build a simulation model of the HVAC unit and assess different control strategies on the model — to determine the most efficient one.

Examples of the first category include [1, 2] which employ a building thermodynamics model to predict the buildings’ temperature evolution. Unfortunately, this is not really applicable for our factories where the manufacturing workload varies every day, and there is no schedule to be predicted. It is also worth mentioning that most such models only consider one optimization parameter at a time, i.e. control heating / cooling to regulate the temperature; whereas in our case, we need to regulate both the temperature and (relative) humidity simultaneously to maintain the optimal manufacturing conditions.

The second category of model-based approaches applies to both PID and RL controllers. PID (proportional integral derivative) controllers [3] use a control loop feedback mechanism to control process variables. Unfortunately, PID based controllers require extensive calibration with respect to the underlying HVAC unit, to be able to effectively control them. [4] outlines one such calibration approach for PIDs based on a combination of simulation tools.

Reinforcement Learning (RL) [5] based approaches [6, 7] have recently been proposed to address such problems given their ability to learn and optimize multi-parameterized systems in real-time. An initial (offline) training phase is required for RL based approaches, as training an RL algorithm in live settings (online) can take time to converge leading to potentially hazardous violations as the RL agent explores its state space. [6, 7] outline solutions to perform this offline training based on EnergyPlus based simulation models of the HVAC unit. EnergyPlus [8] is an open source HVAC simulator from the US Department of Energy that can be used to model both energy consumption — for heating, cooling, ventilation, lighting and plug and process loads — and water use in buildings. Unfortunately, developing an accurate EnergyPlus based simulation model of an HVAC unit is a non-trivial, time consuming and expensive process; and as such is a blocker for their use in offline training.

In this work, we propose an efficient RL based HVAC optimization algorithm that is able to learn and adapt to a live HVAC system in weeks. The algorithm can be deployed independently, or as a ‘training module’ to generate data that can be used to perform offline training of an RL model — to further optimize the HVAC control logic. This allows for a speedy and cost-effective deployment of the developed RL model. In addition, the model output in [6, 7] is the optimal temperature and humidity setpoints, which then rely on the HVAC control logic to ensure that the prescribed setpoint is achieved in an efficient manner. In this work, we propose a more granular RL model, whose output is the actual valve opening percentages of the Heating, Cooling, Re-heating and Humidifier units. This enables a more self-sufficient approach, with the RL output bypassing (removing any dependency and making redundant) any in-built HVAC control logic — allowing for a more vendor (platform) agnostic solution.

The rest of the paper is organized as follows: Section 2 introduces the basics, providing an RL formulation of the HVAC optimization problem. Section 3 outlines the RL logic that is initially deployed to generate the training data for offline training, leading to the (trained) RL model in Section 4 providing the recommended valve opening percentages in real-time. In Section 5, we provide some benchmarking results of the developed RL model that has been deployed in one of our factory zones. Initial studies show that we are able to achieve 25% energy efficiency over the previously existing PID controller logic. Section 5 concludes the paper and provides directions for future work.



Fig. 1 shows the energy balance of the HVAC unit. In a simplified way, the HVAC unit has to bring the mix of the Recirculation air and the Fresh air to the temperature and humidity needed to maintain the area temperature and (relative) humidity at the required level. It is easy to monitor the theoretical energy needed by performing the difference between the energy of outgoing air and the incoming air, comparing this amount with the amount of energy needed for unit gives the energy efficiency of the HVAC unit.

The energies flows can be determined based on the flow of media (air, hot water, cold water, steam) and the temperature difference between the supply and return of the media. And the consumed electrical energy.

Figure 1. HVAC Control

Reinforcement Learning (RL)

RL refers to a branch of Artificial Intelligence (AI), which is able to achieve complex goals by maximizing a reward function in real-time. The reward function works similar to incentivizing a child with candy and spankings, such that the algorithm is penalized when it takes a wrong decision and rewarded when it takes a right one — this is reinforcement. For a detailed introduction to RL frameworks, the interested reader is referred to [5].

Let us take the analogy of a video game. At any point in the game, the player has a set of available actions, within the rules of the game. Each action contributes positively or negatively towards the player’s endgoal of winning the game. For instance, with ref. to the game snapshot below (Fig. 2), the RL model might compute that running right will return +5 points, running left none, and jumping –10 (as it will lead to the player dying in the game).

Figure 2. Reinforcement Learning basics

RL Formulation

We now map the scenario to our HVAC setting. At any point in time, a factory zone is in a state characterized by the temperature and (relative) humidity values observed inside and outside the zone.

The game rules in this case correspond to the temperature and humidity tolerance levels, which basically mandate that the zone temperature and humidity values should be within the range: 19–25 degrees and 45–55% respectively. The set of available actions in this case are the Cooling, Heating, Re-heating and Humidifier valve opening percentages (%).

To summarize, given the zone state in terms of the (inside and outside) temperature and humidity values, the RL model needs to decide by how much to open the Cooling, Heating, Re-heating and Humidifier valves. To take an informed decision in this scenario, the RL model needs to first understand the HVAC system behavior, in terms say how much zone temperature drop can be expected by opening the Cooling valve to X%?

Once the RL model understands the HVAC system behavior, the final step is to design the control strategy, or ‘Policy’ in RL terminology. For instance, the RL model now has to choose whether to open the Cooling value to 25% when the zone temperature reaches 23 degrees, or wait till the zone temperature reaches 24 degrees before opening the Cooling valve to 40%. Note that the longer it waits before opening the valve, contributes positively towards lowering the energy consumption; however, it then runs the risk of violating the temperature / humidity tolerance levels as the outside weather conditions are always unpredictable. As a result, it might actually have to open the Cooling valve to a higher percentage if it waits longer, consuming more energy. The above probabilities are quantified by a reward function (Equation 1) in RL terminology, which assigns a reward to each possible action based on the following three parameters:

Reward (a) = (weight1 x Setpoint closeness) — (weight2 x Energy cost) — (weight3 x Tolerance violation) (1)

The Energy cost is captured in terms of electricity consumption and CO2 emission. A control strategy is then to decide on the weightage of the three parameters. For instance, a ‘safe’ control strategy would assign a very high negative weightage (penalty) to Tolerance violations, ensuring that they never happen, albeit at a higher Energy cost. Similarly, an ‘energy optimal policy’ would prioritize energy savings over the other two parameters. Setpoint closeness encourages a “business friendly” policy where the RL model attempts to keep the zone temperature as close as possible to the temperature / humidity setpoints, implicitly reducing the risk of violations, but at a higher Energy cost. We opt for a “balanced” control policy which maximizes Energy savings and Setpoint closeness, while minimizing the risk of Tolerance violations.

The RL formulation described above is illustrated in Fig. 3.

Figure 3. HVAC Reinforcement Learning formulation


We outline a RL algorithm that outputs how much to open the Heating, Cooling, Humidifier and Re-heating valves at time t, based on the current Indoor Temperature and Humidity (at time t), and the previous Heating, Cooling, Humidifier, Re-heating valve opening percentage values, Indoor Temperature and Humidity values at time t-1. The rl_hvac function runs in real-time computing the new valve opening values every 1 min.

The RL logic can be explained as follows: Recall that the temperature and (relative) humidity setpoints that we would like to maintain are 22 degrees and 50%; with allowed tolerance ranges of ±3 and ±5 respectively. At every iteration (1 min), the rl-hvac function determines which valve(s) to open based on the below control logic:

Knowing which valve(s) to open, how much to open each valve depends on the reward value, computed as a measure of the ‘effectiveness’ of the previous (output) valve openings. For instance, let us assume that the indoor temperature is currently 20.5 degrees (below the temperature setpoint), which implies that the Heating valve needs to be opened. During the previous iteration, the indoor temperature was also below the setpoint, say 21.0 degrees, leading the rl_hvac function recommendation to open the Heating valve at say 15%. Given that the current indoor temperature is even lower (20.5 degrees), we infer that the previous Heating valve opening was not sufficiently effective — assigning it a negative reward — and heating more by an amount proportional to the difference between the current and previous indoor temperature. The behavior of the other valves can be explained analogously. This is reinforcement and ensures that the valves are able to efficiently balance the ‘setpoint closeness’ and ‘energy cost’ parameters of the reward function (Equation 1).

The remaining parameter of the reward function is the ‘tolerance violation’ where a penalty needs to be imposed if the indoor temperature / humidity violates the allowed tolerance ranges. A violation in our case implies that the respective valve(s) needs to react faster. This is accommodated by the step increment constants: h_iW, c_iW, u_iW, r_iW. We adjust them in an offline fashion such that their values are adapted if the number of tolerance violations exceeds a certain threshold during a given period.


In this section, we extend the RL model to accommodate ‘long term rewards’, referred to as the Q-value in RL terminology. Q-value is defined as a weighted sum of the expected values of the rewards of all future steps starting from the current state. Recall that the rewards function in the RL algorithm outlined in the previous section is stochastic, in the sense that it only depends on the last state values.

To accommodate ‘long term rewards’, we extend our original problem to a continuous space setting. Each episode in this setting corresponds to the period when the indoor temperature and (or) humidity starts moving away from their respective setpoints, to the time that the indoor conditions return to their setpoint values, as a result of opening the relevant valve (s).

Figure 4. Valve tipping point computation

Let us now focus on one such episode (in such a continuous space setting). Given that the stochastic RL algorithm (in Section 3) always starts opening the valves at 0.0%, the temperature and (or) humidity deviation from the setpoint keeps increasing, until the valve opening percentage reaches the tipping point, after which the deviation starts decreasing again until it becomes 0. This episodic behavior is illustrated in Fig. 4. For the sake of simplicity, we have only shown the Temperature — Cooling curve, however an analogous behavior can be anticipated for the other scenarios, including those involving Humidity. The energy cost in Fig. 4 corresponds to the shaded region. Given this behavior, it is easy to see that if we knew the Cooling tipping point at 22.3 degrees, we could have opened the Cooling valve earlier — leading to a lower energy cost (depicted by dashed shaded region). The caveat here is that the tipping point needs to be estimated properly for all the valves, otherwise opening a valve to more than the tipping point percentage might actually lead to a higher energy cost.

In the sequel, we show how the data generated by the RL algorithm in Section 3 can be used as training data, to develop a model to predict the ‘tipping point’ of the valves for each state (indoor temperature, humidity of the factory zone). The algorithm output can be considered as consisting of the following input and values, for each time point t: (Indoor Temperature, Indoor Humidity, Heating valve opening%, Cooling valve opening%, Humidifier valve opening%, Re-heating valve opening%). We apply the below filtering criteria (only illustrated for humidity) on the output data — to extract the training data:

The filter aims to identify ‘episodes’, focusing on the time points where the indoor humidity starts converging towards the setpoint, after a period of increased deviation from it. Needless to say, the valve opening percentages at these time points correspond to the ‘tipping points’ for those states (indoor temperature, humidity). We train 4 models: h_model, u_model, r_model, c_models based on this training (filtered) data, to predict the ‘tipping point’ values if the Heating, Humidifier, Re-heating, Cooling valves respectively. The trained models are then embedded in our RL algorithm as follows (we only show the ‘Heat and Humidify’ block for illustration):

With this update (in italics), the RL algorithm is able to bootstrap the valve opening percentages, so that each episode will start with the respective ‘tipping point’ values (instead of starting from 0.0%) provided by the trained models — leading to a lower energy cost as depicted in Fig. 4.


The developed RL model has been deployed in a zone of our factory in Romania. The designated zone has five (similar) HVAC units, where the schematics of a HVAC unit is illustrated in Fig. 5.

Figure 5. HVAC schematics

For this zone, we first present (Fig. 6) the indoor conditions and HVAC valve opening percentages of running the HVAC with a PID controller for a week (~10,000 readings, corresponding to a reading every minute).

Figure 6. PID based HVAC control readings

For the same zone, we then ran the HVAC units controlled by the RL model during the following week. We ensured that the manufacturing workload was similar for both weeks. The results are presented in Fig. 7.

Figure 7. RL based HVAC control readings

Comparing the average valve opening percentages (highlighted by the red bounding boxes in Fig. 6 and Fig. 7), we can see that all the RL based valve opening percentages are lower; ranging from 10% savings for the Heating valve to almost 45% savings for the Re-heating valve — leading to 25% savings on average.


In this work, we considered the problem of HVAC energy optimization in factories, which has the potential of making a significant environmental impact in terms of energy savings and reduction in CO2 emissions. To address the problem complexity, we outlined a RL based HVAC controller that is able to learn and adapt to real-life factory settings, without the need for any offline training. To the best of our knowledge, this is one of the first works to report on a live deployment of an RL-HVAC model, in an actual factory. We provided benchmarking results that show the potential to save upto 25% in energy efficiency.

For simplicity, we have considered energy savings as proportional to the valve opening percentages (the lower the better). In reality, the energy consumption and CO2 emissions of the different valves may not be proportional, i.e. depending on the underlying mechanism, opening the Heating and Cooling valves by the same percentage may not consume the same amount of energy. We leave this as future work adapt the RL logic to different valve characteristics, in terms of their energy consumption and CO2 emissions.


[1] F. Oldewurtel and et al. Energy efficient building climate control using stochastic model predictive control and weather predictions. ACC, 2010.

[2] Y. Ma and et al. Model predictive control for the operation of building cooling systems. IEEE Transactions on Control Systems Technology, 20(3):796–803, 2012.

[3] F. Peacock. An Idiot’s Guide to the PID Algorithm.

[4] C. Blasco and et al. Modelling and PID control of HVAC System according to Energy Efficiency and Comfort Criteria. In: Sustainability in Energy and Buildings. Smart Innovation, Systems and Technologies, vol 12 (2012).

[5] Harmon, M.E., Harmon, S.S.: Reinforcement Learning: A Tutorial.

[6] T. Weiand et al.. “Deep reinforcement learning for building HVAC control” In proceedings of the 54th Annual Design Automation Conference, p. 22, 2017.

[7] T. Moriyama and et al. Reinforcement Learning Testbed for Power-Consumption Optimization. In proceedings of the 18th Asia Simulation Conference (AsiaSim), pp. 45–59, 2018.

[8] EnergyPlus™,