Can You Predict How the Coronavirus Spreads?

Original article can be found here (source): Artificial Intelligence on Medium

Can You Predict How the Coronavirus Spreads?

Data, AI and predictive modeling in pandemics

As of 11th March 2020, the World Health Organisation (WHO) has officially declared that the coronavirus is now a pandemic. People around the world are scrambling to contain this virus and no one is really sure what the future holds. Numerous articles have been published speculating the development of the disease, with some scientists claiming that the worldwide peak will come next winter. Be it predicting the rate of growth or attempting to make pre-emptive detections, many are attempting to predict the future using what we call models.

How reliable are these models? To answer that question at hand, we need to understand how they are derived.

Creating a Model

A mathematical model essentially helps to illustrate a cause and effect relationship. There are two components to a model: the independent and dependent variables.

Here, let’s use a very simple mathematical model to illustrate my point. A linear regression model would suffice since we can easily plot it either through Excel or even by hand. There isn’t a need for complicated software and is intuitive to understand. Below is a graph I have plotted:

Source: Yang Chun Wei, based on a self-generated graphical illustration (Graph 1)

Assume that I am the owner of a small fruit store that is curious to evaluate the performance of sales of apples. As a hypothesis, I predict that my total apple sales (y-axis) will be proportional to the number of months my store is in operation (x-axis). Mathematically, the line forms an equation of y = 10x.

Such perfect data is merely a fantasy. In reality, we usually get something like this:

Source: Yang Chun Wei, data points for Graph 2 below (Table 1)

This is then plotted against a linear regression line (in light grey) to estimate the future sales of apples.

Source: Yang Chun Wei, an illustration of regression based on Table 1 (Graph 2)

Here, we can see that the line of best fit follows very closely to what we have hypothesized, with an equation of y = 10.2x – 0.6667. The R² value, also known as the coefficient of determination, is also extremely close to 1, which means that the data that we recorded almost fits perfectly to the linear trend.

In this context, the independent variable provided for the model is the number of months of operation of store A — essentially time. This independent variable (the cause) affects the dependent variable — the total number of apples sold (the effect). Whenever x changes, y varies as well, which is the cause and effect relationship here.

However, we must be wary of jumping to conclusions so quickly. There is a caveat as usually such models merely show correlation and not causation.

Shortfalls of Data

Correlation vs Causation

In statistics, there is a famous phrase: “correlation does not imply causation”. It refers to the inability to legitimately deduce a cause-and-effect relationship between two variables solely on the basis of an observed association or correlation between them.

Imagine that there is data showing whenever there is an increase in ice cream sales, more children are being born into this world. The correlation may raise some eyebrows, but it would be quite unfair to simply associate these two as a cause-and-effect relationship.

As with any form of modeling, there is a danger of confirmation bias and the burden of imperfect information. This is why scientists and mathematicians are careful and deliberate in choosing the right data to ensure that their model is reliable.

Limitations With the Understanding of the Virus

With any model, there are always certain assumptions to be made. As of me writing this today, we are still banking on the fact that the chances of reinfection are close to zero. If this assumption does not hold true, there will be much greater implications on our model, rendering existing ones inaccurate.

Predicting the Pandemic

For such complex situations, there is definitely more than one independent variable. Beyond time and space, people might look to other variables such as population density, probability of interaction, etc. What we aim to predict also varies per individual. Some might want to understand where might the virus spread, while others may be simply concerned about the rate in which it does.

Models Based off Outbreak Analysis

In the news, we most commonly see these topics being discussed:

“When will the number of cases peak?”

“When will it end?”

“How much does it spread over time?”

The mathematical model thus typically shows the total number of cases against time. In the early phases of the spread, what we are usually presented with is this curve that possesses a characteristic of exponential growth like in the case of Italy:

Source: Yang Chun Wei, with data extracted from

However, true exponential growth in real life is not possible for a very practical reason. There is a theoretical endpoint which is the case where all members of the population are infected. As more individuals get infected, the number of people they can spread to decreases given the constraint of a fixed population. Hence, the rate in which the coronavirus decreases and thus defies the condition of exponential growth.

Eventually, the total number of cases will peak and plateau out when either the number of people infected reaches the population limit, or when the disease starts to be contained like in the case of China:

Source: Yang Chun Wei, with data extracted from

Optimistically, we will want to reach the inflection point as soon as possible. This is where the rate of growth of the number of cases begins to slow down, easing the curve and “flattening” it.

Mathematical models that extrapolate data based on the number of present cases merely serves as an estimate to gauge how serious the problem will become. This is far from being the most accurate model and it’s unlikely that any accurate predictive model can be built on approximately two months worth of data.

Models Based on Unstructured Data

Recently, a Canadian artificial intelligence firm BlueDot has been on the news for predicting the outbreak way before anyone else did. The unique thing about this is that it instead takes in new stories of various languages, flight information and much more as data. Only AI is able to take such unstructured information and make sense of it.

The power of AI lies in the power of synthesis. Computers are vastly superior compared to humans in both memory and the ability to process vast amounts of information.

While we can understand the conclusions AI provides, the process is generally a black box to people. This may sometimes be an undesirable side effect of AI as it would be quintessential to understand the thought processes and reasoning that leads to a certain conclusion. If not, we are simply blind followers who are no different than people who let fortune-tellers decide their fate.

Good AI systems allow you to make your own judgment with regards to its predictions.