Sweeping towards better Coronavirus Forecasting

Original article was published on Deep Learning on Medium

Sweeping Towards Better Coronavirus Forecasting

Utilizing the latest advances in machine learning for COVID-19 forecasting

A image of one of our model’s projections for the Bronx county in NYC starting May 9th. Actual cases are shown in the orange and our model’s projections are shown in blue. You can see our model accurately forecasts the general downward trajectory, however still does not capture all the daily noise.

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

With COVID-19 sweeping across the world, and people (particularly Americans) anxious to get back to work, now more than ever we need models to effectively forecast the spread of COVID-19. However, at present many models have performed poorly at estimating the disease spread and the overall impacts of social distancing. In this article I will review why disease forecasting models are useful, problems/limitations with the approaches taken by current models, and the benefits and barriers of using deep learning approaches.

Why we need better forecasts

Before jumping into the problems of current COVID-19 forecasting models, let’s look at why they are useful:

  • (1) Informing policymakers about the impacts of social distancing
  • (2) Identifying specific counties/cities/towns with the highest risks
  • (3) Helping hospitals plan in terms of staffing, beds, and equipment
  • (4) Informing virologists and epidemiologists of what factors to study

(1) A good model should effectively inform elected officials and policymakers about coronavirus and its impact on their communities. Specifically, the model should offer not just forecasts, but also actionable insights about how different government policies will affect the number of new cases and deaths, and how the virus will burden hospitals. The model should also operate based on the observed values, in contrast to ideal scenarios. This is an area where a data driven approach, as opposed to pure mathematical ones, makes sense. For instance, a model that took as inputs social distancing policies and then assumed a certain outcome based on citizens observing the policy, probably wouldn’t perform well in reality. Instead, models need to learn the causal effect of government social distancing policies and first, how they relate to mobility data and second, actual virus transmission.

(2) Another issue revolves around forecasting for specific localities. While models predicting cases and deaths at the national level have some utility in enabling broad estimates of the number of PPE and ventilators required, they don’t have much further use. The US has many different states, each with diverse populations, resources, and climates. Even forecasting at the state level can be too broad in certain circumstances. For instance, how the virus spreads around NYC is likely very different than how it spreads in upstate counties. Similarly, how the virus spreads in Chicago would likely differ from how it spreads in Carbondale. Forecasts at the county or the city/town level are what provide the most value to local officials. These forecasts enable cities/towns to plan their response based on their risk alone. This could potentially enable more areas to open as well. For instance, if models predict a high risk in a city/town, but a low risk in another area on the opposite side of the state, then it makes little sense to shut the latter down but allow the former to remain open (of course you would want to also make sure there was not a lot of travel between these areas as well).

(3) Ideally, models could also forecast the number of hospital admissions for a specific county or even better for a specific city/town on a daily basis. This would enable hospitals to plan staffing, emergency response personnel, and acquire PPE in an optimal way. Particularly, if the model predicts a peak a week or two in advance, then the hospitals have plenty of time to prepare a response. Additionally, forecasting ICU bed utilization is critical for constructing field hospitals in a timely manner.

(4) Finally, good models can potentially highlight various transmission mechanics that prompt further study. For instance, hypothetically speaking, if a model found UV light as a major factor in spread that could prompt additional epidemiology research. Of course with this we must be very careful that the model does not merely learn noise. However, machine learning does have the potential to play a central role in finding new research directions with respect to a variety of research areas including virology.

Difficulties with the majority of current models

Nate Silver in the article “Why It’s So Freaking Hard To Make A Good COVID-19 Model’’ enumerates the many difficulties with modeling COVID-19. However, many of these difficulties stem from trying to compose exact mathematical/statistical formulas for forecasting COVID-19. Silver discusses using variables like the asymptomatic rate and infection rate. This is problematic as in the classical SEIR model we have to assume certain values for these parameters. Additionally, models like IMHE, for instance, primarily rely on curve fit. CurveFit utilizes a wide variety of variables to be supplied to the model. First IMHE utilizes a spline to fit the COVID-19 curve before relying on a number of additional tweaks to fit data. For those unaware, a spline is a mathematical polynomial formula meant to draw a smooth curve. This method still requires a lot of manual parameters and features. A VOX article from April noted some of the many problems related to the IMHE model. Specifically, the article discusses changes to the model’s confidence intervals and problems with the updating forecasts. For instance, VOX notes that “The model assumes deaths will increase and then come back down;” while this may be true broadly speaking, when forecasting at the daily level there is a lot of noise and differences in reporting. Second, there are many different factors that could stop the real data from having such a trajectory. Models have to be robust enough to determine the noise and thus not make general assumptions.

Other models, such as the Geneva, GA_Tech, MIT, MOBS, UCLA, UA, and UT, in making their predictions, assume that social distancing will continue in its current form. As restrictions ease, their numbers are likely to drastically change as result. Finally, a large number of models only forecast at the state and national level. This is an area where deep learning has the most potential, given its ability to learn the many complex interactions between input features such as seasonality, social distancing, and geography. Moreover, with sufficient data augmentation techniques and transfer learning the model should be able to deal with a change in distribution such as the end of social distancing. However, as we will discuss below, there remain many barriers to using deep learning for predicting COVID-19.

How our approach is different

We aim to incorporate the newest research from deep learning for time series forecasting to address these issues. Although a lot of people remain wary about deep learning in epidemiology, there are numerous advantages in using it. For instance, unlike pure statistical models, deep learning models learn from the actual data present. Deep learning models also have the potential to better integrate all the complex variables, such as weather and symptom surveys. However, many challenges still remain with deep learning methods. The most obvious initial challenge with using deep learning revolves around the lack of training data: right now we still only have 110 time steps, at the most, for many US counties. Moreover, for the majority of counties/states, we have less than 90 time steps. Another problem is an imbalance of time steps with zero new cases. For mobility we have little data to give to the model before the pandemic begins. Therefore utilizing transfer learning and data augmentation is essential to achieving good results.

Volunteering at CoronaWhy and forming a cross-disciplinary team

Our team serves as the sole group at CoronaWhy focused on time series forecasting. CoronaWhy is a global organization of more than 900 volunteers focused on applying machine learning to better understand Coronavirus. Most other teams focus on the White House Challenge and leveraging NLP. Our team was formed in early March to analyze temporal data related to Coronavirus. Since then we have onboarded more than 30 members from locations around the world. In addition to virus spread forecasting we have other initiatives such as forecasting patient prognosis in hospitals (unfortunately that is blocked due to lack of ICU data). Our team works closely with other CoronaWhy teams such as datasets and task-geo.

Since we are a global team of volunteers there is a difficulty with scheduling meetings as we have members in time zones around the world who are also busy with their daily jobs. For this reason we always record our meetings so those unable to attend can play them back. Additionally, we actively communicate across Slack throughout the day. While there are challenges to being dispersed, there are also some advantages. For instance, while some team members are asleep others can actively work and contribute to projects. We manage all issues on a Trello board and try to plan out our work on a weekly basis.

With a problem like COVID, having a true cross-disciplinary team is essential. That is why from the beginning, we aimed to onboard domain experts. So far we have both an epidemiologist and a bioinformatician who are regular contributors to this project. Having domain experts allows us to focus our research in ways that teams with only machine learning members cannot. For instance, these experts can help guide us on how the disease spreads at a biological level. Similarly, having a variety of machine learning experts can enlighten the discussion of models. We have individuals from more traditional statistics backgrounds, NLP researchers, computer vision specialists, and software engineers. As we will discuss in a second, many techniques from computer vision and NLP are directly applicable to time series forecasting as well.

Meta-learning/Transfer learning for time series forecasting

As mentioned above, limited data is one of the primary barriers to effectively leveraging machine learning in this context. Normally in these cases, we would turn towards proven few-shot learning techniques. However, despite the widespread success of transfer learning in NLP and computer vision, there is little literature examining its utility for time series forecasting. TimeNet examined using a pre-trained RNN and found it improved performance on clinical time series forecasting datasets. Yet, besides TimeNet, only a few papers have studied the problem. Laptev, Yu et. al., describe using a reconstruction loss to aid in transfer learning. In addition to the works on TL, several papers have explored other approaches. A paper titled “Learning from multiple cities: a meta-learning approach” looks at meta-learning taxi and bike patterns across various cities in the United States. There are many other techniques from NLP and computer vision that we could consider leveraging in the time series forecasting context. Adapting popular algorithms such as REPTILE and MAML from computer vision is another possible avenue that could lead to overall better few-shot time series forecasting approaches. In some of our initial model iterations we did see some positive transfer on select counties (for instance NYC and Chicago seemed to benefit from pre-training), however we have not rigorously evaluated these results. This leaves plenty of areas for future research.

Incorporating data vertically

Transfer learning can be done for a variety of time series forecasting datasets. However, the most similar data would likely result in the most positive transfer. Therefore, we are working to collect data from similar pandemics and viruses. Collecting data from outbreaks such as SARs, MERs, and Zika can help the model learn. We also are researching more general transfer learning techniques. We also aim to answer a few key questions: such as through general embedding layers could pre-training on wind/solar data improve performance? Are there general geo-spatial patterns that could be found from any temporal data? Our thought process is that even though the data is not related at all, it could serve as an effective method of transfer learning because it could help the initial layers extract temporal patterns effectively. Another, and perhaps more effective transfer method, is to train on each county, state, and country iteratively. These are just some of the examples of the types of data we incorporate vertically. Paradoxically of course, the longer the pandemic goes on, the better our models should get. That is why we are working with our infrastructure and engineering teams to create a framework for continuous improvement/re-training of the models as more data becomes available.

Forecasted projections for Cook County with transfer learning for one ten day period in April. MSE 26515.275
Forecasted projections for the same period without the county transfer learning. Final MSE 30200.779 (lower MSE is better). While transfer appears to have helped this case we have not preformed a rigorous abalation study so it may be due to other causes. Additionally, it is worth noting in certain cases spikes in number of confirmed cases could be due to reporting delays and not actual case numbers being less. Therefore, the utility of directly fitting to data is another topic for discussion.

Incorporating data horizontally

We are integrating a wide of variety of data sources horizontally. Perhaps most importantly, we are looking at integrating mobility data from Google, Facebook and other providers. We also are adding weather data, such as humidity, temperature, UV, and wind. We carefully monitor how each addition to the models affects the overall performance. For instance, we started with previous cases and day of the week variables. Now we are incorporating mobility data and weather data. Next we will consider incorporating demographic data on the number of hospitals in a county, average distance to a hospital, average age, population density, total population, etc., to better forecast admissions and COVID spread. We also are looking at ways to add patient surveys from FB and other data sources. Additionally, more advanced geo-spatial data could potentially prove valuable.

Initially, when we added data on the mobility we noticed performance decrease versus having just new cases from prior days. It was only when we expanded forecast_history that performance seemed to increase. This is likely due to the long incubation period of COVID-19.

Model trained on mobility data + new cases for Emilia Italy. Here we noticed that models required a longer forecast hist to accurately predict future cases. This goes hand and hand with our general understanding that virus takes several days so before patients develop symptoms. In particular we found that a look back window of ten or eleven days seemed to give us some of the best results with the mobility data. However, we may try to explore longer ranges as well (15 days+)

Data Augmentation

There are several ways to generate time series data. TSAug is a library that offers different methods to augment TS-data. An easy way of creating synthetic data in TSAug is by using cropping and drifting. Other libraries and techniques exist such as GANs for creating synthetic time series data as well. Another way we can potentially augment data is to create more geo-locations. For instance, we can sum together counties in close proximity. This would have the effect of providing more training data points for the model.

Hybrid Models

There are some potential hybrid approaches worth exploring. For instance, we could integrate deep learning with the SEIR model or the curve fitting approaches mentioned above. Hybrid models are appealing; e.g., the winning model in the “M5 forecasting competition” used a combination of a RNN and exponential smoothing. Additionally, the aforementioned Youyang Gu model is a hybrid SEIR/ML approach.

Creating models for effective transfer

Part of the difficulty of using transfer learning on time series data is the lack of uniform dimensions. Different multivariate time series data might not contain all the same features. Zika data contains information on infections but does not have the same accompanying mobility data. SARs and MERs are the most similar viruses, however officials have not tracked them as extensively. Therefore, our models need to handle a variable number of “feature” time series. One method for handling this problem is to actually swap out “upper layers.” In most transfer learning tasks we generally swap out the lower layers. However, here we need at least one swappable “upper layer” if we want to map multivariate time series data to a common embedding dimension.


Another “disadvantage” of our approach revolves around the sheer number of parameters. Therefore, we utilize parameter sweeps to help us search for the most effective combinations. We use the Wandb sweep algorithm for this and visualize the results/parameter importance:

A parameter sweep when using a transformer model; here we view how parameters such as the number of encoder layers and sequence_len affect the overall MSE.
A sample parameter importance chart for Antwerp. Green indicates larger values result in higher MSE loss (bad) while red indicates larger values result in lower MSE loss.


Another barrier to using deep learning for COVID is related to interpreting and explaining the findings. Many of the statistical models have garnered negative press for their lack of transparency. This problem might be exacerbated by DL models which have a negative reputation for acting as a black box. However, there are potential remedies to this issue, especially when using transformers and other variants. We can, for example, easily view with transformers what features the model attends to with heatmaps. Similarly, various approaches, such as leveraging convolutional heatmaps or cross dimensionally attending to input features, could help us to understand how the model is learning. Finally, utilizing methods such as bayesian learning can help the model gauge its own uncertainty and generate intervals. We recently successfully added confidence intervals to our model outputs and plan on including them in all future results.


Evaluating a model is complex even when there is a diverse range of data available. For COVID unfortunately we have very limited temporal data. Therefore evaluation at best is limited to 40 or so time steps. Originally, we evaluated models on solely a (held out) one week period in April, however we quickly discovered that our hyperparameter searches overfit to that one-week test set. As a result, we recently expanded our evaluation to a two-week period in May. However, this still has the potential to over-fit. We currently have plans to add code to automatically evaluate on every two-week permutation in the test set. While not perfect, this does provide more evaluation data points and lessens the chance of merely tuning the hyper-parameters to a one two-week period. As more data becomes available over a wider range of seasons, stages in the pandemic, and locations, we hope to develop even more robust evaluation methods, including evaluating data on a rolling basis.

Choosing the proper evaluation metric is another difficult decision. In our case for now we are using Mean Squared Error (MSE). However, MSE does have the fundamental problem that it is not scale agnostic. Therefore while it is easy to compare models on specific counties it is not good across counties. This is where metrics such as normalized MSE or mean absolute percentage error could work well. Additionally, MASE might be another alternative.


From a technical standpoint, how we train and evaluate models is deeply tied into Weights and Biases. We use Weights and Biases to effectively track all of our experiments. Secondly, Wandb is used to write up results on the various performance of models and export them to Latex when publishing. We also use other technologies such as Colaboratory, AI Notebooks (when more computational power is needed), GCS, as well as Travis-CI. However, all of our crucial code is stored in a central repository and tracked with unit tests. We only use notebooks to define the configuration files and import functions/models from the central repository. These configuration files are then logged to W&B for later analysis. Weights are automatically stashed onto GCS along with a UUID to identify their configuration file/results. Code for creating our combined dataset (i.e. mobility data, new cases, weather, symptom surveys, etc.) is stored in our repository. Currently, this code has to be run manually, however we are working on setting up data pipelines with tools like Airflow to run on a daily basis. Another area we are actively working on is better data architecture overall with our datasets team. Specifically, we are looking at how to best utilize our resources in terms of cost and enable our architecture to work on any cloud environment.

An example of one our configuration that we use for forecasting Coronavirus. You can see here we include a wide variety of parameters. Some of these parameters weights and biases optimizes for in their sweep, while some are static. Excluded layers are layers not in the original configuration file

For a more complete example please see this gist


Unlike other projects which have closed source models and/or do not explain their decisions, we are committed to being as transparent as possible. Most of our code is open source and on GitHub (and we plan on open-sourcing additional portions within the next few weeks). All of our meetings and discussions are posted to our GitHub page so anyone can watch them. With respect to data, the majority of data sources are publicly available and will be regularly uploaded (still finalizing this pipeline) to Dataverse. Unfortunately, due to patient privacy, some of our data has to remain private, however we do plan on thoroughly outlining how people can apply for access to those data sources. Additionally, all of our results are logged to public projects on Weights and Biases so anyone can analyze our full experiment results. Feedback from the outside community is always welcome, regardless of background or expertise.


Deep learning has the potential to help create better COVID forecasting models which can, in turn, aid in public policy planning. However, many barriers still exist to using effectively COVID data. This is why we are currently ramping up efforts to leverage cutting edge techniques in the machine learning space to overcome these challenges. If you are interested in volunteering on this task please reach out to myself or CoronaWhy directly. We are especially interested in acquiring more domain experts and collaborators in virology and epidemiology. We also would like to acquire more public policy experts to think of ways our models could have the most positive impact. Finally, as always, we continue to look for experts in transfer learning, meta-learning, time series forecasting and data engineering to round out our team.

More resources

In addition to the resources listed above there are a variety of additional places where you can find information both on our project and other efforts:


Talk on Attention for Time Series Forecasting

IMHE Model

Analysis of Youyang Gu Model

Youyang Gu Model

Los Alamos Model

Nate Silver Blog Post on Where we are headed (Published May 1)

Official CDC Forecasting Website