Source: Deep Learning on Medium
Diminishing the Dengue Danger: Predicting future dengue outbreaks using Machine Learning on historical dengue and climate data in Singapore
Dengue Fever has been identified by the World Health Organisation (WHO) to be the most critical mosquito-borne disease globally. In the last few decades, the world has seen a 30-fold increase in global incidence of Dengue Fever. People living in tropical and subtropical climates are most vulnerable to Dengue, and that is half the world’s population at risk. There are four types of dengue virus serotypes (or ‘strains’), hence a person can be infected by dengue virus up to four times. With no specific treatment associated with Dengue, early detection to prevent the breeding of Aedes mosquitoes that carry the virus is the most effective method to reduce dengue outbreaks.
Ever since the Dengue fever epidemic hit Singapore almost a decade ago, the ‘5-step Mozzie Wipe-Out’ campaign launched by the National Environmental Agency (NEA) is beknownst to every Singapore resident. Indeed, dengue outbreaks and spikes in dengue cases have been intermittently publicised in Singapore, especially given our conducive equatorial climate to breed mosquitoes and the passive monitoring of stagnant water that are potential breeding grounds for these deadly pests. Dengue fever came into the spotlight in Singapore and became a topic of everyday discourse in 2005, when there was a total of more than 14,000 dengue cases. This created a shortage of beds in hospitals due to the influx of dengue patients streaming in. Since then, the dengue virus has been here to stay in Singapore, with more than 16,000 reported cases in 2019.
The objective of our project is to predict the number of dengue cases based on the (1) rainfall and (2) temperature measurements of various locations in Singapore, (3) population growth, and (4) time series effects. Our models will forecast the weekly number of dengue cases up to eight weeks into the future. We decided that predicting weekly outcome is more ideal than daily outcome to reduce variation in the outcome values, since we are also provided with number of dengue cases by weeks. Using Decision Trees and Neural Networks, we aim to produce a robust prediction model that forecasts the upcoming number of cases of Dengue eight weeks ahead.
We were able to leverage on the substantial amount of quality data to bring this project to fruition, because Singapore’s respective ministries collect important and relevant data. In Singapore, it is mandatory for all medical practitioners and clinical laboratories to report to the Ministry of Health (MOH) all clinically suspected and laboratory-confirmed dengue cases within 24 hours of diagnosis (as per Section 7 of the Infectious Diseases Act).
Our full dataset consist of a weekly panel data from 13 August 2011 to 23 November 2019 (433 weeks). We have picked out the following features to serve as significant variables that will determine the forecast of dengue cases — weekly rainfall (50 locations) and temperature (15 locations) data of various locations in Singapore and the yearly population of Singapore. The population data was interpolated to the individual weeks within the year in order to facilitate its processing together with the other weekly data.
Our dependent variable we want to predict is the number of dengue cases per week, up to 8 weeks into the future.
The rationale behind using these data are as follows:
- Weather data (i.e. rainfall and temperature): these are the factors that affect the breeding of Aedes aegypti and Aedes albopictus mosquitoes, and hence are related to the risk of someone contracting dengue fever. High temperatures provide a favourable and conducive environment for mosquitoes to breed, as well as affects their feeding behaviour. Conversely, high rainfall results in a higher likelihood of the accumulation of stagnant water left unchecked, such as in the roofs of private residential homes which is perfect for mosquitoes to breed.
- We also included the population trend of Singapore, to investigate if changes in population size leads to a significant changes in the number of dengue cases over the years. We hope to be able to account for the portion of increase in the number of dengue cases over the years due to an increase in population.
Data Pre-processing Methodology
Working with highly correlated data
Looking at the correlation plots in Figure 4 below, we observed that many of the rainfall and temperature stations are highly correlated to one another.
Looking at the correlation plots in Figure 4 above, we observed that many of the rainfall and temperature stations are highly correlated to one another. This makes sense, considering that Singapore’s small land area should render most parts of the island to be atmospherically homogeneous. Based on geographical locations, the correlation results are logical and as expected, as locations such as Dhoby Ghaut and Somerset (both located in the Southern area of Singapore) have very similar rainfall trends. It is worthy to note that there was an insignificant proportion of missing rainfall or temperature data in the original dataset (which dates back to the year 2000). This was why we narrowed down the range of data that we are using to eight years. We then replaced missing rainfall or temperature values based on the correlation method within this 8-year range.
This makes sense, considering that Singapore’s small land area should render most parts of the island to be atmospherically homogeneous. Based on geographical locations, the correlation results are logical and as expected, as locations such as Dhoby Ghaut and Somerset (both located in the Southern area of Singapore) have very similar rainfall trends. It is worthy to note that there was an insignificant proportion of missing rainfall or temperature data in the original dataset (which dates back to the year 2000). This was why we narrowed down the range of data that we are using to eight years. We then replaced missing rainfall or temperature values based on the correlation method within this 8-year range.
Feature Extraction to create Dendrogram clusters
In view of the high correlation among temperature and rainfall features respectively, we want to reduce the number of temperature and rainfall features. However, we still want to retain our model’s ability to explain the variability in dengue cases (training labels). Thus we conducted dimensionality reduction through feature extraction to handle highly correlated features and to prevent overfitting.
We built two dendrograms, one each for rainfall and temperature. The dendrograms were based on the pairwise-correlation distance (1 minus correlation) between two locations, and from the results we derived 11 rainfall clusters and 4 temperature clusters to represent the different rainfall and temperature trends in different parts of Singapore.
Dealing with missing data
There were patches of missing data in the dataset. For certain weather stations, prolonged periods of rainfall and temperature data is not available in the dataset. Considering the localised nature of rainfall, we approximated rainfall levels using the average rainfall values of the nearest meteorological stations. We believed that doing so would lead to a more accurate representation of the rainfall/temperature trend, as compared to replacing the missing values with other statistics such as an 8-year rainfall average of the respective rainfall stations.
Feature Normalisation of rainfall, temperature, and other values
For each of the 11 rainfall clusters and 4 temperature clusters, we took the average of all the temperature or rainfall values within each cluster. This averaging is done for each week’s data across all stations within the cluster, producing cluster averages for both temperature and rainfall each week. To illustrate this, temperatures at Khatib and Sembawang meteorological stations form cluster 3. As such, cluster 3’s weekly temperatures will be the average weekly temperatures of these two stations.
We then split our dataset into training set (from 13 August 2011 until 17 March 2018) and test set. We then normalised the values within each temperature or rainfall cluster, within the training set. The median and range used to normalise the training set were used for normalisation of the test set values as well. The formula used is as mentioned below.
We normalised the cluster values to minimise the magnitude of the values used, which will speed up neural network learning and also prevent any errors when running the neural network due to large numbers.
We selected certain features using regression trees, which utilises the XGBoost algorithm. Using an initial set of variables with the dengue levels at T+8 (i.e. eight weeks into the future), dengue levels at T+0 (i.e. where week = 0), the maximum and mean values of each rainfall cluster from periods T-3 (i.e. three weeks into the past) to T+0, and the minimum values of each temperature cluster from periods T-12 to T-4.
The idea behind using the mean of each rainfall cluster from periods T-3 to T+0 is that the rainfall of those three weeks could predict for a dengue outbreak eight weeks into the future. For rainfall clusters, this range was selected after taking into account the life cycle of Aedes mosquitoes after being in contact with water (which takes around three weeks from contact of water, to the causing of disease). Importantly, this initial spike in Aedes population could lead to an exponential increase in eggs being laid, thereby predicting the spike in dengue cases eight weeks into the future.
As for temperature clusters, the minimum value of each temperature cluster from periods T-12 to T-4 was selected because we assumed that the lowest temperature of the past 12 weeks could anticipate an increase in temperature eight weeks later. We could not look too far back, otherwise, we would have a severe shortage of test data.
Initial findings using Regression Trees
We first ran the optimisation of regression tree parameters on this initial set of variables. The best set of parameters for our dataset was found to be as seen in the figure below. Using these optimised values, we then began reducing the number of clusters so as to prevent the curse of dimensionality when we move on to using neural networks. The reduction of clusters was performed as follows:
- Use the initial results (using all temperature and rainfall clusters) as the benchmark. The results to take note of are the mean-squared error (MSE), training predictions, test predictions, and the lagged correlations.
- Remove the variables belonging to smallest rainfall cluster, which is situated at the extreme ends of Singapore e.g. Cluster 1 and 2.
- Look at the benchmarks for comparison. If they fare equally, or the model fares better without the removed variable, they are removed from the subsequent models to be tested. If it fares worse than the previous results, then that cluster is added back into the model.
- Should there be a run that was not completed (i.e. test loss line did not cross training loss line), the number of rounds was increased
- This was performed iteratively, and the same method was performed for the temperature clusters.
- Any variations of the final set of variables were added into the model for testing. Like before, if the results were equivalent or worse than the previous model tested, the variable was removed.
We use regression tree for this purpose of feature extraction simply because it is fast, and so we can quickly reduce our number of variables for the deep learning portion which will take considerably longer. The best parameters for the regression tree based on our initial set of variables are as such: [number of rounds = 20; maximum depth of trees = 1250; learning rate = 0.1; number of parallel trees = 10; subsampling = 0.1, column sampling by trees = 1]. We then found our best parameters to be: [mean dengue levels of T-4 to T-1, T-8 to T-5 & T-9 to T-12, mean differentiated dengue values of T-4 to T-1 & T-9 to T-12, Population levels at T+0, the maximum and mean values of T-3 to T+0 for rainfall cluster 2, 4, 5, 6, 7, 8, 9 10 and 11. We also used the mean and minimum values of T-7 to T-4 & T-12 to T-8 for temperature clusters 2, 3 and 4].
The results of our regression tree model (with a squared error of 0.12637) is as listed below:
Deep Learning Model Building
The metrics used for the deep learning model are similar to the metrics used in the regression tree model. However, due to the small number of data points in our dataset, we decided to go without a validation set. Instead, we will rely solely on the Euclidean loss, as well as the test predictions to determine whether our model is performing well. Also, when we looked at the dengue levels in 2017–2018, it was mostly flat. As such, we thought that even if we did a validation set, it would comprise mostly of those values. This could give us a false impression of how well our model is actually doing, since the dengue levels in 2019 begun fluctuating and even peaked. Importantly, we might have optimised to model to fit a validation set which does not have a peak (which corresponds to an outbreak). In view of time, we decided to just go with the Euclidean loss and the test prediction. Of course, we took into account lag as well.
Our benchmark is persistence, which in our test dataset was calculated to be 0.015698. The score was calculated by first determining what our test set dengue values were. This was one column in our Excel, which serves as our T+0 values. In an adjacent column, we copied and pasted the dengue values from T+8 onwards. This column serves as our T+8 values. The protruding ends of the T+0 and the T+8 columns were truncated. After which, we calculated the squared difference of the two columns in a new column. In a separate cell, we averaged the sum of all the squared differences, and divided that average by two to obtain the persistence loss.
Deep Learning Methodology
We then shifted our focus to neural networks with the same features highlighted from the regression tree stage to observe if the ability of neural networks to generalise well to more complex patterns could allow us to make more accurate predictions.
From the variables derived from the regression tree runs, we ran an initial test on a simple 3-layer neural network with just 3 configurations: 16, 32 or 64 nodes. This was in view of time. We ran 10,000 iterations per model, using the Adam optimiser. The whole process was repeated 5 times to ensure that initial weights did not lead to overly biased or overly optimal results. Both the training phase and the test phase was conducted with variables at T+0 and T+8 at the minimum. There was no averaging of the T+8 variable to be predicted during the training phase.
And the best results were:
Test Loss: 0.0060406
Number of inputs: 36
Optimisation algorithm: Adam
Number of perceptrons in topmost layer: 128
Number of layers in neural network: 3
The results were good, but we wanted better. Also, there was a second spike in the predictions which came right after the real spike, which was a false spike in dengue levels. This would make us send false information out.
Through our second round of optimisation, time time on neural networks, we experimented with larger neural networks which have 3, 4 or 5 layers. From there, we discovered that one more parameter improved our results by quite a bit, which is the second differential order of dengue levels T-4 to T-1. As such, our final set of parameters are:
Mean T-4 to T-1, T-8 to T-5 & T-9 to T-12 of dengue levels
Mean first-order differentiated values of dengue levels at T-4 to T-1 & T-9 to T-12
Mean second-order differentiated values of dengue levels at T-4 to T-1
Population levels at T+0
Maximum and Mean values of T-3 to T+0 for rainfall cluster 2, 4, 5, 6, 7, 8, 9, 10 and 11
Maximum and Mean values of T-7 to T-4 & T-12 to T-8 for temperature clusters 2, 3 and 4
We also compared those results with results from using larger neural networks of 6, 7, and 8 layers. Note that the node sizes to be tested were the same i.e. 32, 64 and 128. We also tried using the SGD optimiser instead of the Adam optimiser. However, it seems that less complicated neural networks and the Adam optimiser still worked the best. We ran our final analysis with 20 repeats to get the best results as shown below:
Test Loss: 0.00453716
Number of inputs: 37
Optimisation algorithm: Adam
Number of perceptrons in topmost layer: 64
Number of layers in neural network: 4
The best results came from the model with 4 layers, and 64 number of perceptrons in the first layer. It managed to attain a loss score of 0.00453716, which surpasses persistence by around 0.01, and effectively more than halved the persistence loss.
This particular model also yielded a test prediction curve that is reasonably fitting given that the predicted values are not too far off from the actual values for low test values, and for the peak in dengue cases as reasonably predicted with a lag of 0 (which is awesome). Although our prediction did not manage to predict more cases than the peak of the actual dengue cases which we thought would be favourable, it was able to predict the peak nevertheless. The second smaller peak that came after that should also be accounted for, which could be the result of an actual second peak that did not happen due to intervention by the NEA, although this is just a theory.
Potential risks of using our model
If there is a caveat to our model results, we believe that our model is not quite capable of over-predicting (which is something that we wanted). The organisation using this model should notify the respective healthcare professions about this trend at their discretion, and perhaps ask them to anticipate more cases. Further, the prediction of the peak dengue values in the test set was actually lower than the actual dengue values. As such, we cannot be certain if our model will be able to accurately predict dengue levels in the event of a very extreme outbreak.
Conclusion and Review
Overall, we think we did pretty well, and it was not easy at all to hit the sweet spot of having a low test score, an acceptable prediction pattern and an acceptable lag pattern.
Finding the relevant variables without affecting the all the previous results negatively was tough as well. We attempted to incorporate Google Search trends on ‘dengue’ to observe if it would be a significant predictor variable, but results were mixed. In addition, the correlation and causation between Google search trends and the outbreak itself were not widely studied, thus we decided to forgo that in the end.
We believe that this model can be further improved by tweaking the model components even more, and experimenting with variations of neural network architecture, as well as feeding the training phase more data. Nevertheless, this is the best that we can achieve!