Original article was published by on AI Magazine
We learnt from the experiments using Model V1 that our engineered features consistently gave lower test losses and higher profits as compared to the benchmark. In this final round of experimentation, we thus stuck with all our engineered features and experimented with more sophisticated network features like input scaling, dropouts and L2 regularisation. This time, the configuration of starting with input scaling [without] clamping,  nodes in Layer 1, [0.1] dropout probability, [0.001] L2 regularisation weight decay, [MAE] loss function and [Adam] for back-propagation (bolded below) yielded the lowest test loss of 0.4794.
- Network: 4 layers with ⅔ reduction, starting with 32 / 64 /  nodes in Layer 1.
- Input scaling: With / [Without] clamping
- Dropout probability: [0.1] / 0.2 / 0.25
- L2 regularisation weight decay: [0.001] / 0.0001 / 0.00001 / 0.000001
- Input features: All engineered features mentioned in Section 4.1 (313 columns)
- Loss function: RMSE / [MAE] / ASYM
- Back-propagation algorithm: [Adam] / SGD
- Min. test loss: 0.4794
- Profits / Max. possible profits: 0.5912
6.4 Other experiments / models:
Throughout the three main types of experiments listed above, these are the other modifications that we experimented with that did not optimise our models for our two metrics:
- Adding a 5th Layer: Due to our relatively large training dataset, we wanted to see if our model would improve with extra complexity, and hence tested it with a 5th layer. However, this increased our test loss and resulted in more overfitting instead.
- Auto-encoder with bottleneck sizes of 1.5, 2, 4 and 8: As our neural networks are dealing with large dimensional inputs (>300), we wanted to reduce the dimensions using an autoencoder. However, this instead made our test loss and profits worse.
- Squared Perceptrons: Squared perceptrons used as a drop-in replacement for ordinary perceptrons may help models learn better than ordinary perceptrons, but in our model’s case, models without squared perceptrons performed better.
- Momentum and Force Losses: Momentum and force losses are usually meant to help reduce lag, but in our case our lag correlations already peak at 0. We experimented with these losses anyway, and as expected, models with the momentum and force losses ended up having worse test losses because taking differences often result in greater noise.
7. Limitations and Future Improvements
7.1 High test loss, gap between training and test loss remains large
This shows that the model is not generalising well enough on unseen data and could be a warning sign that the network may be memorising the training data. Even with feature engineering, fine-tuning the windows of existing features, adding dropouts and weight regularisation in Model V2, these changes only made a slight improvement to the gap between training and test loss. Two possible reasons for the relatively high test loss and large gap could be: (1) the data itself is very noisy, and (2) the data provided is not truly predictive for wind energy production.
A potential method to tackle the problem of non-predictive data is sourcing for more predictive data. In the case of predicting wind energy production, we can consider including external sources of data on air density, efficiency factor of the wind farms and length of the rotor blades (all of which are terms belonging to the wind power generation equation). This may improve the amount of predictive data that we have and thus improve the neural networks’ ability to generalise.
7.2 Consistent under-predictions
From the ‘Actual vs Test Predictions’ graphs, we see that in our models, a larger proportion of the plotted points tend to fall below the 45 degree line. This implies that our model consistently under-predicts wind energy production. This is also visible in the ‘Test Predictions’ graphs where we see more blues in the upper region, indicating the upper values are often under-predicted.
One possible reason for the under-predictions could be due to the datasets we employ to train the models. In the original dataset used to train the models, we have wind speed and wind direction forecasts for all 8 locations that spans 01 Jan 2017 to the present. While each location represents the location of a major wind farm in the Ile-De-France region, Boissy-la-Rivière was only in operation from August 2017, while Angerville 1 and Angerville 2 (Les Pointes) were only in operation from 02 Jul 2019. This is likely to have impacted the total energy production capacity of the region for a given wind forecast (i.e the same average wind speed would likely lead to a higher wind energy production in the later part of 2019 than in the earlier part of 2017). This may potentially introduce noise in the data — variations in wind energy production that are not due to the variation in our model inputs.
Therefore, we tested our V2 model on 3 datasets:
- The original dataset spanning 01 Jan 2017 to the present
- A weighted dataset that multiplies energy actuals by 50% before 01 Aug 2017, by 80% between 01 Aug 2017 and 02 July 2019, and 100% for energy actuals after 02 July 2019. The weights for the respective durations were estimated based on this formula: % Wind Energy Production Capacity in Operation = Total Estimated Nominal Power Output of Farms Currently in Operation ÷Total Estimated Nominal Power Outputs of the 8 major wind farms.
- A truncated dataset that consists of only data after 02 July 2019 based on the assumption that all 8 farms are in operation and wind energy production capacity does not change significantly from that point.
Here is a comparison of the results: