Source: Deep Learning on Medium
n most statistical modeling or machine learning prediction tasks, there will be cases that can be easily predicted based on their predictor values (signal), as well as cases where predictions are unclear (noise). Two statistical learning methods, boosting and ProfWeight, use those difficult cases in exactly opposite ways — boosting up-weights them, and ProfWeight down-weights them.
In boosting, a model is repeatedly fit to data, and successive iterations in the process up-weight the cases that are hard to predict. Specifically, the algorithm is as follows:
- Fit a model to the data
- Draw a new (often via cross-validation or the bootstrap) sample from the data so that erroneously predicted records have higher probability of selection.
- Fit a model to the new sample
- Repeat steps 2–3 multiple times
- Final predictions are weighted averages of all the models, giving higher weights to the more recent ones.
Boosting is most often employed with decision trees, where it produced significant gains in predictive performance. Why does it work? Boosting is a form of ensemble learning, in which predictions from multiple models are averaged, leveraging the “wisdom of the crowd,” reducing variance, and yielding better predictions than most of the individual models. By forcing the model to focus on the harder-to-predict cases and learn from them, boosting avoids the easy path of gaining a high accuracy score by relying mainly on the easy-to-predict cases.
ProfWeight is a technology from IBM that does the reverse of boosting: it down-weights the hard-to-predict cases.
The scenario for using ProfWeight is very specific: You have a black-box neural net model that works well, and you need to find a simpler model, either for interpretability or to minimize computation.
Deep learning neural nets are gaining rapidly in popularity as superior performers in a wide variety of situations. However, they do not help you understand or explain the relationships between predictors and outcomes. This is often necessary to communicate the role of statistical models to others in an organization, to understand how predictions might be improved with other data, and to improve processes that give rise to the data. Deep learning is also computationally expensive, and impossible to implement in some sensors, and other memory-constrained environments.
In situations where the deep-learning model is very accurate in generating predictions, ProfWeight uses the information from the deep learning model to guide in the creation of a simpler model that would otherwise not do as well. Just as with boosting, ProfWeight uses the hard-to-predict cases, but it down-weights them rather than upweighting them. Specifically, reaching back into the intermediate layers of the neural net where prediction error is more prevalent, ProfWeight identifies the records that have higher prediction error and then tells the simpler network not to focus on them. The simpler network is thus trained on the easier-to-predict cases, i.e. the less noisy ones. The simpler model is improved by reducing the extent to which it is fit to noise, as opposed to signal.
How can both up-weighting and down-weighting work? The answer is that they are used in different situations:
Boosting is used when you have a weak model, and an ensemble approach can improve its predictive power. Up-weighting the difficult cases is used as part of the learning process, and focuses the ensemble effort where it is most needed.
ProfWeight is used when you already have a strong but complex black-box model, and you need to replace it with a simpler model. Down-weighting is used to remove the noisy cases before the learning process of the simpler model.