Source: Deep Learning on Medium
Model validation is a key component of a machine learning pipeline. After working to collect data, build and train a model, if the model cannot make accurate predictions (or at least accurate enough for your domain!) then it doesn’t have any value.
Classic Model Validation
Models are evaluated by their performance on the test set data, which is hold-out data not seen by the model during training and hyperparameter tuning.
Note that tuning should be done using the “validation” set or a cross-validation fold. I only bring this up so we don’t confuse the final model validation, on the test set, with the intermediate evaluation done using the validation set.
But is this final evaluation enough? Can we construct additional verification tests?
From Scientific Theory to Law
In the sciences, there are many statements that are difficult, or impossible, to prove. For example, one formulation of the second law of thermodynamics is that “it is impossible for heat, under it’s own volition, to transfer from a cold medium to a warmer one.” 
This statement — that heat cannot transfer from cold to hot by itself — is a theory. In fact, it’s a theory that cannot be proven.
However, we can conduct a lot of experiments and, to date, no experiment has ever been conducted that violates this theory. So, in science, we take this as sufficient proof and the theory becomes a law.
This idea might seem silly at first. You are probably thinking, “Well, of course heat doesn’t go from cold to hot.” But that intuition wasn’t given to you at birth; you developed it after countless observations of heat transfer throughout your life.
You have conducted countless experiments (read as life experience), and you have not been able to disprove the second law of thermodynamics so you take it as valid.
We noted that we’ve never observed a violation of the second law of thermodynamics, but what if we actively conducted experiments to disprove it?
If all of these experiments failed, we would be inclined to believe the theory. This is the main idea behind adversarial controls.
We have our hypothesis, which we can’t necessarily prove. We generate and test opposing hypotheses, and if these hypotheses fail then we feel better about the original hypothesis.
Physicist J. R. Platt wrote a prominent paper advocating for the use of alternative and opposing hypotheses, which he defines as strong inference. 
Interestingly, Platt argues that some scientific fields progress faster than others not because they are more generously funded or have more tractable problems, but because their community members are more rigorous in their application of the scientific method.
It shouldn’t come as a surprise that machine learning models are often referred to as hypothesis functions. We can estimate our model performance with its score on the test set, but to really feel confident in our model we need to understand what patterns the model is learning.
This concept basically comes down to asking questions:
- On which test cases did the model fail?
- Were these cases “easy” for a human to understand?
- Are all the images of class x out of focus? Did the model simply learn to recognize blurry images?
Once we start to ask these questions, we can create alternative hypotheses (models). Ideally, these stand as “adversaries” to our original model.
Chuang and Keiser wrote a fantastic paper (only two pages!) outlining some of these questions, and their words have helped me switch to an adversarial mindset. 
Endanger Your Model
Ok, ok … ask questions, interrogate the model… but what about an adversarial control that I can start applying today?
A. H. Rushton said,
“A theory which cannot be mortally endangered cannot be alive.” 
If you really want to endanger your model, you should use y-scrambling.
A straw model, like a scarecrow, is brainless. You are probably familiar with some straw models already — like a model that always returns the mean for data with quantitative response, or a model that always returns the most populated class for categorical problems.
These models serve as good baselines; if your model can’t beat these, something is definitely wrong!
In machine learning, and especially when applied to the sciences, we are looking for physical links between the input data and the observed response. That is, there is a correlation between the features (or their interactions) and the response variable and this makes sense because it’s capturing a physical mechanism.
But, the thing is, given enough features (even random features!) there is a chance that some features will correlate with the response variable. In this case, the model learned a link but this link has no meaning.
Y-scrambling is a type of straw model where you intentionally break any potential linkages between the input features and the response variable. This procedure is easy to implement: you simply shuffle the response variable of the training data and repeat the training procedure.
It is important to note that here the “training procedure” includes all the steps used to train the model, including feature selection and hyperparameter tuning using a validation set (which was also shuffled).
This straw model, by definition, cannot have learned any meaningful links between the input features and the response variable. Now compare your original (non scrambled) model’s performance on the training and test set to the scrambled model. If the original model does not perform significantly better, then there is no reason to believe that this model has learned anything of value — even if it has “high performance.”
What About the Test Set
Isn’t this exactly why we evaluate the model with respect to the test set? How could a model learn useless features and still generalize to unseen examples? Thus, a model trained on noise should perform miserably on the test set. So why use y-scrambling at all?
Unfortunately, it is possible for a model trained on noise to perform well on a test set. Simply put, the test set is, in most cases, randomly drawn from the full data. Because of this randomness, there is a chance that the test data follows the same useless pattern learned by the training data. Or that the error between a particular group of test points and the predicted values is small (again, by sheer luck).
Granted, over-optimistic test set scores are not super likely, but the only way to know for sure would be to test multiple test sets, which is undesirable.
You can, however, train and test many y-scrambled straw models by reshuffling the response variable repeatedly.
If Possible, Scramble Many Times
Some of the scrambled permutations will be close to the original order of the response variable. This means that if your proposed model is good (as in, it did learn meaningful features) then some y-scrambled straw models will have similar performance. But this doesn’t negate your model, since the majority of the y-scrambled straw models will perform much worse.
In fact, if you feel so inclined, you could build a distribution of the y-scrambled performances and conduct a classical hypothesis test to determine whether or not the proposed model is statistically significant from the straw models.
An Example: Y-scrambling [not] in Action
Recently, I found this technical comment where y-scrambling was used to invalidate a published model. 
Without going into great detail, the original paper created various models (linear regression, k-nearest neighbors, support vector machine, neural network, and random forest) and achieved r² scores ranging between 0.64 and 0.93. The test set rmse was also comparable to that computed on the training data.
However, the authors of the technical comment replicated the models in the original paper and found that y-scrambled straw models had almost identical r² and rmse results. Moreover, in the technical comment, two additional test sets were evaluated and a stark decrease in the proposed model’s performance was observed; demonstrating that the original reported test set underestimated the generalization error.
This quick explanation paired with the research papers I quoted should demonstrate the benefits of using y-scrambling.
Please note that y-scrambling is not meant to replace other validation methods such as cross-validation and test set evaluation, but should be a supporting tool in your model validation toolbox.
In fact, you might just find that it is “[your] most powerful validation procedure.” 
Finally, if you want to learn more about scrambling-based straw models, you could track down this paper. 
p.s. – I have no affiliation with any of the referenced authors — just happy to share their fantastic work.
 Cengel, Yunus A., and Michael A. Boles. “Thermodynamics: an engineering approach.” Sea 1000 (2002): 8862. [ch. 6, pg. 290]
 Platt, John R. “Strong inference.” science 146.3642 (1964): 347–353.
 Chuang, Kangway V., and Michael J. Keiser. “Adversarial controls for scientific machine learning.” (2018): 2819–2821.
 Chuang, Kangway V., and Michael J. Keiser. “Comment on “Predicting reaction performance in C–N cross-coupling using machine learning”.” Science 362.6416 (2018): eaat8603.
 Kubinyi, Hugo. “QSAR in drug design.” Handbook of Chemoinformatics: From Data to Knowledge in 4 Volumes (2003): 1532–1554.
 Rücker, Christoph, Gerta Rücker, and Markus Meringer. “y-Randomization and its variants in QSPR/QSAR.” Journal of chemical information and modeling 47.6 (2007): 2345–2357.