Original article can be found here (source): Artificial Intelligence on Medium
From Formulae to Code
There are often surprises in how a paper is actually implemented in practice. What’s mathematically convenient to write rarely corresponds to efficient code, and vice-versa. Those two sentences don’t describe this paper, however: it’s pretty straightforward to code. Or it would be, if not for a major sin: the authors use the (features, datapoints) convention for data matrices. Statisticians sometimes do so; they are forgiven. Still, I had to change this in my implementation: practical machine learning works with the (datapoints, features) convention, period.
Machine learning has several advantages as a field: open research, lots of funding, great open-source software, and some useful conventions such as this one. As practitioners and researchers, it is our duty to defend those against negative external influences. I played my role as a brick in the wall against the (features, datapoints) convention, because I’m not the hero ̶G̶o̶t̶h̶a̶m̶ the machine learning community needs, but the one it deserves.
We’ll work mostly in a CPU environment, using scikit-learn and NumPy, as well as LightOnML, the Python API for the LightOn OPU. Let’s start with SRP. In the paper, with the — disgusted tone — (features, datapoints) convention, the projection matrix we’re looking for is U=XHΨᵀ, where X is the data matrix of size d x n, with n the number of samples and d that of features. H is the centering matrix of size n x n, and Ψ is a matrix of shape k x n, with k the dimension we are projecting to, containing random features that approximate a kernel applied to the labels. We’ll come back to it. For now, let’s convert the formula to the — angels’ choir — (datapoints, features) convention. First, we transpose everything to obtain a projection matrix of shape k x d: ΨHᵀXᵀ, and then we remember that H is symmetric and that we actually use the transpose of the other two matrices, leaving U=ΨᵀHX.
What about H and Ψ? H is a special matrix that’s useful in equations: HX is X minus its mean row (and XH would be X minus its mean column). In NumPy, it’s much clearer and more efficient to write
Xc = X - X.mean(0). Ψ contains the random features. In a classification setting, we want the same labels to be close together, so we use the Delta kernel, which is 1 when two entries are equal and 0 otherwise. Or that’s what we do in SPCA. In SRP, the delta function is approximated by a very narrow Radial Basis Function (RBF, a Gaussian). Scikit-learn offers the
sklearn.kernel_approximation.RBFSampler, that we use with a large gamma to obtain narrow RBFs. You can do that and go on with your life. Or you can use an OPU with these three lines instead and, like Gandalf, be reborn from light:
Am I slightly overstating things? Totally, but you could also stop being a killjoy and appreciate that you just used light to approximate a kernel.
KSRP will be quick, now. In the paper, the formula is ΦHΨᵀ (I changed the notation slightly to accommodate Medium’s dreadful lack of typesetting features). It translates to ΨᵀHΦ in the (datapoints, features) convention. Φ is obtained in the same way as Ψ, but with the random features applied to the feature matrix X instead of the labels. Note that they generally approximate a different kernel than that used for the labels. The rest is the same as for SRP, except that X is replaced by Φ. In my code, I also added the option of using the exact kernel matrix K for the features and approximating that of the labels with SRP, in which case the formula is KHΨ.