Original article can be found here (source): Artificial Intelligence on Medium

**From Formulae to Code**

There are often surprises in how a paper is actually implemented in practice. What’s mathematically convenient to write rarely corresponds to efficient code, and vice-versa. Those two sentences don’t describe this paper, however: it’s pretty straightforward to code. Or it would be, if not for a major sin: the authors use the *(features, datapoints) *convention for data matrices. Statisticians sometimes do so; they are forgiven. Still, I* had to* change this in my implementation: practical machine learning works with the *(datapoints, features)* convention, period.

Machine learning has several advantages as a field: open research, lots of funding, great open-source software, and some useful conventions such as this one. As practitioners and researchers, it is our duty to defend those against negative external influences. I played my role as a brick in the wall against the *(features, datapoints)* convention, because I’m not the hero ̶G̶o̶t̶h̶a̶m̶ the machine learning community needs, but the one it deserves.

We’ll work mostly in a CPU environment, using scikit-learn and NumPy, as well as LightOnML, the Python API for the LightOn OPU. Let’s start with SRP. In the paper, with the — disgusted tone — *(features, datapoints)* convention, the projection matrix we’re looking for is *U=XHΨᵀ*, where *X* is the data matrix of size *d *x* n*, with *n* the number of samples and *d* that of features. *H* is the centering matrix of size *n *x *n*, and *Ψ *is a matrix of shape *k *x* n, *with *k* the dimension we are projecting to, containing random features that approximate a kernel applied to the labels. We’ll come back to it. For now, let’s convert the formula to the — angels’ choir — *(datapoints, features) *convention. First, we transpose everything to obtain a projection matrix of shape *k *x* d*: *ΨHᵀXᵀ*, and then we remember that *H *is symmetric and that we actually use the transpose of the other two matrices, leaving *U=ΨᵀHX.*

What about *H* and *Ψ? H *is a special matrix that’s useful in equations: *HX* is *X* minus its mean row (and *XH* would be *X* minus its mean column). In NumPy, it’s much clearer and more efficient to write` Xc = X - X.mean(0)`

. *Ψ *contains the random features. In a classification setting, we want the same labels to be close together, so we use the Delta kernel, which is 1 when two entries are equal and 0 otherwise. Or that’s what we do in SPCA. In SRP, the delta function is approximated by a very narrow Radial Basis Function (RBF, a Gaussian). Scikit-learn offers the `sklearn.kernel_approximation.RBFSampler`

, that we use with a large gamma to obtain narrow RBFs. You can do that and go on with your life. Or you can use an OPU with these three lines instead and, like Gandalf, be reborn from light: