Original article can be found here (source): Deep Learning on Medium
Throwing the random kitchen sinks at the problem
The major drawback of kernel methods is that given n data points, there is a need to compute the correlation for all n² pairs of points. If we write these correlations in a matrix form, we have an n×n kernel matrix that is really expensive to compute and store when the number of data points is large. Given current computational capabilities, exact kernel methods are therefore used only for average-sized datasets — n ～10⁴ points.
Let us assume a problem with n data points of dimension D. Using traditional kernel methods we would need to compute and store a matrix of n² elements. If n=1.000.000, this could require 1 TB of memory. Naively inverting this matrix to perform Kernel Ridge Regression  would cost n³ operations or about an exaop. As a result of this difficult scaling, several solutions have been proposed to overcome this limitation.
Ali Rahimi and Ben Recht developed the technique of random features [4, 5]: a distance-dependent kernel (such as the RBF) can be approximated in expectation by the dot product of randomized feature maps. The randomness is sampled from the Fourier transform of the kernel. For instance, for the RBF, we have:
where wᵢ∼N(0, 𝕀 / σ²).
Using D random features, we only need to store the data projected using random features (n×D) and the covariance matrix (D×D). Assuming for example n=1.000.000, d=1.000, D=10.000, we have reduced the memory requirement by a factor 100×. Moreover, the ridge regression solution now involves the inversion of a 10.000×10.000 matrix, instead of a matrix of size 1.000.000×1.000.000.
Thus, if the number of random features used is much lower than the number of samples, there is a substantial reduction in the complexity related to the problem. Recent work  has proven that a number of random features could be of order of
is sufficient, under certain reasonable assumptions, to obtain the same statistical accuracy of the exact Kernel Ridge Regression. In our example with n=1.000.000, we would need 14.000 random features to reach statistical optimality.
LightOn’s OPU performs the following operation at the speed of light:
where U∈ℝᴰˣᵈ in which each element is sampled from the complex Gaussian distribution, x∈ℝᵈ, and the absolute value squared is applied element-wise. The exponent can be changed numerically, leading to different kernels. In this randomized feature map, the input and output size (d and D) can scale up to a million.
We computed the kernels associated with the optical random features: they are dot product kernels, that measure the angular correlation between data points. Numerically, we can change the exponent of the feature map and, when the exponent m is even, i.e. m=2s, the formula of the kernel is:
Where C is a coefficient depending on s and i. The formula for the feature map of exponent m=2 is:
By increasing the exponent of the feature map, we measure higher-order correlations between the data points, thus obtaining a finer similarity measure, at the cost of increased risk of over-fitting.
Sinking the error ship in a sea of dimensions
We started by using optical random features in a kernel ridge regression (KRR) setup: we projected the data using the OPU, then we performed penalized linear regression in the new space. We used the Fashion-MNIST dataset and compare optical random features of different exponents to the commonly used RBF random features. We projected the data up to dimension 10⁵ and plotted the classification error as a function of the projection dimensions (Figure 2). The exact kernel accuracies for the RBF and the Optical (m=2) kernels are reached at dimension D=100.000.
Convolutional neural networks (CNNs) are best suited for image classification, therefore we turned to transfer learning. We combined the power of CNNs and of the random features to increase the accuracy of a simple CNN by replacing the last dense layer of common pre-trained networks by a kernel ridge regression classifier, where the random projection is performed using the OPU.
We projected to dimension 10⁴ using the OPU. In architectures with a smaller final layer (e.g. ResNet34, d =512), we obtained an increase in accuracy of more than 2.4%. With a final layer size of the order of the projection dimension, we conserved the same accuracy (e.g. AlexNet, d = 9216). However, when the projection dimension is smaller than the size of the final layer (e.g. VGG16, d =25088), it acts as a non-linear compression, potentially decreasing the accuracy. Therefore, there is a trade-off to be found between speedup and machine learning performance.
The light energy consumption of light computation
In Figure 3, we compare the energy consumption of the OPU with respect to the GPU. The energy consumption of the optical processing is independent of the input and output dimension and is about 45 J (30W times 1.5 seconds). The GPU consumes more energy when increasing the projection dimension, up to 7× (~315 J) more than the OPU at D=56.000, where the GPU hits the memory limit.
A factor 7× in energy consumption is the difference between the energy required to lift a suitcase or a washing machine up two floors (~450J vs ~3150 J)!
“ My back still hurts.”
Laurent Daudet, CTO at LightOn, about his experience in moving a washing machine up two floors.
It’s your turn now
You can find the code used for the paper here: https://github.com/joneswack/opu-kernel-experiments
LightOn is a hardware company that develops new optical processors that considerably speed up big data computation. LightOn’s processors open new horizons in computing and engineering fields that are facing computational limits. Interested in speeding your computations up? Try out our solution on LightOn Cloud ! 🌈
 Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al., “Learning with kernels: support vector machines, regularization, optimization, and beyond”, MIT Press, 2002.
 Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco, “Falkon: An optimal large scale kernel method,” Advances in Neural Information Processing Systems, 2017, pp. 3888–3898.
 Murphy, K. P., “Machine Learning: A Probabilistic Perspective”, chapter 14.4.3, pp. 492-493, The MIT Press, 2012
 Ali Rahimi and Benjamin Recht, “Random features for large-scale kernel machines,” in Advances in neural information processing systems, 2008, pp. 1177–1184.
 Ali Rahimi and Benjamin Recht, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” in Advances in neural information processing systems, 2009, pp. 1313–1320.
 Alessandro Rudi, and Lorenzo Rosasco, “Generalization Properties of Learning with Random Features”, Advances in Neural Information Processing Systems, 2017.