By increasing the exponent of the feature map, we measure higher-order correlations between the data points, thus obtaining a finer similarity measure, at the cost of increased risk of over-fitting.

Sinking the error ship in a sea of dimensions
We started by using optical random features in a kernel ridge regression (KRR) setup: we projected the data using the OPU, then we performed penalized linear regression in the new space. We used the Fashion-MNIST dataset and compare optical random features of different exponents to the commonly used RBF random features. We projected the data up to dimension 10⁵ and plotted the classification error as a function of the projection dimensions (Figure 2). The exact kernel accuracies for the RBF and the Optical (m=2) kernels are reached at dimension D=100.000.

Figure 2: Ridge Regression test error on Fashion MNIST for different RFs and projection dimensions D. Horizontal lines show the test error using the true kernel. Standard deviations for different seeds are negligibly small and not shown in the plot. Plot (a) compares optical RFs of degree m=2 to RBF Fourier RFs. Higher degree optical RFs are left out for better readability. The more slowly converging optical RFs for m=4 are added for larger D in plot (b).
Convolutional neural networks (CNNs) are best suited for image classification, therefore we turned to transfer learning. We combined the power of CNNs and of the random features to increase the accuracy of a simple CNN by replacing the last dense layer of common pre-trained networks by a kernel ridge regression classifier, where the random projection is performed using the OPU. We projected to dimension 10⁴ using the OPU. In architectures with a smaller final layer (e.g. ResNet34, d =512 ), we obtained an increase in accuracy of more than 2.4% . With a final layer size of the order of the projection dimension, we conserved the same accuracy (e.g. AlexNet, d = 9216 ). However, when the projection dimension is smaller than the size of the final layer (e.g. VGG16, d =25088 ), it acts as a non-linear compression, potentially decreasing the accuracy. Therefore, there is a trade-off to be found between speedup and machine learning performance.

The light energy consumption of light computation
Figure 3: Time and energy spent on computing a matrix multiplication (n ,D) × (D,D). The batch size n is 3000 (solid line) or 1000 (dotted line). The OPU is compared to an NVIDIA P100 GPU.
In Figure 3, we compare the energy consumption of the OPU with respect to the GPU. The energy consumption of the optical processing is independent of the input and output dimension and is about 45 J (30W times 1.5 seconds). The GPU consumes more energy when increasing the projection dimension, up to 7× (~315 J ) more than the OPU at D=56.000 , where the GPU hits the memory limit.

A factor 7× in energy consumption is the difference between the energy required to lift a suitcase or a washing machine up two floors (~450J vs ~3150 J )!

“ My back still hurts.”Laurent Daudet, CTO at LightOn, about his experience in moving a washing machine up two floors.

It’s your turn now
You can find the code used for the paper here: https://github.com/joneswack/opu-kernel-experiments

You can try to reproduce it or make your own calculations with an OPU through our LightOn Cloud . The Cloud will be available on March 31st. You can subscribe here .

LightOn and LightOn Cloud support research. From March 31st you will be able to apply to the LightOn Cloud Program for Research on LightOn Cloud website .

About Us LightOn is a hardware company that develops new optical processors that considerably speed up big data computation . LightOn’s processors open new horizons in computing and engineering fields that are facing computational limits. Interested in speeding your computations up? Try out our solution on LightOn Cloud ! 🌈

Follow us on Twitter at @LightOnIO , subscribe to our newsletter and register to our workshop series . We live stream, so you can join from anywhere. 🌍

The author
Ruben Ohana is a PhD student at Ecole Normale Supérieure , INRIA & LightOn . Ruben acknowledges support from Région Ile-de-France.

References
[1] Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al., “Learning with kernels: support vector machines, regularization, optimization, and beyond”, MIT Press , 2002.

[2] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco, “Falkon: An optimal large scale kernel method,” Advances in Neural Information Processing Systems , 2017, pp. 3888–3898.

[3] Murphy, K. P., “Machine Learning: A Probabilistic Perspective”, chapter 14.4.3, pp. 492-493, The MIT Press, 2012

[4] Ali Rahimi and Benjamin Recht, “Random features for large-scale kernel machines,” in Advances in neural information processing systems, 2008, pp. 1177–1184.

[5] Ali Rahimi and Benjamin Recht, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” in Advances in neural information processing systems, 2009, pp. 1313–1320.

[6] Alessandro Rudi, and Lorenzo Rosasco, “Generalization Properties of Learning with Random Features”, Advances in Neural Information Processing Systems , 2017.