Tensorflow.js is a great tool, but it is still too slow to run in real-time in the browser with a video input for a real use-case.
How can I run or train a deep learning model in the browser?
How can I do deep learning in the browser as efficiently as possible?
Why processing client side matters
If we want to make the web smarter, we need to integrate machine learning based algorithms into the web applications. Currently the state of the art of machine learning is reached by deep learning. Our web application will collect the data from the web browser of the user. Then it will either process it directly in the web browser (client side) or send it to the server which will return the result later (server side).
The server-side processing is often easier because it does not depend on the user’s configuration. We can then use almost unlimited computing power. If the data is lightweight and if the application does not require a very low latency, we should then process the data server side. For example a conversational agent should be implemented server side: the data consists in text chunks, and the accepted delay is around 1 second.
But if we want to analyze a video stream, the server-side processing is not the best solution. The data is heavy so it will require a large bandwidth and powerful servers. The best way is to process it client-side. But since the user may have a cheap device, we need a very efficient deep learning engine. That is why we developed Jeeliz deep learning technology. We focus mainly on video applications because this is where client-side processing is almost mandatory.
Some competitors still bet on server-side processing arguing that the generalization of 5G networks and cheap cloud GPU computing power will make it interesting, but:
- Even with 5G there will be network cuts. If the user enters in a metal elevator for example. And the 5G is very fast in a short range, so it will be mostly for urban areas.
- The Virtual Reality is coming with its higher display resolutions. The 4K then 8K resolutions are the next standards. As the video resolutions and framerates will increase, they will require higher bandwidths for their streaming…
- It is true that GPU cloud computing prices decrease and power increases. But the same trend is observed for laptops and mobile devices GPUs.
- Using the user’s GPU will always be free while the GPUs in the cloud will never cost nothing.
Why speed matters
Currently the client side deep learning demonstrations and applications relying on other engines than Jeeliz don’t have any commercial use. We can enumerate :
- toy demonstrations learning on video,
- drawing and music generation weird experiments,
- image processing demos, using GAN models. The GAN model weights more than 100MO and the model takes more than 200ms to process one image on a gaming laptop. It would be better to do this server side,
- chatbots which would run better server side.
They cannot process effectively enough the video stream for the most interesting and valuable use-cases. It is quite frustrating because if we look at the state of the art of deep learning video processing we see many incredible applications like deep fakes, style transfer or face generation. But most of them:
- Don’t run in real-time or require a $5000 graphic card with Cuda,
- Run in a controlled environment using the video from a quality external webcam and a nice illumination of the scene,
- Use a model with tens of convolutionnal layers, which weighs sometimes gigabytes, unloadable for a web browser.
Our deep learning technology allows us to get closer of the state of the art in the browser. We were the first company to offer a commercial product based on deep learning in the web browser with our glasses virtual try-on application in 2016. Now we have explored other tracks like:
- Expression recognition to do like Apple Animoji application
- Object recognition and tracking for augmented reality
We keep improving our deep learning engine and our models for these use cases. And at the same time we keep exploring new possibilities.
An integrated environment
For real-time video analysis, the developer requiring a deep learning based solution is not a deep learning specialist. He needs more an integrated API than a deep learning API. If he wants to detect an track a face from the webcam video feed to build a funny face filter, he will be able to use Jeeliz FaceFilter without any ML knowledge.
So we propose solutions which mask the complexity of deep learning. These 3 aspects are hidden for the final user:
- The complexity of the training: to avoid overfitting we exclusively train our model using 3D generators. We use an internal powerful training application to be able to monitor training of network models, to easily change hyperparameters without having to write, test and maintain hundred of Python scripts,
- The structure of the neural network: if the final user is not a deep learning expert, he may not be able to guess the best structure for the neural network. In particular the balance between the accuracy of the result and the computationnal complexity is hard to determinate,
- The implementation of the trained neural network: the outputs of a neural network are always noisy so we need to filter them, and we also need to get the video stream, to move the detection window over the full video frames until an object is detected, etc…
Originally published at jeeliz.com.
Source: Deep Learning on Medium