Source: Deep Learning on Medium
I am writing this post to guide newbies in AI/Data science about which skills to learn. It can be tough to navigate your way through all the technologies but that is something you learn with experience. I will try to explain the primary concepts of end-to-end machine learning and what technologies you should go after and what is the role of the modern, non-PHD, AI engineer.
Production ready AI is as much about data engineering as it is about machine learning. They are primarily 5 questions which you need to address:
- How to scale your system to handle, potentially, billions of requests?
- How to preprocess huge datasets?
- Machine learning is a vast field. Which algorithms and use-cases are most in demand?
- How much math is enough?
- Do you need a PHD?
All of these are questions which arise once you start thinking about making a real-world AI application. Lets talk about them one by one.
How to scale your system?
Before answering that, lets think about it. Why do we need to scale our system? Imagine an application like twitter or facebook. Imagine real-time data coming in from billions of users. Imagine that you need to classify the text as “hate speech” or need to analyze the “sentiment” or need to see what is inside the photos generated on instagram and produce automatic tags. You need to do all of this and you need to do it in real-time! (For newbies: real-time basically means instantly)
A couple of decades back, all of this was something very challenging. You needed to buy a lot of servers which would actually be able to respond to each request and run the ML model to give an answer to each query. Now, however, you can do this seamlessly thanks to cloud computing and cloud vendors like Google, AWS and Azure and thanks to research in virtualization. Public Cloud vendors, like Google, provide you with the ability to instantiate millions of nodes in a second. They also allow you to seamlessly add more machines or decrease the number of machines based on the load being experienced by an application. A lot of the cloud “infrastructure” is programmable and can be modified just by changing a line in a script. Without going into the details of how cloud vendors achieve this ( hint: virtualization), cloud computing is a skill that you should possess if you are to build scalable applications that can serve billions of users.
How to preprocess huge datasets ?
If you look at job descriptions of most Data Science roles, you would see that everyone keeps mentioning “Apache Spark”(although most people just mention it because its a buzz word ). What is Spark? Why are people associating it with Machine learning. Why should we care about it?
Spark, at its very core, is a general purpose, data-processing framework. It was made to process large amounts of data using in-memory data structures. A Spark application normally has a driver node( a node can be considered as a fancy way of saying a computer) which dictates the “jobs” to different workers/executors ( which would be separate machines running code assigned to it) . So, in essence, Spark solves the problem of processing “Big Data”. That is not all though. Spark has a built-in ML library, a library to handle graphs, a library to handle SQL like queries and has a module that can handle streaming data. Sparks ML library helps train models by distributing the workloads between different nodes and that makes it extremely powerful.
Thats all good. But is Spark the only framework with all that? No! Cloud vendors like Google offer fully-managed services that can do all of your big-data processing at scale! You can pre-process data and transform it- all in real-time.
So… when should we use Spark? Well, thats a question that has many answers. It depends on a lot of variables. The ideal scenario for usage of Spark is in places where the data is sensitive and so cant be moved to the public cloud. ( Google, AWS etc ). A lot of banks fall into that category and a lot of telcos as well. In such a setting, you basically have a private data center and you would like to analyze your data locally. It is here that the value of Spark becomes apparent as you can deploy your clusters and carry out data transformations and Machine Learning, simultaneously, on the data. You can build custom data-pipelines for real-time machine learning. There will always be a limit to the amount of processing you can do since you are limited by your hardware in such a setting.
If your data is not sensitive then moving to the cloud becomes the natural answer. Google’s DataFlow/Data prep /PubSub all solve the challenges of data processing at scale and are way faster than a on-premise Spark cluster.
Nevertheless, Spark is a great tool to master as they are still a lot companies who dont want to migrate their data to the cloud.
Machine learning is a vast field. Which algorithms and use-cases are most in demand?
Machine learning is all about prediction. Companies are trying to predict if the next flight will be cancelled or not and they are trying to predict if this transaction is fraudulent or not? Why is prediction so important for companies? Well, first of all, the prediction is automatic(no human involved) and, secondly, it helps companies to adjust resources or gather information accordingly. This is where AI creates value for companies.
Even though there is a lot of AI-hype in the world, there are only a few tried and tested use-cases for businesses like banks and telcos. For example, customer segmentation ( grouping together people based on similar interests), churn prediction ( predicting when a person will leave the service), anomaly detection ( fraud detection for transactions ) are some sample use-cases which are being used by almost every bank and telco out there and thus should be studied thoroughly.
Remember one thing. The problem these days is not the algorithm. It is the the data. People dont have relevant data organised in the right way for ML to be applied. A major reason for this issue is that a lot of people in the C-suite do not understand AI or its use. So practically speaking, you will always come across a case where people want to do something but they dont have the data.
This is the point where you should realize that you are not Google. You dont have access to their resources and data. You cannot beat Google and make a better model with no data. There is no easy way solve this. However, it does help having to know techniques like “transfer learning” and you should know about open-source pre-trained models available online. What this basically means is that you have access to models trained on much more data and you can customize those to your needs. This is why it is important to know about these techniques so that you can handle some use-cases where there is such a bottleneck.
A modern AI engineer should also be able to handle common ML related tasks in images, text, speech. You should have foundational knowledge of images, text, speech and should be good at 1 ML-related task of each. I say 1 because that would help you understand the terminology of other tasks in the same domain. Sample tasks are image classification for images, automatic speech recognition for speech and sentiment analysis for text. Lastly, you should know about some algorithms such as decision trees, random forests ( and its varieties) and neural nets in significant detail. These are the ones which are most commonly used.
How much math is enough?
As a modern AI-engineer, you should know a bit of math (Just the important stuff in Calculus, Probability, Stats, Optimizations and no need for great detail) and this is just to get good intuitions about the algorithms you are using ( and for passing interviews). Math also helps you read new research papers and blog posts about the latest trends in ML. You might not use any of it but it keeps you updated and new knowledge is always good to have.
Do you need a PHD or are these skills enough?
First of all there is a dearth of quality PHDs who are actually doing work that is contributing to the field and I reckon most of them are already hired by the likes of Google. What this means is that if you are not in the top 5% of PHDs, you will most likely work as an engineer, anyway. Secondly, PHDs are experts in 1 field and dont know about the breadth of the knowledge required to create a working application. So if you think that you are not gonna be a top-notch PHD, then it is better to work on the other skills required to create a world-beating AI product.
Right, so that’s about it for now. I will probably be doing some tutorials where I will complete the use-cases mentioned above but that will probably take some time. If any of you have any questions then please feel free to ask me via email or comment on the post.