Source: Deep Learning on Medium
Why should you care?
As machine learning is getting more mature, the need to build infrastructure that supports running these workflows are even more. In a large enterprise setting on an average, there are atleast 200+ data scientists/DL/ ML engineers that run their model training and inferencing jobs. Ensuring that these users get easy hardware/software access to train their models is imperative. This sounds like an easy task, I’m here to tell you it is not.
There are multiple challenges. For example:
- Abstraction of running jobs and ease of hardware access to data scientists is important.
- Different users have different hardware and software requirements. Some users train their models using Tensorflow, while some using Pytorch and others use their own framework built in-house.
- One team uses large scale XLNET/BERT training say using sixteen V100s, the other team uses pre-trained EfficientNet with say two T4 GPUs.
- How to manage and maintain these resources?
- Who actually reserves the control to these hardware resources? Is it the data scientists or the DevOps team?
- Who sets the priority of the jobs (A job is anything that you want to run on hardware for ex, training, inferencing, etc.)?
- How to maintain the sanity of resource allocations? We all know everybody wants their model to be trained first.
- How to support data scientists who don’t know to use their allocated resources fully — For example, their GPU utilization across 32 GPUs is less than 25%?
- How to handle security issues — for example, ensuring only desired users get access?
- How to ensure all the accelerators and nodes are used effectively while not taking significant performance penalty?
- Who helps data scientists profile their slow applications? This is sometimes an issue because data scientists are not necessarily building the most efficient model training pipelines. In other words, data scientists are not software engineers.
- Some jobs are a chain of dependencies: perhaps they use lets say parallel server-worker architecture where some workers must be kicked off before others, how to handle them to maintain and the queue?
- Who installs and maintains new tools- for example: KubeFlow, Polyaxon, Seldon, Pachyderm, Domino Data Lab, Argo, etc. that come as quick as the next month.
- If these problems are already not sufficient to cause headaches, think about installing and maintaining hardware level drivers, for example CUDA drivers, new packages, etc.
- Some errors the data scientists run into are very ML library specific, for example NCCL ring topology issues.
- Dealing with deploying and supporting inferencing jobs is another beast of it’s own that needs it’s own infrastructure team.
A typical DevOps engineer doesn’t necessarily have expertise in supporting some very ML library specific issues. On the other hand, data scientists themselves aren’t experts in managing large scale clusters and it’s not a good idea to hand off a large cluster to data scientists either. So who does the above work? At this point you might be thinking, “hmm….This sounds a lot like a SysAdmin role”, in a way, it is! However, since this entails the need for knowing ML concepts it needs someone to be a SysAdmin + ML Engineer = Enter ML/DL Infrastructure engineer and added along with large on-prem clusters usually come HPC.
DL Infrastructure Engineers are responsible for managing and maintaining clusters. This is true when you move from cloud to on-prem.
In the upcoming future, we’ll be able to see an infrastructure branch that caters to data scientists who are aware of ML and DevOps concepts. That way, let the scientists do their science and infrastructure engineers do their DL infrastructure 😉