Original article can be found here (source): Artificial Intelligence on Medium
Step 4: Building and Integrating the AI Microservices
How you build out the AI services will largely depend on the model training and serving architecture you choose, the best-practices you follow, and the integrations and optimizations you have.
For training, moving your data and your training code to the cloud is an obvious choice to reduce the wait time that your team could end up spending hours per day on. For deploying your models for inferencing, you should separate out the model inference and business logic into separate micro-services so that you can scale for inferencing as needed.
a) Reducing wait-time for training: When I talk to Data Scientists and ask them what do they spend most time while training on that I could optimize for them, they touch on largely two things — moving the data around from their data sources in the cloud to their desktop for development takes a long time, and they have to wait for the model to train before they can try different hyper-parameters or catch a bug in their code. Paying close attention to these pain-points, you’ll realize that the scientist is blocked during these activities from applying their expertise to actually solving the problem.
ML Infrastructure-as-a-service has come a long way in the last couple of years. Data storage in the cloud is cheap and plugs straight into the machines running on the cloud training the model. Recently, these cloud providers have also started offering IDEs (e.g. Jupyter Notebooks) in the same environment as your data allowing you to move your code from your desktop nearer to your data storage. Along with the proximity advantage, training your models in the cloud provides the additional advantage of the availability of high performance compute (GPUs or ASICs designed specifically for Deep Learning) that can scale along with your data (thanks to recent advances in distributed learning techniques). I would highly recommend trying these out and structuring your architecture and workflow for model training around these and reduce the time wasted in the above efforts.
b) Split the micro-services for inference and your business-logic: Based on my experience in deploying multiple models in production over the past few years, the best way to structure models for deployment is to separate out the inferencing from the business logic in the back-end. That is, we want one micro-service for taking in the input of the model and predicting the output, while the logic for moving the data around from the database, any business requirements like access-control, wrapping the input and output of the model into the user experience, etc. should be put inside a separate micro-service. This also allows for scalability of only the inference micro-service, when you have access to hardware optimized compute resources dedicated for inferencing (e.g. ASICs like TPUs, etc.).
One way to build the model inference micro-service is by using containerization (e.g. creating your own Docker container, or using TF-Serving, Seldon, etc.) to serve the model. The container orchestration layer can then be tasked with auto-scaling the model as per the compute or volume of incoming requests. If in framing the ML problem, you’ve decided to break it down the model into parts, you may want to create multiple inferencing micro-services. Make sure you account for the amount of data (e.g. images) to be transferred between the models, otherwise your model prediction would become network bound. Of course, if you’re using a Cloud-based MLaaS service, the serving architecture is taking care of by the service.
Best-practices for Building Models
Having a peer-review process with a “guild” of experts of the trade helps make sure your approach is sound, optimized and keeps up with fast-changing technology. Model, hyperparameter, and data versioning help you improve model performance systematically. Establishing model and data CI/CD ensures your models are deployed more frequently and reliably into production.
I wanted to also share some follow-up best-practices for building the models. I apologize if this gets a little technical, but it is best noted in this article where we talk about building the model services, their architecture, and integration.
a) Peer Review and Guilds: With the recent strong rise in research in ML-powered by the large interest in Deep Learning, the state-of-the-art best methods, algorithms, and architectures are getting updated about twice per year. Your solution could be suboptimal if it is not built keeping in mind the rapidly changing technologies. Engineering teams have had a “Pull-request” review process in place before committing code into a master branch. A similar practice should be employed by the ML team to check their code into main branches or promote deploy a champion model into production.
A recent trend in developing strong teams in the workplace has been to introduce “guilds” where practitioners across teams that have a core competency within the organization (e.g. Deep Learning, etc.). If you have a functional organization, think of these as “experts” on a particular topic within a function (e.g. “Computer Vision in Healthcare”, etc.), where you can draw from people’s interests, previous work experience, and current skills. Having a periodic peer review of the ML solution is key to making sure you have the right operational support in place.
b) Model Architecture, Hyperparameter, and Data versioning: Data Scientists frequently try different model architectures based on their hypotheses for the way a model could be built for the data. They also change hyper-parameters for model training. One of the biggest frustrations of the Data Scientist while building the model is when they make changes and they can’t go back to a previous better performing hyperparameter or model architecture. The same goes for training data that grows over time. A good practice here is to at-least maintain a “champion model” and keep track of that model’s architecture and hyperparameters using a “Github”-like code versioning system. Some recent services like Weights & Biases, MissingLink, etc. have tried helping scientists with this problem by tracking previous versions of the models and data.
c) Automated Pipelines: Engineering teams have recently optimized on the build and deploy process by creating Continuous Integration and Continuous Delivery (CI/CD) pipelines that ensure rapid and reliable code deployment to production. ML Model training and deployment needs similar process improvement. At the least, the ML model deployment process needs to be scripted such that if a new “contender” model beats the “champion” model in your optimizing and satisficing metrics on test data, then the model can be auto-deployed. For example, deployment may include building the serving container, pushing the container to the container repository, and deploying the container to your managed instances. On the training side the CI/CD process could include getting new data, getting it labeled, training the model with automated hyperparameter selection and then storing the best performing model in the shared volume for deployments.
As your team builds out the model and iterates, these best-practices will make sure that your model is deployed to production more frequently and reliably.
Integrating your model with the back-end and front-end services using a combination of synchronous and asynchronous messaging platforms, building end-to-end model training workflows, having Human-in-the-loop failure preventions, and performance optimizations are all techniques for successful integration of your model with the services supporting it.
I also wanted to touch briefly on some integrations and optimizations. While the implementation timeframe for these might vary from immediate to long term, it’s good to know about them beforehand in order to know where you might want to take your engineering efforts in the future.
a) Service integrations: Like any other software system, your front-end will typically talk to your back-end (that has the business logic, database, auth, etc.) Your back-end will then talk to your model. If the user interaction is waiting for a model prediction (e.g. our email text prediction) your interaction between your front-end and the back-end will be synchronous (e.g. HTTP/S). Your model serving itself can then be integrated with the back-end in either synchronously / asynchronously. If your model is served with an HTTP endpoint (i.e. synchronous), make sure your back-end server is using non-blocking IO (since the prediction time might be large). If you want your front-end directly to hit the model (in case there is no significant business logic), I recommend using a gateway like Ambassador which can help you authenticate. Finally, if your predictions are offline (i.e. there is no UI waiting for a prediction), you might want to think about integrating your model using a message-based architecture (e.g. message queues, Kafka, etc.). I’ll talk more about this in the optimization bit below.
b) Workflow integrations: In the automated pipeline section above, we touched on how CI/CD can be used to train your model when new data is available, and automatically deploy your model when a new “champion” is available. One part that might be important is to integrate your data annotation steps, data clean-up, quality checks, A/B UX tests, etc. However, these need a more complicated set-up and probably achievable only a few months after launching production (unless of course, you’re using a managed ML as a service).
c) HITL back-ups — Recently, I visited the Amazon Go store in downtown San Francisco and was surprised at the ease of the check-out process. Perhaps the first time I went to the store, my experience was a bit clumsy as I learned the ‘interaction’, but I can easily imagine checkout being a super easy process for grab-and-go lunch burritos once you’ve been to the store a couple of times. An important thing to note is that Amazon Go did not seem to wait for the whole ML to get perfect. In fact, they use a Human-in-the-Loop solution successfully for correcting a system-inability to decide what I took by sending it to a human labeler.
d) Performance Optimizations: One interesting architectural optimization for really advanced cases is to have ‘mini-batch’ predictions, which GPUs are optimized for. So while your front-end talks to the back-end synchronously, the back-end puts the prediction requests on a message queue. The model service pulls from the queue in batches with the wait time as less as 10s of milliseconds, at a batch size of what it can process with the model at once (e.g. 8 to 16 images, etc.). It writes back the prediction output asynchronously, and the back-end can then serve the front-end that’s waiting for the requests. This architecture works because the ML inferencing code is optimized for batches — predicting with a mini-batch of 8 images is faster than 8 requests for individual predictions. Another last piece of optimization may be hardware optimizations (e.g. using TPUs, ASICs, or converting floating-point to fixed integer calculations, etc.)