Security and Privacy in Artificial Intelligence and Machine Learning — Part 1

There is so much excitement and buzz about Machine Learning (ML) and Artificial Intelligence (AI) and the use cases they enable for almost all walks of life these days. In so many areas, we now have mature solutions in place and, now and again, we hear about interesting and insightful applications of ML & AI towards hitherto unexplored problems that make us go “Oh, wow!”.

As the deployed scenarios become richer and more widespread, we have to make sure that security and privacy aspects of AI & ML are adequately thought through. In order to do so, it is important to understand the different ways security and privacy cross paths with the fields of AI & ML. In this series, we will explore those intersections one ‘facet’ at a time with a goal to understand the nature and extent of risks involved and what protections are available or being explored for each.

While I have several years of work experience in the information security and privacy space, I am relatively new to AI & ML. (We have just started using ML techniques in our cloud security and dev ops projects at work.) So this series is more from the viewpoint of a security person reasoning about security and privacy in the contexts of AI & ML in a systematic manner. The early parts of the series will focus on somewhat narrow technical aspects. However, over the course of the series, we will also cover the broader social/societal concerns. As the series develops, I would love to hear back from experts in both fields what they think about the way I have organized the presentation.

In this first part, I will start with the “let us get the basics right” message from a cybersecurity standpoint without getting much into the internals of AI & ML . In the parts to follow, we will start looking closer (inside the box) at the AI & ML components.

Traditional Cybersecurity of AI & ML Solutions

Like any new field, AI & ML bring a bunch of interesting artifacts, associated nuances and caveats with them that pose a challenge from a security standpoint.

The two images below (from the Microsoft Azure documentation), show the typical artifacts and stakeholder groups involved at various stages of an end to end AI/ML workflow.

The End-2-End Workflow — Lots of systems, data formats, tools, stakeholders, etc.
The diaspora of development tools, platforms, frameworks, data handling, etc.

Diagrams like these highlight the following characteristics of the nature and types of security challenges we are faced with:

  • There is a huge ecosystem of (some old, some new) tools, languages, frameworks — with functions ranging from data ingestion, data pre-processing, modeling, model evaluation, integration, visualization, packaging & deployment, etc.
  • Multiple stakeholders groups — ranging from traditional software engineers to machine learning engineers, data scientists and statisticians — are involved in any project.
  • The complete working solution usually involves multiple systems coupled together at various interfaces.

From a security standpoint, this is a lot of new and unknown things. We are looking at lots of new tools and frameworks, new people (stakeholder groups) and new systems or combinations of systems. Setting aside the specifics/internals of AI & ML for the moment, even just traditional cybersecurity tends to be hard to do well in such a context. This is because often new tools and frameworks take time to become ‘secure enough’, new stakeholder groups may not always have a uniform understanding or awareness of security and, thirdly, solutions involving many systems and interfaces have had a history of interesting security failures at the boundaries/interfaces.

Now let us look at the ‘data security’ side a bit. The following characteristics of data security challenges stand out in the context of AI & ML:

  • In most solutions, huge (HUGE!) amounts of business data is involved. We are talking millions or even billions of records. And, depending upon the problem domain, these can have a tremendous amount of value (e.g., past transactions at a brokerage) or sensitivity (e.g., medical records from a chain of hospitals).
  • Unlike traditional software engineering, typically one can’t use ‘dummy data’ for testing/pre-production stages in AI & ML. (The whole point is about ‘learning from data’!) So, at all points in the ecosystem, we are talking about various stakeholders and various systems handling precious real data. (E.g., the team of data scientists collaborating to develop a model have to get access to much of the same data during experimentation that you have painstakingly protected for years on production systems leading to business data getting onto systems it typically never did in the past.) The many iterations and requests for variety of data can disrupt any existing data governance you may have in place.
  • Given that as a field AI & ML have ventured into all walks of life, we are looking at hundreds of data and record formats (and corresponding readers, encoders and decoders). We have had a colorful history of security bugs just in the handful of the more popular encoding formats. Consider what lies ahead when we account for the multiplier effect.

While they are indeed daunting, many of the above challenges should not be new for security practitioners. We have been through technology paradigm shifts in the past. We have seen how the foundational principles of good security do not change and how it is all about applying them after carefully understanding the nuances and unique requirements of any new context/ecosystem. At the end of the day, it should all boil down to doing the basics right — viz.,

  • Ensure that all team members/stakeholders have a good basic understanding of security and privacy — things like data classification, data protection techniques, authentication/authorization, privacy principles, applicable regulatory requirements, etc. The goal should be to ensure that all stakeholders have role-appropriate understanding of security and privacy and everyone uses the same terminology and knows and understands relevant policy and standards.
  • Have a good data governance structure is in place. Ownership and accountability should be clear for various stakeholders as data changes hands at different stages of each workflow. This is particularly important given the wide circulation of data that will be inevitable in ML & AI projects.
  • Perform diligent threat modeling of solutions — both at component level and from an end to end perspective. This will ensure that security is ‘built in’ into the design and that applicable security requirements are met at every point in the end to end system (where valuable data is processed, stored or transmitted).
  • When threat modeling, particular attention should be paid at boundaries and interfaces between the different sub-systems. Assumptions made by either side at those interfaces should be clearly documented and verified.
  • Also, because production data is involved everywhere, be sure to exhaustively cover all workflows in the threat models— starting from the earliest experiments and proofs of concept to the fully operational system as it would be in deployed in production.
  • Ensure that good programming practices are followed during implementation and, depending on the technologies used, appropriate vulnerabilities are mitigated (e.g., if some part of the solution is web-based, then the authentication token/cookies must be protected from attacks such as XSS, CSRF, etc.)
  • Ensure that all threats/risks identified during threat modeling and considered important enough to address are actually fixed by performing a combination of feature security testing and penetration assessments.
  • Make sure that native security models of different frameworks and sub-systems are adequately understood and rationalized to achieve uniform security across the system. This is particularly important because of the multiple, possibly disparate components and frameworks that will get ‘glued’ together to make up the end to end solution.
  • Be careful about software components that are used by various stakeholders in different stages of the project or imported into the different ‘pipelines’. Use verified and signed components where possible and, where that is not an option, consider other factors such as developer reputation, extent of use, trustworthiness of the repository, comments and reviews, history of security issues/vulnerabilities, etc. These should help assess if the component may be good enough from a security quality perspective.
  • Think about security of your deployment (CICD) pipeline. Do you have good control over who can change the build/release environment? Are you protecting any secrets/connection strings that are required for deployment adequately? Is your production setup locked down enough so that people cannot make ad hoc configuration changes?
  • Exercise good monitoring and security hygiene. Make sure all software components are at their latest security patch level, conduct periodic access reviews, rotate keys/certificates, etc.
  • Lastly, do have a good incident response plan in place so you can deal with a calamity if one does happen.

Ok, so there we are! To a security professional, none of the above should sound new/novel. You must have done some or all of these in various problem contexts in the past. So this first aspect of security and privacy in AI & ML is much about applying all that expertise in a new context while partnering with all the stakeholders involved.

But realize that we have approached the AI & ML components of the system as a ‘black box’ so far. Things start getting much more interesting when we look closer into that box. That is exactly what we will venture into in the next part of the series. Looking forward to seeing you again!

Source: Deep Learning on Medium