AI Strategy in The Age of Vertical Federated Learning and Data Sharing

Original article was published on Artificial Intelligence on Medium

As you may know, data sharing can be a challenge for large scale machine learning (ML). Moreover, the lack of data is often an issue in ML projects. Federated Learning is trying to bring a solution to the issue of siloed and unstructured data, lack of data, privacy, and regulation of data sharing as well as incentive models for data alliances.

Recently, I had the opportunity to oversee the implementation of vertical federated learning based on a “data sharing alliance” with some of our competitors.

Our strategic need to build a data sharing alliance with our competitors can be explained by two reasons. First, we are limited by our own data in many projects . Secondly, foreign tech companies will soon need to meet new requirements in the European Union regarding AI and sharing data with smaller rivals.

In this article, I will share my experience in identifying specific use cases and leveraging Federated Learning to enable us and our competitors to train machine learning models without sharing any raw data while creating new business models.

Today’s reality

A majority of companies share the same approach of centralized machine learning. Concretely, the process of developing accurate models usually starts with collecting as much data as possible from multiple sources (operational data, legacy system, social media, CRM, IoT data, …), then develop machine learning models on the collected, pooled data.

This approach contains several challenges that diminish the potential of AI systems. Indeed, only a fraction of the possible available data is currently accessible and therefore limits machine learning models reaching high accuracy.

If only we could increase our ability to share data between companies, we could unlock new use cases or improve the accuracy or existing ML solutions. Some of you might be thinking about open data when it comes to this issue, however, this solution is often of limited quality, unstructured, or inconsistent. Other solutions could include synthetic data or data augmentation techniques…

Reasons limiting data sharing

In most cases, organizations prefer to strictly control their data rather than partnering or trading with third parties and even less with competitors. Even though they might occasionally contract with third parties to help speed development, data partnerships or alliances are still something very rare.

Why we need more data sharing

Too often the focus is on how an organization is able to leverage its own data, while the biggest opportunities lie in merging multiple datasets, both internal and external.

For instance, the ability to leverage your competitors’ data could be a game-changer. Indeed, making this data available for specified purposes can unlock value for several organizations and the end user. Moreover, in an AI context, joint-effort collaboration with competitors can improve internal machine learning models. This is why data sharing between competitors is crucial.

As of today, our goal is to try to imagine what we could create using not only the data at our disposal. This new approach requires us to imagine new business models, use cases, partners, and frameworks.

Depending on the use case, we urgently need more data to train our models. For instance, I’m currently working in the healthcare field. As you can imagine, data acquisition in this specific industry is extremely difficult. As a result, we tend to work on small datasets that have been gathered under strict governance. A data sharing alliance using a Vertical Federated Learning architecture would help us a lot.

We envision a future in which different companies will build models together without disclosing their data and share the benefit of more accurate machine learning models or new use cases to improve internal processes or customer experience. Our idea is to not only use shared data not only to improve existing applications but also “co-create” applications that would otherwise not be possible.

This shift fits perfectly the nature of Machine Learning. As most of you know, the machine learning field is by nature a collaborative one. As such, I am not surprised to see the recent shift made by some large tech firms when it comes to data sharing.

We also believe that FL will change the power dynamics in value chains, which could be less dependent on individual data monopolies and also generate additional revenue for companies. Companies that were not using their data can have a new alternative to generate revenue.

In the end, the best strategy to construct vertical federated learning depends on a number of factors including:

Federated Learning

I will not get too much into details because other articles have already perfectly covered the technical aspects of Federated Learning. As mentioned before, the main idea with FL is that you can decentralize the machine-learning process so you can still respect privacy but get the statistical power and additional data.

Vertical federated learning can be seen as “a B2B model, where multiple organizations join an alliance in building and using a shared ML model. The model is built while ensuring that no local data leaves any sites and maintaining the model performance according to business requirements.”1 FL ensures full data protection along with rewards to companies for sharing their learning.

Indeed, the federation can allocate part of the revenue to data owners as incentives. In reality, it can be highly complicated to build a revenue model for all participants/competitors. We have developed a payoff-sharing scheme developed especially for vertical federated learning.

Obviously, all the participants benefit from the global model in their local applications. Moreover, “Federated Leaning also benefits cross-border data models, where, in many cases, legislation requires the data to be stored in a particular jurisdiction, and cross-institutional partnerships.”2

I expect some startups and or consultants to become specialized in FL and develop a framework to help organizations select use cases, identify partners, and discuss with them the ideal setup for their data collaborations (revenue model, data governance, …).

It is key to mention the difference between Vertical Federated Learning and Federated Transfer Learning. The first one refers to where we have many overlapped instances but few overlapped features. It can happen that two different companies (not in the same industry, for instance, banks and retailers) have more or less the same customers but each owns different datasets/features. In that case, vertical Federated Learning merges the features to create more powerful feature space for machine learning tasks and uses homomorphic encryption to provide protection on data privacy.

A known issue of FL is that an adversary can infer the local training data from the model updates sent by a device. To mitigate this issue, we rely on Homomorphic Encryption (HE). HE allows data to remain encrypted while it is being processed for training models.

Homomorphic encryption: a form of encryption that allows specific types of computations to be executed on cipher texts and obtain an encrypted result that is the cipher text of the result of operations performed on the plain text. (3)

Despite these positive elements regarding FL, many challenges remain. For instance, siloed, unstructured data, privacy, regulation of data sharing and incentive models for data alliances using FL. I could also mention the maturity of FL based solutions and internal support from C-level executives… (sharing sensitive data with competitors is still something hard to explain to some managers)

Use cases using FL

To help you better understand the concrete applications of Federated learning, I have selected three “mature use cases” below:

Smart retail
In this industry, the data collected is mostly related to customer purchasing power, personal preferences, and information related to a product. In reality, these three data features are likely to be splitted between three different departments or companies.

  • Purchasing power can be related to the user’s bank savings
  • Personal preferences can come from social media
  • Product information can be collected on e-shops

According to several researchers from Webank and Hong Kong University (4), we are facing two issues. First, data barriers between these different organizations are difficult to break. As a result, data cannot be directly aggregated to train a model. Secondly, the data stored by the three parties are usually heterogeneous, and traditional ML models cannot directly work on heterogeneous data.

Federated learning and transfer learning bring a solution to these issues. Indeed, by leveraging the characteristics of FL, it becomes possible to build an ML model for the three parties without exporting the company data, which protects data privacy and data security. At the same time, we can use transfer learning to address the data heterogeneity problem and break through the limitations of traditional AI techniques.

Finance
Another interesting use case would be related to the detection of multiparty borrowing. This happens when certain users borrow from one bank to pay for the loan at another bank.

According to the same researchers from Webank (5), to find these users without exposing their user lists, banks can use vertical federated learning. Indeed, we can leverage the encryption mechanism of federated learning and encrypt the user list at each side and then take the intersection of the encrypted list in the federation.

Smart healthcare
Healthcare is another domain that will benefit from vertical federated-learning. Data such as medical reports are private and sensitive (for good reasons!). In reality, medical datasets are difficult to collect and can be found in isolated medical institutions and hospitals. Based on my experience, I can tell that the insufficiency of data sources and the lack of labels often mean weak performance of ML models (low accuracy, overfitting, etc.), despite data augmentation techniques. Ideally, if all medical institutions and pharma groups would form a data alliance and share their data to create a large medical dataset, the performance of ML models trained would be significantly improved.

I believe FL is a great option for production systems at scale, but for research projects, I am still skeptical about the overall efficiency (except in the medical field). Federated Learning does not apply to all Machine Learning projects.

The success of such an approach highly depends on your use case. Lastly, the complexity of debugging a FL system without being able to see the data is something not to be underestimated.

For more information on Federated Learning, I recommend the following links: