Source: Deep Learning on Medium
In Capgemini Invent we have built a solution named Intelligent Contract Analysis for Regulation (ICARe), that specifically addresses all of those four challenges.
We are going to take you step by step through the main components of ICARe pipeline described bellow, from preprocessing steps to the Deep Learning classifier of legal clauses.
We gathered thousands of contracts. Among them : (i) internal Capgemini contracts, (ii) GDPR addendum from above contracts and Data Processing Agreements (DPA), (iii) external contracts or templates web-scrapped from internet.
We also collected other legal documents completely unrelated to data privacies, in order to train our model to distinguish GDPR relevant content from other legal topics (cf. Relevance classifier part) in a contract base. As the vocabulary distribution of legal documents can be quite different from other content such as news articles or books, one should pay attention collecting data coming from the same domain, we don’t want our model to overfit to the legal jargon of specific contracts !
Optical Character Recognition (OCR)
The vast majority of contracts are unfortunately not digitally signed yet, so we have to deal with scanned documents, usually saved as PDF files. Scanned documents can get messy, but most of high quality OCR technologies are now doing a great job at extracting text even from not-aligned or skewed pages.
We have to consider multi-language document, in big master agreement contracts of hundreds of pages we may find relevant information expressed in annexes from an other language than the one used in the rest of the document. In this particular set up, it’s a nice idea not to assume all our document fits in one language !
Relevance classification or “How de-noising contract ?”
Contracts are noisy ! Most of the text we would find in a contract brings no real information or value to the task at hand. Therefore, we want to filter out the irrelevant content, i.e. sections that are not sensible to GDPR.
There are couple of approaches to do so: while we could experiment unsupervised approaches such as Topic Modeling, we could train a binary text classifier based upon a corpus of both GDPR and non GDPR documents to predict whether a given page or section is relevant for GDPR review.
We can now exactly tell which contracts have sensible contents and need to be reviewed !
But we have not finished yet ! Once we got rid of the noise in our documents, we’re left with the core of our contracts that is actually sensitive to the regulation. This is the set of clauses our legal experts want to have a closer look at.
The classifier should predict for each clause the covered topic(s).
Consider this bellow clause:
“Company X is a provider of enterprise cloud computing solutions which processes personal data upon the instruction of the data exporter Company Y in accordance with the terms of the Agreement.”
Here we want our model to predict that this clause is actually talking about defining the roles of each of the parties involved.
We built an ad-hoc deep learning architecture that is able to capture both the semantic of the paragraph and the surrounding context to categorize each clauses to a set of topics. Our model is trained upon more than 4K+ annotated clauses by our GDPR experts both in french and english languages.
Having categorized all the clauses, we are now able to suggest clauses around topics that are not covered by the contract.
This is where we leverage collective intelligence using annotations of legal experts that also served as a training set for our clause classifier. There is now a “clause base” that fulfills both purposes of classifying clauses and recommending missing ones !