How to Utilize Unstructured Text?
There is no doubt that we live in the era of info explsion. As human being, we need to process a tremendous amount of text data every day (emails, messaging, news, etc).
According to a study conducted at UCSD, people receive every day about 105,000 words or 23 words per second in half a day (12 hours) (during awake hours).
Remember, this is the amount of information that is processed by an individual person. The amount of unstructured text for a business is humongous since it’s a combination of all the data generated by its employees as well as its customers.
So, how to utilize unstructured text? This is one of the hottest questions brought up by developers, researchers, investors and leaders. There is no way to have a exhaustive list of all possible applications of utilizing text data, since there is infinity number of them. A better way to approach this is perhaps a top-down approach, where we define the basic operations we can do on text data. Then given these building blocks, developers can come up an infinite number of creative way to build interesting applications. This is how technology usually works, isn’t it?
In the rest of blog, I would like first go over some of the existing methods and then describe a path that I envision will scale up unstructured text utilization for the near future.
Current State of Natural Language Processing
Natural language process (NLP) is one of the fastest growing fields in the last decade. Especially with the rise of deep learning, large datasets, GPU acceleration and pre-training, many NLP applications have evolved from research projects to commercial products that are used by tens of millions of people. The following are some of the NLP tasks are already commoditized as standard APIs that accessible to many developers:
- Document/sentence classification (including topic classifications, intent detection, sentiment classifications etc.) that recognizes a categorical label given an input sentence.
- Sequence tagging, e.g. slot detection, entity recognition, POS tag detection etc.
- Sentence parsing that parses a linguistic structure, e.g. dependencies parse tree given input sentences.
- Machine translation that translates from one language to another.
We can do fairly well for the above tasks for many languages if we have enough training data. Many great libraries are created to support easy access these NLP skills, e.g. NLTK, Spacy, Stanford Core NLP, Google NLP APIs, Microsoft LUIS, and HuggingFace Transformers and etc.
Is the problem of using unstructured text solved? What is the catch? You may notice that all the above APIs/models are desgined to process a single sentence/document. Yes, we can convert one sentence to many potential outputs, but isn’t the main challenge of BIG DATA is because its size? Given 1,000,000 user reviews about a restaurant or a 100,000 long documents, none of the above methods are immediately useful. The lack of developer tools to apply the latest NLP technologies to BIG unstructured text at scale is the missing piece to fully unleash the power of unstructured text.
A Proposed Path to the Future
First of all, Let’s look at the type of big data that we can handle very well, i.e. structured big data. Any modern databases can easily digest millions of rows of data and conduct efficient queries. We can easily find a column or a field given a set of constraints OR get useful insights by analyzing a set of data records and visualizing them using tools like Matplotlib or Tableau.
So going back to the first principle, what are the basic operations here that empower all the applications on top of structured data? It turns out there are roughly two basic operations (besides more low-level operations like create/delete):
- A filter function: find/select/match/etc.
- An aggregation function: max/mean/count/etc.
The filter function essentially enables the user to zoom in to one or a subset of documents/rows and the aggregation function is used to provide the backend results for all sorts of wonderful analytics. In fact, all popular database have these two functions implemented, including MySQL, MongoDB, ElasticSearch, BigTable etc.
Then the question is how we can have the two functions for unstructured text? This is an actually a quite challenging task for unstructured data. For example, to have the “filter” function for a pool of documents, the user can be looking for a subset of documents, a subset of sentences, or a subset of phrases. There are numerous possibilities of dividing the text data and it’s not possible for a traditional database to support a true filter function for unstructured text. Moreover, since is no schema, there is no obvious ways to filter unstructured text by filters, e.g. column A=X.
Similarly, for aggregation, it does not make sense to count a set of documents or take the maximum of a set of sentences. Language are discrete variables that can have very different or similar meanings even with 1 word difference. To aggregate over text, it should be conducted at a semantic level where text are grouped or summarized.
Thus, it requires non-trivial AI technologies that can actually read and understand the meaning of text in order to achieve the basic “filter” and “aggregation” function of processing large text data source.