Source: Deep Learning on Medium
This article describes the need and approach taken for developing a solution using NLP and Deep Learning on unstructured data to improve business processes from hours to minutes by text extraction, text summarization, taxonomy-based search over tons of historical data and thus overcoming human limitations.
In a typical presales cycle, this is how things pan out –
The customer comes up with a new innovative idea and requests a proposal. Senior management contacts the presales team and requests them to come up with something asap. Presales try to identify relevant people, talk with each of them and search for similar work done in past in tons of previous proposals, case studies, etc. After all this manual work, the team stitches together a deck that can be presented to the customer. Most of the times it is required to tailor an offering to fit with customer’s expectations.
An ideal solution is to have an intelligent search based solution which can shorten the presales cycle efficiency by providing ease of searching the work done with a short summary of the historical case studies. The end result will be a customer-focused offering deck in minutes.
Salient features of the solution
- Ability to index various document types such as pdf, word, video, audio and provide a taxonomy and semantic-based efficient search to find relevant documents from the existing repository.
- Provide a summary of each search result with ranking in terms of relevance along with metadata like author, date, content type.
- Ability to merge different search results and create a new presentation with a click.
- Classification of documents to business verticals and technology horizontals.
- A lightweight application that can be integrated with other enterprise solutions on-premise/cloud.
For the extraction of data and meta-data from documents, Apache Tika is used. For extraction from Video files, python module- FFMPEG is used to convert video to audio files and google translate API is used to get the text information from audio files.
In the case of text from audio, text punctuation is required to achieve the desired accuracy in extraction. Europarl module is used for the same.
Spring scheduler which is basically a directory watcher service to monitor add/updates in the source document directory.
Apache OpenNLP is used to summarize text from documents. Python NLTK module is used to summarize punctuated text from Video/audio files.
Weka framework is used to classify the documents into business verticals and technology horizontals.
Apache Lucene is used to index text files from documents and video files. User search queries are served using the same index. Semantic and taxonomy-based search are supported.
Built using Spring boot and Spring security. Bootstrap and jQuery are used primarily to build UI/UX features. Word Cloud is also generated using Word cloud python module.
Helps to merge slides from different decks using the selection on UI. This is achieved using Apache POI.
A sales process is one of the important use cases demonstrating the need for such a solution. Other use cases where such a solution can be relevant could be Physicians searching past clinical reports, Lawyers searching researching for relevant historical cases, Ph.D. scholars researching on topics…
This solution approach/design can be implemented in weeks and start helping improve business processes. Since there are only open source tools and technologies used, no licensing cost is involved in the development and maintenance of the solution.