Source: Deep Learning on Medium
This is an executive summary about what we managed to do in approximately 2 months in the Profi.ru DS department (Profi.ru is one of online leading service marketplaces in the CIS region) within a Semantic factory project.
This article mostly focuses on comparing novel NLP techniques applicable in a applied business setting and may serve as a guide if you plan to do something similar in your company/project.
In a nutshell — we had 2 objectives:
- Make search / routing within the main search bar on Profi.ru better (supervised task);
- In doing so — develop unsupervised methods to efficiently search for new services in third-party data (unsupervised task);
We considered many architectures, but in the end a single bi-LSTM (or bi-GRU) with embedding bag layer instead of plain embeddings (essentially fast-text meets RNN) turned out to be the best both for supervised and unsupervised tasks (see charts below). Remarkably, making the model robust to partial input/input with errors increased its performance. Also adding second objective (fake “service” detection) also slightly boosted accuracy. More sophisticated models like transformer showed no real benefits in our domain.
We used internal data from several main sources:
- Actual client search (30k unique queries per 3–6 months) + manual annotation (which produced our main dev-set);
- Internal synonym database (30k unique synonyms);
- Internal marketing SEO databases with annotation (300k);
No one built ML models before us in the company, so we had to improvise when obtaining the data and managing its annotation. Also it should be noted that even though the total number of queries is significant, the unique query count turned out to be very low, which forced us to be creative in obtaining / augmenting data. Also the quality of data-sets and annotation at hand varied quite drastically. The best supervised datasets were, as expected, the manually annotated dev-set and the synonym database. We also tried exploring conversion pipelines (i.e. query — session — order) and distilling previous classifier results — with expected poor performance.
As for external data, we used the following sources:
- Search engine data (data provided by SEO panels of Google and Yandex as well as their statistical services) (11m queries);
- External SEO services (3m queries) ;
- Our competitors’ public data (1.5m queries);
60%with elastic search) accuracy on client routing / intent classification (~
- Search is agnostic to input quality (misprints / partial input);
- Classifier generalizes, the morphologic structure of the language is exploited;
- To be on the safe side — at least
1,000new services were found + at least
15,000synonyms (vs. the current state of
30,000). I expect this figure to double of even triple;
- We thoroughly compared many modern sentence embedding techniques using a downstream classification task + KNN with a database of service synonyms;
- We managed to beat weakly supervised (essentially their classifier is a bag-of-ngrams) elastic search on this benchmark (see details below) using UNSUPERVISED methods;
- We developed a novel way of building applied NLP models (a plain vanilla bi-LSTM + bag of embeddings, essentially fast-text meets RNN) — this takes the morphology of the Russian language into consideration and generalizes well;
- We demonstrated that our final embedding technique (a bottle-neck layer from the best classifier) combined with state-of-the-art unsupervised algorithms (UMAP + HDBSCAN) can produce stellar clusters on external data;
- We demonstrated in practice the possibility, feasibility and usability of: (i) knowledge distillation (ii) augmentations for text data (sic!);
- Training text-based classifiers with dynamic augmentations reduced convergence time drastically (10x) compared to generating larger static datasets (i.e. the CNN learns to generalize the mistake being shown drastically less augmented sentences);
What works in NLP now?
A birds’ eye view on the NLP landscape:
Also, you may know that NLP may be experiencing the Imagenet moment now.
Large scale UMAP hack
When building clusters, we stumbled upon a way/hack to essentially apply UMAP to 100m+ point (or maybe even 1bn) sized datasets. Essentially build a KNN graph with FAISS and then just rewrite the main UMAP loop into PyTorch using your GPU. We did not need that and abandoned the concept (we had only 10–15m points after all), but please follow this thread for details.
What works best
- For supervised classification fast-text meets the RNN (bi-LSTM) + carefully chosen set of n-grams;
- Implementation — plain python for n-grams + PyTorch Embedding bag layer;
- For clustering — the bottleneck layer of this model + UMAP + HDBSCAN;
Best classifier benchmarks
Manually annotated dev-set
Manually annotated dev-set + 1–3 typos per query
Manually annotated dev-set + partial input
Large scale corpora / n-gram selection
We collected the largest corpora for the Russian language:
- We collected a
100mword dictionary using 1TB crawl;
- Also use this hack to download such files faster (overnight);
- We selected an optimal set of
1mn-grams for our classifier to generalize best (
500kmost popular n-grams from fast-text trained on Russian Wikipedia +
500kmost popular n-grams on our domain data);
Stress test of our 1M n-grams on 100M vocabulary:
In a nutshell:
- Take a large dictionary with errors (e.g. 10–100m unique words);
- Generate an error (drop a letter, swap a letter using calculated probabilities, insert a random letter, maybe use keyboard layout, etc);
- Check that new word is in the dictionary;
We brute forced a lot of queries to services like this one (in attempt to essentially reverse engineer their dataset), and they have a very small dictionary inside (also this service is powered by a tree classifier with n-gram features). It was kind of funny to see that they covered only 30–50% of words we had on some corpora.
Our approach is far superior if you have access to a large domain vocabulary.
Best unsupervised / semi-supervised results
KNN is used as a benchmark to compare different embedding methods. This is more of an illustration, but obviously we used these methods for clustering / searching for new services.
(vector size) List of models tested:
- (512) Large scale fake sentence detector trained on 200 GB of common crawl data;
- (300) Fake sentence detector trained to tell a random sentence from Wikipedia from a service;
- (300) Fast-text obtained from here, pre-trained on araneum corpus;
- (200) Fast-text trained on our domain data;
- (300) Fast-text trained on 200GB of Common Crawl data;
- (300) A Siamese network with triplet loss trained with services / synonyms / random sentences from Wikipedia;
- (200) First iteration of embedding bag RNN’s embedding layer, a sentence is encoded as a whole bag of embeddings;
- (200) Same, but first the sentence is split into words, then each word is embedded, then plain average is taken;
- (300) The same as above but for the final model;
- (300) The same as above but for the final model;
- (250) Bottleneck layer of the final model (250 neurons);
- Weakly supervised elastic search baseline;
To avoid leaks, all the random sentences were randomly sampled. Their length in words was the same as the length of services / synonyms they were compared with. Also, measures were taken to make sure that models did not just learn by separating vocabularies (embeddings were frozen, Wikipedia was under-sampled to make sure that there was at least one domain word in each Wikipedia sentence).
An example of how the best embedding technique + UMAP work on one of the external corpora.
Cluster exploration “interface”
Green — new word / synonym.Gray background — likely new word.
Gray text — existing synonym.
Ablation tests and what works, what we tried and what we did not
- See the above charts;
- Plain average/tf-idf average of fast-text embeddings — a VERY formidable baseline;
- Fast-text > Word2Vec for Russian;
- Sentence embedding by fake sentence detection kind of works, but pales in comparison with other methods;
- BPE (sentencepiece) showed no improvement on our domain;
- Char level models struggled to generalize, despite the recant paper from google;
- We tried multi-head transformer (with classifier and language modeling heads), but on the annotation available at hand it performed roughly the same as plain vanilla LSTM-based models. When we migrated to embedding bad approaches we abandoned this line of research due to the transformer’s lower practicality and impracticality of having an LM head along with embedding bag layer;
- BERT — seems to be overkill, also some people claim that transformers train literally for weeks;
- ELMO — using a library like AllenNLP seems counter-productive in my opinion both in research/production and education environments for reasons I will not provide here;
- Docker container with a simple web-service;
- CPU-only for inference is enough;
2.5 msper query on CPU, batching not really necessary;
1GBRAM memory footprint;
- Almost no dependencies, apart from
pandas(and web server ofc).
- Mimic fast-text n-gram generation like this;
- Embedding bag layer + indexes as just stored in a dictionary;
We have shown that you can apply relatively simple / off-the-shelf tools for solving both supervised and unsupervised challenges within a semantic search framework in a real business.
- A more detailed presentation in Russian;
- Parsing Common Crawl and Wikipedia;
- How to mimic fast-text bag of embeddings split in one function;
- More pre-processed common crawl;
- Fast-text word vectors pre-trained on Wikipedia;
- Fast-text and Word2Vec models for the Russian language;
- Seminal word embeddings papers: Word2Vec, Fast-Text, further tuning;
- Some of the current SOTA CNN-based approaches: InferSent / Generative pre-training of CNNs / ULMFiT /Deep contextualized word representations (Elmo);
- Imagenet moment in NLP?
- Sentence embedding baselines 1, 2, 3, 4;
- Fake Sentence Detection as a Training Task for Sentence Encoding;
Originally published at spark-in.me on November 2, 2018.