Source: Deep Learning on Medium
Launchmetrics is a Marketing Platform and Analytics Solution helping the Fashion, Luxury and Cosmetics professionals discover, activate and measure the voices that matter for their brands. It is the most essential and trusted platform in the industry, yielding an unrivaled market penetration to the top seventy fashion and luxury brands worldwide including Dior, Fendi, NET-A-PORTER, Topshop and more.
Founded in NYC with operating headquarters in Paris, and offices in London, Milan, Los Angeles, Tokyo, Madrid, and Girona, and support in five languages; the company works with over 1,000 brands as well as partners like IMG, the Council of Fashion Designers of America, the British Fashion Council, Pitti Immagine, and Google, to accelerate their business and build lasting exposure. The company’s industry communities GPS Radar & Style Coalition bring together over 50,000 influencers, editors, buyers and more to share content, events, news, images and more.
With such a highly specific market, it’s critical that we can help our clients identify what products and brands are being referenced in a specific article or post. In Europe, over 40 data scientists and analysts work with the company’s technology teams to develop proprietary algorithms that leverage AI technology and real machine learning to bring cutting-edge changes to FLC industries. In this article we will share the specifics linked to the product type and brand recognition algorithms that we are developing at Launchmetrics, based on deep learning for the fashion vertical.
NER and its limits for Launchmetrics use
Nowadays, people communicate almost everything in natural language. As a consequence of that, online and social media generates large amounts of text content on a daily basis. Managing this information correctly is very important to get the most out of each article. AI algorithms can be used for several generic purposes such as classifying content from providers, make search algorithms more efficient, power content recommendations and customer support amongst others.
Named Entity Recognition (NER) can be described as the problem of locating and classifying named entities (people, places, companies, cities, organizations and others) in a given text. NER is one of the starting points for using Natural Language Processing (NLP) to augment text content. Extracting key entities add semantic knowledge to unstructured content and this helps to understand the subject of any given text.
For some specific domains, NER is not enough. Specifically, in the fashion domain it is not enough to know if the article is about a product or a brand. The real value comes from knowing which product and which brand. For example: Is the post talking about Celine bags or about Coach trousers?
The main challenges to address this problem are:
- Multi-language product detection.
- Brand detection and ambiguity discovery.
- Get enough data to train the neural networks.
- Build a manageable and scalable architecture and code.
There are several generic options in the market to solve the NER challenge, but we will see that focusing on a specific domain helps us to have good performance and to successfully address the challenges above.
Is deep learning a good option?
The area of Natural Language Processing has been around for decades, with researches trying to push hard to get better and better performances using traditional machine learning algorithms. In the past 30 years, many papers on the area have been published [4, 5, 6, 7]. However, over the past few years, Deep Learning architectures and algorithms have made impressive advances in the field, outperforming the state-of-the-art results for some common NLP tasks such as NER, sentiment analysis or machine translation.
There are several reasons for that:
- Deep learning is able to take advantage of massive amounts of data. The “Big Data Era” of technology provides huge amounts of opportunities for innovations in deep learning.
- Deep Learning requires high-end machines contrary to traditional Machine Learning algorithms. GPU has become a integral part now to execute any Deep Learning algorithm.
- Deep Learning algorithms learn high-level features from data in an incremental manner. This eliminates the need of domain expertise and hard coded feature extraction.
- Deep Learning provides end-to-end solutions, whereas traditional Machine learning techniques need to break the problem down into different parts and then combine their results as a final stage
- Usually, a Deep Learning algorithm takes a long time to train due to the . large number of parameters (two weeks could be a good reference measure). Conversely, traditional Machine Learning algorithms need from a few seconds to a few hours to train. The scenario is completely reversed in the testing phase. At test time, Deep Learning algorithms take much less time to run.
So, is it always a good option to use deep learning techniques to solve a problem? The answer is no. Deep learning will perform better if…
- … the dataset is large. With a small dataset, traditional Machine Learning algorithms are preferable.
- … we have high end infrastructure to train models in reasonable time.
- … there is lack of domain understanding for feature introspection, as we have to worry less about feature engineering.
- … when it comes to some complex problems such as image classification, natural language processing, and speech recognition.
Can we use deep learning for product and brand recognition?
As stated before, one of the first premises to use deep learning is to have data and at Launchmetrics we have huge amounts of data. We crawl 20 Million documents per day in 11 latin languages, and we have 18 years of historical data. We classify all the crawled documents in different categories and therefore we can build a rich fashion dataset. These documents contain mentions to fashion products (dress, trousers, shoes, jackets, …) and to Fashion Brands (Celine, Boss, Coach, …).
For this dataset to be usable, we need to clean and annotate the data so that we can train, validate and test our deep learning models. This process is manually done by our data team.
Finally, we harness the Google cloud infrastructure, using several servers powered by GPUs to train, validate and test the models. NER is a complex challenge so that deep learning algorithms are a good choice.
Now that we’ve argued why deep learning is a good approach, let’s state our problem a bit more formally and see which metrics are the best to measure the model’s success.
Our problem is stated as follows:
Given an article, we wish to detect all the product mentions and its positions within it, so that we can know which are the products that a publication is talking about. Seems easy, isn’t it? So let’s detect synonyms as well and in multiple languages without a dictionary next to us. We currently have 50 product types only for fashion. Moreover, we also want to detect brand mentions by discarding ambiguities. Boss, Coach or Gap are clear examples of brands with a high degree of ambiguity, and we need to detect it in order to provide the correct information to our clients.
Now we need to choose a simple, observable and attributable metric that tells us how far are we from our objective: we will use precision and recall. We need to be sure that our algorithms are able to label all the products and brands present in the articles. If our methods are able to label nearly all of them, we can say that they have a high recall. Besides having a high recall, we also need a high precision: whenever the algorithm labels something as a brand or product, it should be true.
Current deep learning solutions and limits
There already exist several state-of-the art technologies based on deep learning to solve the NER problem. Some of the most successful can be summarized as follow:
- Spacy API provides several models trained on different corpus of data. Depending on the corpus used they use a more or less fine-grained NER annotation schema, ranging from 4 to 18 entities (Person, Product, Org, etc.). There’s one model per language.
- Stanford Named Entity Recogniser (Finkel et al 2005) is a Java implementation that provides a good named entity recognizer for English with 3 entities (Person, Org and Location). They also provide models for other languages and circumstances.
- Google Cloud Natural Language API supports NER for 8 different languages. It is able to classify 8 entities.
- Azure is much more expensive than Google and on beta and it claims to be able to recognize more than 20 categories.
- Flair is a NLP library by Zalando Research. It can do NER, PoS, sense disambiguation and classification. It can use different word and doc embeddings including its own variation. It has state-of-the-art results on the de-facto standard datasets. It builds on top of pyTorch.
- AllenNLP An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. AllenNLP is an open-source project backed by the Allen Institute for Artificial Intelligence (AI2)
Previous solutions are able to detect entities such as brands and products. However: (i) they are not able to detect them with the precision and recall that we need to solve our specific problem; (ii) they are not able to detect the specific product or brand; and (iii) they are not able to deal with ambiguities. This is why in order to produce outstanding results for our clients, we need to build our own solution.
How we do it at Launchmetrics
We followed some of the guides given by the Google team to work on machine and deep learning algorithms [1, 2, 3]. Phase 1 is explained in the previous section and here we will concentrate on Phase 2, the steps of which can be summarized in the next figure.
Coding our models
Using deep learning architectures and algorithms we aim to detect product types in multiple languages. We use English as a reference language and we build a model to translate it into all the other languages we work with.
The translation process works as follow:
- Given a document, we use fasttext to obtain the word embeddings.
- These embeddings are then used as input to the MUSE algorithm to align the desired language to English
- Finally, we use the aligned model to obtain the translation from the desired language to English.
The product type and brand detection models are build by using an LSTM (Long Short Term Memory) neural network. This network will allow us to predict whether each word in a document is an entity or not. We prefer this network over the RNN (Recurrent Neural Network) because it avoids gradient damping by converting general recurrence’s multiplication paradigm into an addition paradigm.
Train, evaluate, and tune the models
Following the deep learning rules, we used 90% of the data to train, evaluate and tune the models.
For the translation step we first tried to use the fasttext pre build models for translation. Those models were built by using generic data from wikipedia and on the evaluation phase we saw that results for the fashion vertical were very poor. Therefore, we recovered 3 million documents per language from our historical data to train our own word embeddings for the fashion vertical and built our own translation model. Then, the algorithm converts each word into a dense 300-dimensional vector.
The MUSE algorithm and the neural network needs these embeddings as inputs instead of words directly.
To train the product and brand detection models we used data from online media as well as from social media (Instagram and Youtube). Specifically, 2000 mentions per product, labeled as brand/no brand are used. The no brand mentions are required to detect ambiguities and discard those documents that do not really talk about a brand but about some ambiguous topics (e.g Celine vs Celine Dion).
We also calculated the current error that our algorithms are doing when crawling documents talking about a specific brand.
For Brands with ambiguous names, we have seen that the 64% of articles are not really talking about the specific brand, but other topics.
Testing our model
10% of the remaining data is used to test the models. For the product type recognition we’ve got a precision of 93.07% and recall of 94.52%. For brand detection we’ve got a precision of 87.95% and recall of 86.51%.
Deploying and versioning the models
We store our models in the cloud and so we send prediction requests to them from our products. Currently we have the brand detection and disambiguation integrated in one of our products. We monitor the prediction service and we compute the metrics per day. For example, by using the brand detection and disambiguation, now only 18% of the articles are not related to the brand, compared to the 64% that we had without using it. The whole system is designed to scale. Since the content of fashion documents can vary quite often we recompute the models monthly.
What is next?
The methods described are language and channel dependent and the main drawback is that it will not scale well if new languages come in to our platforms. We are working in the direction to build models to detect products, brands and celebrities in articles coming from all channels: print, social and online regardless of the language used.
In order to improve the experience of our customers, we have started to take care of Natural Language Understanding (NLU) techniques and have started the investigation of text summarization amongst others. We would like to include more inference and reasoning to our algorithms in order to give better insights to our clients. This will allow us not just tell them what is going on related to their brands but, for example, how they can improve (based on data) their marketing campaigns or their position in the market related to their competitors.
Finally, since the web, and specially FLC vertical, is more visual than textual, we are working on computer vision algorithms to detect products, brands and logos in the images and we wish to use both visual and text information to provide richer information and see how these algorithms can benefit from each other. Additionally, for Phase 1, we wish to improve or annotations system by using weak labeling [1,2] and knowledge distillation  methods.
Thanks to Josep Roura, Joan Espasa, Arthur Hieulle, Katherine Knight, Fabien Brazic and Alison Levy for their feedback.
 Multi-Label Learning with Weak Label. Yu-Yin Sun Yin Zhang Zhi-Hua Zhou. Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10)
Snorkel: Rapid Training Data Creation with Weak Supervision Alexander Ratner Stephen H. Bach Henry Ehrenberg Jason Fries Sen Wu Christopher Re. Proceedings of the VLDB Endowment, Vol. 11, №3
 Data-Free Knowledge Distillation for Deep Neural Networks. Raphael Gontijo Lopes, Stefano Fenu ,Thad Starner. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
 Deep learning. Ian Goodfellow, Yoshua Bengio and aaron Courville. 2016 The MIT Press.
 Bates, M (1995). “Models of natural language understanding”. Proceedings of the National Academy of Sciences of the United States of America. 92 (22): 9977–9982. doi:10.1073/pnas.92.22.9977. PMC 40721.
 Christopher D. Manning and Hinrich Schütze (1999). Foundations of Statistical Natural Language Processing. The MIT Press. ISBN 978–0–262–13360–9.
 David M. W. Powers and Christopher C. R. Turk (1989). Machine Learning of Natural Language. Springer-Verlag. ISBN 978–0–387–19557–5.