How to build a smart search engine

Original article was published by Josh Taylor on Artificial Intelligence on Medium


How to build a smart search engine (Part II)

Creating an intelligent search service in Python

All images by author

In the first post within this series, we built a search engine in just a few lines of code which was powered by the BM25 algorithm used in many of the largest enterprise search engines today.

In this post, we want to go beyond this and create a truly smart search engine. This post will describe the process to do this and also provide template code to achieve this on any dataset.

But what do we mean by ‘smart’? We are defining this as a search engine which is able to:

  • Return relevant results to a user even if they have not searched for the specific words within these results.
  • Be location aware; understand UK postcodes and the geographic relationship of towns and cities in the UK.
  • Be able to scale up to larger datasets (we will be moving to a larger dataset than in our previous example with 212k records but we need to be able to scale to much larger data).
  • Be orders of magnitude faster than our last implementation, even when searching over large datasets.
  • Handle spelling mistakes, typos and previously ‘unseen’ words in an intelligent way.

In order to achieve this, we will need to combine a number of techniques:

  • fastText Word vectors. We will train a model on our data set to create vector representations of words (more information on this here).
  • BM25. We will still be using this algorithm to power our search but we will need apply this to our word vector results.
  • Superfast searching of our results using the lightweight and highly efficient Non-Metric Space Library (NMSLIB).

This will look something like the below:

An overview of the pipeline we will be creating in this post

This article will walk through each of these areas and describe how they can be brought together to create a smart search engine.

1. Set-up; Preprocess and tokenize text

The first step in creating a search engine is splitting our documents into individual words or ‘tokens’. The spaCy library makes this both very simple and very fast to achieve. As a reminder, the example we are using in this article is the same as the one in the previous article. It contains UK public sector contract notices, published on the Contracts Finder platform. However, for the purposes of this exercise, we have increased the dataset size (it is now 212k records, previously it was only 50k). In addition to this, we have also brought through location data into our dataset. Prior to any processing, the data frame we are using looks like the below:

Some example records of the dataset (212k records in total)

The column we will be using for our search engine is the ‘Text’ column which is an amalgamation of the free text and location fields for each notice.

We can take this column, clean it and tokenise it all using the spaCy library. The below code does this by using a spaCy pipe, this makes the processing as efficient as possible and also allows us to choose only the parts of the tokenizer engine that we want to use (again ensuring that the process is as fast as possible):

The above code splits our documents into a list of tokens whilst performing some basic cleaning operations to remove punctuation, white space and convert the text to lowercase. Running on a Colab notebook, this can process over 1,800 notices a second.

2. Create word vectors; build a fastText model

Why word vectors? Why not BERT/GPT-3/[latest SOTA NLP model]?

Since the introduction of sophisticated transformer models like BERT, word vector models can seem quite old fashioned. However they are still relevant today for the following reasons:

  • They are ‘lightweight’ when compared to transformer models in all areas that matter when creating scalable services (model size, training times, inference speed).
  • Due to the above point they can be trained from scratch on domain specific texts. In addition to this they can be trained on relatively small data sets (i.e. thousands of documents rather than the many millions typically used to train transformer models).
  • They are easier to interpret due to the fact that a word vector will remain consistent and will not change based on the context of the surrounding text (both an advantage and a disadvantage, more on this later).

In addition to the above they are super simple to implement using the Gensim library. Here we are building a fastText model:

Reviewing performance:

Now that we have trained up our model, let’s see how it performs.

It never ceases to amaze me just how effective fastText can be at capturing the relationships between words within a corpus. This is best demonstrated with a few examples:

Most similar words to ‘m4’:

ft_model.wv.most_similar("m4", topn=20, restrict_vocab=5000)
Most similar words to ‘m4’. The model understands the link to UK motorway naming. Higher scores indicate greater similarity

This really is quite astounding, the model has clearly learned that M4 relates to the UK motorway and understands that other large UK motorways are similar to this (M1, M5, M3, M60).

It has also learned that LRN is also closely related (this stands for Local Road Network) I did not even know this myself!

The ‘9AT’ token looks quite odd however a quick search reveals that this is the postcode of Highways England.

Including postcode and location information in our word vector model was a deliberate design choice. The rationale was that the model will understand how UK postcodes and locations relate to one another. Let’s put this to the test:

Most similar words to ‘Yorkshire’:

Similarity scores for closest words to ‘Yorkshire’ higher scores indicate greater similarity

The model has learned that Yorkshire is a region in the UK (in the north west) and the major cities and towns within it. It also understands the relationship between this region and its sub regions; ‘Riding’ here refers to the North/East/West Ridings which sit within the Yorkshire county. But what about post codes?

Most similar words to ‘RG9’:

RG9 is a postcode (zip code) within the UK which relates to the town of Henley. This is a tricky example as Henley is quite a small town and the RG postcode is also used for other, larger nearby towns (such as Reading). Will the model be able to correctly associate this post code with Henley?

The model knows that the RG9 postcode relates to the town of Henley

It passes with flying colours! Henley is the most similar word and the other results are also highly relevant, representing neighbouring towns and postcodes.

Clearly our word vector model is performing well, in the next step we will supercharge our word embeddings using the BM25 algorithm.

3. Apply BM25 to word vectors

Now that we have our word vectors, we need to find a way of combining these for each document in our search engine.

The simplest way of doing this would be to average the word vectors for each document (and this would work) however it has been shown that combining word vectors with the BM25 algorithm can produce much higher quality search results[1]

A simple recap of BM25 is below but please review my first post for more details on its inner workings:

A recap on the inner workings of BM25

Whilst this looks complex, implementing this with our word vectors is actually very simple and requires just a few lines of code(just like all of the other steps in this article!)

Converting our word vectors into a document vector weighted using BM25

The output from this will give us a single vector per document in our search engine.

4. Create a super-fast search index with NMSLIB

We have now have a list of vectors for each document in our data set. We can also use the techniques outlined above to create a vector for a search query from a user.

But how do we return relevant results based on this search query? We need to be able to find the closest vectors to our search vector. Given the high number of dimensions (100) in our vectors this is where things could start to fall down with our approach. Search engines need to be fast and searching over 100 dimensions across a data set of over 200k records is very resource intensive.

NMSLIB:

Thankfully this is a fairly common challenge within computer science and solutions exist to massively speed up similarity search problems. NMSLIB is one of the fastest solutions out there [2]. Using this library we can create a search index which will be orders of magnitude faster than finding similar vectors using a brute force approach:

Putting it all together; a smarter search engine:

Now that we have our search index, it is simply a matter of creating a search vector and returning the closest matches from the index:

Using the same query of ‘flood defences’ as in our previous article we now get the following results (top 5):

Searched 212,447 records in 0.0005 seconds:Q3172 PROPERTY FLOOD MITIGATION SCHEME WASH GREEN, WIRKSWORTH, DERBYSHIRE SUPPLY AND INSTALL CERTIFIED FLOOD PROTECTION PRODUCTS INCLUDING  FLOOD DOORS, FLOOD BARRIERS, AIR BRICKS AND OTHER WORKS IDENTIFIED IN THE PROPERTY LEVEL FLOOD PROTECTION SURVEY REPORTS, AS WELL AS SKIMMER PUMPS AND HOSES. Matlock DE4 3AG WHIMPLE FLOOD DEFENCE IMPROVEMENTS CONSULTANCY SERVICES FOR PREPARATION OF CONTRACT FOR CONSTRUCTION OF FLOOD ALLEVIATION SCHEME. Sidmouth EX10 8HL FLOOD RISK ASSESSMENT FLOOD RISK ASSESSMENT Woolwich SE186HQ PAPERMILL DYKE FLOOD DEFENCE WALL CONSTRUCTION OF FLOOD DEFENCE Doncaster DN1 3BU MVDC - JS - LEVEL 2 STRATEGIC FLOOD RISK ASSESSMENT LEVEL 2 STRATEGIC FLOOD RISK ASSESSMENT TO SUPPORT PREPARATION OF THE FUTURE MOLE VALLEY LOCAL PLAN Surrey RH4 1SJ

Some great results. Also, the search is performed in 0.0005 seconds. This is 122 times faster than our previous search engine despite the dataset being more than 4 times the size.

It is also worth highlighting that a number of the results, whilst highly relevant, do not contain the word ‘defences’. The approach of using word vectors means that exact word matches are now no longer required to return relevant results.

Given that geographic information should also be encoded within the search index, let’s try searching for a contract that has been awarded in a specific area. To do this we will search using the NR2 postcode to find Notices in Norwich: ‘audit services NR2′. Here are the top 3 results:

Searched 212,447 records in 0.0004 secondsPROVISION OF EXTERNAL AUDIT SERVICES THE CONTRACT IS A SINGLE LOT FOR THE PROVISION OF EXTERNAL AUDIT SERVICES. Norwich NR4 6TJGB-NORWICH: EXTERNAL AUDIT ANNUAL AUDIT OF TRUST FINANCIAL & QUALITY ACCOUNTS AND ANNUAL REPORT. Norwich NR6 5BEGB-NORWICH: 18-022 - INTERNAL AUDIT SERVICES BROADLAND HOUSING GROUP WISHES TO ENTER INTO A CONTRACT FOR INTERNAL AUDITING. Norwich NR1 1HU

It works! Returning all results for both internal and external audit services in Norwich, note that even though we searched with the NR2 postcode, it knows that that this is also similar to the other Norwich postcodes of NR4, NR6 and NR1… pretty smart!

Finally, let’s feed it a typo and see if it can still handle this in an intelligent way. ‘audit services in Norwic’:

Searched 212447 records in 0.0005 secondsPROVISION OF EXTERNAL AUDIT SERVICES THE CONTRACT IS A SINGLE LOT FOR THE PROVISION OF EXTERNAL AUDIT SERVICES. Norwich NR4 6TJGB-NORWICH: EXTERNAL AUDIT ANNUAL AUDIT OF TRUST FINANCIAL & QUALITY ACCOUNTS AND ANNUAL REPORT. Norwich NR6 5BEGB-NORWICH: 18-022 - INTERNAL AUDIT SERVICES BROADLAND HOUSING GROUP WISHES TO ENTER INTO A CONTRACT FOR INTERNAL AUDITING. Norwich NR1 1HU 0.13

Same results again, despite the misspelling of the town name.

In conclusion:

In this article we have seen how combining word vectors to BM25 and supercharging this with a fast similarity search index can create a smart, scalable and performant search engine.

Despite this, it is always important to consider whether this will be of benefit to the end user. For example, we may find that users prefer a simple key word search as they can easily interpret the results. This also highlights one of the biggest risks of creating ‘smarter’ services; they can often become:

  • less predictable,
  • learn biases that exist within the search data, and;
  • can be far more difficult to debug due to the increased complexity.

For these reasons they require a significant amount of testing to ensure they behave as expected once in production.

As the search continues to get more sophisticated, we will no doubt see even more complex solutions emerging over time. Whilst it is great to see such rapid development in this area, it is also important to remember that often the simplest solution to a problem is the best.

Notebook containing the data and code:

References:

[1] Word Embeddings in Search Engines, Quality Evaluation https://ad-publications.cs.uni-freiburg.de/theses/Bachelor_Eneko_Pinzolas_2017.pdf

[2] Benchmark of similarity search libraries https://github.com/erikbern/ann-benchmarks

Link to the first part of this series:

As always, a big thanks to the TDS editorial team!