Original article was published by Chris Thornton on Artificial Intelligence on Medium
Fuzzy Name Matching with Machine Learning
Stacking Phonetic Algorithms, String Metrics and Character Embedding for Semantic Name Matching
It is often the case when working with external data that a common identifier such as a numerical key does not exist. In place of a unique identifier, a person’s full name can be used as part of a universal or composite key to link data, however this is not a fail-safe solution.
Let’s take for example the name Alan Turing; disparate data sources could have recorded the calling name Al Turing. Data entry may innocently record: Alan, Allan, Allen, or worse, undetected typos (Alam Turing) into their databases. Enterprise document scanning solutions (OCR) are also rife with misreadings.
A human agent could intuitively assign these variations to the same entity of Alan Turing through the cognitive process of applying soft-logic to approximate the spelling and phonetic (sound) characteristics. Often shortened hypocorisms don’t always have these characteristics and are part of the agents’ learned associations i.e. Charles → Chip.
What follows is a study of applying machine learning to achieve semblance of human-like logic and semantics for alternative name identification.
I scraped multiple lists of common alternative spellings for first-names, around 17,500 pairings. The names are restricted to ASCII and include many Unicode-decoded cross-cultural examples to avoid over-fitting to western name conventions.
The intuition of using first names as the core data for our model is to integrate ensemble methods on name-components, requiring exact matching on surnames to ensure greater precision/less false positives at the cost of some recall.
I decided to make the classes imbalanced (1:4) as under-sampling the negative class lead to a noticeable artificial bias towards positive class. It is difficult to approximate the a priori probabilities for each class, but it is assumed that the classes are imbalanced in favor of the negative class.
There are many string metrics and phonetic algorithms to use as features, the base level model uses 20+ features including:
- Levenshtein distance
- Bigram similarity
- Jaro distance
- Editex distance
- Soundex coding
Deep LSTM siamese networks have been shown to be effective in learning text similarities. I used TensorFlow to train these networks on name pairs and use out-of-fold predictions as a feature of the meta model.
Names can be transformed to help our model learn new patterns from the same data. Transformations include:
- Splitting names into syllables to acquire meaningful multi-token string metrics (e.g. token-sort and token-set from fuzzywuzzy package)
- Removing high-frequency name endings
- Removing vowels
- Converting to IPA (International Phonetic Alphabet)
I used the AutoML package TPOT to aide in selecting an optimized pipeline and hyperparameters for a base-level model with F1 as the scoring metric.
The base model and character embedding networks were stacked via stratified 10-fold cross-validation to train a logistic regression meta-model. Some features from the base model were included to provide additional context and dimensionality for the meta-model. Grid-search was used to select the optimal parameters and features, affording priority to precision.
Evaluation metrics for the international alternative first name test-set:
This model was specifically trained to handle alternative names, but transfers well to correctly classify all the aforementioned variants including typographic errors.
The methods used and resulting model is henceforth dubbed HMNI (Hello my name is). I’ve open-sourced this project (in alpha status) as a Python package under the same moniker.
How to use HMNI in your project :
Install using PIP via PyPI
pip install hmni
Quick Usage Guide — Pair Similarity, Record Linkage, Deduplication & Normalization
More to come…
I will keep this post updated with future releases of HMNI; including best performing models, language-specific configurations and data processing optimizations.
All code is released under MIT Licence. Copyright 2020, Christopher Thornton