Creating Word Embeddings for Out-Of-Vocabulary (OOV) words such as Singlish

Source: Deep Learning on Medium

Methodology

In this section, I will first explain where I got the data from. Then I will talk about the model used to create the Singlish word embeddings. After which, I will go through my initial findings, the problems I realised from these findings, how I rectified the problem before finally revealing my finalised result from this.

Data Collection — Hardware Zone Forum

The data was scrapped from our all time favourite Singapore forum called “Hardware Zone” ( https://forums.hardwarezone.com.sg/). It was a forum initially made to be IT-oriented but as with all forums, it has deviated to being a place many Singaporeans go to to talk about anything and everything under the sun.

The best part?

They mostly use Singlish in the forum. Perfect for any deep learning task. Here is an example of how a thread comment would look like.

['got high ses lifestyle also no use treat chw like a dog if me i sure vote her she low ses man take people breakfast ask people buy ckt already dw pay up tsk tsk tsk\n',
' children big liao no need scare tio pok\n',
'she low ses man take people breakfast ask people buy ckt already dw pay up tsk tsk tsk i dont get her thinking actually if she is rich she could even fly over to kl to eat buffet and then fly back in 1 hour dunno why want to save over this kind of small thing to ppl like her like food\n']

It may seem like just bad and broken English but actually, to a Singaporean reading this, we understand all the implicit meanings from the slangs and English used here.

Data collection-wise, 46 main threads were scrapped with a total of 7,471,930 entries. Below is the breakdown of number of threads by entry.

Eat-Drink-Man-Woman 1640025
Travel and Accommodation 688253
Mobile Communication Technology 582447
Mass Order Corner 574927
MovieMania 451531
Gaming Arena 426976
Campus Zone 322146
General Merchandise Bazaar 286604
Money Mind 264523
Internet Bandwidth & Networking Clinic 243543
Music SiG 227198
Hobby Lovers 224994
Headphones, Earphones and Portable Media Devices 206141
Notebook Clinic 184116
HomeSeekers and HomeMakers 153683
Apple Clinic 139810
Hardware Clinic 133421
Cars & Cars 115286
The Tablet Den 108793
Fashion & Grooming 98211
Electronics Bazaar 75192
Degree Programs and Courses 49734
Football and Sports Arena 47769
Software Clinic 43553
Health & Fitness Corner 21665
The "Makan" Zone 21358
National Service Knowledge-Base 16545
Current Affairs Lounge 16259
Home Theatre & Audiophiles 14888
The House of Displays 13987
Employment Office 12146
Other Academic Concerns 9927
Tech Show Central (IT Show 2018) 9895
Parenting, Kids & Early Learning 9756
Pets Inner Circle 6525
Wearable Gadgets and IoT 5735
The Book Nook 5238
Digital Cameras & Photography 4714
Ratings Board 3858
IT Garage Sales 3549
Diploma Programs and Courses 2348
Certified Systems, IT Security and Network Training 1798
Post-Degree Programs & Courses 1724
Online Services & Basic Membership Support/Feedback 876
Design & Visual Art Gallery SiG 246
HardwareZone.com Reviews Lab (online publication) 18

Having manually reviewed the data quality from each thread, some threads like “Travel and Accommodation” were used for mainly sales and thus, rejected from the analysis.

I ended up only using data from “MovieMania” which did not include any form of advertisements nor sales within the thread. I also deemed 451,532 entries a large enough set to train decent word embeddings.

The average length of each entry in this thread was 27.08 words. This gave me a corpus of 12,675,524 words. In terms of number of unique vocabulary words used, the number came down to 201,766. I kept the top 50,000 words based on frequency and labelled the remainder with an “unknown” token for training.

There is no point keeping infrequent words for training. If the word only appears a couple of times, there is no learning that can take place. Hence, the reason why I replaced all infrequent words with an“unknown” token.

I also cleaned the data by simply removing all forms of punctuation and standardising the cases by lower casing them.

Technical Summary:

No. of words: 12,675,524
No. of sentences: 427,735
Avg. No. words per sentence: 27.05
No. of unique vocabulary words used: 50,000

Now that the data is out of the way, the next segment will talk about the actual model used to train word embeddings.

Introducing the Skip-Gram!

The skip-gram model is an unsupervised machine learning technique that is used to find the most relevant words for a given word.

Take this phrase for example,

Figure 4 — Skip-gram example

The skip-gram model tries to predict the context words given an input word. In this case, given the word “Fox”, predict “quick”, “brown”, “jumps” and “over”.

Now imagine the model scanning through my entire training corpus from Hardware Zone. For each sentence (427,735 sentences), loop through each word, extract out it’s context words and use the input word to predict those context words.

An error function is also calculated per sentence. The goal of the model is to minimise this error function through multiple iterations by slowly adapting the weights within the model.

I iterate this “learning” until the error function starts to stabilise before extracting the weights with the model as my word embeddings.

What I attempted to explain above, is essentially how neutral networks work. Architecturally, it will look something like this:

Figure 5- Skip-Gram Architecture.

Following so far?

Great! Before I get to the results, let’s review some technical specifications in the model building process.

Technical Specifications:
Context size = 3
Learning Rate = 0.025
Learning Rate Decay = 0.001
No. of Epochs = 25
No. of word dimensions = 100
No. of Negative Samples = 3
Total Training Time = 17 hrs 05 mins (~= 41 mins per Epoch)

For those who are wondering what these technical jargon refers to, context size refers to the number of words to predict i.e. “quick”, “brown”, “jumps”, “over” was a context size of 2 — Two words before the input word, “Fox” and two words after.

Learning rate and decay refers to how the model adapts it weights over the iterations. i.e. how much it learns per iteration.

Epoch refers to how many iterations of the entire training set to loop through. For example, I had 427,735 sentences in my training set, I am looping through this 25 times.

Word dimensions refers to the number of numerical values to best represent a word. In my example in the introduction, I used 3 dimensions to keep it simple and easy to understand. In actual fact, we can actually go up to 300 dimensions.

Negative sample size is a parameter that comes from Negative sampling — a technique used to train machine learning models that generally have more negative observations compared to positive ones. Recall I kept 50,000 vocabulary words to use.

If I were to use “Fox” to predict “Quick”, there will only be 1 correct answer, versus 49,999 wrong answers. The probability to get the prediction of “Quick” correct would be insanely low.

Hence to speed up the learning process, negative sampling is used to reduce to number of negative levels. i.e. Instead of looking at 49,999 wrong answers, I only look at 3 random wrong answers. Hence, No. of Negative Samples = 3.

Still with me yeah? Fantastic! Let’s carry on!

Initial Results

Here are the results of the first run of the trying to generate word embeddings for Singlish.

What you see below are similarity scores (0 to 1) between word-pairs and the top 10 words that tend to be closest to the word of interest. The higher the number, the more similar the words are in terms of context.

As an initial test before even looking into Singlish words, I decided to look at the scores for words like “her” vs “she” and “his” vs “he”. If these scores were not high, I would have deemed that the model has not trained well enough.

Thankfully, the scores are pretty high and decent.

Similarity between 'her' & 'she': 0.9317842308059494
Closest 10:
her 0.9999999999999994
she 0.9317842308059496
who 0.8088322989667506
face 0.7685887293574792
and 0.731550465085091
hair 0.7196624736651458
shes 0.7191209881379563
when 0.7119862209278394
his 0.7107795929496181
that 0.7091856776526962
********************************************************
Similarity between 'his' & 'he': 0.897577672968669
Closest 10:
his 1.0
he 0.8975776729686689
him 0.8446763202218628
who 0.775987111217783
was 0.7667867138663951
that 0.7528368024154157
father 0.749632881268601
son 0.7281268393201477
become 0.7264880215455141
wife 0.711578758349141
********************************************************
Similarity between 'jialat' & 'unlucky': 0.011948430628978856
Closest 10:
jialat 1.0
sia 0.8384455155248727
riao 0.8266230148176981
liao 0.8242816925791344
sibei 0.814415592977946
hahaha 0.8064565592682809
ya 0.8045512611232027
meh 0.7954521439129846
lol 0.7936809689607456
leh 0.7920613014175707
********************************************************
Similarity between 'jialat' & 'bad': 0.6371130561508843
Closest 10:
bad 1.0
quite 0.8823291887959687
good 0.8762035559199239
really 0.8758630577100476
very 0.8731856141554037
like 0.8728014312651295
too 0.8656864898051815
damn 0.8599010325212141
so 0.8486273610657793
actually 0.8392110977957886
********************************************************
Similarity between 'bodoh' & 'stupid': 0.1524869239423864
Closest 10:
bodoh 0.9999999999999998
628 0.4021945681425326
u4e3au56fdu5148u75af 0.3993291424102916
beck 0.39461861903538475
recieve 0.39110839516564666
otto 0.3839416132228821
gaki 0.34783948936473097
fapppppp 0.3418846453140858
bentley 0.3344963328126833
hagoromo 0.3331640207541007
********************************************************
Similarity between 'bah' & 'ba': 0.5447425470420932
Closest 10:
bah 0.9999999999999998
lei 0.7051290703273838
nowadays 0.698360482336586
dun 0.6968374466521237
alot 0.6767383433113785
type 0.6745085658120278
cos 0.6711909808612231
wat 0.6682283480973521
ppl 0.6675756452507112
lah 0.6671682261049516
********************************************************
Similarity between 'lah' & 'la': 0.8876189066755961
Closest 10:
lah 1.0
meh 0.8877331822636787
la 0.8876189066755962
dun 0.8865821519839381
mah 0.8793885175949425
leh 0.8723455556110296
cannot 0.8686775338961492
ya 0.8661596706378043
u 0.8549447964449902
wat 0.8542029625856831
********************************************************
Similarity between 'lah' & 'leh': 0.8723455556110294
********************************************************
Similarity between 'lah' & 'ba': 0.6857482674200363
********************************************************
Similarity between 'lah' & 'lor': 0.8447135421839688
********************************************************
Similarity between 'lah' & 'hor': 0.722923046216034
********************************************************
Similarity between 'lor' & 'hor': 0.6876132025458188
Closest 10:
lor 1.0
u 0.8925715547690672
cannot 0.865412324252327
dun 0.8509787619825337
leh 0.8508639376357423
lah 0.8447135421839689
ya 0.8438741042009468
la 0.8403252240817168
meh 0.8356571743730847
mah 0.8314177487183335
********************************************************
Similarity between 'walau' & 'walao': 0.4186234210208167
Closest 10:
walau 1.0
nv 0.6617041802872807
pple 0.6285030787914123
nb 0.6248358788358526
la 0.6207062961324734
knn 0.6206045544509986
lah 0.6158839994483083
lo 0.6102554356797499
jialat 0.6079250154571741
sibei 0.6076622192051193
********************************************************
Similarity between 'makan' & 'eat': 0.6577461668802116
Closest 10:
makan 1.0
jiak 0.7007467882204779
go 0.6911439090088933
pple 0.65857421561786
eat 0.6577461668802115
food 0.6575154915623017
kr 0.6545185140294344
sg 0.6473315303433985
heng 0.6422265572697313
beo 0.6354614594882941
********************************************************
Similarity between 'makan' & 'food': 0.6575154915623018
********************************************************
Similarity between 'tw' & 'sg': 0.7339666801539345
Closest 10:
tw 0.9999999999999997
tiong 0.7723149794376185
sg 0.7339666801539344
lidat 0.7330705496009475
hk 0.7258329008490501
taiwan 0.7195021043855226
tiongland 0.7171170137971364
pple 0.7130953678674011
yr 0.7017495747955986
mediacorpse 0.6954931933921777
********************************************************
Similarity between 'kr' & 'sg': 0.7889688950703608
Closest 10:
sg 1.0
go 0.7940974860875252
kr 0.7889688950703608
tiongland 0.7675846859894958
ongware 0.7674045119824121
yr 0.7584581119582794
pple 0.7536492976456339
time 0.7533848231714694
buy 0.751509500730294
tix 0.743339654154326
********************************************************

There are some interesting findings from the above results that I’d like to point out.

Overall, I would say the model has learned and performed decently well on understanding Singlish words.

Not the best, but decent.

It’s no surprise that the model has learned that “Makan” (Malay for “Eat”), “Food” and “Eat” were found to be similar. But what’s interesting is the word “Jiak” — dialect for “Eat” and commonly used in Singlish sentences, was picked up and deemed similar to “Food”, “Makan” and “Eat” as well.

This looks to me that the model has indeed learned some Singlish words well.

The next interesting result comes from the abbreviations of country names. As humans reading the result, we know “sg” refers to “Singapore”, “kr” — “Korean”, “tw” — “Taiwan” & “hk” — “Hong Kong”. To a machine however, it is not that simple, the model did not have prior knowledge of these. Yet, it has managed to group these terms together.

Very interesting indeed!

This is another validation point that tells me the model has trained decently well on Singlish words.

Now, instead of just staring at numbers, there is a method that helps us visual words in the vector space. This is called T-distributed Stochastic Neighbor Embedding (TSNE).

TSNE can be loosely seen as a dimensionality reduction technique to visualise high-dimensional data. i.e. word embeddings of 100 dimensions. The idea is to embed high-dimensional points in low dimensions in a way that respects similarities between points.

In English, convert 100 dimensions into 2 or 3 dimensions and retain the embedded information as much as possible.

If I were to run TSNE (n=2 dimensions) on some of my results above and display it on a scatter plot, it will look like this:

Figure 6 — TSNE results displayed on a scatter plot

Not too shabby a result.

While the results above looked promising, there were some issues that needed addressing.

I only realised this when I was reviewing the results. The following section will explain further.

Problems with the Results

First off, as you can see, Singlish words like “lah”, “lor”, “meh” seem to be similar to each other but…

So what? What does it actually mean?!

After thinking about it, it is not exactly useful.

Singlish phrases usually appear as bigrams (2 words) that really encapsulates the meaning of the context in which it was used. A good example is Figure 3 — The power of “Can”. The bigrams “can meh?” vs “can lah!” have very different meanings.

If we were to look at the words independently, “can” and “meh”, that meaning is lost.

Therefore, having a vector representation of the words “meh” vs “lah”alone is not entirely useful here.

To remedy this issue, I needed a way to keep these words together. Skip-Gram trains on unigrams, yes. But what if I could convert these bigrams to unigrams first? i.e. “Can meh” -> “Can_meh” *note the underscore.

The second problem is not as serious as the first, but worth some air time.

It is the inherit problem with dimensionality reduction techniques like TSNE. You see, information loss by downsizing from 100 dimensions to 2 dimensions is inevitable. As much as I love visuals, I realised that I could only get the true understanding of the results by looking at numbers and not the visual.

Nothing to solve here.

Just food for thought.

Now let’s take a look at my attempt to solve the first problem.

Rectification of the problem: Converting bigrams to unigram — Introducing Point-wise Mutual Index (PMI)

Figure 7 — Point-wise Mutual Index (PMI) formula

After realising the issue with Singlish, I needed a way to convert my bigrams to unigram and yet keep the information or meaning of those words.

The way I could do that was by calculating the PMI of all possible bigrams in the corpus.

For example, the formula in Figure 7 reads, take the log of the probability of the occurrence of pair of words (i.e. “can meh”), divided by the probabilities of the occurrences of each individual words.

The breakdown of my process is as such:

If x = “can” and y = “meh”,

  1. Count the number of times “can meh” appears together.
  2. Count the number of times “can” appears alone.
  3. Count the number of times “meh” appears alone.
  4. Apply PMI and get a score.
  5. Set some threshold parameter. If score is above threshold convert all occurrences of “can meh” into “can_meh” i.e. bigram to unigram.
  6. Retrain the entire skip-gram model.

The steps from 1 to 5 is actually known as building a phrase model.

And so… the quest to retrain my 17 hour model ensued…

I calculated the PMI scores for all possible bigrams and set a threshold variable. I actually used the normalised version of the PMI formula above for calculation, but the concept is the same.

Technical Specifications of Normalised PMI phrase model:

  1. Minimum count of word = 5
  2. Threshold = 0.5 (range from -1 to 1)
['as we are left with just 2 days of 2017 hereu2019s a song for everyone here edit to weewee especially lol',
'good_morning click here to download the hardwarezone_forums app']
['wah_piang . why every time lidat one . siao liao can meh . wah_lau . can lah . can la . of course medicated_oil . can anot']

As you can see, the threshold affects how many bigrams gets converted to unigrams. I did not manage to get bigrams of “can meh” to be converted to unigrams because if I were to set the threshold any lower, many other non-bigram related words will start to be converted into unigrams.

Sadly, there is a trade-off to make here.

Final Results of Singlish word embeddings

After accounting for the bigrams, I retrained the model (which now took me 18 hours to train) and received the results below.

Notice how many non-related bigrams became unigrams? i.e. “for_her”, “with_her”. This is what I meant as trade-off in the phrase modelling step.

Because of all these non-related bigram, a number of the scores have fallen.

Similarity between 'her' & 'she': 0.8960039000509769
Closest 10:
her 1.0
she 0.8960039000509765
when_she 0.7756011613106286
she_is 0.7612506774261273
with_her 0.7449142184510621
who 0.7348657494449988
for_her 0.7306419631887822
shes 0.7279985577059225
face 0.7192153872317455
look_like 0.718696491400789
********************************************************
Similarity between 'his' & 'he': 0.8129410509281912
Closest 10:
his 1.0
he 0.8129410509281914
him 0.804565895231623
he_was 0.7816885610401878
when_he 0.7747444761501758
that_he 0.7724774086496818
in_the 0.7622871291423432
himself 0.7611962890490288
was 0.7492507482663726
with_his 0.7241458127853628
********************************************************
Similarity between 'jialat' & 'unlucky': 0.1258840952579276
Closest 10:
jialat 0.9999999999999996
den 0.7104029620075444
liao 0.7050826631244886
chiu 0.6967369805196841
heng 0.686838291277863
hahaha 0.6860075650732084
riao 0.6810071447192776
la 0.6804775540889827
le 0.676416467456822
can_go 0.6756005150502169
********************************************************
Similarity between 'jialat' & 'bad': 0.4089547762801133
Closest 10:
bad 1.0
but 0.8388423345999712
good 0.8244038298175955
really 0.8192112848219635
i_think 0.7970842555698856
very 0.7965053100959192
like 0.785966214367795
feel 0.783452344516318
too 0.7788218726013071
i_feel 0.7721110713307375
********************************************************
Similarity between 'bah' & 'ba': 0.5793487566624044
Closest 10:
bah 1.0
say 0.6595798529633369
lah 0.6451347547752344
dun 0.6449884617104611
sure 0.627629971037843
bo_bian 0.6251418244527653
this_kind 0.6223631439973716
ppl 0.6196652346594724
coz 0.61880214034487
mah 0.6146197262236697
********************************************************
Similarity between 'lah' & 'la': 0.8863959901557994
Closest 10:
lah 0.9999999999999998
la 0.8863959901557996
meh 0.8739974021915318
mah 0.8641084304399245
say 0.8589606487232055
lor 0.8535623035418399
u 0.8304372234546418
leh 0.8275930224011575
loh 0.8189639505064721
like_that 0.8170752533330873
********************************************************
Similarity between 'lah' & 'leh': 0.8275930224011575
********************************************************
Similarity between 'lah' & 'ba': 0.7298744482101323
********************************************************
Similarity between 'lah' & 'lor': 0.8535623035418399
********************************************************
Similarity between 'lah' & 'hor': 0.7787062026673155
********************************************************
Similarity between 'lor' & 'hor': 0.7283051932769404
Closest 10:
lor 1.0
u 0.8706158009584564
lah 0.8535623035418399
mah 0.8300722350347082
meh 0.8271088070439694
or_not 0.8212467976061046
can 0.8202962002027998
la 0.8136970310629428
say 0.8119961813856317
no_need 0.8113510524754526
********************************************************
Similarity between 'walau' & 'walao': 0.37910878579551655
Closest 10:
walau 0.9999999999999998
liao_lor 0.595008714889509
last_time 0.5514972888957848
riao 0.5508554544237825
no_wonder 0.5449206686992254
bttorn_wrote 0.5434895860379704
hayley 0.5418542538935931
y 0.5415132654837992
meh 0.5397241464489063
dunno 0.5377254112246059
********************************************************
Similarity between 'makan' & 'eat': 0.6600454451976521
Closest 10:
makan 0.9999999999999998
go 0.7208286785168599
den 0.697169918153346
pple 0.6897361126052199
got_pple 0.6756029930429751
must 0.6677701662661563
somemore 0.6675667631139202
eat 0.6600454451976518
lo 0.6524129092703767
there 0.649481411415345
********************************************************
Similarity between 'makan' & 'food': 0.4861726118637594
********************************************************
Similarity between 'tw' & 'sg': 0.6679227949310249
Closest 10:
tw 1.0
tiong 0.7219240075555984
taiwan 0.7034844664255471
hk 0.7028979577505058
tiongland 0.6828634228902605
mediacock 0.6751283466015426
last_time 0.6732280724879228
sg 0.6679227949310248
this_yr 0.6384487035493662
every_year 0.6284237559586139
********************************************************
Similarity between 'kr' & 'sg': 0.7700608852122885
Closest 10:
sg 1.0
in_sg 0.8465612812762291
kr 0.7700608852122885
pple 0.7599772116504516
laio 0.7520684090232901
go 0.7509260306136896
sure 0.7156635106866068
come 0.7116443785982034
tiongland 0.7015439847091592
ongware_wrote 0.6974645318958415
********************************************************
Closest 10:
wah_piang 1.0
tsm 0.4967581460786865
scourge_wrote 0.4923443232403448
aikiboy_wrote 0.4887014894882954
sian 0.4871567208941815
ah_ma 0.48058368153798403
lms 0.4790804214522433
ruien 0.47420796750340777
xln 0.46973552365710514
myolie 0.4682823729806439
********************************************************

In short, the results above was worst off than the previous results.

I think that the this issue came from the lack of data. If I had more examples of people using “can meh” or “can lah”, the phrase model would have picked up these bigrams without having me set a low threshold. 0.5 threshold is really low in my opinion.

I caught too many false positives.

Rubbish in, rubbish out right? This was exactly what happened here.

Future Works

As you can imagine, I still have ways to go in tuning my model to get decent Singlish word embeddings for my future works.

What I would do differently is this:

  1. Recode my scripts to run distributed. Or better yet, code to run in a GPU. This will allow me to run on a much larger corpus i.e. the whole of Hardware Zone and not just one thread of it.
  2. Train a better phrase model. i.e. the PMI portion.

For now, I will keep the word embeddings for the first model.