Why word embeddings works — II

In the last article, we tried to provide an intuitive way of understanding how gradients helps to for good word embeddings. In this we are directly focusing on implementing it in Tensorflow in few lines of code.

I am using text8 dataset for the demo implementation purpose. How multi-label comes into our context. If we recall the portions of training pairs we have generated

('magazine', 'weekly')
Vocab index 2 3
('magazine', 'read')
Vocab index 2 1

the input ( magazine ) has more than one output class. It wants to predict [weekly , read] and more will be there as your corpus getting large. So, this allows us to provide a multiple label for a single training data, which is nothing but creating a vector of zeros and replacing the indexes of the labels with 1.

First part is to preprocess the data and generate training pairs.

The above code is self explanatory. We are loading text8 data using word2vec, and iterate over it, create a vocab with min_count =5. Then we are creating our label_matrix ( this is the multi-label matrix ), where each row corresponds to the word in the vocab and columns corresponds to the neighbor words, and wherever there is a neighbor present, we put 1.0 else 0.0.

Note: The above label matrix consumes nearly 20GB. So, please be careful, this is demonstration purpose. This pre-loading label matrix is not an efficient approach.

I hope those who are familiar with tensorflow, it is a basic hello world code. We are creating two matrices, W_in and W_out, and we do simple matrix multiplication to generate the logits and then use, the label_matrix we created in the previous script to, do the loss calculation followed by a simple optimization using Adam. The only thing to notice here is that, instead of softmax, we are using sigmoid which is the right loss function for multi-label classification.

After 150 epochs, we can see model starts generating useful embeddings. By 200–250 epochs, embeddings become bit stable and we can stop the training and freeze the model.

Evaluation against word2vec

I have created a model using word2vec on same text8 data, with same min_count=5. Lets see some results

word = 'politics'
Multilabel class
[('politics', 0.99999994),
('political', 0.86523706),
('democracy', 0.8184589),
('democratic', 0.8141048),
('liberal', 0.8122362),
('social', 0.7826215),
('independence', 0.7800275),
('affairs', 0.7797648),
('relations', 0.7746104),
('opposition', 0.76887476)]
('political', 0.7108944058418274)
('conservatism', 0.65284264087677)
('ideology', 0.6383846998214722)
('socialism', 0.6360189914703369)
('democracy', 0.6324355602264404)
('nationalism', 0.6319522857666016)
('socialist', 0.618427038192749)
('policy', 0.6135483384132385)
('affairs', 0.605991780757904)
('democratic', 0.5972957015037537)
word = 'highly'
Multilabel class
[('highly', 0.99999994),
('extremely', 0.8494898),
('particularly', 0.8420101),
('widely', 0.8152985),
('variety', 0.8053535),
('especially', 0.7947577),
('less', 0.79363227),
('strong', 0.7935199),
('produce', 0.79041016),
('combined', 0.7900724)]
('fairly', 0.6947253942489624)
('extremely', 0.6848607063293457)
('enormously', 0.6423172950744629)
('hugely', 0.6322559714317322)
('somewhat', 0.6279569864273071)
('quite', 0.6213657855987549)
('socially', 0.621038019657135)
('very', 0.6139760613441467)
('moderately', 0.607708215713501)
('remarkably', 0.6053123474121094)
word = 'hate'
Multilabel class
[('hate', 0.9999999),
('homosexuals', 0.76239765),
('hatred', 0.7389945),
('prejudice', 0.73819554),
('racist', 0.73606145),
('crime', 0.7299746),
('crimes', 0.71541023),
('homophobia', 0.7035942),
('racism', 0.70098424),
('fear', 0.6960762)]
('incitement', 0.6599056720733643)
('homophobia', 0.6584235429763794)
('affirmative', 0.6348272562026978)
('bigotry', 0.6310179233551025)
('greed', 0.6309402585029602)
('homosexuals', 0.6294110417366028)
('lgbt', 0.6259943246841431)
('racist', 0.6244571208953857)
('guilt', 0.6208794116973877)
('rape', 0.6207380890846252)
word = "man"
Multilabel class
[('man', 1.0000001),
('she', 0.88904625),
('her', 0.88450146),
('said', 0.8683877),
('love', 0.8510502),
('my', 0.84702027),
('him', 0.84684056),
('story', 0.8413086),
('life', 0.8349931),
('young', 0.82487655)]
('woman', 0.7598437070846558)
('girl', 0.6586174964904785)
('creature', 0.6173745393753052)
('boy', 0.6068246364593506)
('person', 0.5814706087112427)
('mortal', 0.5743921995162964)
('bride', 0.5679938793182373)
('evil', 0.567527174949646)
('thief', 0.5572491884231567)
('demon', 0.5415527820587158)
word = 'italy'
Multilabel class
[('italy', 1.0),
('spain', 0.84999496),
('france', 0.8470675),
('germany', 0.81304777),
('rome', 0.7964344),
('italian', 0.79149246),
('austria', 0.78135693),
('greece', 0.77487767),
('portugal', 0.77276725),
('venice', 0.76478523)]
('france', 0.7863207459449768)
('spain', 0.7676301002502441)
('sicily', 0.7537814378738403)
('greece', 0.7422847151756287)
('austria', 0.7342372536659241)
('germany', 0.7263705730438232)
('gaul', 0.7121081352233887)
('naples', 0.7076730728149414)
('turkey', 0.7016696929931641)
('portugal', 0.6989914178848267)
word = 'computer'
Multilabel class
[('computer', 0.9999999),
('software', 0.90251267),
('programming', 0.88836336),
('digital', 0.84895533),
('computers', 0.8487892),
('systems', 0.84857726),
('program', 0.8431728),
('machine', 0.84080416),
('design', 0.8391893),
('computing', 0.82896864)]
('computers', 0.7370442152023315)
('computing', 0.7154221534729004)
('programmer', 0.6870735287666321)
('digital', 0.662018358707428)
('calculator', 0.6605353355407715)
('mainframe', 0.6580640077590942)
('console', 0.6566474437713623)
('graphics', 0.6503777503967285)
('hardware', 0.6495660543441772)
('programmable', 0.6406388878822327)
word = 'algebra'
Multilabel class
[('algebra', 1.0),
('algebraic', 0.89749193),
('calculus', 0.89501),
('theorem', 0.87045324),
('euclidean', 0.85363406),
('algebras', 0.852213),
('mathematical', 0.8469849),
('abstract', 0.84639347),
('topology', 0.8402641),
('finite', 0.83807576)]
('algebraic', 0.853831946849823)
('boolean', 0.8227413892745972)
('topology', 0.8171056509017944)
('commutative', 0.8079892992973328)
('calculus', 0.807814359664917)
('associative', 0.7952014207839966)
('banach', 0.7888739109039307)
('algebras', 0.783771276473999)
('integrals', 0.7819021344184875)
('cauchy', 0.7753515243530273)

Note: This article is mostly, intended to get the concepts how and why word2vec / word embedding works. There is no guarantee that all these information are right. The only advantage of multi-label is over softmax, where we can reduce the loss function calculation to one training sample, instead of doing softmax over same sample, multiple times.

Rooms for improvement

There is no sampling we are doing here to avoid, frequent words as in the case of word2vec. I hope, there is still room for improvements, by changing the loss function, use Energy based models to map labels into a different space, pruning out neighbors based on tf-idf, occurence, PMI etc. The essence is, gradients flowing from the output to the input, for similar words ( based on the output words, the input is trying to predict ) will be similar.

How you want to relate the input to the output, this relationship defines different embeddings. eg: if we are trying to predict the post tags of neighboring words, we might get grammatically similar embeddings.

In the next and final part we will see, how we can use this concept in real word application, with the help of architectures like CNN, RNN etc.

Source: Deep Learning on Medium