Original article was published by on AI Magazine
Everything Has Its Price — How to Price Words and Phrases, in Online Ad Bidding and More
This article sketches an NLP approach to pricing natural language words or phrases. It leverages creatively (1) the model word2vec, which learns the context and associations between words from a given corpus; (2) the Mondovo dataset, which provides basic building blocks for us to further bootstrap our application. This solution will have interesting applications in fields such as online ad bidding, online marketing, search engine optimization, etc. This article serves as an illustration of an initial baseline solution to the pricing problem and readers eager to learn more about how I do it in practice and a more in-depth treatment of the topic are welcome to tune in for my followup publication.
People are quantifying everything. When we are unable to do that to something, we call it either worthless or mysterious, or dismiss it adroitly as hallucination; such is the case with things like love, loyalty, honesty etc.
The online ad bidding industry is definitely not an exception, and one of their biggest problem is how to come up with accurate bid prices for their chosen ad keywords or phrases to secure some hot ad spots on the publishers’ websites. The quandary goes like this: if the bid price is set too high, you may be sure to get the ad spot, but you will also have to pay the hefty price you bid at; if you set the bid price too low, chances are that you will have a hard time getting that ad spot at all. Apparently, this delicate trade-off here calls for creative solutions to the problem of quantifying words/phrases into prices.
Fortunately, we can rest assured of the resounding good news: words can be priced too! For this problem, we might not have the luxury of a well-crafted recipe like the Black-Scholes model for options pricing, but there are multiple ways by which we can take a crack at it.
In this article, I will sketch up a simple solution to the keywords pricing problem that makes basic use of a natural language processing technique called word2vec. The following sections will show how to handle the data, where to employ word2vec, how to transform our problem to a regression task, and finally the performances of the whole pipeline.
Let us get started.
Brief Intro to word2vec
It might be helpful to trace back the evolution of the statistical language models. At the beginning, we have the naive bag-of-words model, in which we treat each word in the corpus discretely; no context, no dependency, just independent words. And the best you can do with such a model is come up with a chart of word frequencies.
Next comes the n-gram model. Unigrams, namely indiviudal words, are not that powerful, but we can extend to bigrams, trigrams, quadrograms and beyond, in which every N (2, 3, 4 or more) consecutive words are considered as a whole (as an individual word). Arguably, such models will be able to capture word context of size N, and enable us to do more sophisticated predictions and inferences. For example, we can easily build more powerful probabilistic state transition models such as Markov chain, which supports daily applications such as word autosuggestion or autocomplete.
In contrast, word embedding is a family of language models where words or phrases from the vocabulary are described/represented using vectors and word2vec is one of the most popular techniques to do that. Generally speaking, it uses neural networks to learn word associations/relations from a given corpus, and uses vectors of a given length to represent each word such that the semantic similarity between words will correlate to the vector similarity between their vector representations. The Wikipedia page will provide a good initial pointer and for more in-depth treatment of this topic, please stay tuned for my future posts.
This is an extremely important step. In order for us to come up with any model, we will need data first. Further, in order for our model to learn any meaningful relationship among the data, we want the data to contain sample mappings from natural language words to prices. Unfortunately there are many such datasets available on the Internet and the one I was able to find is from Mondovo. This specific dataset contains the top 1000 most asked questions on Google and their associated global cost-per-clicks, which, although a fairly small dataset, provides the basic ingredients we need: words and their prices.
It is fairly easy to wrap the 1000 rows of data into a pandas dataframe with two columns: keyword and price, and let us call this dataframe df from now on.
Then let us do the following step to insure the order of the data is indeed randomized:
df = df.sample(frac=1).reset_index(drop=True)
That is it for our data preprocessing.
Now let us concern ourselves a bit with word2vec. In this task, instead of learning a word vector representation from our own corpus, namely the 1000 phrases, we will rely on some ready-to-use vector representation. The following code snippet will introduce an out-of-box solution from Google:
import gensim.downloader as apiwv = api.load('word2vec-google-news-300')
According to this source, the model was built on ‘pre-trained Google News corpus (3 billion running words), (and contains) word vector model (3 million 300-dimension English word vectors)’.
From Word to Sentence
Here is the catch: the model word2vec contains only vector representations of individual words, but we need vector representations of short sentences/phrases like those in our dataset.
There are at least three ways to get around this problem:
(1) Take the average of the vectors of all words in the short sentence;
(2) Similarly, take the average, but weight each vector using the idf (inverse document frequency) score of the word;
(3) Use doc2vec, instead of word2vec.
Here I am curious to see how a baseline model would perform, so let us go with (1) for now and leave the other options for future explorations.
The following code snippet will provide a straightforward example to implement the averaging function:
def get_avg(phrase, wv):
vec_result = 
tokens = phrase.split(' ') for t in tokens:
if t in wv:
#300 is the dimension of the Google wv model
vec_result.append([0.0]*300) return np.average(vec_result, axis=0)
Please note the if condition is necessary in that certain ‘stop-words’ (those extremely common and generally uninformative words in a given language. In English, think of ‘the’, ‘it’, ‘which’, etc) have been excluded from the Google model. In the snippet above, I took some leeway to skip dealing in detail with the topic of missing words or stop-words. More in-depth discussion will follow in my future posts. Please stay tuned in!
Regression Problem Setup
Remember that fundamentally, almost all machine learning algorithms expect numerical inputs: e.g., in imaging processing problem, black-white pictures are fed to algorithms as matrices of 0–1, and colored pictures as RGB tensors. Our problem is no exception and that is why we took all the trouble to introduce word2vec.
With that in mind, let us our feature matrix and target vector for use in machine learning algorithms:
X = np.array([get_avg(phrase, wv) for phrase in df['keyword']])y = df['price']
And since we are predicting some numerical values, this is a regression problem. Let us choose some handy regression algorithm for this task:
from sklearn.ensemble import RandomForestRegressor#leaving out all params tuning to show absolute baseline performance
reg = RandomForestRegressor(random_state=0)
Now we are finally able to see how our absolute baseline model performs. Let us set up a 10-fold cross validation scheme as follows:
from sklearn.model_selection import KFoldfrom sklearn.metrics import mean_absolute_error#set up 10-fold Cross Validation:
kf = KFold(n_splits=10)#loop over each fold and retrieve result
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index] reg.fit(X_train, y_train)
In my experiment, running the code above gave MAE scores of 1.53, 0.98, 1.06, 1.23, 1.02, 1.01, 1.06, 1.19, 0.96 and 0.96, leading to an average MAE of 1.1, which means on average our estimated price could deviate $1.1 from the true value.
Given the scanty data available, the lack in word redundancy in training data, the sparsity of in-sample data points, and our absolute baseline assumptions without any parameter optimization, I am really impressed with how far we have been able to push with our current methodology. It is not hard to imagine that some avid readers doing their own experiments are certain to achieve better results.