Explaining the Field-Aware Factorization Machines (FFMs) for CTR-Prediction

Source: Deep Learning on Medium

“I vividly remember my first encounter with a CTR prediction problem. Before this, I had been learning data science and I was feeling good about my progress. I was starting to feel confident in ML, but the first look at the data-set gave me jitters. The data after unzipping was well over 50 GB! — I had no clue how to predict a click on such a huge data-set. A blessing in disguise, Factorization machines came to my rescue.”

Anyone who has worked on a CTR Prediction problem or Recommendation Systems would have faced a similar crisis. As the data-sets are huge, predicting these data-sets becomes very challenging with limited computation resources at disposal.

However, in most of the cases these data-sets are sparse hence, there are several features which are not important for prediction, this is where factorization comes to rescue — it helps to extract the most important features (hidden or latent) from the existing raw ones.

In this article, I discuss Factorization Machines(FMs) and Field-Aware Factorization Machines(FFMs) which allows us to take advantage of factorization in a regression/classification problem.


Click-through rate (CTR) prediction plays an important role in computational advertising. The most commonly used models for CTR estimates are Poly2 mappings and factorization machines (FMs). In recent times, a variant of factorization machines, the FFMs performed well in the worldwide CTR estimation competition, which is mainly effective in large-scale sparse data classification tasks.


Logistic regression is the most widely used model in CTR estimation. For a data-set with m samples (yi, xi), i = 1,.……,m, yi stands for label, xi is a feature vector, the model parameter w is obtained by solving the following optimization problem:

among them, This is a linear model:

In the CTR predictions, it is crucial to learn the effects of feature conjunctions, such as in the following examples:

Ads from Gucci have a higher click-through rate at Vogue. However, linear models face difficulties in learning this information. Because, linear models learn the two weights Gucci and Vogue separately, to solve this problem, the degree-2 polynomial mapping (Poly2) model and the FMs model were proposed. The Poly2 model is for each feature conjunctions. The FMs model, learns weight by expressing the effects of feature conjunctions by breaking into the inner product of two hidden vectors.


Studies show that the Poly2 model can capture the effects of feature conjunctions well, and when the linear model is applied to the degree-2 mappings, the training time and test time are significantly faster than the method of applying the kernel. The Poly2 model learns a weight for each feature pair:

Where h(j1,j2) is a function that encodes j1 and j2 into a natural number.

The computational complexity of the above formula is O(n²), n, representing the average of the non-zero numbers of each sample.

FM learns an implicit vector representation for each feature, and each hidden vector contains k dimensions, so that the effects of feature conjunctions are modeled as the inner products between the two hidden vectors:

The computational complexity of the above formula is O(n²k).

The O(n²k) computational complexity can be reduced to O(nk) by some computational techniques.

On a sparse data-set,

FM model > Poly2 model

How Factorization Machines trump Polynomial and linear models?

For example, consider the pair (ESPN, Adidas), there is only one unique negative sample. The Poly2 model will learn a large negative weight for this pair. However, for FMs, because it is a hidden vector representation of ESPN and Adidas, all samples containing ESPN and all samples containing Adidas will be used to learn these two hidden vectors, so its prediction will be more accurate.


The idea of ​​the FFM model is derived from the PITF model. The PITF model is used in the recommendation system with personalized tags. In the PITF model, there are 3 fields,

user | item | tag

these 3 fields are decomposed in the independent hidden spaces as,

(user, item), (user, tag), (item, tag).

Mostly for CTR predictions, features can generally be grouped into fields. In the example above,

ESPN, Vogue, and NBC can be grouped into Publisher,

Nike, Gucci, and Adidas belong to the Advertiser.

FFMs will make full use of group information. Consider the following example:

In FMs,

each feature has a unique implicit vector representation. This hidden feature is used to learn the effect of any other feature. Considering ESPN, w(ESPN) is used to learn the implicit impact of,

  1. Nikew(ESPN)*w(Nike)
  2. Male → w(ESPN)*w(Male)

but due to Nike and Male belongs to different fields, their latent effects are different.

In FFMs,

each feature has several different hidden vectors. The FFMs of the above examples are as follows:

Thus, to learn the latent effect of,

  1. (ESPN, NIKE) → w(ESPN, A) & w(Nike,P) is used, because Nike belongs to the field Advertiser and ESPN belongs to the field Publisher.


2. (ESPN, Male) → w(ESPN,G) & w(Male,P) is used.

Mathematically, the overall calculation formula is as follows:

here, f2 represents the field of j2, and f1 represents the field of j1.


Data Sets:

Evaluation method:

logistic loss defined as,

Impact of the Parameters:

It can be evidently seen that the effect of k on logloss is not significantly large; if the regular term ( λ) is relatively large, the model is prone to high deviation, the effect is not good. On the contrary, if the regular term is relatively small, the model gets better results, but it easily overfits the data.

For the learning rate ( η), if it is too small, the model will be slower. If it is too large, the model loss will decrease very quickly, but then it will be fitted, so it needs early stopping,

Early stopping is a strategy to reduce overfitting for many Machine Learning problems, for FFMs the strategy used is:

1. Divide the training data-set into training data and verification data.

2. At the end of each epoch(iteration), use the validation set to evaluate the loss.

3. If the loss rises, record the number of epochs. Stop or implement the Step 4.

4. If requires, retrain the model with the complete data-set, using the number of epochs recorded in Step 3.

The more difficult thing is that logloss is very sensitive to the number of iterations. The number of iterations that perform well on the validation set is not necessarily good on the test set.

Comparison with other models:

It is evident that FFM works well on CTR estimates!


  1. FFMs also supports parallelization, so they are faster.

2. FFM is much better than LM, poly2, and FM in the processing of sparse data.

3. FFMs incorporate the concept of field into FMs and are worth learning.

4. In actual use, it is necessary to adjust the parameters according to the requirements of specific problems for achieving better results.

If this blog helped you in any possible way then please don’t forget to give it a Clap!! Thank you for the read……

If you wish to discuss on relative topics or wanna recommend anything then feel free to reach me out on my e-mail : vinitvssingh5@gmail.com . Adios!!