Using Focal Loss for Deep Recommender systems.

This blog post explains the approach I have taken during the 1 day hackathon hosted by Analytics Vidya. I stood 11th rank on public leader board. Code in PyTorch.

Deep Learning For recommender system


A student solves a series of challenges (Programming exercises) on an online platform. The problem here is to predict what challenges a user solves given the data of the first 10 challenges he solved.

  • There are about 5500 challenges in the dataset.
  • There are 69732 users in train and 39732 users in the test. The train-test split is based on users.
  • Every user in the train data has solved 13 challenges and the order of the same is given in the train data. In the test data, The first 10 challenges solved by the user were given and we are asked to predict the next 3 challenges.
  • More about the challenge here


mAP@3 is used as the metric.

Network Architecture:

  • Input to the network would be 10 challenges solved by the user. All the challenges were label encoded.
  • A embedding layer to represent each challenge using 50 dim vector (Hyper param)
  • Concat all the vectors of each challenge.
  • Pass the vector through a set of FC layers . ReLU is used as an Activation function.
  • Final layer has 5501 neurons. The output is a 5501 vector with 1 where the three challenges were solved by the user.
  • Sigmoid is used as an Activation function in the end.

Loss Function:

Since there is a high imbalance in the output (3/5501), I used focal loss to train the network. Used 0.25 and 2 as balancing param and focusing param respectively.


  • Initialized the final layer bias with math.log((1–0.01)/0.01) as given in the focal loss paper. Default initializations of Pytorch were used for all remaining layers.
  • SGD with learning rate of 0.1, weight decay of 0.0001 and momentum of 0.9 were used as optimizer params.
  • Batchsize 256 and ran for 300 epochs. Network converged at 246 epochs.
  • Training took approximately 3 hours? Don’t know the reason, will check later.

Results mAP@3

  • Train mAP@3: 0.263, Val mAP@3: 0.1877
  • Train Loss: 0.356 Val Loss: 0.4437

Public Leader board: 0.19233104, Rank: 11


  1. I haven’t extensively tested the network but do feel so that Focal loss might be of a great help in optimizing this network. The initial results with only 2 set of simulations were quite promising. With large datasets and more simulations, this should perform much better.
  2. Not sure as of now, How to scale this problem for new challenges. Probably we should use a different method for suggesting users new challenges.
  3. I also haven’t used the challenger.csv data provided in the competition. This has various important features as following which might work.
  • Number of students who solved a particular challenge.
  • Number of articles written by the author of a challenge.
  • Programming language used to solve this challenge.


For people who want to experiment, I have hosted the code along with the data.


Thank you Analytics Vidhya for clean dataset.

Clap and share if you like this post. Comment below if you have any feedback or doubt. Thanks for your time. Hope it helps.

Source: Deep Learning on Medium