My first data science contest.

Original article was published by Asicoder’s blog on Artificial Intelligence on Medium

It was time to hustle. We quickly realized that our opponents were much older and experienced. In fact, the rest of the teams were formed by master’s degree students in data science, even some of them were already working on it.

Consequently, if we wanted to get a good position we had to start looking for interesting insights on data, and maybe trying new models.

Visualization tools may help us to find patterns to differentiate classes easier, we thought. So we studied PCA and T-SNE for dimensionality reduction, as well as violin plots for each feature. These were the revealing results we obtained:

Scatterplot of PCA with 2 principal components, training set.
Scatterplot of t-SNE with perplexity of 50.

All classes overlapped a lot, and no pattern or group was found.

After this, we tried some minor changes on the data, which seemed to improve a bit our result:

  • Convert the land register quality feature to 12 one hot variables.
  • Change the construction year to the number of years of antiquity.
  • Change the null values to 0, instead to -1.

We still had many things to try, and we had less than one month to run some experiments, with our modest laptops.

That is how we splitted the work.

Mario would try to reduce dimensionality using autoencoders, build some fancy model with neural networks, and test how Extra Trees and Support Vector Machines could help.

I would work mainly on over and under sampling techniques, One Against All — Data Balancing algorithm, and feature engineering.