Original article can be found here (source): Deep Learning on Medium
Fast and accurate learning with Transfer Learning on Tabular Data? How and Why?
How we can fine-tuned a classifier of natural images to perform machine learning tasks for tabular data?
In general, we can categorise our data into unstructured data
(those which can be maintained in formats that are not uniform
like image and text) and structured ones (the common tabular). In the first category, winners by a large margin are the Deep Learning Models (CNNs, RNNs, etc). However, in the latter case, thinks are boosting tree-based algorithms (like XGBoost, LightGBM, CatBoost) lead.
Normally, when data scientist deal with tabular data, they spend about 60%-80% of their time to data-preprocessing step (clean the data, exploration data analysis (EDA), visual graphs etc). In contrast in a task with unstructured data, for example image classification, we just dequantize our images from the discrete space that they are leaving to a continuous one, in order to be able to perform back-propagation. Another major plus with neural networks is the ability to perform transfer learning — instead of starting of with a random initialization, we used pre-trained models on millions of data, hoping that (mostly) the first layers of convolutions have captured some important general concepts of the data.
The question is: Is there another — novel — way for tabular, that is
• Easier to implement,
• Help us save some time,
• Archives great results?
The method: SuperTML
Inspired by the recent NLP research that the two-dimensional embedding of the Super Characters method is capable of achieving state-of-the-art results on large dataset benchmarks, Sun et. al. (https://arxiv.org/abs/1903.06246) borrow this concept to address Tabular Machine Learning (TML) problems!
The idea, which is called SuperTML, is both super simple and crazy at the same time; it is composed of two steps:
- Create two-dimensional embedding;
Projects features in the tabular data onto the generated images
- using pre-trained CNN models (on ImageNet ?!) to fine-tune on the generated SuperTML images.
Maybe the craziest part of all of this process, is that the images (or two-dimensional embeddings ) have now the following properties:
- If some features are more important than the others (prior knowledge), just increase the size of the number!
- Missing data? just replace them with ‘?’ (!!)
- Categorical features? just put them as they are (as strings)
• Core idea: Tabular data can be embedded to 2 dimension-matrix (an image)
• Question: Is this method working? And if yes, WHY ?!
Many readers probably have the same reaction as me: “There is no way this works! This is so crazy to work!”. And creates a nice transaction to the experiments and evaluation section
In this section, apart from the results that presented in the papers, I am going to share the results of my implementation (code available here).
We are going to explore 3 datasets, two very small (were the DNN models are prone to overfit) and one Big one. The details of them are showed in the following table:
The following indicates the power of neural networks to adapt to any task given, and is wonderfully crazy:
Those who still are not impressed, the paper shows results for a very challenging task, Higgs Boson Kaggle Challenge. A dataset of 30 features; 25,000 training / 55,000 testing samples was not enough to stop SuperTML and allow it to climb to the top by a large margin (for more details, please refer to the paper)!
A super trivial idea and very easy to implement, that outperformed the winner of kaggle competition (which required a lot of feature engineering) by a margin of 0.170! With just an image!
WHY DOES THIS WORKS ??
Authors’ opinion coming…
It seems that this algorithm learn the visual representation of the features (number, text) and learn to compare with their analogous from another sample.
It learns that 4 is close to 5 but learns a relative distance between the number; even though for a standard algorithm, the distance of 4.5 and 4.7 has the same value (or weight) with the 4.1 and 4.3, this algorithm learns their relationship though the data, and now we can maybe argue that this distance is learned (we can think about that the algorithm learns its own arithmetic)!
• Simple Idea that is easy to use and manipulate
• No data normalization, no categorical features special treatmen no costly fine-tuning via grid search
• Get advantage of the best CNN classifiers
• No overfitting on small datasets
• Make sure the features on an image do not overlap
• Numerical values have some hidden relationship behind the shape of the digits, such as 6.01 and 5.999 (did stop us so far!)
Do we really need numerical values in tabular data
feature’s relationship is enough? 🤔
As stated before, the complete code can be found here. Feel free to play with it and try more crazy ideas!
Until next time, take care!