Original article was published on Artificial Intelligence on Medium
Blue Book for Bulldozers Competition Part 7 (Continued) — Deep Learning for Tabular Data II
In this part (which is optional), we are going to try deep learning on the original bulldozers dataset to see how far we can go with no feature engineering and a simple fully-connected neural network. Using deep learning on tabular datasets used to be frowned upon because of neural nets’ limitations when dealing with such problems but today, thanks to embeddings (which we’ll discuss in great depths), we can bypass those limitations and achieve state-of-the-art results on tabular datasets using deep learning. You can find the notebook for this tutorial here.
Without further ado, let’s get coding (in Colab)!
P.S: My hope was for this article to be the last part of this “Deep Learning for Tabular Data” miniseries but there wasn’t enough time and space for everything I wanted to talk about and so this part will only include pre-processing the data and making it suitable for neural nets but I promise the next part will be solely devoted to actual model building. Sorry for the inconvenience!
Importing the Right Dependencies
Up until now, we’ve been using FastAI-0.7 for certain tasks such as data preprocessing, feature importance, etc. However, for the task at hand, FastAI-0.7 isn’t our best choice and instead, we’re going to be using its newer and more refined version, FastAI-1.0.X.
So let’s install it:
!pip install fastai>=1.0.0
Just like its older counterpart, this version of FastAI comes with all the other libraries we might need and so there’s no need for us to install them (and unlike FastAI-0.7, which is incompatible with newer versions of certain libraries such as Scikit-learn, FastAI-1.0.X works fine with the most recent versions of most popular machine learning libraries).
Next, let’s import the libraries we need. Fortunately, fastai.tabular includes nearly everything required for dealing with structured datasets (ranging from pandas and NumPy to little-known but useful libraries and functions) and so we don’t need to import anything else:
from fastai.tabular import *
Getting the Data
We obviously can’t do deep learning without data and so naturally our next step is to get the data we need. At first, I thought about feeding our neural net the data we modified using our Random Forest which would most likely yield a better accuracy and also result in a much lower training time. But then I thought let’s see how far we can go with no feature selection/engineering and so we’re going to be using the original data. We can use the exact same code we used in part 1, which is at the top of the Colab notebook.
Next, let’s read our data:
df_raw = pd.read_csv('Train.csv', low_memory=False, parse_dates=['saledate'])df_raw.sort_values('saledate', inplace=True)df_raw['SalePrice'] = np.log(df_raw['SalePrice'])add_datepart(df_raw, 'saledate')df_train = df_raw.iloc[:-12000].copy()
Please note that df_train includes both our training and validation set.
Continuous and Categorical Features
As we’ve already discussed, in order to use deep learning on structured datasets, we must use embeddings for categorical variables and therefore, we must tell our model which of our columns are categorical ones so it knows whether to create embeddings for them or not. Doing so is an important (albeit easy) part of the process of using deep learning on tabular datasets and thus, you should definitely get comfortable with it. Here’s the process I use which usually gives adequate results:
First, take a look at all the columns you have and see which ones are really continuous. Such features could be the distance of something in kilometers, the temprature, etc. But please note the “really” part: As long as you’re not a 100% sure a column is continuous, don’t count it as a continuous variable and put it aside for now.
Next, you should do the same thing but for categorical features (again, keep in mind the really part). Good examples of such columns would be anything that’s a boolean, different types of something (different car models, brands, etc.).
And last but not least, go through the variables that neither belong to the first group nor to the second one and think about some of the following (assume F is one such feature), not necessarily in this order:
- Could the F column of your test set include values which are absent from the F column of your training set? If so, how many such cases might there be? If your answer is a lot, you should probably refrain from treating F as a categorical variable.
- Assume a, b, c, and d all belong to F and further more, a-b = c-d (assuming F is an integer column). Is it possible that the relation between a and b is different from the one between c and d? If so, using embeddings is probably the right thing to do.
- What’s the cardinality of F? If it’s too high (relative to your problem and the size of your dataset) and you’re not sure about #1 and #2, treat F as a continuous variable. Otherwise, a categorical one.
This is all very abstract so let’s go through the features we’re dealing with and apply these steps to them:
The only features that I’m extremely confident should be treated as continuous ones are ‘MachineHoursCurrentMeter’ and ‘saleElapsed’ so for now, they’re going to be our only continuous features.
For every remaining feature we have other than ‘Undercarriage_Pad_Width’, ‘Stick_Length’, ‘YearMade’, ‘saleYear’, ‘saleMonth’, ‘saleWeek’, ‘saleDay’, ‘saleDayofweek’, and ‘saleDayofyear’, we can say with certainty that they’re categorical ones which prompts us to step 3.
Before we do the last part, I just wanted to point out if you’re not quite sure what a column actually is (such is the case for me with ‘Stick_Length’ and ‘Undercarriage_Pad_Width’), that’s OK and you can still apply more or less this exact same process to it.
First, let’s grab these columns from df_raw and df_train:
cols = ['Undercarriage_Pad_Width', 'Stick_Length', 'YearMade', 'saleYear', 'saleMonth','saleWeek', 'saleDay', 'saleDayofweek','saleDayofyear']df_train_cols, df_raw_cols = df_train[cols].copy(), df_raw[cols].copy()
Next, let’s find the cardinality of each feature for both our dataframes:
As we can see, the only feature which has values in the test set that aren’t in the training set is ‘YearMade’. But even ‘YearMade’ has only 1 value that’s present in the test set but not in the training set and so for now, for all of our features, the score is 1–0 in favour of embeddings.
The second test, which is very important, is also easy in our case (we’ll leave out ‘Undercarriage_Pad_Width’ and ‘Stick_Length’ since I don’t know what they are, which is required for this step): Figuring the different relations between different values of a feature. For example, let’s say in ‘saleDayofweek’, 1 represents Monday, 2 Tuesday, and so on. So Sunday-Saturday is equal to Monday-Sunday, which is basically implying that the relation between Sunday and Saturday is the same as the one between Monday and Sunday but that’s obviously not true: Saturdays and Sundays are probably very similar in some aspects but they’re vastly different from Mondays (so, for instance, the sales of a breakfast restaurant on Saturdays and Sundays might be very high but on Mondays, the same restaurant might receive only a few customers). Similar arguments can be made for all the other features as well and hence, it’s another point in favour of embeddings for our date-related columns.
And last but not least, we must see if, given the cardinality of our features, it’s a good idea to treat them as categorical variables or not. As we can see, all features except ‘saleDayofyear’ and ‘YearMade’, which have 360 and 72 distinct classes respectively, have a relatively low number of categories. But even the cardinalities of ‘saleDayofyear’ and ‘YearMade’ aren’t that high compared to the size our dataset (we have almost 400K rows of data and 360 isn’t really high compared to that). So it’s another win for embeddings (also, please note that even though we don’t know what ‘Undercarriage_Pad_Width’ and ‘Stick_Length’ are, we can still talk about them since we know their respective cardinalities).
dep_var = 'SalePrice'cat_vars = list(set(df_raw.columns)-set(['SalePrice','SalesID','MachineHoursCurrentMeter','saleElapsed']))cont_vars = ['MachineHoursCurrentMeter', 'saleElapsed']
Suitable Data for Neural Nets
With traditional machine learning libraries such as Scikit-learn, we could directly pass in a dataframe for training, validation, and the like. With FastAI, however, that’s not possible. We have to turn our data into a DataBunch, which is basically an object that comes with some helpful modules and contains our data, and pass that to our neural net. Fortunately for us though, doing so is extremely easy thanks to FastAI’s data block API, which is, in my modest opinion, one of the jewels of FastAI . We’re not going to dive deep into this awesome API (if you’re more interested, I strongly urge you to read this) but in short, it’s basically an extremely flexible way of turning your data into a DataBunch object, which you can in turn use to train different models with ease: You can get your data from a number of different sources (CSV files, images in folders, text files, you name it), split it into a training and validation set using every way imaginable (random split, validation folder, etc.), label it in virtually whatever way you’d like (the name of your files, regex, etc.) and a lot more. Further more, it comes with a lot of pre-existing modules for different types of datasets (vision, tabular, etc.) but you can also customize it for other types of problems as well.
To turn our dataframe into a DataBunch, we should first think about 2 things:
- The pre-processing we need (handling missing values, strings to numbers, etc.)
- Our validation set
#1 is simple: We need to fill in the missing values our data has, categorify our categorical features, and normalize our continuous variables (we’ve already talked about the first two but if you don’t know what normalization is, you should definitely learn more about it. But for now, it’s basically scaling down our data so that all values would be between -1 and 1, which is a must for training virtually any type of neural net).
Before we split our validation set, there’s one very important thing we should keep in mind: When dealing with tree-based models, we could hyperparameter tune on a small sample of our data (which, in our case, consisted of 50K rows of data) and train our final model on the entire dataset (which was roughly 6 times the size of our sample data) using more or less the same values for our hyperparameters. Alas, we can’t do that with neural nets: At first, I tried hyperparameter tuning with a small sample (same size as above) but when I tried using the entire dataset to build my final model, my validation score actually dropped (it also gave very inconsistent results). I tried increasing the size of the sample data to 100K rows and although it did help a bit, it was still pretty bad. Eventually, after training the final model on 150K rows of data instead of the full dataset, the validation score seemed to be getting better (so I used a sample data of size 100K to find the most optimal hyperparameters and when I was ready to use the test set, I used 150K rows of data instead of the full dataset). And please note that I tried a few other datasets as well and the same things happened: The sample data can’t be smaller than half the data you use to train your final model.
So before we go further, let’s create our smaller dataset:
# Sample data for hyperparameter tuningn_sample = 100000small_train = df_train.iloc[-n_sample:]
Regarding #2, we’ve already discussed that if we’re dealing with temporal data, our training set should chronologically precede our validation set and our validation should precede our test set (df_train only includes the training and validation sets, so we need not worry about the test set for now). We also talked about how our validation set must be similar to our test set so that our validation score is a good indication of what we might get on the test set. As we saw, an easy way to create a validation set that’s close to our test set is to pick a chunk of our data that comes immediately before our test set and whose size is roughly equal to that of our test set (we also talked about other (possibly better) ways to choose the right validation set but for the purpose of this article, the easy way suffices)
OK, now we’re ready to create a DataBunch:
First, let’s create a list of pre-processes our data needs. FastAI comes with a lot of built-in pre-processing techniques and thus, we don’t need to write anything ourselves (nor do we need to apply them to our dataframe. We’ll just pass this list to the data block API and it’ll do the rest by itself):
procs = [FillMissing, Categorify, Normalize]
Second, we have to split our validation set. There are various ways to tell the data block API what our validation set is going to be, the easiest of which in our case is just giving it the indices of the rows we’d like to have in our validation set (so if we have a dataset consisting of, say, 100 rows and we’d like our validation set to be the last 10 rows, we can simply pass in [89, 90, …, 99]). So let’s create a list that contains the indices of the last 20000 rows (which is the size of our validation set) of our training sample:
# The indices of our validation rowsn_valid = 20000valid_idx = list(range(n_sample-n_valid, n_sample)) # This is equal to [80000, 80001, ..., 99999]
And last but not list, let’s put all these different pieces together to create a DataBunch we can give our neural net:
small_data = (TabularList.from_df(small_train, cat_names=cat_vars, cont_names=cont_vars, procs=procs).split_by_idx(valid_idx).label_from_df(dep_var).databunch())
As you can see, it’s extremely easy to create a DataBunch suitable for our task using the data block API:
First, we need to tell it what type of data we have (TabularList).
Then, we must tell it where that data comes from which, in our case, is a dataframe. We also need to say which one of our features are categorical ones and which ones are continuous, and we also say what pre-processing techniques we’d like done to our data (.from_df).
Thirdly, we tell it we’d like to split our validation set using a list of validation indices, which we pass in (.split_by_idx).
Next, we have to specify how our data should be labeled, which is pretty simple: A column (dep_var) in the dataframe we passed in contains the target value for each row (.label_from_df)
And last but not least, we turn our data into a DataBunch with the default arguments, the most notable one being a batch size of 64, which usually works pretty well (.databunch).
In this part, we went through the various things we need to do in order to tailor our data to suit neural nets, from splitting our features into categorical and continuous ones, to turning our data into a DataBunch, which we can then pass to FastAI’s neural nets. We delved deep into the former and learned about the various things we should keep in mind if we’re at a crossroads and don’t know whether a feature is a categorical one or a continuous one. We also briefly went over the amazing data block API, which makes it very easy to turn almost any type of data into a DataBunch with training, validation, and (optionally) test sets. In the next part, we’ll create a neural net with embeddings and train it on our data, all using FastAI-1.0.X. We’re also going to see how to hyperparameter tune neural nets (for structured datasets) in a reasonable amount of time which is very important due to neural nets’ tendency to take long to train and also the sheer number of hyperparameters we need to tune. After we do those things, we can hopefully achieve a good score on our test set and get at least as good an accuracy as our ensemble model!
To be continued…
Please, if you have any questions or feedback at all, feel welcome to post them in the comments below and as always, thank you for reading!