Mercedes-Benz Greener Manufacturing: Getting into Top 50!

Original article can be found here (source): Artificial Intelligence on Medium

Mercedes-Benz Greener Manufacturing: Getting into Top 50!


Business Problem:

Safety and reliability testing is a crucial step in the automobile manufacturing process. Every new vehicle design must pass a thorough evaluation before it enters the consumer market. Testing can be time-consuming and cost-intensive as a full check of vehicle systems requires subjecting the car to all situations it will encounter in its intended use. Predicting the overall time for a vehicle to pass testing is difficult because each model requires a different test stand configuration. Mercedes-Benz has been a pioneer of numerous vehicle safety and technology features and offers a range of custom options for each model. Every possible vehicle combination must undergo the same rigorous testing to ensure the vehicle is robust enough to keep occupants safe and withstand the rigors of daily use. The large array of options offered by Mercedes means a large number of tests for the company’s engineers to conduct. More tests result in more time spent on the test stand, increasing costs for Mercedes and generating carbon dioxide, a polluting greenhouse gas. Efforts by Mercedes Benz and other automakers to improve the efficiency of vehicle testing procedures have mainly focusing on developing automated test systems. An automatic test system eliminates the variability inherent in human behavior, is safer than allowing humans in the driver’s seat, and results in an overall more efficient evaluation process.

The Mercedes-Benz “Greener Manufacturing” Competition hosted by Kaggle pursues a related approach to optimizing the vehicle testing process by encouraging the development of a machine learning model that is able to predict the testing duration based on a particular vehicle combination. The stated goal of the competition is to reduce the time vehicles spend on the test stand which consequently will decrease carbon dioxide emissions associated with the testing procedure. Although the reduction in carbon dioxide may not be noteworthy on a global scale,4 improvements to Mercedes’s process can be passed along to other automakers or even to other industries which could result in a significant decrease in carbon dioxide emissions. Moreover, one of the fundamental tenets of machine learning is that the efficiency of current systems can be improved through the use of the massive quantities of data now routinely collected by companies. Kaggle is a website dedicated to that proposition where companies create machine learning competitions with a stated objective and provide the public a dataset to apply to the problem. Competitions such as those offered on Kaggle, or the X-Prizes, have been demonstrated to spur innovation and help attract individuals and teams looking to hone their skills, participate in cutting-edge challenges, and perhaps win a modest prize. For this project, I created a model to participate in the Greener Manufacturing competition. All required data for the Greener Manufacturing competition was provided by Mercedes-Benz. The dataset was collected from thousands of safety and reliability tests run on a variety of Mercedes vehicles.

Problem Statement:

The objective of the Mercedes-Benz Greener Manufacturing competition is to develop a machine learning model that can accurately predict the time a car will spend on the test bench based on the vehicle configuration. The vehicle configuration is defined as the set of customization options and features selected for the particular vehicle. The motivation behind the problem is that an accurate model will be able to reduce the total time spent testing vehicles by allowing cars with similar testing configurations to be run successively. This problem is an example of a machine learning regression task because it requires predicting a continuous target variable (the duration of the test) based on one or more explanatory variables (the configuration of the vehicle). This problem is also a supervised task because the targets for the training data are known ahead of time and the model will learn based on labeled data. The steps to solving the problem are as follows:

  1. Download the Data from the source i.e Kaggle. The link is given in the later section.
  2. Clean and process the data to be able to feed to the Machine Learning Models.
  3. Apply various algorithms/ models to solve the problem.
  4. Optimize the models.
  5. Compare the results.
  6. Select an appropriate model/approach that fulfills all the requirements and gives the best score.
  7. Predict the values using the selected model and submit it to Kaggle for evaluation.

This is a Regression Problem.

The Metric to be used for evaluation is R².

Mercedes-Benz will implement the best-performing model into the vehicle design and manufacturing process to increase the overall efficiency of the testing procedure while maintaining high safety and reliability standards.

Source of Data:

The above link can be used to get the Data from Kaggle.

Existing Approaches:



Above are some of the existing approaches to this problem.

My Improvements:

In all the above approaches, the authors have mostly focused on getting into the Public/Private Leader-board of Kaggle. So there are certain things in the solution which is not very helpful in the real world scenario. The following steps are the improvements that I have tried on the existing approaches:

  1. I have tried to reduce Overfitting as a lot of these approaches have overfitting problem.
  2. I have removed all outlier points. Some of these approaches have used outliers, which helped them get good Kaggle scores.
  3. I have used the categorical features as some of the approaches have dropped them from the data.
  4. Some of the approaches used takes long time to train. According to our business problem, there is no harm if it takes few seconds or some minutes. So, i have kept that in mind and used algorithms accordingly.

Exploratory Data Analysis:

  1. This dataset contains set of variables, each representing a custom feature in a Mercedes car. For example, a variable could be 4WD, added air suspension, or a head-up display.
  2. The ground truth is labeled ‘y’ and represents the time (in seconds) that the car took to pass testing for each variable.
  3. File descriptions: Variables with letters are categorical. Variables with 0/1 are binary values.
  4. train.csv — the training set
  5. test.csv — the test set, you must predict the ‘y’ variable for the ‘ID’s in this file
  6. sample_submission.csv — a sample submission file in the correct format
  7. Number of Datapoints in Train Data: 4209
  8. Number of Datapoints in Test Data: 4209
  9. Total Number of Features: 377
  10. Number of Categorical Features: 8
  11. Number of Numerical Features: 369
Plot between the Index and the Y values. From the above plot we can see the existence of some outlier/ noisy points.

12. We can observe that most of the values lies between 90–120. So avg production time is 90–120.

13. So we have a pretty standard distribution here, which is centered around almost exactly 100.

14. The fact that ID is not equal to the row ID seems to suggest that the train and test sets are randomly sampled.

From the above plot we can see a very slight decreasing trend of y with respect to the ID , maybe cars later in the series took less time in test bench. This gives ID an importance while estimating y.

NOTE: For exact Feature Engineering, please check my GitHub Repository. The Link is given at the end.

First Cut Solution:

  1. Remove all the outlier points and the unnecessary features from the dataset.
  2. Encode all the categorical features using Label Encoding or One-Hot Encoding.
  3. The Numerical features are Binary i.e 0 and 1. So will use it as it is.
  4. After that combine all the features.
  5. Apply feature engineering based on certain experiments. Specially looking at correlations and combine features with positive correlation to create new features.
  6. Apply a Baseline Model. Since it is a Regression problem, will use Linear Regression as Baseline.
  7. Apply various other models(ensembles and stacking)

8. Evaluate all the models and select the best model that gives the best score.

Models Used:

  1. Linear Regression: The first model that is used to get a benchmark score is Linear Regression. It is selected as the Baseline Model. Since we have 300+ features and it is a Regression task, Linear Regression will give us a decent benchmark/baseline score to work upon.

2. Elastic Net Regressor: From the Baseline model and analysis, I had figured that the dataset tends to overfit easily. So, keeping that in mind I decided to use Elastic Net Regressor. Elastic Net linearly combines both L1 and L2 regularizations, so used it to see the improvement just by reducing overfitting over our Baseline model.

3. Random Forest Regressor: The Random Forest employs a number of techniques to reduce variance in predictions thus reducing overfitting. It is also an ensemble i.e Random forest builds multiple decision trees and merge their predictions together to get a more accurate and stable prediction rather than relying on individual decision trees. So, this model was usede to see how it performs with our data. Also, it is highly interpretable i.e we can easily get the feature importances of the data.

Note: For exact hyper-parameters used refer to the solution.

4. XGBoost Regressor: XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. Since Random Forest Regressor performed well on the Data so wanted to try XGB Regressor as XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

Note: For exact hyper-parameters used refer to the solution.

5. Extra Trees Regressor: Similar to a Random Forest classifier we have the Extra Trees classifier — also known as Extremely Randomized Trees. To introduce more variation into the ensemble, we will change how we build trees. Extra Trees reduce the variance further from Random Forests. This model performed well with the data.

Note: For exact hyper-parameters used refer to the solution.

6. Stacking: If you have ever competed in a Kaggle competition, you are probably familiar with the use of combining different predictive models for improved accuracy which will creep your score up in the leader board. The below approach that I have taken is mostly on experimental grounds. I decided to combine all the ensembles that I have used. I have stacked Random Forest , XGBoost and Extra Trees and kept the meta classifier as Ridge with regularizations kept to 0 so that it doesn’t impact the output from the stacked models.

Note: For exact hyper-parameters used refer to the solution.

Model Comparison:

From the above table we can get the comparsion of the models. For more information on the Categorical Encoding and the features, please check my GitHub Repository.
  1. From the above table we can see that the Stacking Model with Correlation Features and One Hot Encoded categorical features, reduced to less number of dimensions using K-Best has performed the best with a CV Score of 0.616546.
  2. But, for the final predictions, The Stacking Model with Correlation Features and Label Encoded categorical features has given the best Kaggle Score with a Private LB Score of 0.55316.
The 0.55316 Private LB Score reflects to the 49th position on the Leaderboard.

Future Work:

  1. In my solution, I have used PCA to generate new features. Same can be done with other dimetionality reduction techniques like TSVD.
  2. Implementing Neural Nets can also give a different approach to the problem.
  3. Furthur and better Hyper Parameter tuning and more cross-validation can improve results.



Github Repository:

LinkedIn Profile: