Source: Deep Learning on Medium
Feature Engineering & Exploratory Data Analysis:
Data accumulation involves collecting data from multiple sources in a centralized, scalable & accessible location. Our offline data pipelines pulls data from mongoDB and snowflake.
While inspecting a pre-owned car we capture car parameters at six different levels:
- Car Details: Here we capture basic car parameters such as Make, Model, Variant, Fuel Type, Manufacturing year & month, Ownership, Odometer Reading, Insurance Type, RC Condition, Road Tax Information etc.
- Air Conditioning: Availability and Current condition of different parts like Cooling, Heating, Climate Control and there working status.
- Exterior and Tyre: Current condition of different parts like Bonnet, Apron, Pillars, Rooftop, Bumpers, Doors, Tyre labelled across broken, dented, scratched, repainted, repaired etc along with the severity level of respective label (type of damage). Also, we feed the captured images in an image processing module to extract & highlight the damaged parts on auction platform.
- Engine and Transmission: Current condition of different parts like battery, clutch, coolant, engine, engine oil, gear shifting. We run audio processing to detect the engine condition via engine sound in accelerated and idle conditions.
- Steering, Suspension and Brakes: Current condition of Brakes, Steering and Suspension.
Finally, post de-normalization we end up with ~600 parameters for inspection report of any car.
Let’s dive into some of the key attributes that came across as significant during the EDA stages — starting with univariate distributions e.g. distribution of odometer across age of car — showing boxplot to get a better feel for the ranges and possible outliers (quite a few vehicles have very low odometer reading!)
As is obvious newer cars have lower odometer reading and for older cars the odometer reading is high.
Similar obvious stats for historical price distribution by age of the car the time of inspection (& auction).
Let’s look at price by different car brands. The plot below shows very large price variations within a specific make — this variation is sort of capturing the spread of manufacturing years as well as different variants (less expensive hatchback to premium SUVs) within a brand.
Now let’s get into more granular view — bi-variate cuts against Make for the true insights! Let’s see the price distribution by make per year, as it’s important to show just how much the price changes over time. And while the prices tend to decrease with age, the rate of decrease varies significantly across makes!
Since we have more than 600 variables it became imperative to reduce the dimensions and create new aggregated variables.
Listing few of the randomly selected features for illustration:
Rating: For the 5 subsection viz A/C, Exterior&Tyre, Engine&Transmission, Electricals&Interior and SteeringSuspension&Brakes we predict rating in a separate ML module. This rating is further fed into all the platforms showing the car details.
OdoDeviation: This is the odometer/year deviation from the typical expected odometer/year for the given make, model & variant combination. This variable penalize cars which have run more kms than what is expected for that particular variant.
meanPriceMM: Historical rolling mean price for all Make Model combinations.
CurrentCondition Fields: CountDentsApron, CountDentsPillar, CountScratchesApron, CountRepairsApron, totalPanelsRepaired, totalPanelsRepainted, roadTaxDuration, CountPanelsHighSeverityDamage etc
… and others like Body Type, Transmission Type, Age, TyreTreadDepth etc.
Background removal and Embedding Extraction:
While taking the pictures of the car excessive amount of background noise, including images of trees, vehicles or persons also gets captured. Pre-requisite for any robust image processing & embedding extraction is to remove these distractions and focus only on the image features that we want to work on. We used an object detection architecture to detect the ROI and then performed semantic segmentation to remove other background noise.
The image captured for all 4 sides of the car goes through the process stated above. Further we flatten the 2D RGB image into single dimension representation of 1×10 which makes it additional 40 columns and it gives the summarized visual representation of the car body as predictor in the pricing engine.
Prototyping and Training:
Tools used — Machine Learning library in python Numpy, Pandas, Scikit-learn, XGBoost, Tensorflow, Keras, OpenCV MatplotLib
As in the training dataset above, we many a times need to perform additional data processing before we can fit a model:
· Data Imputation: We need to check if any data is null/missing. We need to find if the data is missing at random or not. If not, we need to understand the cause and do the relevant missing value imputation using statistical features.
· Encoding Categorical Variables: We need to perform one-hot-encoding for categorical values with low categories. In case of high cardinality fields, we do encoding by frequency count of each category.
Performing Model Selection:
After trying different regression models to fit our data set, it is time that we summarize all the results we have got.
Clearly XGB model was outperforming the RF model and hence we took this as our final model.
Below are the some of the top important features that we found with the XGB model.
Evaluation & Performance:
We then performed Evaluation of our model by performing batch prediction on all the historic transaction data in the Test set for the past 1 year. Our model was outperforming the manual pricing by very high accuracy.
As depicted in the below plot, we were getting 50% of the cases within an error margin of 5% and 80% of the cases within and error margin of 10%.
Productionizing the model:
Now it was time to take the model into production for real time pricing. We developed a microservice using Flask framework in python and deployed it into a t2 large aws machine.
We use Bitbucket for version controlling, Jenkins for deployment and AWS cloudwatch for logs monitoring. On a typical Sunday we receive 15–20 requests per minute in peak hours.
Below is the API architecture for the profecto module (pricing engine).
Pricing is a critical component of our business enabling more efficient auctions, better inventory control & financing risk management, and we at CARS24 are constantly fine-tuning our models augmenting them with new features trying to address as many outliers as possible.
Introduction of deep learning module (Profendus) to refine input universe is a relatively newer change which has had massive impact on overall accuracy of our models — and we have barely scratched the surface on that!