A Step-by-Step Case Study on the Mercari Price Suggestion Challenge by Kaggle

Original article was published by Rajarshi Deka on Artificial Intelligence on Medium


A Step-by-Step Case Study on the Mercari Price Suggestion Challenge by Kaggle

Business Problem

Description

Product pricing gets even harder at scale, considering just how many products are sold online. Clothing has strong seasonal pricing trends and is heavily influenced by brand names, while electronics have fluctuating prices based on product specs.

Mercari, Japan’s biggest community-powered shopping app, knows this problem deeply. They’d like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari’s marketplace.

In this competition, Mercari’s challenging us to build an algorithm that automatically suggests the right product prices. We are provided user-inputted text descriptions of their products, including details like product category name, brand name, and item condition.

Summary

The main purpose of the case study is to predict the price of an item with the help of the given features. For this, we shall be using the dataset as provided in the Mercari Price Suggestion Challenge. It is a regression based machine-learning problem which expects an output of continuous type (not a categorical type as seen in classification based problems). Since the error metric that we need to evaluate is RMSLE, so we need to convert the output target (price) to its corresponding logarithmic version and then evaluate the RMSE

This case study will help us to understand the optimum price that a seller can expect for his/her product based on the various descriptions of the item as provided by the seller

Objectives

  • This is a regression based machine learning problem where we need to predict the price using the given input features
  • The error metric to be evaluated is RMSLE (Root Mean Square Logarithmic Error)

Evaluation Metric

The evaluation metric for this competition is RMSLE (Root Mean Square Logarithmic Error)

where:

  • ϵ is the RMSLE value (score)
  • n is the total number of observations in the (public/private) data set
  • pi is your prediction of price
  • ai is the actual sale price for i
  • log(x) is the natural logarithm of x

Use of ML/DL

As a part of this case study, we have experimented with a number of regression based machine learning as well as deep learning models. Although, some were able to generate a very good RMSLE score while the rest were not satisfactory. Following is the list of models which we used:

Machine Learning Models

  • Linear Regression: We used linear regression as a first cut approach model to get a basic idea on the RMSLE score that is being generated
  • Ridge Regression: Since ridge regression adds a penalty term thereby reducing overfitting which linear regression is unable to do so, hence we experimented with this model also to improve the score
  • Support Vector Machine (SVM): We were able to observe that our training dataset had a high dimension and so SVM was used to avoid the difficulties that were faced by linear functions in high dimensional feature space
  • Decision Tree: As our training dataset consists of more than a million datapoints and since decision trees has a low training time complexity so we decided to use this model as well
  • Random Forest: Random Forest is based on the bagging algorithm and uses Ensemble Learning technique. It creates as many trees on the subset of the data and combines the output of all the trees. In this way it reduces overfitting problem in decision trees and also reduces the variance and therefore improves the accuracy. So we decided to experiment with it
  • XGBoost: XGBoost is highly parallelizable, quick to execute, and typically out performs other algorithms. So, we used it in our case study too
  • Light GBM: Since LGBM itself uses XGBoost as a baseline model and is almost 7 times faster than XGBoost which is a much better approach when dealing with large datasets, so we decided to check the score using this model as well
  • Stacking (Ensemble): Many a times, we have observed that stacking improves the score to a considerable amount and so we decided to use the top 3 performing models as the Base Models and subsequently use Linear Regression as the Meta Model thereby trying to improve the score

Deep Learning Models

  • LSTM+CNN: Although, our case study was strictly confined to machine learning models, but we decided to experiment with an LSTM followed by a 1D ConvNet to observe the score and thereby try to monitor the result.

Source of Data

As a part of this case study, we are using the train.tsv file as provided by Kaggle itself to train our models. On closer analysis, it was observed that the specified file consists of the following columns:

  • train_id – the id of the listing
  • name – the title of the listing. Note that we have cleaned the data to remove text that look like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
  • item_condition_id – the condition of the items provided by the seller
  • category_name – category of the listing
  • brand_name
  • price – the price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn’t exist in test.tsv since that is what you will predict.
  • shipping – 1 if shipping fee is paid by seller and 0 by buyer
  • item_description – the full description of the item. Note that we have cleaned the data to remove text that look like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]

Existing Approaches

  • As a part of the feature engineering technique, it was observed that the category_name is encoded into multiple hierarchical levels. So each of the successive hierarchies were treated as categories and their respective subcategories. Once splitted, histograms were used to understand the product distribution in terms of quantity for each category and their subcategories. Consequently, these were used as categorical features
  • The length of each of the sentences from the item_description column after text-preprocessing were used as a feature
  • All the missing values were replaced with the word ‘missing’ especially in case of brand_name, category_name and item_description
  • XGBoost is like a standard algorithm that is being used by almost everybody to solve similar case studies and so in this case also, there is no exception
  • Hyperparameter-tuning was observed to be the mandatory criterion in almost all existing case studies while training any machine learning model

Improvements

  • Since the train_id feature is a column comprising of unique serial numbers and hence does not hold much significance during training, so have dropped it
  • We have added an additional name_length feature similar to the item_desc_length feature
  • The sentiment scores had correlation values much lesser than 0.5 with respect to price. So we decided to drop these features from our dataset
  • We observed a lot of null values in the brand feature and so we tried to check the name feature for any possiblity of finding the null values.
  • In the item_description column, we replaced those cells marked as ‘No description yet’ along with the null values as ‘missing
  • Apart from training with XGBoost, we also tried out Light GBM to get a better accuracy keeping the low time-complexity of training models using LGBM in mind
  • While performing hyperparameter-tuning, instead of using Random Search CV or Grid Search CV, we used the conventional iteration technique. This improved the training time thereby requring less time to train the models
  • We also experimented with Stacking which is an Ensemble technique to try and improve the score

Exploratory Data Analysis

Preliminary Investigation

The dataset as provided by Kaggle consists of 1482535 rows and 8 columns and once loaded is displayed as follows:

Pandas provides us with two functions namely describe() and info() to run a basic preliminary investigation of the loaded dataframe

The above representation provides some valueable insights which are as follows:

  • The highest price that an item can have is 2009 USD
  • 25% of the items have a price of less than 10 USD, 50% of them with less than 17 USD and 75% of all the items of less than 29 USD.
  • The item_condition_id category is basically ordinal in nature with the lowest rating being 1 and the highest rating being 5
  • The shipping category being nominal in nature has two values namely 0 and 1

Following are some of the insights as obtained from the above representation:

  • Although, the total number of rows was found to be 1482535, but we can clearly see that the total number of non-null values of category_name (1476208), brand_name (849853) and item_description (1482531) are signinficantly less and hence we can conclude that the dataset has null values
  • Also the item_condition_id and shipping being categorical features, takes in integer type value

Price

The following code-snippet generates the distribution of price using a histogram:

plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(10,5))
plt.hist(df['price'], bins=20, range=[0,100])
plt.title("Histogram Representation of Price Feature")
plt.xlabel("Price")
plt.ylabel("Count")
plt.show()

From the graph, we can conlude that the distribution of price follows right-skewness and as per the problem statement, we need to compute the RMSLE (Root Mean Square Logarithmic Error). Hence, we need to transform the price feature to its logarithmic form which can be achieved by means of the following code-snippet:

plt.figure(figsize=(10,5))
plt.hist(np.log(df['price']+1), bins=20, range=[0,10])
plt.title("Histogram Representation of Price Feature")
plt.xlabel("log(price+1)")
plt.ylabel("Count")
plt.show()

Here, we observe that the skewness of the distribution reduces once we apply log transformation. Also, we need to add a value 1 to each of the price feature values since the minimum price was observed to be 0 USD and log (0) is undefined

Using a Box-Plot, we can observe that 50% of all the items lies within the range of 10 USD to 29 USD. Also majority of the items shows the enlisted price to be somewhere around 18 USD which is shown as follows:

Price-Shipping

As analysed, the list of unique categories of the shipping feature are 1 (fee is paid by seller) and 0 (fee is paid by buyer)

From the histogram as shown below, we can conclude that if the price of the item is less, then in most cases the shipping charges is being payed by the seller and if the price is high, then in most cases the shipping charges is being payed by the buyer:

Price-Item Condition ID

We also observed that the list of unique categories of the item_condition_id feature are 1, 2, 3, 4 and 5

From the plot as shown below, we cannot conclude much as we can observe a lot of overlapping among the 5 categories. However, we can say that there is a high imbalance among the item_condition_id categories

We further analysed the percentage of occurence for each of the five categories to understand the distribution better as shown below:

Thus we can say that the category_1 is the most dominant with 43.2% and category_5 trails with the least at 0.2%

A look at the Box-Plot also reveals that the 50th percentile for each of the 5 categories does not show much of a variation that can be useful to us

Item Description

Executing the following code-snippet generates a Word-Cloud that gives us a high level overview of the most frequently occuring words in the item_description feature:

word_cloud = WordCloud(background_color='white', max_words=100, stopwords=set(STOPWORDS))
word_cloud.generate(" ".join(df['item_description'].astype(str)))
plt.figure(figsize=(15, 10))
plt.imshow(word_cloud)
plt.axis('off')
plt.show()

From the Word-Cloud plot, we can get a basic view of the 100 most frequently occuring words in the item_desription feature. Since the word ‘Brand’ is being displayed with the largest font-size, hence it occurs the most followed by words like ‘new’, ‘free’ etc.

Item Description Length-Price

The following code-snippet helps us to generate the distribution of price with respect to the length of sentences in the item_description feature

plt.figure(figsize=(15,10))
x = df['item_desc_length']
y = df['price']
plt.scatter(x,y)
plt.xlabel('Item Desription Legth')
plt.ylabel('log(price+1)')
plt.title('Item Description Length vs Log(price+1)')
plt.show()

Here we observe that as the length of the item-description increases, the price seems to decrease. Thus, we can say that products with shorter desrcriptions tends to cost more.

Brand Name

Performing a value-count on the brand_name feature generates the result as follows:

Here we can observe that there are a total of 4810 unique categories of brand_name. But for our convenience, we shall try to observe the top 10 brand names with the help of a pie plot as follows:

From the pie-plot, we an conclude that 71% of the data is missing for brand_name. However, some of the top most frequently occuring brand names are PINK(6%), Nike(6%), Victoria’s Secret(5%) and LulaRoe (3%)

However, brands like PINK, Victoria’s Secret and LulaRoe specializes in products designed for ladies specifically and hence we can say that the sale of female products is much higher as compared to male.

Furthermore, all three of these brands are related to the clothing line industry and hence we get the intuition that the higher sale of products is for women-apparels/accessories

Category Name

A similar value-count on the category_name feature reveals the following insights:

Looking at the data for the category_name feature, we observe that the forward_slash actually acts as a delimiter and so the category names can be split into a number of sub-categories. Our first task is to determine how many such sub categories are present.

For this, we execute the following line of code:

# https://stackoverflow.com/questions/21572870/matplotlib-percent-label-position-in-pie-chartdf['category_name'].value_counts()[:50].plot(kind='pie', figsize=(15, 15), autopct='%1.0f%%');

From the pie-plot as shown below, we can observe that the top 50 unique category names can be split into 3 parts and so we shall consider main_categ, sub_categ_one and sub_categ_two

A closer analysis of each of the three subcategories of the category_name feature presents the following insights:

As expected from the analysis of Brand Name, we observe here that 45% of the total sale was related to Women products followed by Beauty related products which constitutes 14% of the total sales

Here also we observe a similar trend to women related products when considered the top 10 unique values from the sub_categ_one feature. However, it can be observed that apparels, cosmetics, shoes and jewelleries are the major choices for women related products

The sub_categ_two feature gives us even a deeper clarity on the items that most of the women prefers.

Here, if we generate a Word-Cloud plot on the sub_categ_two feature, then we can get a clearer picture of most of the frequently sold items. Also, majority of the words with bigger font-size was observed to be female-oriented, with visible words related to electronics and toys can also be observed as shown below:

First Cut Solution

  • As it was strictly mentioned in the case study objective that we need to evaluate the RMSLE, so the very first thing we did was to transform the price feature into it’s log form. However, it was observed that the minimum price that the dataset consist of was 0 USD and so we needed to add ‘1’ before applying log over it. As such, we can now evaluate the RMSE
  • Also it was observed from many case studies that sentiment scores somestimes plays an important role if the description is present. So the sentiment scores were computed for each of the item-description thereby creating four new features ss_pos, ss_neg, ss_neu and ss_com. However, it should be kept in mind that the sentiment analysis should be performed before text-preprocessing of item-description. It is so, because if we consider an example say ‘Good’, then the ss_pos value of ‘GOOD’ is higher than that of ‘good’
  • Text-preprocessing is another important step we carried out in order to modify the name and item_description feature to create the name_preprocess and item_desc_preprocess feature respectively. Here we performed decontraction, removal of alphanumeric characters as well as stopwords and also conversion to lower case which will help us to perform vectorization over it
  • Another thing we observed was that, as the length of the item-description decreases, the cost of the product seems to increase. Since we observe a fair amount of correlation between the price and length of the item-desription, so we consider item_desc_length as an additional feature in the dataset
  • As observed from the EDA carried out for category_name, we can assume that the three features main_categ, sub_categ_one and sub_categ_two are basically in a hierarchical order. Hence, the first category name can be considered as the main category and the rest of the subcategories can be said to carry less weightage. So, we discard sub_categ_one and sub_categ_two and rename the main_categ as category feature in the dataset.
  • Also for null-value removal of categorial features namely category and brand, we took the help of model-imputation technique whereby we have used the KNN model
  • On training this dataset after featurization using the selected models, we observe the following scores as generated using the Kaggle kernel:
  • Here we observe that our models have performed very poorly with the Kaggle dataset inspite of some of the models having relatively good scores with the Test Dataset.
  • As such we modified our original dataset with some new features in an attempt to generate better Kaggle scores

Final Approach

  • We have added an additional name_length feature similar to the item_desc_length feature
  • As opposed to the first cut solution, here we are considering all the three divisions of categories as main_categ, sub_categ_one and sub_categ_two
  • For the sentiment scores, we observe that each of ss_pos, ss_neg, ss_neu and ss_com has correlation values of -0.021, -0.022, 0.035 and 0.002 respetively with respect to price. So we decided to drop these features from our dataset since none of them have a correlation value closer to 0.5
  • In case of brand, we observe a lot of null values and so we tried to check the name feature for any possiblity of finding the null values. With this technique, we were able to fill 128,886 empty brand values. As for the remaining, we marked it as ‘missing
  • Also we were able to detect 82,517 cells marked as ‘No description yet’ in the item_description column and so we replaced those cells along with the null values by marking them as ‘missing
  • As for the null values in case of category column, we simply marked them as ‘missing
  • Furthermore during feature-transformation of name_preprocess and item_desc_preprocess using CountVectorizer and TfidfVectorizer, we increased the max_features to 10,000 and 100,000 respectively in a bid to improve the scores. Since, this resulted our dataset to be of very high dimension, so due to computational limitations we trained using some of the better performing algorithms which we observed in the first cut solution and discarded the rest

Model Explanation

Stacking

Stacking is an ensemble machine learning algorithm. It uses a meta-learning algorithm to learn how to best combine the predictions from two or more base machine learning algorithms. It can harness the capabilities of a range of well-performing models on a regression task and make predictions that have better performance than any single model in the ensemble.

The architecture of a stacking model involves two or more base models, often referred to as level-0 models, and a meta-model that combines the predictions of the base models, referred to as a level-1 model.

In our case study, we have observed that the three most best peforming models are LGBM, Ridge Regression and SVM. So, we shall be considering these three models as our Base-Models for buiding the stacking-ensemble model. As for the Meta-Model, we shall be considering a Linear Regression model

Here, instead of using the predefined stacking class from sklearn/mlextend library, we shall be implementing our own mechanism for the stacking-ensemble model as shown:

y_pred_ridge = ridge_reg_model_tfidf.predict(X_train_merged_tfidf)
y_pred_lgbm = lgbm_model_tfidf.predict(X_train_merged_tfidf)
y_pred_svm = svm_model_tfidf.predict(X_train_merged_tfidf)
y_pred_train = np.hstack((y_pred_ridge, y_pred_lgbm.reshape(-1, 1), y_pred_svm.reshape(-1, 1)))
linear_reg = LinearRegression(n_jobs=-1)
linear_reg.fit(y_pred_train, y_train)
y_pred_ridge_test = ridge_reg_model_tfidf.predict(X_test_merged_tfidf)
y_pred_lgbm_test = lgbm_model_tfidf.predict(X_test_merged_tfidf)
y_pred_svm_test = svm_model_tfidf.predict(X_test_merged_tfidf)
y_pred_test = np.hstack((y_pred_ridge_test, y_pred_lgbm_test.reshape(-1, 1), y_pred_svm_test.reshape(-1, 1)))y_pred = linear_reg.predict(y_pred_test)

LSTM+CNN

Following is the list of layers which we used to create our neural network as a part of the first cut solution:

  • Embedding: This layer turns positive integers (indexes) into dense vectors of fixed size.
  • LSTM: Long short-term memory is an artificial recurrent neural network (RNN) architecture composed of a cell, an input gate, an output gate and a forget gate.
  • Conv1D: This layer creates a convolution kernel that is convolved with the layer input over a single spatial (or temporal) dimension to produce a tensor of outputs
  • Flatten: This layer flattens the input and does not affect the batch size.
  • Dense: It is a densely-connected neural network layer.
  • Dropout: This layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting

Here we used an LSTM layer followed by a 1D ConvNet on the item_description feature which we initially passed it through an Embedding layer. For the categorical features comprising of brand_name and category_name, we simply passed them through an Embedding layer. As for the remaining numerical features, we concatenated them and passed it through a Dense layer

The output from all the textual, categorical and numerical features are then concatenated and passed through multiple pairs of Dense and Dropout layers with the final output layer being set as units=1 and activation=linear

Following represents the architecture of the neural-network along with the input and output shape at each successive level:

Code Snippets

  • The text_preprocess(text, eng_stopwords) function is being used for the name and item_description feature to perform text-preprocessing
def text_preprocess(text, eng_stopwords):'''Function to perform text preprocessing'''text = decontracted(text) 
text = re.sub("[\-\\\n\t]", " ", text)
text = re.sub("[^A-Za-z0-9]", " ", text)
text = re.sub('\s\s+', ' ', str(text))
text = text.lower()
text = " ".join([word for word in text.split() if word not in eng_stopwords])
return text
  • The w2v_vectors(preprocessed_essays, glove_words, model) function is being used for computing the word-to-vec representation of each of the sentences in the item_description feature after text-preprocesssing
def w2v_vectors(preprocessed_essays, glove_words, model):'''Function to compute the Word2Vec'''w2v_vectors = []
for sentence in tqdm_notebook(preprocessed_essays):
vector = np.zeros(100)
for word in sentence.split():
if word in glove_words:
vector += model[word]
w2v_vectors.append(vector)
return np.array(w2v_vectors)
  • The predict_scores(model, feature_merged_tfidf, df_test) function is being used to predict the the target values of the test-dataset using the selected model
def predict_scores(model, feature_merged_tfidf, df_test):'''Function to generate the predicted scores'''y_pred = model.predict(feature_merged_tfidf)
y_pred_antilog = pd.Series([np.exp(i) for i in y_pred])
test_id = pd.Series(df_test['test_id'].values)
df = pd.DataFrame({'test_id' : test_id, 'price' : y_pred_antilog})
return df
  • The text_length(text, no_desc_string) function is being used to determine the length of the sentences for which there is a description present or else it returns 0
def text_length(text, no_desc_string):'''Function to compute the text length only for items with descriptions'''try:
if text in no_desc_string:
return 0
else:
return len(text.split())
except:
return 0

Final Models Comparison

As per the new techniques in feature engineering that we have implemented, following is the list of RMSLE values of the test dataset along with the respective Kaggle scores

Here we observe that LGBM performs the best with an RMSLE value of 0.43206 on the Test Dataset and 0.44016 on the Kaggle kernel

Future Work

  • Although, we designed a simple neural network as a first cut solution, attempts can be made to develop more complex and deeper neural networks to improve the evaluation metric
  • While computing the Tf-Idf scores of the item_description feature, we took the n_gram range with a maximum value of 3. However, we can try to increase this range in a bid to improve the score
  • Hyperparameter tuning involving more wider range of values can also lead us in generating better performing models
  • We can also try to increase the number of base models in case of stacking mechanism which might help us in improving the score

References

Conclusion

All files related to the codes for the end-to-end implementation of the whole case study is available on Github which can be accessed by clicking on GitHub and in case of any further query kindly reach out to me on LinkedIn