Market Basket Analysis with Pandas

Original article was published by Soner Yıldırım on Artificial Intelligence on Medium

The main focus of market basket analysis is what items are purchased together. One common technique is association rule learning which is a machine learning method to discover relationships among variables. Apriori algorithm is a frequently used algorithm for association rule learning.

We will not go in detail about apriori algorithm or association rule learning in this post. Instead, I will show you a simple way to check which items are frequent purchased together.

Let’s first create a dataframe that contains the items list per shopping.

items = groceries.groupby(['Member_number', 'Date'])\
.agg({'itemDescription': lambda x: x.ravel().tolist()}).reset_index()
(image by author)

Three items purchased by customer 1000 on 2014–06–24 are whole milk, pastry, and salty snack.

We need to determine which items frequently exist in the same rows in “itemDescription” column.

One way is to create combinations of items in each row and count the occurrences of each combination. The itertools of python can be used to accomplish this task.

Here is an example for the first row.

import itertoolslist(itertools.combinations(items.itemDescription[0], 2))
(image by author)

There are 3 items in the first row so we have 3 combinations of pairs. Itertools.combinations do not return repeated combinations (e.g. (‘pastry’, ‘pastry’)) which is what we need.

The following code will to this operation on each row and add the combinations to a list.

combinations_list = []
for row in items.itemDescription:
combinations = list(itertools.combinations(row, 2))

We have created a list of lists:

(image by author)

We can create a list from this list of lists by using the explode function of pandas but we need to first convert it to a pandas series.

combination_counts = pd.Series(combinations_list).explode().reset_index(drop=True)
(image by author)

We can now count the number of occurrences of each combination using the value_counts function. Here are the ten most frequent combinations:

(image by author)

The first one is a surprise because it is repeating one. We have made the combinations with no repeated elements. The dataset might contain repating elements. For instance, if a customer buys 2 whole milks at one shopping, there must be two rows of whole milk for that shopping.

We can confirm it by counting the number of whole milks at each shopping.

whole_milk = groceries[groceries.itemDescription == 'whole milk']\
.sort_values(by='itemDescription', ascending=False).reset_index()
(image by author)

It seems like what we suspect is correct. For instance, customer 1994 purchased 4 whole milks on 2015–11–03. Thus, it totally makes sense to see the most frequent combination is whole milk and whole milk.

We should focus on non-repeating combinations. For instance, the second most frequent combination is whole milk and rolls/buns. The whole milk seems to be dominating the shopping lists.