Mastering Joins: Pandas merge, concat, append methods

Original article was published by Bex T. on Artificial Intelligence on Medium


Basic Exploration

Let’s say we have these two tables:

>>> user_usage.head()
>>> user_devices.head()

This data was downloaded from the KillBillerapplication. KillBiller was a free service that compared every mobile tariff in the UK and Ireland. The first, user_usage table contains monthly statistics of the mobile usage of users. user_devices table provides details about each user’s phone such as operating system and phone model.

Question 1:

How many users use Android OS and how many use iOS?

To answer this question, we will need the information from both tables. There is one linking attribute between both tables: use_id. We will use this column in our merge:

Number of users for each OS: android    157
ios 2
Name: platform, dtype: int64

It looks like there is a huge difference between the two operating systems in our dataset.

In the merge above, we used an example of an inner join. In merge() function, how argument is set to inner by default so we did not have to write it out. When merging two tables using the merge() function, we use on argument to specify the common column. If there are multiple, it is also possible to pass a list of columns to the argument and pandas will take care of the rest.

Note that as a right table, I subset the user_devices table to exclude irrelevant columns to the question.

Now, as we explore further, we will notice that number of given users in two datasets is different:

Clearly, the number of matching user IDs in both tables were 159. This means there are user IDs which are in user_devices table and not in user_usage table and vice versa. So, the next question we want to ask is:

Question 2

How many users use Android OS and iOS, including all the users that aren’t in user_usage table?

We can answer this question by using either a left or a right join. First, let’s look at the general case of one-sided joins:

Image by author

A left join will return all of the rows of the left table including matching rows from the right table. A right join will return all of the rows of the right table including matching rows from the left table. Note that in both types of joins, pandas will put an NaN value to the cells with no match.

Number of users for each OS: android    184
ios 88
Name: platform, dtype: int64

The same result can be achieved using a left join. We can just swap the tables and pass left keyword for the how parameter:

Number of users for each OS: android    184
ios 88
Name: platform, dtype: int64

We use an outer join to return the intersection and all the other rows from both tables:

Image by author