When Data meets Burgers — Exploratory Data Analysis

Original article was published on Artificial Intelligence on Medium

Comparing Categories

We can notice that there are nine unique categories of items which implies a total of 260 items are divided into 9 categories. The pie-chart shows that the Coffee and tea category dominates the menu whereas the Salads category is the least.

Comparing categories based on nutrients

Let us try and analyze which category has the more number of proteins, cholesterol, fats, etc. Before visualizing the data it is important to understand the concept of pivot_table(). Let us look at an example.

Instead of np.mean() of Sodium values, we can also perform other aggregation functions like np.max() , np.sum() etc. Since we are comparing nutrient values category wise it is better to take the mean of values.

Similar to the above example let us create a pivot_table for Category and Protein and try to understand the distribution. Let us plot a bar plot of all mean Protein values wrt. to Category.

We can create plots using the Pandas in-built plot() method. Since we are going to create a bar plot we pass bar as a parameter to thekind argument

We can notice that The Chicken and Fish Category has the highest value of protein. Similarly, let us analyze all the nutrient distributions category wise.

As expected, we can notice more amount of sugars in Smoothies and more amount of Vitamin A and Vitamin C in salads. Calories are more in Chicken and Fish items whereas Iron is more in Beef and Pork. It is better to have breakfast at home due to the high amounts of cholesterol as indicated by the first bar plot.

Do you think bar plots are correct in depicting comparisons among discrete categories? Here we calculated the mean of all items category wise and created a bar plot. But we also know that means are very sensitive to outliers.
Are outliers playing a role in the bar plots above? To understand the distribution of each item we can create Swarm plots.
In Swarm plots, every item in each column is depicted with a dot instead of taking the mean of all items for that particular category like a bar plot does.

Let us create a swarm plot for the Vitamin C and Category columns and compare it with the bar plot.

We can create a Swarm plot using the Seaborn’s swarmplot()method. The X-axis labels are rotated by 90 degrees for readability purposes.

sns.set(style="whitegrid", color_codes=True)ax1= sns.swarmplot(x="Category", y="Vitamin C", data=df)
ax1.set_xticklabels(ax1.get_xticklabels(),rotation=90)

Here we can see that the swarm plot consists of Vitamin C values of all the items wrt. category. On looking at the swarm plot, we may assume that the Beverages or Snacks and Sides category will have the highest amount of Vitamin C but after keen observation, we can notice that the number of zero-valued items(circled in swarm-plot) these two columns is way more than the Salads column which has all the values around 20–30. This leads to a higher mean of the Salads category when compared to the other two columns. Similarly, we can create Swarm plots for other categories and analyze them.

Selecting the appropriate plot is a crucial part of insight generation.