Original article was published by Sahithya Swaminathan on Artificial Intelligence on Medium
Elegant way to make data talk stories: Exploratory data analysis
The big data knows everything
Data can tell great stories and making it to convey the right story is an art. The means to acquire this art is Exploratory data analysis (EDA). Exploratory data analysis is nothing but using the statistical and probability approaches to understand what the data is trying to convey to us.
As a data scientist, a major share of the work will mostly be focused on understanding the data and trying to get only the necessary characteristics to be sent to the Machine Learning model. Only when the input data makes sense, the model will be able to leverage its maximum power.
One of the really tough things is figuring out what questions to ask. Once you figure out the question, then the answer is relatively easy — Elon Musk
It’s often a challenging task to find the right question from a clean slate. But, by constantly asking Why, we’ll be able to understand the behaviour of the data and derive the insights
Now, we can dive into some common starting points that can be used while performing EDA. Having always been a fan of Pokemon right from my childhood, I will be using the Pokemon 🌟 dataset from Kaggle for step by step process to go ahead with EDA
Come let’s catch ’em all 🙂
Super tool: Pandas
Before describing the general steps to perform the EDA, let’s take a look at an important tool.
Pandas library is a fast, powerful and easy tool which was built on the top of python. From my personal experience, as a data scientist, my everyday bread and butter solely rely on pandas. All the programming logic can be easily implemented with just one or two lines of code, which makes this library so popular. It can handle thousands of data without many computational requirements. Moreover, the functionalities provided by this library is simple and are quite effective.
Even for our Pokemon dataset EDA, we will be using pandas for understanding the data and also for visualisation
Common Steps in Exploratory Data analysis:
When I receive data, I perform some of the below steps to get a hold on what I’m actually dealing with.
- Know your data
- Understand the characteristics/columns in the data
- Causal analysis of data
Know your data:
It is important to know the number of data points that we’ll be dealing with. This is because, once the size of the data increases, the code must be written in such a format that it’s efficient and executes in less time.
import pandas as pd
import numpy as np#Read data
data = pd.read_csv('../pokemon.csv')
shape() function in pandas gives the number of rows and columns in the format (rows, cols)
#Output: (801, 41)
Understand the characteristics/columns in the data
info() function in pandas, provides the complete split of columns, with total Non-Null items in each of the columns including the datatype. This function is useful to determine if any columns had to be Imputed to deal with the missing data
There are some interesting fields in the Pokemon dataset like generation, ability, type. Using pandas visualization feature and by using seaborn’s aesthetic looks, we can generate good graphs which help in providing baseline stories.
value_counts() function in pandas, group unique values in the field and provides the frequency share of each unique value in the group (if normalize parameter is set True, provides the percentage of share of occurrence). The frequency distribution can then be well understood using the histogram plots.
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import cmcolor = cm.Pastel1(np.linspace(.4, .8, 30))fig, ax = plt.subplots(2,2, figsize=(18,18))data.generation.value_counts(normalize=False).sort_index().plot(kind='bar', color=color, ax=ax[0,0])
ax[0,0].set_ylabel('Frequency of occurrence')data.type1.value_counts(normalize=False).plot(kind='bar', color=color, ax=ax[0,1])
ax[0,1].set_ylabel('Frequency of occurrence')data.type2.value_counts(normalize=False).plot(kind='bar', color=color, ax=ax[1,0])
ax[1,0].set_ylabel('Frequency of occurrence')data.is_legendary.value_counts(normalize=False).plot(kind='bar', color=color, ax=ax[1,1])
ax[1,1].set_xlabel('Legendary - Yes/No')
ax[1,1].set_ylabel('Frequency of occurrence')
The above graph provides the frequency distribution of a few features (columns) from the dataset. It can be inferred from Fig 1 that, the pokemons from 1,3,5 generation with non-legendary type, forms the majority of the population. Also, most of them belong to water in type1 and flying in type2.
Fig 2 graphs are obtained from filtering rows which had Type 1 = water and Type 2 = flying respectively. It shows that major type 2 constituent for water(type 1) is — ground and flying. Same way, most of the flying(type 2) belonged to normal and bug type.
Going a bit deeper on the types field, Fig 3, describes that not all pokemons belong to both the types and also, there is no much difference in the strength if a pokemon belongs to two types.
Thus, from the initial analysis, we can understand that there are quite influential categorical variables like type, generation, legendary in the data which might impact the modelling results. There are also some continuous variables like attack points, defence points, capture rate, base_egg_steps, sp_attack and sp_defence points whose correlation with categorical variables, have some impact on the capture and strength of the pokemon under study.
Before starting the causal analysis, let’s figure out some of the questions that might be helpful in building an outline about the pokemon dataset
- How difficult is it to capture a given pokemon?
- Is it worth capturing it and how will it be useful for us in the battle?
- What are the strengths and weakness of a given pokemon?
Now let’s dive deep into understanding the relationship between the columns to unveil some of the hidden stories and to get general hypothesis.
How difficult is it to capture a given pokemon?
This metric can be defined using capture_rate and hp(hit-point) column in the dataset. Capture rate is nothing but, at a given time, how often a pokemon can be captured if found. So if the capture rate is less, that means the pokemon is difficult to be captured. Hit-point is the amount of damage a pokemon can handle. So if the hit-point is high, then the pokemon is powerful. So most of the pokemons which are difficult to capture have high hp. The below figure gives the top 10 pokemons which are difficult to capture
#clearn outlier data
cap = data[~(data.capture_rate == '30 (Meteorite)255 (Core)')]
cap['capture_rate'] = cap['capture_rate'].astype(int)cap[['name','capture_rate','hp']].sort_values(by=['capture_rate','hp'], ascending=[True,False]).head(10)
Is it worth capturing it and how will it be useful for us in the battle?
Once a pokemon is captured, we need to understand what we can gain from it. So it’s good to know how much points it needs to grow, so it can become more powerful. This can be inferred from a column called experience_growth.
Nearly 65% of pokemons need 1–1.06M experience points to grow. So, if a pokemon with high hp is captured then it’s really a jackpot, else more points are needed to increase the worth of it.
Fields like height, weight, hp (hit-point), attack, defense, special attack and special defense define the strength of a pokemon and it is good to understand the interlinking correlation between them.
power=data[['height_m','weight_kg','hp','attack','defense','speed','sp_attack','sp_defense']]sns.heatmap(power.corr(), annot=True, cmap='Pastel1', cbar=False)
Height and weight of the pokemon seem to have a moderate correlation, followed by defense and special defense columns. However, there is no prominent correlation found between the power, size columns.
Some types of pokemon have better hit-point when compared to others. This relationship might be useful in determining the worth of the pokemon captured or to be captured
The dot in the violin plots are the median (50th percentile) of the population and long-tail suggest there is a small percentage of the group which has higher values (for example on normal). It can be inferred that flying, fairy and dragon type of pokemons have higher HP, so capturing any pokemon from this type might be useful in the battles 🙂
What are the strengths and weakness of a given pokemon?
Among the population, there are nearly 9% of pokemons which are legendary. It’s often said that legendary pokemons have better attack and defense strategy and we can confirm if the data says the same thing with a box plot
fig, ax = plt.subplots(1,2, figsize=(10,5))
sns.boxplot(x=data.is_legendary, y=data.attack, ax = ax)
sns.boxplot(x=data.is_legendary, y=data.defense, ax = ax)
Thus the data says that attack and defense values for legendary pokemons are high in comparison with normal pokemons, proving our hypothesis.
Next, it will be interesting to see how each type of pokemons have attack and defense strategy. To get this result, I’ve grouped all the pokemons based on their type1 and have calculated the mean of attack and defense for each type to get an inference.
fig, ax = plt.subplots(1,2, figsize=(10,5))data.groupby('type1')['attack'].mean().plot(kind='bar',color=color, ax = ax)
ax.set_ylabel('Total Attack points')data.groupby('type1')['defense'].mean().plot(kind='bar',color=color, ax = ax)
ax.set_ylabel('Total Defense points')
Dragon, fighting and ground type pokemons have the highest attack plan of action and at the same time steel, rock and dragon have the highest defense plan of action. On the whole, dragon, steel and ground are the strongest pokemon types
There are some interesting fields in the dataset like against_electric, against_fire and some other against_ fields which define the amount of damage a pokemon can take in the battle-field ranging from 0–4 on the scale. For a given type of pokemon, we can find the endurance score for each nature of the battle.
It can be seen that flying pokemon take the highest damage against ice type of pokemon, followed by dark type pokemon taking the damage from the fairy. Hence, this heatmap gives an idea on the type of pokemon to use in the battle-field.
Tada 💫 we have used the pokemon raw dataset to get answers to some of the basic questions. Likewise, we can develop more to this baseline story, by specifically finding the best pokemon name for each type of the battle and even more.
Hope this article gave you an idea on how to start from a clean slate. Happy learning 🙂