A Beginner’s Guide to Data Analysis in Python

Original article was published by Natassha Selvaraj on Artificial Intelligence on Medium


Pandas Profiling

This is a very useful tool that can be used by analysts. It generates an analysis report on the data frame, and helps you better understand the correlation between variables.

To generate a Pandas Profiling report, run the following lines of code:

import pandas_profiling as pp
pp.ProfileReport(df)

This report will give you some overall statistical information on the dataset, which looks like this:

By just glancing at the dataset statistics, we can see that there are no missing or duplicate cells in our data frame.

The information provided above usually requires us to run a few lines of codes to find, but is generated a lot more easily with Pandas Profiling.

Pandas Profiling also provides more information on each variable. I will show you an example:

This is information generated for the variable called “Pregnancies.”

As an analyst, this report saves a lot of time, as we don’t have to go through each individual variable and run too many lines of code.

From here, we can see that:

  • The variable “Pregnancies” has 17 distinct values.
  • The minimum number of pregnancies a person has is 0, and the maximum is 17.
  • The number of zero values in this column is pretty low (only 14.5%). This means that above 80% of the patients in the dataset are pregnant.

In the report, there is information like this provided for each variable. This helps us a lot in our understanding of the dataset and all the columns in it.

The plot above is a correlation matrix. It helps us gain a better understanding of the correlation between the variables in the dataset.

There is a slight positive correlation between the variables “Age” and “Skin Thickness”, which can be looked into further in the visualization section of the analysis.

Since there are no missing or duplicate rows in the data frame as seen above, we don’t need to do any additional data cleaning.

Data Visualization

Now that we have a basic understanding of each variable, we can try to find the relationship between them.

The simplest and fastest way to do this is by generating visualizations.

In this tutorial, we will be using three libraries to get the job done — Matplotlib, Seaborn, and Plotly.

If you are a complete beginner to Python, I suggest starting out and getting familiar with Matplotlib and Seaborn.

Here is the documentation for Matplotlib, and here is the one for Seaborn. I strongly suggest spending some time reading the documentation, and doing tutorials using these two libraries in order to improve on your visualization skills.

Plotly is a library that allows you to create interactive charts, and requires slightly more familiarity with Python to master. You can find the installation guide and requirements here.

If you follow along to this tutorial exactly, you will be able to make beautiful charts with these three libraries. You can then use my code as a template for any future analysis or visualization tasks in the future.

Visualizing the Outcome Variable

First, run the following lines of code to import Matplotlib, Seaborn, Numpy, and Plotly after installation:

# Visualization Importsimport matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
get_ipython().run_line_magic('matplotlib', 'inline')
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px
import numpy as np

Next, run the following lines of code to create a pie chart visualizing the outcome variable:

dist = df['Outcome'].value_counts()
colors = ['mediumturquoise', 'darkorange']
trace = go.Pie(values=(np.array(dist)),labels=dist.index)
layout = go.Layout(title='Diabetes Outcome')
data = [trace]
fig = go.Figure(trace,layout)
fig.update_traces(marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.show()

This is done with the Plotly library, and you will get an interactive chart that looks like this:

You can play around with the chart and choose to change the colors, labels, and legend.

From the chart above, however, we can see that most patients in the dataset are not diabetic. Less than half of them have an outcome of 1 (have diabetes).

Correlation Matrix with Plotly

Similar to the correlation matrix generated in Pandas Profiling, we can create one using Plotly:

def df_to_plotly(df):
return {'z': df.values.tolist(),
'x': df.columns.tolist(),
'y': df.index.tolist() }
import plotly.graph_objects as go
dfNew = df.corr()
fig = go.Figure(data=go.Heatmap(df_to_plotly(dfNew)))
fig.show()

The codes above will generate a correlation matrix that is similar to the one above:

Again, similar to the matrix generated above, a positive correlation can be observed between the variables:

  • Age and Pregnancies
  • Glucose and Outcome
  • SkinThickness and Insulin

To further understand the correlations between variables, we will create some plots:

Visualize Glucose Levels and Insulin

fig = px.scatter(df, x='Glucose', y='Insulin')
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
marker_line_width=1.5)
fig.update_layout(title_text='Glucose and Insulin')
fig.show()

Running the codes above should give you a plot that looks like this:

There is a positive correlation between the variables glucose and insulin. This makes sense, because a person with higher glucose levels would be expected to take more insulin.

Visualize Outcome and Age

Now, we will visualize the variables outcome and age. We will create a boxplot to do so, using the code below:

fig = px.box(df, x='Outcome', y='Age')
fig.update_traces(marker_color="midnightblue",marker_line_color='rgb(8,48,107)',
marker_line_width=1.5)
fig.update_layout(title_text='Age and Outcome')
fig.show()

The resulting plot will look somewhat like this:

From the plot above, you can see that older people are more likely to have diabetes. The median age for adults with diabetes is around 35, while it is much lower for people without diabetes.

However, there are a lot of outliers.

There are a few elderly people without diabetes (one even over 80 years old), that can be observed in the boxplot.

Visualizing BMI and Outcome

Finally, we will visualize the variables “BMI” and “Outcome”, to see if there is any correlation between the two variables.

To do this, we will use the Seaborn library:

plot = sns.boxplot(x='Outcome',y="BMI",data=df)

The boxplot created here is similar to the one created above using Plotly. However, Plotly is better at creating visualizations that are interactive, and the charts look prettier compared to the ones made in Seaborn.

From the box plot above, we can see that higher BMI correlates with a positive outcome. People with diabetes tend to have higher BMI’s than people without diabetes.