Data Analysis and Data Visualization: Plots, Charts, and HeatMaps

Original article was published by Ananya Paul on Artificial Intelligence on Medium


Data Analysis and Data Visualization: Plots, Charts, and HeatMaps

Data Visualisation

Data analysis is a process of importing, cleansing, wrangling, and modeling data with the goal of discovering functional and convenient information, informing conclusions, and supporting decision-making. Data analysis has multiple features and approaches, encapsulating diverse techniques under a variety of names, and is used in different domains of modern technology. In the present time, data analysis plays a role in making decisions more scientific and helping businesses instrument in a more dynamic fashion

Data Integration is a predecessor to Data Analysis that involves combining data residing in different sources and providing users with an amalgamated view of them. It comes with the increasing volumes of data(big data) and the need to share data explodes. Data Analysis is closely linked with Data Visualisation which plays a big role in helping the programmers visualize the correlation between predictors and their statistical significance,

In this blog, I am going to explain a small project exercised under the guidance and mentoring of Udemy. The project inputs data from Kaggle, and deals with a handy set of 911 calls made in times of an emergency.

Data Analysis

The very first step for any project, as discussed in my previous blog is to import the libraries and dataset

Importing libraries and dataset

NumPy — supports the creation of 1D arrays, 2D matrices, and the high-end functions to operate on these vectors/matrices

Pandas — helps in analyzing and manipulating data, offers high-end functions for these purposes

Matplotlib — A data visualization library that assists us in graphically representing our data through charts, graphs, etc.

Seaborn — A powerful data visualization library built over Matplotlib to improve the graphical representation. Cufflinks and Plotly are advanced visualization libraries built on Matplotlib

In the next step, we will analyze the information provided by the dataset, the total number of data points, the datatype of the parameters, and we will view a brief part of the dataset in a tabular format

Information and View of the dataset in the tabular format

Now let us play with our data and answer a few questions:-

Q.1) What are the top 5 zip codes for the 911 calls

Top 5 zip codes for the 911 calls

Q.2) What are the top 5 townships (twp) for 911 calls?

Top 5 townships(twp) for 911 calls

Q.3) Take a look at the ‘title’ column, how many unique title codes are there?

Unique Title Codes in the ‘Title’ column

Data Visualisation

In the titles column of our dataset, there are “Reasons/Departments” specified before the title code. These are EMS, Fire, and Traffic. Let us use the .apply() function with a custom lambda expression to create a new column called “Reason” that contains this string value. Let us view the count of each reason responsible for the emergency

Breaking a column into a new one

Following the creation of new features, we will visualize our dataset using a simple count plot graph under the seaborn library. A countplot counts and plots the frequency of each entity and displays it using a bar graph

Count Plot distribution on the reason column

Now let us focus on the time information in our dataset. The timeStamp column values are in the string datatype and they are converted to DateTime objects. We break down the time into 3 separate entities — Hour, Month and Day of Week

Now, we will use the countplot to distribute the reasons for emergency and plot it for each day of the week. In the code, I have used the ‘Viridis’ palette, there are different varieties of the palette available. The ‘hue’ part of the code is assigned to the ‘Reason’ column and the legends are plotted using appropriate syntaxes.

Distribute the reasons for emergency and map it for the week

For better understanding, we will do the same mapping for each and every month.

Distribute the reasons for emergency and map it for each and every month

Now, we will create an object using groupby() function where we will group the DataFrame by the month column and use the count() method for aggregation. We will use the head() method on this returned DataFrame.

Group the Dataframe by the month column

Now we will create a simple plot off of the data frame indicating the count of calls per month.

Simple plot off of the data frame indicating the count of calls per month

Now we will see if we can use seaborn’s lmplot() function to create a linear fit on the number of calls per month. We may need to reset the index to a column

lmplot() to create a linear fit on the number of calls per month

Now our task is to create a new column called ‘Date’ that contains the date from the timeStamp column. You’ll need to use apply along with the .date() method. After we are done with the .apply() function, we will groupby this Date column with the count() aggregate and create a plot of counts of 911 calls.

Plotting the count of 911 calls on the Date column

For better clarity, we will go over the plot of counts of 911 calls for each and every reason for the emergency. We will have 3 separate plots, for 3 separate reasons.

Plotting the count of 911 calls on the Traffic Reason
Plotting the count of 911 calls on the Fire Reason
Plotting the count of 911 calls on the EMS Reason

We are almost through with the data visualization section, and we are only left with discussing the heatmaps and the clustering maps. Let us go over these two topics going through the code

Let us move on to creating heatmaps with seaborn and our data. We’ll first need to restructure the data frame so that the columns become the Hours and the Index becomes the Day of the Week. There are lots of ways to do this, but I would recommend trying to combine groupby with an unstack method.

The unstack() method → Pivot a level of the (necessarily hierarchical) index labels. The method returns a data frame having a new level of column labels whose inner-most level consists of the pivoted index labels. If the index is not a MultiIndex, the output will be a Series. If the explanation lacks clarity, I would request you to go over the materials available on the internet.

Restructuring the data frame

Now we will visualize the heatmaps and the clustering maps on our dayHour data frame.

HeatMap using the new data frame
The same HeatMap using the data frame

The heatmap thus helps us to map or visualize the correlation factor between the predictors, which gives us a very important understanding of which predictors should be used for projecting the label.

We will now go over the clustering map for the same data frame.

Clustering map for the data frame

There are other visualizations techniques under the Matplotlib and Seaborn library, all of which could not be summed up in one blog. Data Analysis and Visualization range from a simple housing plot to geographical plotting, under which we get a number of sub-sections. I will request you to go over this publication and post your feedback.

Hope you like it, stay safe and keep blogging!

Thank you Udemy and Kaggle for their considerable support