Original article was published by Ananya Paul on Artificial Intelligence on Medium
Data Analysis and Data Visualization: Plots, Charts, and HeatMaps
Data analysis is a process of importing, cleansing, wrangling, and modeling data with the goal of discovering functional and convenient information, informing conclusions, and supporting decision-making. Data analysis has multiple features and approaches, encapsulating diverse techniques under a variety of names, and is used in different domains of modern technology. In the present time, data analysis plays a role in making decisions more scientific and helping businesses instrument in a more dynamic fashion
Data Integration is a predecessor to Data Analysis that involves combining data residing in different sources and providing users with an amalgamated view of them. It comes with the increasing volumes of data(big data) and the need to share data explodes. Data Analysis is closely linked with Data Visualisation which plays a big role in helping the programmers visualize the correlation between predictors and their statistical significance,
In this blog, I am going to explain a small project exercised under the guidance and mentoring of Udemy. The project inputs data from Kaggle, and deals with a handy set of 911 calls made in times of an emergency.
The very first step for any project, as discussed in my previous blog is to import the libraries and dataset
NumPy — supports the creation of 1D arrays, 2D matrices, and the high-end functions to operate on these vectors/matrices
Pandas — helps in analyzing and manipulating data, offers high-end functions for these purposes
Matplotlib — A data visualization library that assists us in graphically representing our data through charts, graphs, etc.
Seaborn — A powerful data visualization library built over Matplotlib to improve the graphical representation. Cufflinks and Plotly are advanced visualization libraries built on Matplotlib
In the next step, we will analyze the information provided by the dataset, the total number of data points, the datatype of the parameters, and we will view a brief part of the dataset in a tabular format
Now let us play with our data and answer a few questions:-
Q.1) What are the top 5 zip codes for the 911 calls
Q.2) What are the top 5 townships (twp) for 911 calls?
Q.3) Take a look at the ‘title’ column, how many unique title codes are there?
In the titles column of our dataset, there are “Reasons/Departments” specified before the title code. These are EMS, Fire, and Traffic. Let us use the .apply() function with a custom lambda expression to create a new column called “Reason” that contains this string value. Let us view the count of each reason responsible for the emergency
Following the creation of new features, we will visualize our dataset using a simple count plot graph under the seaborn library. A countplot counts and plots the frequency of each entity and displays it using a bar graph
Now let us focus on the time information in our dataset. The timeStamp column values are in the string datatype and they are converted to DateTime objects. We break down the time into 3 separate entities — Hour, Month and Day of Week
Now, we will use the countplot to distribute the reasons for emergency and plot it for each day of the week. In the code, I have used the ‘Viridis’ palette, there are different varieties of the palette available. The ‘hue’ part of the code is assigned to the ‘Reason’ column and the legends are plotted using appropriate syntaxes.
For better understanding, we will do the same mapping for each and every month.
Now, we will create an object using groupby() function where we will group the DataFrame by the month column and use the count() method for aggregation. We will use the head() method on this returned DataFrame.
Now we will create a simple plot off of the data frame indicating the count of calls per month.
Now we will see if we can use seaborn’s lmplot() function to create a linear fit on the number of calls per month. We may need to reset the index to a column
Now our task is to create a new column called ‘Date’ that contains the date from the timeStamp column. You’ll need to use apply along with the .date() method. After we are done with the .apply() function, we will groupby this Date column with the count() aggregate and create a plot of counts of 911 calls.
For better clarity, we will go over the plot of counts of 911 calls for each and every reason for the emergency. We will have 3 separate plots, for 3 separate reasons.
We are almost through with the data visualization section, and we are only left with discussing the heatmaps and the clustering maps. Let us go over these two topics going through the code
Let us move on to creating heatmaps with seaborn and our data. We’ll first need to restructure the data frame so that the columns become the Hours and the Index becomes the Day of the Week. There are lots of ways to do this, but I would recommend trying to combine groupby with an unstack method.
The unstack() method → Pivot a level of the (necessarily hierarchical) index labels. The method returns a data frame having a new level of column labels whose inner-most level consists of the pivoted index labels. If the index is not a MultiIndex, the output will be a Series. If the explanation lacks clarity, I would request you to go over the materials available on the internet.
Now we will visualize the heatmaps and the clustering maps on our dayHour data frame.
The heatmap thus helps us to map or visualize the correlation factor between the predictors, which gives us a very important understanding of which predictors should be used for projecting the label.
We will now go over the clustering map for the same data frame.
There are other visualizations techniques under the Matplotlib and Seaborn library, all of which could not be summed up in one blog. Data Analysis and Visualization range from a simple housing plot to geographical plotting, under which we get a number of sub-sections. I will request you to go over this publication and post your feedback.
Hope you like it, stay safe and keep blogging!
Thank you Udemy and Kaggle for their considerable support