A to Z: Master Data Visualization with this Ruleset

Original article was published on Artificial Intelligence on Medium

A to Z: Master Data Visualization with this Ruleset

Intro:

Whether you’re trying to break into the world of data analytics or data science, if you’re a product manager, sales leader, or anybody seeking to understand their business being able to utilize data in a meaningful way is key. Whether you’re using data visualization software like Tableau, Domo, PowerBI, etc. or you’re using a language like R, Python, etc. there are a variety of principles and concepts that will help you get started.

Purpose of your analysis:

Before anything else, keep in mind that any analysis should have some purpose. It’s easy to look at a chart and ask yourself, “why am I looking at this?” or “what am I supposed to get here?”. To boil it down to a very simple principle, we want to understand the nature of a given variable & how that variable might relate to others.

Key things to keep in mind:

Dimensionality, Data types

Dimensionality:

Here we’re talking about the number of Ds in 3D. So when you played super mario in 2 dimensions, you had an x and a Y axis. Most of us have seen a lot of two dimensional charts and graphs. The way to think about this is, “how many variables do I want to include in a given visualization?”. As a general rule here; less is often more.

Data types:

Whether a field is numeric; something like age, weight, etc. categorical; gender, hair color, etc. or time; the date, month, or day something occurred. Once you understand this and have some hypothesis about how certain variables may relate to one another, you can begin to formulate what types of visualizations you might use.

Language and datasets:

All of the visualizations will be made using the ggplot2 package in R using a variety of sample datasets including iris, mtcars, mpg, and economics. I won’t be including much r code here as I want this to be broadly applicable, but if you’d like the code for any of these, please comment below or reach out.

Jumping in:

There are other things we’d do to understand the data before we might actually make any visualizations, but we’ll jump right into the visualizations for the sake of getting to rules around visualization specifically.

We’ll go through a variety of options & rules by datatype & dimensionality, starting with a single dimension.

Number of dimensions: 1

Datatype: Numeric

Dataset: mtcars; sample dataset included in base R that gives a variety of datapoints on cars

Purpose: Understanding distribution & summary statistics

Charts: histogram & boxplot

When trying to understand a numeric variable in isolation, you’ll first seek to understand it’s distribution. For this, you’ll use a simple histogram that tells you how many occurrences there are at each value. The first variable that we want to understand is horsepower.

What we’re seeing here is that Horsepower is right-skewed or that the tail on the right side of the peak is stretching further than that of the left. If you think about the typical horsepower for a car, most will be less than 250, but there are certainly still cars that are being made to push that envelope, albeit far fewer.

This is a box plot, box and whiskers plot for the same variable, hp. Boxplots are great for visualizing a number of summary statistics on a given variable. The horizontal lines on the end of the plot represent the max and min. The dark horizontal line is the median. The box the median sits within represents the IQR or interquartile range (breaking your data into four even quartiles, the IQR represents the range between the 1st and 3rd quartiles).

Here we see a histogram of Miles per gallon, and we see a slightly right-skewed distribution. It also nearly appears bimodal. Bimodality is when there are effectively two peaks. One explanation would be that we’re overlapping distributions of gas and electric cars, so let’s say the average mpg for a gas powered car is between 15–20, but for electric, it’s 30–35, then given enough volume of either we could see two peaks in our distribution.

We’re now looking at qsec; a car performance metric. It’s the time it takes the car to travel ¼ of a mile. What we see here is a very standard normal distribution.

Number of dimensions: 1

Datatype: Categorical

Dataset: mpg; sample dataset included in base R that gives a variety of datapoints on cars

Purpose: Understanding proportions

Charts: bar chart & pie chart

When trying to understand a single categorical variable in isolation, the main thing you want to consider is how many occurrences of a given term is popping up.

Now let’s take a look at the transmission & class variables from the mpg dataset. We’ll do so by creating a bar chart with the categorical variable on the X-axis & the count of occurrences on the Y-axis.

Here you can get an idea of which transmissions occur frequently versus those that appear slightly more rare. Typically you don’t need to include color, but just to make things a tad more clear.

Similarly, we see the count of occurrences charted by class. We can see that 2seaters and minivans are far less frequently occurring than SUVs or compacts. This bar chart could just as easily be shown as a pie chart. Pie charts can sometimes be a tad more difficult to delineate the volume of a given slice than it is in a bar chart as any given slice will have a different angle, could be on different sides of the pie, etc.

As mentioned, here is a pie chart of the class variable.

Additionally, we can use bar charts to plot other aggregations by categorical variables. For instance, taking the average mpg per a given car class, but we’ll get into that later.

Ok so now we’ve looked at numeric & categorical variables in isolation; let’s increase the number of dimensions we’re charting to two and look at some different combinations.

Number of dimensions: 2

Datatype: numeric

Dataset: Iris & mpg; sample datasets included in base R that gives a variety of datapoints on three species of Iris and some of their measurements

Purpose: Understanding the relationship between two variables

Charts: Scatter plot

Whenever trying to understand the relationship between two numeric variables, scatter plot is best practice.

Here we are trying to observe the relationship between the length (Y Axis) and width (X axis) of a sepal (for the plant anatomy, just run a quick google search.. 🙂 )

I also looked at the correlation (measure to understand how two variables relate or move together, 1 would be the move perfectly in sync, -1 would mean that they were perfectly inverse, .5 or -.5 is a good relationship, .3 or -.3 could be a weak relationship, and anything too far below that would be weak or random relationship) for these two variables and found that it was -.11, suggesting that there is no real relationship. While it appears like these two variables are unrelated; It is important to consider the many potential layers of a relationship among variables.

I’ll talk more about this later, but I’ll give a sneak peek now. While in the previous plot we saw that width and length appeared not to relate, it is important to include as many potential perspectives as possible. Looking at the same plot as before, we are going to add one more dimension to it; Species. We will visualize species by using color.

Once we added in the third dimension, we can see that by species there is a clear linear relationship between length & width.

Below I’ve included the correlation when grouping by species and we can see on the high end a correlation of .74 and on the low end .46, which is still considerable.

Now let’s get back to assessing two numeric variables. Here we’re looking at City MPG and Highway MPG. Here the scatter is moving up and to the right in a linear fashion, indicating a positively correlated relationship between the two. These two variables correlate at .96.

Something to keep in mind if you’re new to statistics. Even though these things move together; it doesn’t necessarily mean that one is causing the other. It just indicates they’re related.

To continue evaluating other combinations of two dimensional data, let’s consider how we might analyze the relationship between two categorical variables.

Number of dimensions: 2

Datatype: categorical

Dataset: mpg; sample dataset included in base R

Purpose: Understanding the relationship between two variables

Charts: table, heatmap, bar chart

Before jumping into visualizations, I’m going to show two categorical variables in a table as visualizations of two categorical dimensions are representative of what we’ll find in a table.

The table below is from the mpg dataset; we’re looking at the class of car and whether the car is four wheel drive, front wheel drive, or rear wheel drive.

Here we can see the frequency of records that pertain to any given combination of the categorical variables

At a quick glance, we can see that SUV, 4 wheel drives are the most common. Another layer you can add to this is looking at each cell as a percent of the whole.

There is more to do with prop tables, but we’ll save that for another time.

From here we’re looking to visualize what we’re seeing in the prop table.

An excellent way to turn this table to a visualization is with a heatmap. Take a look at the chart below. On either axis, you see the categorical variables, and the color of the heat map is represented by the count of occurrences.

As we saw in the original table, SUV, 4 wheel drive is the most common, with midsize front-wheel drive coming in second and compact front wheel drive taking third.

A heatmap presents some difficulty in terms of being able to gauge exactly what the count is. We have a legend, and depending on the software you’re using you may be able to serve that up easily as a tooltip.

Another option is to go back to the bar chart, with one of the categorical variable on the x-axis, the numeric variable (count) on the y-axis, and the second categorical variable is represented on the dimension of color.

Here again, we can see which of the combinations is the most frequently occurring, thus bringing us to a similar end.

This next plotting option is very similar, but rather than just representing the third dimension on color, we can also use faceting. Faceting is a technique that allows you to give each level of a categorical variable its own plot. Take a look below.

As mentioned here we can see a very similar plot to the previous ones but each level of class is broken out in its own plot.

At this point, we’ve seen a very similar result with various charts. The consideration you have to make is how well does a given plot convey your message or produce the necessary insight, is it overly complex, is it taking up too much space, what screen sizes will potential stakeholders be seeing your charts on, etc.

Now we’re on to plots with multiple dimensions.

Number of dimensions: 3–5

Datatype: numeric & categorical

Dataset: mpg; sample dataset included in base R

Purpose: Understanding the relationship between multiple variables

Charts: Scatter

Here we see the same plot as before with a single modification. To include the third dimension of ‘engine displacement’, we are now changing the size of the points to correspond to engine displacement. As we can see it seems to relate inversely to higher city and highway gas mileage.

Now an alternative option for adding a dimension is color.

Rather than using size, I’ve now included color to indicate engine displacement.

Below I’ve included an additional plot where we do both, which makes it even easier to see.

One thing to keep in mind is that instead of putting engine displacement in color and size, we could actually introduce a fourth numeric dimension.

Here I’ve swapped out cyl to fill the size dimension; which also appears to relate inversely to highway and city mpg.

A couple of other options we could introduce here could be to use color to introduce a categorical variable or to facet according to a categorical variable.

Here you can see the sample plot now faceted by class, which allows us to see how these numeric variables relate to one another across different levels of a categorical variable.

Before we wrap up, a final consideration is if we have two dimensional data with a variable representing time. A great rule of thumb for time is to use a line chart.

Number of dimensions: 2

Datatype: numeric & time

Dataset: economics; sample dataset included in base R

Purpose: Understanding the relationship between time and a numeric variable

Chart: line

The data we’ll be looking at comes from the economics sample dataset and represents unemployment over the last 50 years or so.

When charting a dataset with a time dimension, we are attempting to identify a trend in a given numeric variable to understand trend and whether given activity might coincide with that movement. Put your time dimension across the X-axis and whatever numeric variable you are measuring on the Y-axis.

There is a lot more that could potentially be done even here, but I’ll save that for another time.

Takeaway:

• 1 dimension
• numeric: histogram, box plot
• categorical: table, bar chart, pie chart
• 2 dimensions
• numeric/numeric: scatter
• numeric/ categorical: bar chart
• categorical/ categorical: bar chart, table
• 3 dimensions
• numeric/ numeric/ numeric: scatter with size/color
• numeric/ numeric/ categorical: scatter with color/facets
• numeric/ categorical/ categorical: bar chart with color/facets
• 4+ dimensions
• Use varying combinations of datatypes and variables with your x & y axis, as well as color, fill, size, facets, etc.

There is so much you can do with data visualization & this is just the start of it. Here’s to hoping this helps you get started!

Add yourself to my email list if this was helpful; also be sure to let me know if you’d prefer code examples, additional information, etc.

Come check out some of my other posts at datasciencelessons.com & happy data science-ing!