Comprehensive Guide to Exploratory Data Analysis of Haberman’s Survival Data Set

Source: Artificial Intelligence on Medium

Exploratory Data Analysis(EDA) is a process of data analysis that primarily aims to unearth the information hidden in the data set using statistical tools, plotting tools, linear algebra, and other techniques. It helps to understand the data better and highlight its main characteristics that may help to make predictions and forecasts that can have a bearing on the future.

Understanding data is core to data science. Hence EDA is imperative to generating accurate machine learning models. Consider Haberman’s Survival Data set to perform various EDA processes on it using Python. The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

The various attributes of the data set are:

  1. Age of patient at the time of operation (numerical)
  2. Patient’s year of operation (year between 1958 to 1970, numerical)
  3. Number of positive axillary nodes detected (numerical)
  4. Survival status (class attribute) denoted as:
  • 1 — if the patient survived 5 years or longer
  • 2 — if the patient died within 5 years

Just like the medical diagnosis of patients plays a key role in the patient’s treatment lifecycle, EDA plays a vital role in data assessment and the creation of accurate models.

Importing Requisite Python Libraries

Python was chosen due to its best AI packages and Machine Learning libraries. Here we import libraries required to perform data analysis and plotting:

  • Pandas(Python Data Analysis Library)
  • Numpy(Python Package for Scientific Computing)
  • Matplotlib(Python Plotting Library)
  • Seaborn(Python Statistical Data Visualization Library)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading Data Set

The Haberman’s Survival Data Set is a comma-separated values(csv) file. The read_csv() function of Pandas is used to read the csv file(haberman.csv) into DataFrame named haberman. Dataframe is a two-dimensional structure that is size-mutable and has potentially heterogeneous tabular data.

haberman = pd.read_csv('haberman.csv', header=0, names=['Age of Patient', 'Year of Operation', \ 'Positive Axillary Nodes', 'Survival Status'])

Getting Glimpses of Data Set

Let’s get acquainted with the data set by doing some preliminary analysis of data. First of all, let’s see how the data set looks like. In the following code snippet, iterrows() is used to iterate through each row of the DataFrame.

for i,j in haberman.iterrows():
print(j)
print()

All the attributes of the data set are self-explanatory. Age of the Patient implies the age of the patient. Year of Operation mentions the year in which the operation is performed. Positive Axillary Nodes denotes the number(presence or absence) of positive axillary nodes(Lymph nodes) in a patient. Positive Axillary Nodes are the lymph nodes affected by cancer cells. Finally, Survival Status provides information about the patients’ survival for 5 years or longer.

Observations:

  1. The csv file contains 306 rows and 4 columns, implying that the data set contains information about 306 patients who underwent surgery for breast cancer. Considering the volume of data, the data set is small.
  2. A patient’s diagnosis is based on the symptoms that the patient exhibits. As no other attribute of data set other than Positive Axillary Nodes falls into the category of symptoms, we can assume that the presence of Positive Axillary Node is a major catalyst(cause) of breast cancer. According to BreastCancer.org, to remove invasive breast cancer, the doctor removes one or some of the underarm lymph nodes(before or during surgery) so that they can be examined under a microscope for cancer cells. The presence of cancer cells is known as lymph node involvement.
  3. The presence of another symptom in the data set would have created confusion as to what variable should be given top priority for data analysis. Hence in the preliminary analysis, it seems Positive Axillary Nodes is the most important variable.

The first five rows of data set can be seen by the head() function.

haberman.head()

Now let’s find the total number of data points and features(attributes) of data set by using Pandas shape property. A data point is a collection of attributes or features. Hence it is a complete record. In the given data set, a data point comprises of data involving the four attributes(row of DataFrame). The shape attribute returns a tuple representing the dimensionality of the DataFrame(DataFrame stores number of rows and columns as a tuple).

print(haberman.shape)Output:
(306, 4)

The data set has 306 data points(rows) and 4 attributes(columns). The attributes of the data set can be known by the column property of Pandas.

print(haberman.columns)Output:
Index(['Age of Patient', 'Year of Operation', 'Positive Axillary Nodes', 'Survival Status'],
dtype='object')

Here dtype refers to the data types of Pandas. The object data type can contain multiple data types(integers, floats or strings)

Survival Status attribute(dependent variable) contains integer data types that are not categorical type. Hence it is required to convert to categorical type.

haberman['Survival Status'] = haberman['Survival Status'].apply(
lambda x: 'Survived' if x == 1 else 'Died')

Let’s verify whether the conversion has occurred.

print(haberman.head(10))

We can see now that the Survival Status has fields marked as Survived or Died.

A concise summary of the data set can be displayed by the info method. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

print(haberman.info())Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
Age of Patient 306 non-null int64
Year of Operation 306 non-null int64
Positive Axillary Nodes 306 non-null int64
Survival Status 306 non-null object
dtypes: int64(3), object(1)
memory usage: 9.7+ KB
None

Observations:

  1. The data type of the first three columns namely Age of Patient, Year of Operation and Positive Axillary Nodes is an integer. Survival Status has an object data type. The data set has four data columns.
  2. The data set has no null values.
  3. Memory used by data set is approximately 9.7 KB

Pandas describe method generates descriptive statistics that include information that summarizes the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values. It analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided.

print(haberman.describe(include='all'))Output:
Age of Patient Year of Operation Positive Axillary Nodes \
count 306.000000 306.000000 306.000000
unique NaN NaN NaN
top NaN NaN NaN
freq NaN NaN NaN
mean 52.457516 62.852941 4.026144
std 10.803452 3.249405 7.189654
min 30.000000 58.000000 0.000000
25% 44.000000 60.000000 0.000000
50% 52.000000 63.000000 1.000000
75% 60.750000 65.750000 4.000000
max 83.000000 69.000000 52.000000
Survival Status
count 306
unique 2
top Survived
freq 225
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN

Observations:

  1. The method gives the total count of each attribute
  2. For numeric data(variables like Age of Patient, Year of Operation, Positive Axillary Nodes), the method provides valuable information on standard deviation(std), mean, percentiles(25%, 50%, and 75%), min and max values. The 50th percentile(50%) is median. Hence, a summary of central tendencies(mean and median) and dispersion(standard deviation) is obtained.
  3. Using min and max values, the following inferences can be made:
  • The maximum age of a patient is 83 and the minimum age is 30.
  • The year of operation starts from 58(1958) to 69(1969).
  • One or more patients had 52 positive axillary nodes and one or more patients had zero positive axillary nodes.

4. For object data type(Survival Status), the result will include unique, top and freq(frequency). The variables with numerical data type will be given NaN in corresponding fields. Survival Status has two unique values(Survived and Died). The top is the most common value. Hence Survived is the most common survival status. freq is the most common value’s frequency and its value here is 225. So the total number of patients survived is 225.

We can ascertain the total number of patients survived by value_counts() method.

print(haberman['Survival Status'].value_counts())Output:
Survived 225
Died 81
Name: Survival Status, dtype: int64

Hence we can conclude that more patients survived (225) breast cancer than the ones who died of it(81). Hence the data set is imbalanced.

Objective

The main objective of EDA is to determine whether a patient will survive for 5 years or longer based on the attributes Age of Patient, Year of Operation and Positive Axillary Nodes.

Different Levels of Analysis

Now let’s dive deeper into the data set. For that, it’s imperative to consider the different levels of analysis that exist. They are:

  • Univariate Analysis
  • Bivariate Analysis
  • Multivariate Analysis

The selection of the data analysis technique ultimately dependents on the number of variables, data type, and focus of the statistical inquiry.

Univariate Analysis

Univariate analysis is the simplest data analysis technique that deals with only one variable. Being a single variable process, it does not give insights on the cause or effect relationships. The primary objective of the univariate analysis is to simply describe the data to find patterns within the data. The univariate analysis methods being considered are:

  1. 1-D Scatter Plot
  2. Probability Density Function(PDF)
  3. Cumulative Distribution Function(CDF)
  4. Box Plot
  5. Violin Plot

Bivariate Analysis

Bivariate analysis is the process to establish a correlation between two variables. Bivariate analysis is more analytical than univariate analysis. If the data seems to fit a line or curve, then there is a relationship or correlation between the two variables. The bivariate analysis methods being considered are:

  1. 2-D Scatter Plot
  2. Pair Plot

Multivariate Analysis

Multivariate analysis is a more complex statistical analysis. It is the analysis involving three or more variables and is implemented in a scenario where there is a need to understand the relationship between them. The multivariate analysis method being considered is:

  1. Contour Plot

Modus Operandi

The analysis will start with the bivariant analysis. 2-D scatter plot will be plotted first and will make observations of it. Then we will move over to pair plot to see both the distribution of single variables and the relationship between two variables. Afterward, the univariate and multivariate analysis will be conducted.

2-D Scatter Plot

The two-dimensional scatter plot helps to visualize a correlation between two variables using Cartesian coordinates. The values of one variable will be plotted along the x-axis and the other variable on the y-axis. The data will be plotted in the resultant quadrant as an ordered pair(x, y) in which x relates to value on x-axis and y relates to y-axis value.

sns.set_style('whitegrid')
sns.FacetGrid(haberman, hue ='Survival Status', size = 8) \
.map(plt.scatter, 'Age of Patient', 'Positive Axillary Nodes') \
.add_legend()
plt.show()

FacetGrid is a multi-plot grid for plotting conditional relationships. FacetGrid object takes a DataFrame as input and the names of the variables that will form the row, column, or hue dimensions of the grid. The variables should be categorical and the data at each level of the variable will be used for a facet along that axis. The map() method is responsible for repeating the same plot on each space of the grid. It applies a plotting function to each facet’s subset of the data. The add_legend() method creates the legend of the plot.

Observations:

  1. Majority of patients in the age group 30–40 have survived breast cancer as there are very few orange dots here.
  2. It is very rare for patients to have positive axillary nodes more than 25(or rather 30)
  3. Almost all patients in the age group 50–60 have survived when there is an absence of positive axillary nodes. We can assume this by the absence of orange dots between 50 and 60.
  4. All patients above the age of 80 have died within five years after the operation, as there are no blue dots here.
  5. Few patients with a higher number of positive axillary nodes(greater than 10) have also survived breast cancer(presence of blue dots along Positive Axillary Nodes>10).

Ascertaining Observations:
We can ascertain the observation no. 1 by carrying out the following operation on the haberman DataFrame.

df_3040 = haberman.loc[(haberman['Age of Patient']<=40) & (haberman['Age of Patient']>=30)]
#print(df_3040)
df_3040_survived = df_3040.loc[df_3040['Survival Status']=='Survived']
print('No. of patients in the age group 30-40 survived: {0}' .format(len(df_3040_survived)))
df_3040_died = df_3040.loc[df_3040['Survival Status']=='Died']
print('No. of patients in the age group 30-40 died: {0}' .format(len(df_3040_died)))
Output:
No. of patients in the age group 30-40 survived: 39
No. of patients in the age group 30-40 died: 4

The output verified the observation no. 1.

We can ascertain the value 25(assumed by the blue dot at the midpoint of 20 and 30) mentioned in the observation no. 2 by:

ax_node = haberman['Positive Axillary Nodes'].unique() #unique values of axillary node
ax_node.sort() #sorted the list
print(ax_node)
Output:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 28 30 35 46 52]

The list has value 25. So we can safely assume that the blue dot is at value 25.

We can ascertain observation no. 4 by the following operation:

age = haberman['Age of Patient']
count = 0
print(len(age))
for i in age:
if(i >= 80):
count += 1
print('No. of patients whose age is greater than or equal to 80: {0}' .format(count))
Output:
306
No. of patients whose age is greater than or equal to 80: 1

Hence there is only a patient whose age is greater than or equal to 80. The orange dot after 80 must be representing this patient.

Pair Plots

Pair plot plots pairwise relationships in a dataset. It will create a grid of Axes such that each numeric variable in data will be shared in the y-axis across a single row and in the x-axis across a single column. The diagonal axes are different from the rest of the grid of Axes, and they show the univariate distribution of the data for the variable in that column.

sns.set_style('whitegrid')
sns.pairplot(haberman, hue = 'Survival Status', size = 4)
plt.show()

In the data set, there are 3 quantitative variables(Age of Patient, Year of Operation, Positive Axillary Nodes) and 1 categorical variable(Survival Status). Only numeric(integer or float) continuous values will be plotted in pair plot. Hence in the pair plot shown above, we have 3C2 plots (i.e. 3 unique plots). There is an equal number of plots on either side of the diagonal, and they are the mirror image of each other. The diagonal plots(plot 1, plot 5 and plot 9) demonstrate histograms that showcase the distribution of a single variable.

Observations:

  1. There is no linear separation in any of the plots
  2. There is considerable overlapping of data in each plot.
  3. In plot 2, the Year of Operation is on the x-axis and the Age of Patient is on the y-axis. There is a substantial amount of overlapping that it is difficult to make a classification based on the plot. One interesting fact to observe is that the majority of the patients who have undergone operations during the years 1961 and 1968 have survived 5 years or longer(due to very few orange dots compared to other years). We can even rephrase it as the year 1961 and the year 1968 had the least number of deaths of patients who had undergone breast cancer operations
  4. In plot 3, Positive Axillary Nodes is on the x-axis and the Age of Patient is on the y-axis. Even though there is an overlapping of dots, there are distinguishable patterns that enable us to make inferences. Plot 3(and Plot 7) appears to be better than the rest of the plots( We have analyzed the same plot in-depth during 2-D Scatter Plot ). So Positive Axillary Nodes and Age of Patient are the most useful features to identify the survival status of a patient.
  5. In plot 6, Positive Axillary Nodes is on the x-axis and the Year of Operation is on the y-axis. The plot has the most overlapping of dots. Hence it will not lead to any meaningful conclusions or classifications.
  6. Plot 4 is a mirror image of plot 2. Plot 7 is a mirror image of plot 3. Plot 8 is a mirror image of plot 6
  7. Finally, plot 7 and plot 3 are the best plots to be considered for data analysis.

Ascertaining Observations:
We can ascertain the observation no. 3 of plot 2 by carrying out the following operations on the haberman DataFrame:

df_1961 = haberman.loc[haberman['Year of Operation']==61]
df_1968 = haberman.loc[haberman['Year of Operation']==68]
#print(df_1961)
df_1961_survived = df_1961.loc[df_1961['Survival Status']=='Survived']
print('No. of patients survived during 1961: {0}' .format(len(df_1961_survived)))
df_1961_died = df_1961.loc[df_1961['Survival Status']=='Died']
print('No. of patients died during 1961: {0}' .format(len(df_1961_died)))
aster = '*'
print(aster*45)
df_1968_survived = df_1968.loc[df_1968['Survival Status']=='Survived']
print('No. of patients survived during 1968: {0}' .format(len(df_1968_survived)))
df_1968_died = df_1968.loc[df_1968['Survival Status']=='Died']
print('No. of patients died during 1968: {0}' .format(len(df_1968_died)))
Output:
No. of patients survived during 1961: 23
No. of patients died during 1961: 3
*********************************************
No. of patients survived during 1968: 10
No. of patients died during 1968: 3

1-D Scatter Plot

The scatter plot in which a single variable is used to make inferences is a 1-D scatter plot. Here the variable will be on the x-axis and the y-axis will have zeros(as it is impossible to make a plot without two axes). Obviously, it is a univariate analysis.

df_survived = haberman.loc[haberman['Survival Status'] == 'Survived'] 
df_died = haberman.loc[haberman['Survival Status'] == 'Died']
plt.plot(df_survived['Positive Axillary Nodes'], np.zeros_like(df_survived['Positive Axillary Nodes']), 'o')
plt.plot(df_died['Positive Axillary Nodes'], np.zeros_like(df_died['Positive Axillary Nodes']), 'o')
plt.show()

In the code above, haberman.loc[ ] was used to pick the data points from haberman DataFrame that are associated with the specific indexes, which in turn is stored in another DataFrame. np.zeros_like() method will create an array of zeros. ‘o’ is the small letter of alphabet O to make dots on plot bigger and visible.

Observations:

  1. 1-D scatter plot based on one feature — Positive Axillary Nodes
  2. There is a significant overlap of data that hampers from making any meaningful observations.

Histogram

A histogram is an accurate representation of numerical data distribution that was first introduced by Karl Pearson. It gives an estimate of continuous variables’ probability distribution. The histogram is a univariate analysis as it relates to only one variable.

The very first step to constructing a histogram is to ‘bin’(or bucket) the range of values. To bin means to divide the entire range of values into a series of intervals, and then count the number of values that belong to each interval. The bins are usually consecutive, non-overlapping intervals of a variable. The bins must be adjacent and are of equal size(not required). All but the last bin(right hand most) is half-open.

For equal-sized bins, a rectangle is erected over the bin with height proportional to a number of cases in each bin(frequency or count).

Matplotlib is used to plot histogram and Numpy is used to calculate count and bin edges.

import matplotlib.pyplot as plt
import numpy as np
df_axnodes = haberman['Positive Axillary Nodes'] #DataFrame of Positive Axillary Nodes
count, bin_edges = np.histogram(df_axnodes, bins=25)
print('Bin Edges: ', bin_edges)
print('Counts per Bin: ', count)
plt.hist(df_axnodes, bins=25, color='skyblue', alpha=0.7)
plt.xlabel('Positive Axillary Nodes', fontsize=15)
plt.ylabel('Frequency', fontsize=15)
Output:
Bin Edges: [ 0. 2.08 4.16 6.24 8.32 10.4 12.48 14.56 16.64 18.72 20.8 22.88 24.96 27.04 29.12 31.2 33.28 35.36 37.44 39.52 41.6 43.68 45.76 47.84 49.92 52. ]
Counts per Bin: [197 33 13 14 9 6 9 4 2 5 4 4 1 1 1 0 1 0 0 0 0 0 1 0 1]
Text(0,0.5,'Frequency')

Matplotlib plots histogram using plt.hist() that takes a DataFrame as input. The bin_edges give bin edges(left edge of the first bin and right edge of the last bin). The color parameter sets the color of the bar and the alpha parameter sets the transparency of the bar. plt.xlabel and plt.y-label are used to set the labels of the x-axis and y-axis respectively.

Observations

  • 197 patients out of 306 patients have positive axillary nodes less than 2.08. So the majority(64.37%) of patients have a small number of positive axillary nodes.

Probability Density Function(PDF)

PDF is used to specify the probability of the random variable falling within a particular range of values. It is the probability function used to describe a continuous probability distribution. PDF is used to deal with the probabilities of random variables that have continuous outcomes. The height of a person arbitrarily chosen from a population is a typical example.

PDF is a smoothed version of the histogram. The smoothing of the histogram is done using Kernel Density Estimation(KDE). The area under the PDF(curve) always sum up to 1. PDF is a univariate analysis.

The code snippet shown below will plot PDF.

PDF based on Age of Patient

sns.set_style('whitegrid')
sns.FacetGrid(haberman, hue='Survival Status', size=8) \
.map(sns.distplot, 'Age of Patient') \
.add_legend()
plt.show()

The bars(orange and blue) are the histograms, and the curves represent the PDF.

Observations:

  1. There is a significant overlapping of data that amount to ambiguity
  2. Patients in the age group 30–40 have more survival chances than other age groups.
  3. Patients in the age group 40–60 have fewer prospects of survival.
  4. The age group of 40–45 recorded the highest number of deaths (have the least possibility of survival).
  5. We cannot make final conclusions about a patient’s survival chances based on the attribute ‘Age of Patient’.

PDF based on Year of Operation

sns.set_style('whitegrid')
sns.FacetGrid(haberman, hue='Survival Status', size=8) \
.map(sns.distplot, 'Year of Operation') \
.add_legend()
plt.show()

Observations:

  1. Major overlapping can be observed.
  2. The plot provides information about the number of successful operations(in which patients survived) and the unsuccessful ones. The success of an operation cannot be based on a year as a factor.
  3. Most unsuccessful operations were performed in the year 1965, followed by 1960.
  4. Most successful operations were performed in the year 1961.

PDF based on Positive Axillary Nodes

sns.set_style('whitegrid')
sns.FacetGrid(haberman, hue='Survival Status', size=8) \
.map(sns.distplot, 'Positive Axillary Nodes') \
.add_legend()
plt.show()

Observations:

  1. The presence of positive axillary nodes(lymph node involvement) can be the obvious manifestation of breast cancer. BreastCancer.org has listed it as an important symptom in its website. Positive Axillary Nodes thus attain more significance than the rest of the attributes.
  2. Patients with zero positive axillary nodes have very high chances of survival than the patients who have their presence in them.
  3. Patients with a single positive axillary node also have good chances of survival.
  4. The likelihood to survive breast cancer decreases with an increase in the number of positive axillary nodes.
  5. Only a small number of patients have positive axillary nodes of more than 25.
  6. Positive Axillary Nodes is the preferred attribute to do data analysis.

Ascertaining Observations:
We can ascertain the observations by carrying out the following operation on the haberman DataFrame:

df_one = haberman.loc[haberman['Positive Axillary Nodes']<=1]
df_less = haberman.loc[(haberman['Positive Axillary Nodes']<=25) & (haberman['Positive Axillary Nodes']>1)]
df_more = haberman.loc[haberman['Positive Axillary Nodes']>25]
df_one_survived = df_one.loc[df_one['Survival Status']=='Survived']
print('No. of patients survived(with one or no positive nodes): {0}' .format(len(df_one_survived)))
df_one_died = df_one.loc[df_one['Survival Status']=='Died']
print('No. of patients died(with one or no positive nodes): {0}' .format(len(df_one_died)))
aster = '*'
print(aster*65)
df_less_survived = df_less.loc[df_less['Survival Status']=='Survived']
print('No. of patients survived(1<positive nodes<=25): {0}' .format(len(df_less_survived)))
df_less_died = df_less.loc[df_less['Survival Status']=='Died']
print('No. of patients died(1<positive nodes<=25): {0}' .format(len(df_less_died)))
print(aster*65)df_more_survived = df_more.loc[df_more['Survival Status']=='Survived']
print('No. of patients survived(25<positive nodes<=52): {0}' .format(len(df_more_survived)))
df_more_died = df_more.loc[df_more['Survival Status']=='Died']
print('No. of patients died(25<positive nodes<=52): {0}' .format(len(df_more_died)))
Output:
No. of patients survived(with one or no positive nodes): 150
No. of patients died(with one or no positive nodes): 27
*****************************************************************
No. of patients survived(1<positive nodes<=25): 72
No. of patients died(1<positive nodes<=25): 52
*****************************************************************
No. of patients survived(25<positive nodes<=52): 3
No. of patients died(25<positive nodes<=52): 2

The output has ascertained the observations. We can make the following conclusions:

  1. 85% of patients with one or zero positive axillary nodes have survived breast cancer.
  2. 58% of patients with positive axillary nodes less than 25 and greater than 1 have survived five years or longer
  3. 60% of patients with positive axillary nodes greater than 25 have survived breast cancer.
  4. These statistics prove that the survival chances of patients are pretty high if the number of positive axillary nodes is one or zero. If the number is greater than one, then survival chances range from 58% to 60%.

Cumulative Distribution Function(CDF)

The cumulative distribution function (CDF) of a real-valued random variable X is the probability that the variable takes a value less than or equal to x.
F(x) = P(X <= x)
where the right-hand side represents the probability that the random variable X takes on a value less than or equal to x. The probability that X lies in the semi-closed interval (a,b], where a<b, is therefore:
P(a < X <= b) = F(b) — F(a)

The integration of Probability Density Function(PDF) gives CDF. CDF is also a univariate analysis.

CDF is plotted using the selected variable ‘Positive Axillary Nodes’.

df_axnodes_survived = haberman.loc[haberman['Survival Status']=='Survived']
counts1, bin_edges1 = np.histogram(df_axnodes_survived['Positive Axillary Nodes'], bins=10, density=True)
pdf1 = counts1/(sum(counts1))
print('PDF of patients survived 5 years or longer:', pdf1)
print('Bin Edges: ', bin_edges1)
cdf1 = np.cumsum(pdf1)
aster = '*'
print(aster * 60)
df_axnodes_died = haberman.loc[haberman['Survival Status']=='Died']
counts2, bin_edges2 = np.histogram(df_axnodes_died['Positive Axillary Nodes'], bins=10, density=True)
pdf2 = counts2/(sum(counts2))
print('PDF of patients died within 5 years:', pdf2)
print('Bin Edges: ', bin_edges2)
cdf2 = np.cumsum(pdf2)
print(aster * 60)line1, = plt.plot(bin_edges1[1:], pdf1, label='PDF_Survived')
line2, = plt.plot(bin_edges1[1:], cdf1, label='CDF_Survived')
line3, = plt.plot(bin_edges2[1:], pdf2, label='PDF_Died')
line4, = plt.plot(bin_edges2[1:], cdf2, label='CDF_Died')
plt.legend(handles=[line1, line2, line3, line4])
plt.xlabel('Positive Axillary Nodes', fontsize=15)
plt.show()
Output:
PDF of patients survived 5 years or longer: [0.83555556 0.08 0.02222222 0.02666667 0.01777778 0.00444444 0.00888889 0. 0. 0.00444444]
Bin Edges: [ 0. 4.6 9.2 13.8 18.4 23. 27.6 32.2 36.8 41.4 46. ]
************************************************************
PDF of patients died within 5 years: [0.56790123 0.14814815 0.13580247 0.04938272 0.07407407 0. 0.01234568 0. 0. 0.01234568]
Bin Edges: [ 0. 5.2 10.4 15.6 20.8 26. 31.2 36.4 41.6 46.8 52. ]
************************************************************

Matplotlib is used to plot the histogram and Numpy is used to calculate count and bin edges. Matplotlib plots histogram using plt.hist() that takes a DataFrame as input. The bin_edges gives bin edges(left edge of the first bin and right edge of the last bin). np.cumsum() is a numpy method to calculate the cumulative sum. plt.legend() is a Matplotlib method for generating legends of the graph. plt.xlabel() is another Matplotlib method to label the x-axis

Observations:

  1. Even patients with a higher number of positive axillary nodes have survived breast cancer. Contrary to this, patients who have no positive axillary nodes have died after undergoing an operation.
  2. The maximum number of positive axillary nodes for a patient who survived cancer is 46
  3. 83.55% of patients who survived cancer had positive axillary nodes in the range of 0 to 4.6.
  4. 56.79% of patients who died had positive axillary nodes in the range 0 to 5.2.

Ascertaining Observations:
We can ascertain the observation no. 1 by carrying out the following operations on the haberman DataFrame:

df_axnodes_died = haberman.loc[haberman['Survival Status']=='Died']
df_no_axnodes_died = df_axnodes_died.loc[df_axnodes_died['Positive Axillary Nodes']==0]
print('No. of patients died with zero Positive Axillary Node: ', len(df_no_axnodes_died))
df_axnodes_survived = haberman.loc[haberman['Survival Status']=='Survived']
df_high_axnodes_survived = df_axnodes_survived.loc[df_axnodes_survived['Positive Axillary Nodes']>=20]
print('No. of patients survived with high Positive Axillary Nodes(>=20): ', len(df_high_axnodes_survived))
Output:
No. of patients died with zero Positive Axillary Node: 19
No. of patients survived with high Positive Axillary Nodes(>=20): 7
Age of Patient Year of Operation Positive Axillary Nodes \
7 34 59 0
34 39 66 0
44 41 64 0
45 41 67 0
54 42 59 0
64 43 64 0
65 43 64 0
81 45 66 0
97 47 62 0
98 47 65 0
114 49 63 0
125 50 64 0
224 60 65 0
230 61 65 0
239 62 58 0
258 65 58 0
268 66 58 0
285 70 58 0
293 72 63 0
Survival Status
7 Died
34 Died
44 Died
45 Died
54 Died
64 Died
65 Died
81 Died
97 Died
98 Died
114 Died
125 Died
224 Died
230 Died
239 Died
258 Died
268 Died
285 Died
293 Died
Age of Patient Year of Operation Positive Axillary Nodes \
9 34 58 30
59 42 62 20
174 54 67 46
188 55 69 22
227 60 61 25
252 63 61 28
254 64 65 22
Survival Status
9 Survived
59 Survived
174 Survived
188 Survived
227 Survived
252 Survived
254 Survived

Box Plot

Box plot is a visual representation of the distribution of data based on the five-number summary. These five numbers are minimum or smallest number, first quartile(Q1) or 25th percentile, median(Q2) or 50th percentile, third quartile(Q3) or 75th percentile and maximum or largest number. Q1 is the middle number between the minimum and median. Q3 is the middle value between the median and the maximum. InterQuartile Range(IQR) is the difference between the first quartile and the third quartile.
IQR = Q3 — Q2
The height of the box plot represents IQR. The top line and bottom line of the box represent the first quartile and the third quartile respectively. The line between the top line and bottom line of the box represents the median. The lines extending parallel from the boxes are known as the ‘whiskers’, which are used to indicate variability outside the upper and lower quartiles. Outliers are sometimes plotted as individual dots that are in-line with whiskers. An outlier is the data point that differs significantly from the other observations. It lies outside the overall pattern of distribution.

The Box and Whisker Plot was first introduced by mathematician John Tukey in 1969. Box Plots can be drawn either vertically or horizontally. Although Box Plots may seem primitive in comparison to a Histogram or Density Plot, they have the advantage of taking up less space, which is useful when comparing distributions between many groups or datasets.

Python Statistical Data Visualization Library Seaborn is used to plot the box plot.

sns.boxplot(x='Survival Status', y='Positive Axillary Nodes', data=haberman)
plt.show()

Violin Plot

A Violin Plot is used to visualize the distribution of the data and its probability density. It is a combination of the box plot with a rotated kernel density plot on either side to show the distribution shape of the data. The white dot in the middle is the median value and the thick black bar at the center represents the interquartile range. The thin black line extended from it represents the maximum and minimum values in the data. Violin plots are similar to box plots except that they also show the probability density of the data at different values, usually smoothed by Kernel Density Estimator. Combining the best of both worlds (Histogram and PDF, Box Plot) gives the violin plot. The violin plot is a univariate analysis.

sns.violinplot(x='Survival Status', y='Positive Axillary Nodes', data=haberman, size=8)
plt.show()

Observations:

  1. The IQR is a measure of the bulk of the values lies. Hence, the patients who survived have positive axillary nodes of less than 3. Similarly, patients who died have positive axillary nodes greater than 2.
  2. The presence of points outside the whisker indicates the presence of outliers. The number of outliers in the Survived category(patients survived 5 years or longer) is considerably higher than the Died category(patients died within 5 years).
  3. The Q1 and median of the Survived category are almost the same. The median of the Died category and Q3 of the Survived category are apparently on the same line. Hence there is overlapping that may result in at least 15% to 20% of error. Thus it is difficult to set a threshold to differentiate patients’ chances of survival.
  4. The majority of patients who had an absence of positive axillary nodes survived breast cancer. Similarly, the majority of patients with a larger number of positive axillary nodes died.
  5. There is an exception to every rule. It applies here too. As few patients with a large number of positive axillary nodes have survived and, few patients with the absence of positive axillary nodes have died.

Contour Plot

A contour plot is a multivariate analysis. A contour plot is not a normalization technique, rather it is a graphical technique for representing a three-dimensional surface by plotting constant z slice called contours, in a two-dimensional format. Seaborn is used to plotting the contour plot.

sns.jointplot(x='Age of Patient', y='Positive Axillary Nodes', data=haberman, kind='kde')
plt.show()

Observations:

  1. Of all the patients with positive axillary nodes less than or equal to two, the majority of them fall in the age group of 50–56.

Ascertaining Observation:
We can ascertain the observation by carrying out the following operations on the haberman DataFrame:

df_axnodes_zero = haberman.loc[haberman['Positive Axillary Nodes']<=2]
print('No. of patients with positive axillary nodes<=2: ', len(df_axnodes_zero))
df_axnodes_zero_50 = df_axnodes_zero.loc[(df_axnodes_zero['Age of Patient']>=50) & (df_axnodes_zero['Age of Patient']<=56)]
print('No. of patients in the age group 50-56 who have positive axillary nodes<=2: ', len(df_axnodes_zero_50))
Output:
No. of patients with positive axillary nodes<=2: 197
No. of patients in the age group 50-56 who have positive axillary nodes<=2: 40

So 20.30% of all the patients with positive axillary nodes less than or equal to two falls in the age group of 50–56.

Let’s summarize the important observations we made during Exploratory Data Analysis.

Conclusions:

  1. The majority of patients in the age group 30–40 have survived breast cancer.
  2. The majority of the patients who have undergone operations during the years 1961 and 1968 have survived 5 years or longer after the operation.
  3. The presence of positive axillary nodes(lymph node involvement) can be the obvious manifestation of breast cancer. In general, the survival chances of a breast cancer patient is inversely proportional to the number of positive axillary nodes.
  4. Patients with zero positive axillary nodes have very high chances of survival than the patients who have their presence in them.
  5. A few patients with a large number of positive axillary nodes have survived and, few patients with the absence of positive axillary nodes have died. So the absence of positive axillary nodes cannot augur a foolproof assurance of survival.
  6. Only a small number of patients have positive axillary nodes of more than 25.
  7. So based on Exploratory Data Analysis, we can propose a hypothesis about the survival chances of a breast cancer patient.

References: