Exploratory Data Analysis Tools

Original article was published by Karteek Menda on Deep Learning on Medium


Exploratory Data Analysis Tools

Pandas-Profiling, Sweetviz, D-Tale

Hello Aliens…..

In this article I will be explaining what is EDA and some of the packages which do pretty good job in doing EDA with few lines of code. Thus saving the time.

Exploratory Data Analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. It is a good practice to understand the data first and try to gather as many insights from it. EDA is all about making sense of data in hand, before getting them dirty with it.

It is in general that most of the time will be spent on cleaning the data and doing the EDA. So, what if we have some ready made packages which will reduce the work and save time. So, there are some packages which comes to the life of a data scientist as a savior. So, in this article I would be taking you through three such packages which are very popular and prove to be useful.

1. Pandas-Profiling:

It is a open source python module. Generates profile reports from a pandas dataframes. The pandas describe() function is great but a little basic for serious exploratory data analysis. Pandas Profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics – if relevant for the column type – are presented in an interactive HTML report:

  • Type inference: detect the types of columns in a dataframe.
  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values matrix, count, heatmap and dendrogram of missing values
  • Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
  • File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.

You can install using pip.

pip install pandas-profiling

To generate the report, run

from pandas_profiling import ProfileReport
report=ProfileReport(your_dataste_name,title="profile report")

To generate reports interactively. There are two interfaces through widgets and through a HTML report.

#if you want the report in widgets.
profile.to_widgets()
#if you want the HTML report
profile.to_file("your_report.html")

The report will be generated after this. See below for the generated report.

2. Sweetviz:

Sweetviz is an open source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code. Output is a fully self-contained HTML application.

It requires Python >=3.6. It is very simple and requires only two lines of code.

Install the package using pip.

pip install sweetviz

To analyze a single dataframe, simply use the analyze() function, then the show_html() function:

########################
import sweetviz as sv #
########################
my_report = sv.analyze(my_dataframe)
my_report.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"

The report will be generated.(See Below).

Retrieved from here.

And this is absolutely beautiful package which can deal with huge dataset and you can explore much ore about this package over here.

3. DTale:

D-Tale is the combination of a Flask back-end and a React front-end to bring you an easy way to view & analyze Pandas data structures. It integrates seamlessly with ipython notebooks & python/ipython terminals. Currently this tool supports such Pandas objects as DataFrame, Series, MultiIndex, DatetimeIndex & RangeIndex. It is a very detailed report and very simple

You can install the package by using pip

pip install dtale

To see the entire report, just follow the below code.

###############
import dtale #
###############
report = dtale.show(your_data_frame)
report

Very detailed report can be obtained from this.

Very detailed report from Dtale

You can explore a lot of this package and it covers more dimensions in terms of EDA. It is very cool and detailed.

Personally I would prefer all 3 but in particular I would go with DTale. For more information on DTale refer this.

I would recommend all the readers to go through the documentations of all three packages as they are simply superb in doing EDA.

Happy Learning…………

Thanks for reading the article! If you like my article do 👏 this article. If you want to connect with me in Linkedin, please click here.

This is Karteek Menda.

Signing Off