Loading Different Data Files in Python

Original article was published on Artificial Intelligence on Medium

Data Science

Loading Different Data Files in Python

From Excel, Pickle, Parquet, Stata, MATLAB, HDF5, and Sas7bdat files

Source: Pixabay

Don’t we all just love pd.read_csv()… It’s probably the most endearing line from our dear, pandas library.

No doubts, when it comes to tabular data files, CSVs are the ‘coolest cats in the pack’ .

Truth is, data files come in different types and extensions, and as Data Professionals, we must handle these as they come.

ETL (Extract, Transform, Load) is at the heart of mainstream data processing. So today we shall explore loading a few varying data types in Python.

First on our list is importing data in NumPy:

Table of Contents

  1. Importing Data in Numpy
  2. Importing Data from Excel Files
  3. Importing Data from SAS Files
  4. Importing Data from STATA Files
  5. Importing Data from HDF5 Files
  6. Importing Data from MATLAB Files
  7. Importing Data from Pickle Files.
  8. Importing Data from Parquet Files
  9. Summary

Importing Data in Numpy:

img_credit

Numpy is ideal if all data in the given flat file are numerical, or if we intend to import only the numerical features. Hence its name Numpy (Numerical-Python).

But why Numpy?

Numpy arrays are the python standard for storing numerical data. They are efficient, fast, and clean. Numpy provides the core multidimensional array object that is necessary for most tasks in scientific computing in Python. This is why it’s at the root of the SciPy stack.

Let’s explore the NumPy functions for loading data.

  1. numpy.loadtxt() function:

This function is best-suited for pure numerical data. Let’s put it to work. First, let’s import the titanic dataset without the np.loadtxt() function.

The first five rows of the titanic_df

Now, let’s import it with the np.loadtxt() function, and select only numeric rows and columns.

So passing skiprows=1, we skip the column headers. And passing, usecols=[0,1,4],we only select columns 0, 1, and 4, which are all numeric.

So we get a good, solid numpy array without the headers and just 3 columns.

Note that numpy can handle data of mixed types…

2. numpy.genfromtxt() function:

This NumPy function can better handle mixed data types. Passing the argument, dtype=None,it will figure out the right data types for each column.

The downside is that it returns a 1D array where each element is a row of the flat file imported. Let’s import the same data using np.genfromtxt().

So we pass the dataset, then we pass delimiter=’,’ telling NumPy it’s a comma-separated-value file. Then we pass names=True telling NumPy the dataset has column headers.

Finally, we pass dtype=None and encoding=’utf8' . The former allows NumPy to infer the data types by itself, and the latter enforces ‘utf8’ encoding to avoid any conversion issues in case bytes get corrupted or lost during the conversion. See link

So this 1D array allows us to index into the array by calling the names of the columns. For example, if we want to see the first 10 fare costs on the titanic data set from this array we can simply do:

data[’fare’][:10]
>>
# And the output is:
array([ 7.25  , 71.2833, 7.925 , 53.1  , 8.05  , 8.4583, 51.8625, 21.075 , 11.1333, 30.0708])

it’s not best practice to pass data of mixed types to numpy. Put mixed data types in a pandas data frame instead.

3. numpy.recfromcsv() function:

This function behaves exactly like the np.genfromtxt() function. Except that it has all the parameters by default. Meaning, we don’t need to pass, names, delimiter or dtype args.

# function np.recfromcsv() has the defaults delimiter=',' and names=True in addition to dtype=None.# So we only pass it the file to read.data = np.recfromcsv('titanic.csv')# This returns the exact same array we saw from np.genfromtxt().

Importing Data From Excel Files:

Source: Google Search

Excel files are a little tricky, in that they may contain several sheets. So it’s ideal to parse the data and select relevant sheets.

From the Git-gist above, we extract the battle_link excel file and load it as a pandas.io.excel object, using the `pd.ExcelFile()` function. Then we print out the sheet names, `print(xls.sheet_names)` and we get a list of the sheet names:- [‘2002’, ’2004′].

This tells us there are just two sheets in this file.

  1. pd.parse() function:

We can select either of the sheets by name or index, using the method parse(). Let’s load a sheet by its name…

# Load a sheet into a Data Frame by name: df1df1 = xls.parse('2004')# Print the head of the DataFrame df1df1.head()
>>
Loading an Excel sheet by sheet name into pandas.

Next, let’s load the other sheet by its index…

# Load a sheet into a DataFrame by index: df2df2 = xls.parse(0)# Print the head of the DataFrame df2print(df2.head())
>>
Loading an Excel sheet by sheet index into pandas.

Customizing your Excel spreadsheet import in pandas:

Just as we did in NumPy, we can use additional arguments to skip rows, rename columns, and select only particular columns with Excel files too.

As before, we’ll use the method pd.parse(). This time, however, we’ll add the additional arguments skiprows, names and usecols. Note that the values passed to these parameters must be of type list.

A. Let’s parse the first sheet (‘2002’ or index 0) and rename its columns

# Parse the first sheet and rename all the columns:df1 = xls.parse(0, skiprows=[1], names=['Country','AAM due to War (2002)'])# Print the head of the DataFrame df1print(df1.head())
>>
Applying skiprows and names parameters to customise Excel imports.

B. Let’s parse only the first column of the second sheet and also rename it.

# Parse the first column of the second sheet and rename the column:df2 = xls.parse(1, usecols=[0], skiprows=[1], names=['Country'])# Print the head of the DataFrame df2print(df2.head())
>>
Applying usecols, skiprows and names parameters to customise Excel imports.

2. pd.read_excel() function:

This function is more popular for loading Excel files. Instead of first parsing the data, pandas reads the Excel data directly into a Pandas Data Frame object. And like pd.parse(), we can fully customize imports using pd.read_excel().

With this function, we don’t have to restrict all arguments to lists. We can pass a string to sheet_name, and an int to skiprows, but names and usecols, must have a list.

For example, let’s select only the first column from the sheet named ‘2004’, and skip the first row and rename the column to ‘war_Country’… In one line of code.

data_2 = pd.read_excel(battle_link, sheet_name='2004', usecols=[0], skiprows=0, names=['War_Country'])data_2.head()
>>
And here’s the output of the one-liner…

Importing Data From SAS Files:

Source: SAS

SAS stands for Statistical Analysis System SAS is a software suite that can mine, alter, manage, and perform statistical analysis on data using the SAS programming language. SAS is a trusted analytical powerhouse with over 40 years of analytics innovation. SAS data files have the extension .sas7bdat.

So let’s import a SAS file in pandas.

From the Git-gist above, we import pandas as pd. Then we install the sas7bdat (all lower case) package, which is the default dataset file package in SAS.

From this package, we import SAS7BDAT (all upper case) which is like the context manager we can use to read in and bind the sales.sas7bdat file to a Pandas Data Frame.

Note that within the context manager, we read in the sales.sas7bdat file as files and bind it to a Pandas Data Frame using the .to_data_frame() method… `df_sas = files.to_data_frame()`.

First five rows of the sales.sas7bdat file, in a pandas Data Frame.

Let’s see a Histogram of the P-values feature of the SAS Data Frame.

df_sas.P.hist()
plt.ylabel('count')
plt.title('Histogram of P-Values')
plt.show()
Histogram of P-values from the SAS sales Data Frame.

Importing Data From Stata Files:

Source: Google Search

Stata is a contraction for Statistics and Data. Stata files are popular in academic social sciences research. Stata files have the extension .dta.

The Good thing is pandas have a built-in function for reading Stata files. So we don’t need a context manager as we did for SAS files. Let’s import a Stata file that contains the extent of infectious diseases in certain countries…

So we simply download the dataset using wget command, then we create a data frame from it by passing it to the pandas read_stata() function.

Let’s see the first five rows and a histogram of the disa10 feature.

First five rows of the data frame of the .dta file.
Histplot of disa10 feature.

We can customize the import by passing arguments to several parameters of this function. For example:- pd.read_stata(index_col=, convert_missing=, preserve_dtypes=, chunksize=, iterator=). See Link

Importing Data From HDF5 Files:

HDF5 Data Structure… img_credit

In the Python world, a consensus is rapidly converging on Hierarchical Data Format Version 5 AKA HDF5 as the standard mechanism for storing large quantities of numerical data. It’s now becoming common to deal with data sets hundreds of Gigabytes or Terabytes in size. HDF5 itself can scale up to Exabytes of data.

What is the structure of HDF5 files?

We can explore its hierarchical structure just like Python dictionaries using the method keys().

for key in data.keys():
print(key)

This would return the existing keys in the data set. Each of these is an HDF group. They are like directories. Note that the HDF5 project is actively maintained by the HDF Group based in Champaign Illinois and formerly part of the University of Illinois, Urbana Champaign.

Let’s load the LIGO HDF5 data set. LIGO stands for Laser Interferometer Gravitational-Wave Observatory Project.
This project led to the validation of Einsteins’ Theory of Gravitational waves.

Using h5py to import HDF5 Files.

The h5py package is a Pythonic interface to the HDF5 binary data format. It lets you store huge amounts of numerical data, and easily manipulate that data. See link.

So from the above Git-gist, we downloaded the LIGO data set to our workspace. Then using the h5py context manager, we read the LIGO data into an h5py.File object named data. This object has a tree-branch structure of nested Python dictionaries. This means the parent key has some values that may, in turn, contain other nested values of compressed data.

For example, let’s explore the ‘strain’ key of the data object.

type(data['strain'])
>> h5py._hl.group.Group

So we can see it’s an h5py.Group object, let’s look, does it have any sub-keys?

data['strain'].keys()
>> 'Strain'
# So it has a 'Strain' sub-key...

Now let’s extract the value in the ‘Strain’ sub-key and save to an array

strain = data['strain']['Strain'].value print(type(strain))
>> numpy.ndarray

Let’s see the shape of the array

array.shape
>>(131072,)
# It has 131072 elements...

Finally, let’s plot the first 10000 elements of strain, like time-series data…

import matplotlib.pyplot as plt# Set number of time points to sample: num_samples
num_samples = 10000
# Set time vector
time = np.arange(0, 1, 1/num_samples)
# Plot dataplt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
plt.ylabel('strain')
plt.title('Time/Strain Chart')
plt.show()
Time-series plot of 10000 data points from LIGO HDF5 dataset.

Importing Data From MATLAB Files:

img_credit

MATLAB needs no introduction. A brainchild of Professor Cleve Moler, who was an expert in Fortran programming language. As a Professor of Maths and Computer Science at The University of New Mexico, he quickly discovered how syntactically heavy and difficult Fortran was for students. So he created MATLAB (contraction for Matrix Laboratory) in 1984, and the rest is history.

How to import a MATLAB file in Python…

MATLAB Files have the extension .mat, let’s import a MATLAB file from Albeck lab at UC Davis. It contains gene expression data.

We shall use the Python scipy library to read the MATLAB file. So, we download the `'ja_data2.mat'`file via wget and then, using the scipy.io.loadmat() function we load it in a variable named mat.

Guess what… This variable mat, is just a Python dictionary. So once again, can we see how powerful dictionaries are?

We have ourselves another tree-like data structure of nested dictionaries, just as we did with HDF5 files.

Let’s explore the mat object, starting with its’ keys.

# It's a dictionary so let's see the keys for key in mat.keys():
print(key)
>>
__header__
__version__
__globals__
rfpCyt
rfpNuc
cfpNuc
cfpCyt
yfpNuc
yfpCyt
CYratioCyt

The last key above (CYratioCyt), let’s see what value it has…

# Let's see what type of object the CyratioCyt object isprint(type(mat['CYratioCyt']))
>>
<class 'numpy.ndarray'>

It contains a numpy array. Splendid! so we can check its shape and even make a Data Frame out of it and explore like normal. Let’s do that…

# Since it's a numpy array, lets see its shapeprint(mat['CYratioCyt'].shape)
>>
(200, 137)
# Next, let's make a Data Frame out of it.import pandas as pd
Cy_df = pd.DataFrame(mat['CYratioCyt'])
# Let's see the first five rows
Cy_df.head()
>>
Head of the Cy_df Data frame…

Finally, let’s subset Cy_df and plot the 5th to the last columns of the 25th row.

data = mat['CYratioCyt'][25, 5:]fig = plt.figure()
plt.plot(data)
plt.xlabel('time (min.)')
plt.ylabel('normalized fluorescence (measure of expression)')
plt.show()
The plot of a subset of the CyratioCyt data.

Importing Data From Pickle Files:

img_credit

So what is pickling?

“Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.

In some programming languages, it’s referred to as serialization, marshalling or flattening. But in Python, we call it pickling.

We can save Python lists, dictionaries, trained-machine-learning models, even data sets, and pretty much any Python object to pickle.

For example, it makes a lot of sense to pickle very large datasets, instead of loading them in memory each time we run analysis on them. Pickling is far faster and more memory efficient. Up to 50–1000’s times faster. Sometimes far more, depending on the size… Link

So How do we Load Pickle Files in Python…

  1. First, let’s pickle the ratings.csv dataset from GroupLens. This dataset has over twenty-two million rows of data for movie ratings.

From the above Git-gist, we downloaded the ratings.csv file via wget. Then we created a pickle byte-stream placeholder object called ratings.pickle. Finally, we dump the ratings.csv file into the ratings.pickle object, using the pickle.dump() function, then we close the connection.

Now, we can see the ratings.pickle file saved in our workspace.

2. Next, let’s unpickle the ratings.pickle object and load it in a pandas Data Frame…

So, here, we read the ratings.pickle binary file as f. Then we use the pickle.load() function to unpickle and load f into a variable named ratings_data, and f is converted back to the unpickled ratings.csv file.

So we’ve basically pickled the ratings.csv file into ratings.pickle object and unpickled ratings.pickle object back into ratings.csv file, saved as variable, ratings_data.

The first five rows of the pickle_df data frame

The real magic of Pickling…

Let’s look a little closer at the pickling exercises we did. Let’s compare the size of the ratings.csv file to the size of the ratings.pickle file.

Amazingly, ratings.pickle has only 21 bytes of data, while ratings.csv has over 620 million bytes of data. It’s just amazing! that by applying the dataset to pickle, we have reduced it by over 29.5 million times its original size.

Do note that pickle files also have the extension .pkl in addition to .pickle. The former is common in python 2. See link

Importing Data From Parquet Files:

img_credit

The Apache Parquet project provides a standardized open-source columnar storage format for use in data analysis systems. It was created originally for use in Apache Hadoop.

Parquet files have big-data-processing written all over. In fact, the first time I handled one was while working on a project in Pyspark.

So why Parquet Files?

Basically, Parquet is a columnar file format whereas CSV is row-based. Columnar file formats are more efficient for most analytical queries. You can speed up a lot of your Pandas DataFrame queries by converting your CSV files and working off of Parquet files. Plus, Parquet files are more memory-efficient than CSV files.

Ok, let’s create a Parquet file object from the ratings.csv file we’ve been using.

Pandas to the rescue! We defined a simple method that takes a file name (a string) and creates a DataFrame (df), which we convert to a Parquet file using the df.to_parquet() function of Pandas. This method saves a Parquet file to the current working directory.

So we passed the ratings.csv’ file to the method and created ratings.parquet.

Next, we compared the size of ratings.parquet to ratings.csv and just as expected, ratings.csv is over four times the size of ratings.parquet.

Loading a Parquet file in Pandas…

Finally, we simply pass ratings.parquet’ as an argument to pd.read_parquet() function. This creates a DataFrame object. Note that this function has its own set of parameters that we can customize. Found in Pandas.

parquet_df = pd.read_parquet(‘/content/ratings.parquet’)

parquet_df.head()

The head of the parquet_df Data Frame.

Summary:

This has been a pretty-detailed article and some concepts may be relatively new. It’s best to save this in your reading list for future reference.

We’ve seen how to read data from various formats like MATLAB, SAS, Stata, Pickle, Parquet, Excel, HDF5, and NumPy. We’ve explored various functions from pandas, scipy, and Python modules and most importantly I’m glad that I’ve been able to show how as Data Professionals, we can handle virtually all kinds of Data files.

See the Git-Repo containing the Notebook for this article here. It also contains a tutorial on ‘working with relational databases in python’.

Finally, feel free to check out the course ‘intro to importing data in python’, from Datacamp.

Cheers!