Original article was published on Artificial Intelligence on Medium
Loading Different Data Files in Python
From Excel, Pickle, Parquet, Stata, MATLAB, HDF5, and Sas7bdat files
Don’t we all just love pd.read_csv()… It’s probably the most endearing line from our dear, pandas library.
No doubts, when it comes to tabular data files, CSVs are the ‘coolest cats in the pack’ .
Truth is, data files come in different types and extensions, and as Data Professionals, we must handle these as they come.
ETL (Extract, Transform, Load) is at the heart of mainstream data processing. So today we shall explore loading a few varying data types in Python.
First on our list is importing data in NumPy:
Table of Contents
- Importing Data in Numpy
- Importing Data from Excel Files
- Importing Data from SAS Files
- Importing Data from STATA Files
- Importing Data from HDF5 Files
- Importing Data from MATLAB Files
- Importing Data from Pickle Files.
- Importing Data from Parquet Files
Importing Data in Numpy:
Numpy is ideal if all data in the given flat file are numerical, or if we intend to import only the numerical features. Hence its name Numpy (Numerical-Python).
But why Numpy?
Numpy arrays are the python standard for storing numerical data. They are efficient, fast, and clean. Numpy provides the core multidimensional array object that is necessary for most tasks in scientific computing in Python. This is why it’s at the root of the SciPy stack.
Let’s explore the NumPy functions for loading data.
- numpy.loadtxt() function:
This function is best-suited for pure numerical data. Let’s put it to work. First, let’s import the titanic dataset without the
Now, let’s import it with the
np.loadtxt() function, and select only numeric rows and columns.
skiprows=1, we skip the column headers. And passing,
usecols=[0,1,4],we only select columns 0, 1, and 4, which are all numeric.
Note that numpy can handle data of mixed types…
2. numpy.genfromtxt() function:
This NumPy function can better handle mixed data types. Passing the argument,
dtype=None,it will figure out the right data types for each column.
The downside is that it returns a 1D array where each element is a row of the flat file imported. Let’s import the same data using
So we pass the dataset, then we pass
delimiter=’,’ telling NumPy it’s a comma-separated-value file. Then we pass
names=True telling NumPy the dataset has column headers.
Finally, we pass
encoding=’utf8' . The former allows NumPy to infer the data types by itself, and the latter enforces
‘utf8’ encoding to avoid any conversion issues in case bytes get corrupted or lost during the conversion. See link
So this 1D array allows us to index into the array by calling the names of the columns. For example, if we want to see the first 10 fare costs on the titanic data set from this array we can simply do:
# And the output is:array([ 7.25 , 71.2833, 7.925 , 53.1 , 8.05 , 8.4583, 51.8625, 21.075 , 11.1333, 30.0708])
it’s not best practice to pass data of mixed types to numpy. Put mixed data types in a pandas data frame instead.
3. numpy.recfromcsv() function:
This function behaves exactly like the
np.genfromtxt() function. Except that it has all the parameters by default. Meaning, we don’t need to pass,
names, delimiter or
# function np.recfromcsv() has the defaults delimiter=',' and names=True in addition to dtype=None.# So we only pass it the file to read.data = np.recfromcsv('titanic.csv')# This returns the exact same array we saw from np.genfromtxt().
Importing Data From Excel Files:
Excel files are a little tricky, in that they may contain several sheets. So it’s ideal to parse the data and select relevant sheets.
From the Git-gist above, we extract the
battle_link excel file and load it as a
pandas.io.excel object, using the
`pd.ExcelFile()` function. Then we print out the sheet names,
`print(xls.sheet_names)` and we get a list of the sheet names:- [‘2002’, ’2004′].
This tells us there are just two sheets in this file.
- pd.parse() function:
We can select either of the sheets by name or index, using the method
parse(). Let’s load a sheet by its name…
# Load a sheet into a Data Frame by name: df1df1 = xls.parse('2004')# Print the head of the DataFrame df1df1.head()
Next, let’s load the other sheet by its index…
# Load a sheet into a DataFrame by index: df2df2 = xls.parse(0)# Print the head of the DataFrame df2print(df2.head())
Customizing your Excel spreadsheet import in pandas:
Just as we did in NumPy, we can use additional arguments to skip rows, rename columns, and select only particular columns with Excel files too.
As before, we’ll use the method
pd.parse(). This time, however, we’ll add the additional arguments
skiprows, names and
usecols. Note that the values passed to these parameters must be of type list.
A. Let’s parse the first sheet (‘2002’ or index 0) and rename its columns
# Parse the first sheet and rename all the columns:df1 = xls.parse(0, skiprows=, names=['Country','AAM due to War (2002)'])# Print the head of the DataFrame df1print(df1.head())
B. Let’s parse only the first column of the second sheet and also rename it.
# Parse the first column of the second sheet and rename the column:df2 = xls.parse(1, usecols=, skiprows=, names=['Country'])# Print the head of the DataFrame df2print(df2.head())
2. pd.read_excel() function:
This function is more popular for loading Excel files. Instead of first parsing the data, pandas reads the Excel data directly into a Pandas Data Frame object. And like
pd.parse(), we can fully customize imports using
With this function, we don’t have to restrict all arguments to lists. We can pass a string to
sheet_name, and an int to
usecols, must have a list.
For example, let’s select only the first column from the sheet named ‘2004’, and skip the first row and rename the column to
‘war_Country’… In one line of code.
data_2 = pd.read_excel(battle_link, sheet_name='2004', usecols=, skiprows=0, names=['War_Country'])data_2.head()
Importing Data From SAS Files:
SAS stands for Statistical Analysis System SAS is a software suite that can mine, alter, manage, and perform statistical analysis on data using the SAS programming language. SAS is a trusted analytical powerhouse with over 40 years of analytics innovation. SAS data files have the extension
So let’s import a SAS file in pandas.
From the Git-gist above, we import pandas as pd. Then we install the
sas7bdat (all lower case) package, which is the default dataset file package in SAS.
From this package, we import
SAS7BDAT (all upper case) which is like the context manager we can use to read in and bind the
sales.sas7bdat file to a Pandas Data Frame.
Note that within the context manager, we read in the
sales.sas7bdat file as
files and bind it to a Pandas Data Frame using the
`df_sas = files.to_data_frame()`.
Let’s see a Histogram of the
P-values feature of the SAS Data Frame.
plt.title('Histogram of P-Values')
Importing Data From Stata Files:
Stata is a contraction for Statistics and Data. Stata files are popular in academic social sciences research. Stata files have the extension
The Good thing is pandas have a built-in function for reading Stata files. So we don’t need a context manager as we did for SAS files. Let’s import a Stata file that contains the extent of infectious diseases in certain countries…
So we simply download the dataset using wget command, then we create a data frame from it by passing it to the pandas
Let’s see the first five rows and a histogram of the
We can customize the import by passing arguments to several parameters of this function. For example:-
pd.read_stata(index_col=, convert_missing=, preserve_dtypes=, chunksize=, iterator=). See Link
Importing Data From HDF5 Files:
In the Python world, a consensus is rapidly converging on
Hierarchical Data Format Version 5 AKA
HDF5 as the standard mechanism for storing large quantities of numerical data. It’s now becoming common to deal with data sets hundreds of Gigabytes or Terabytes in size.
HDF5 itself can scale up to Exabytes of data.
What is the structure of HDF5 files?
We can explore its hierarchical structure just like Python dictionaries using the method keys().
for key in data.keys():
This would return the existing keys in the data set. Each of these is an HDF group. They are like directories. Note that the HDF5 project is actively maintained by the HDF Group based in Champaign Illinois and formerly part of the University of Illinois, Urbana Champaign.
Let’s load the LIGO HDF5 data set. LIGO stands for
Laser Interferometer Gravitational-Wave Observatory Project.
This project led to the validation of Einsteins’ Theory of Gravitational waves.
Using h5py to import HDF5 Files.
The h5py package is a Pythonic interface to the
HDF5 binary data format. It lets you store huge amounts of numerical data, and easily manipulate that data. See link.
So from the above Git-gist, we downloaded the
LIGO data set to our workspace. Then using the h5py context manager, we read the
LIGO data into an
h5py.File object named
data. This object has a tree-branch structure of nested Python dictionaries. This means the parent key has some values that may, in turn, contain other nested values of compressed data.
For example, let’s explore the
‘strain’ key of the
So we can see it’s an
h5py.Group object, let’s look, does it have any sub-keys?
# So it has a 'Strain' sub-key...
Now let’s extract the value in the
‘Strain’ sub-key and save to an array
strain = data['strain']['Strain'].value print(type(strain))
Let’s see the shape of the array
# It has 131072 elements...
Finally, let’s plot the first 10000 elements of
strain, like time-series data…
import matplotlib.pyplot as plt# Set number of time points to sample: num_samples
num_samples = 10000# Set time vector
time = np.arange(0, 1, 1/num_samples)# Plot dataplt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
Importing Data From MATLAB Files:
MATLAB needs no introduction. A brainchild of Professor Cleve Moler, who was an expert in Fortran programming language. As a Professor of Maths and Computer Science at The University of New Mexico, he quickly discovered how syntactically heavy and difficult Fortran was for students. So he created MATLAB (contraction for Matrix Laboratory) in 1984, and the rest is history.
How to import a MATLAB file in Python…
MATLAB Files have the extension
.mat, let’s import a MATLAB file from Albeck lab at UC Davis. It contains gene expression data.
We shall use the Python scipy library to read the MATLAB file. So, we download the `
'ja_data2.mat'`file via wget and then, using the
scipy.io.loadmat() function we load it in a variable named
Guess what… This variable mat, is just a Python dictionary. So once again, can we see how powerful dictionaries are?
We have ourselves another tree-like data structure of nested dictionaries, just as we did with
Let’s explore the
mat object, starting with its’ keys.
# It's a dictionary so let's see the keys for key in mat.keys():
The last key above (
CYratioCyt), let’s see what value it has…
# Let's see what type of object the CyratioCyt object isprint(type(mat['CYratioCyt']))
It contains a numpy array. Splendid! so we can check its shape and even make a Data Frame out of it and explore like normal. Let’s do that…
# Since it's a numpy array, lets see its shapeprint(mat['CYratioCyt'].shape)
(200, 137)# Next, let's make a Data Frame out of it.import pandas as pd
Cy_df = pd.DataFrame(mat['CYratioCyt'])# Let's see the first five rows
Finally, let’s subset
Cy_df and plot the 5th to the last columns of the 25th row.
data = mat['CYratioCyt'][25, 5:]fig = plt.figure()
plt.ylabel('normalized fluorescence (measure of expression)')
Importing Data From Pickle Files:
So what is pickling?
“Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
In some programming languages, it’s referred to as serialization, marshalling or flattening. But in Python, we call it pickling.
We can save Python lists, dictionaries, trained-machine-learning models, even data sets, and pretty much any Python object to pickle.
For example, it makes a lot of sense to pickle very large datasets, instead of loading them in memory each time we run analysis on them. Pickling is far faster and more memory efficient. Up to 50–1000’s times faster. Sometimes far more, depending on the size… Link
So How do we Load Pickle Files in Python…
- First, let’s pickle the ratings.csv dataset from GroupLens. This dataset has over twenty-two million rows of data for movie ratings.
From the above Git-gist, we downloaded the
ratings.csv file via
wget. Then we created a pickle byte-stream placeholder object called
ratings.pickle. Finally, we dump the
ratings.csv file into the
ratings.pickle object, using the
pickle.dump() function, then we close the connection.
Now, we can see the
ratings.pickle file saved in our workspace.
2. Next, let’s unpickle the
ratings.pickle object and load it in a pandas Data Frame…
So, here, we read the
ratings.pickle binary file as f. Then we use the
pickle.load() function to unpickle and load f into a variable named
ratings_data, and f is converted back to the unpickled ratings.csv file.
So we’ve basically pickled the ratings.csv file into ratings.pickle object and unpickled ratings.pickle object back into ratings.csv file, saved as variable, ratings_data.
The real magic of Pickling…
Let’s look a little closer at the pickling exercises we did. Let’s compare the size of the
ratings.csv file to the size of the
ratings.pickle has only 21 bytes of data, while
ratings.csv has over 620 million bytes of data. It’s just amazing! that by applying the dataset to pickle, we have reduced it by over 29.5 million times its original size.
Do note that pickle files also have the extension
.pklin addition to
.pickle. The former is common in python 2. See link
Importing Data From Parquet Files:
Parquet files have big-data-processing written all over. In fact, the first time I handled one was while working on a project in Pyspark.
So why Parquet Files?
Basically, Parquet is a columnar file format whereas CSV is row-based. Columnar file formats are more efficient for most analytical queries. You can speed up a lot of your Pandas DataFrame queries by converting your CSV files and working off of Parquet files. Plus, Parquet files are more memory-efficient than CSV files.
Ok, let’s create a Parquet file object from the
ratings.csv file we’ve been using.
Pandas to the rescue! We defined a simple method that takes a file name (a string) and creates a DataFrame (df), which we convert to a Parquet file using the
df.to_parquet() function of Pandas. This method saves a Parquet file to the current working directory.
So we passed the
‘ratings.csv’ file to the method and created
Next, we compared the size of
ratings.csv and just as expected,
ratings.csv is over four times the size of
Loading a Parquet file in Pandas…
Finally, we simply pass
‘ratings.parquet’ as an argument to
pd.read_parquet() function. This creates a DataFrame object. Note that this function has its own set of parameters that we can customize. Found in Pandas.
parquet_df = pd.read_parquet(‘/content/ratings.parquet’)
This has been a pretty-detailed article and some concepts may be relatively new. It’s best to save this in your reading list for future reference.
We’ve seen how to read data from various formats like MATLAB, SAS, Stata, Pickle, Parquet, Excel, HDF5, and NumPy. We’ve explored various functions from pandas, scipy, and Python modules and most importantly I’m glad that I’ve been able to show how as Data Professionals, we can handle virtually all kinds of Data files.
See the Git-Repo containing the Notebook for this article here. It also contains a tutorial on ‘working with relational databases in python’.
Finally, feel free to check out the course ‘intro to importing data in python’, from