Original article was published on Artificial Intelligence on Medium
Processing a billion data points in milliseconds with Vaex
Often times as a data scientist or machine-learning engineer, the key to gaining more insight into a problem is examining data. It’s important to understand the problem before a solution can be determined. To get the most insight it’s key to jump into the data, and analyze it from different angles. We all are familiar with tabular data spanning around N – dimensions with hundreds or millions of data points.
What is Vaex?
Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It calculates statistics such as mean, sum, count, standard deviation etc, on an N – dimensional grid for more than a billion (
10^9) samples/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).
Pandas DataFrame, a pretty familiar name, right? We know what it is and how useful it’s to a data enthusiast like you. One major problem we encounter very often is memory overhead in processing large datasets with millions of records and n number of features. Now, Vaex comes into the scene with a resolution which can take over a billion data points with ease and run complex operations on top of it. Here, we’ll be discussing how we can use Vaex in practice to handle a really huge dataset.
$ pip install vaex
$ conda install -c conda-forge vaex
For ease, I’ll be creating a hypothetical dataset using numpy as below.
I imported a couple of libraries required and created a dataset of 1 Million records with 500 features. If we run some analytics on top of this dataframe, trust me guys, we are going to be trapped. Let’s check how much memory it eats.
3.7 GB, How’s that?
I’m going to convert the dataframe into a csv file and save into the storage, so that we can get started with Vaex by reading the data from a csv file which we are doing every time.
If we running the above block of code, we would a result similar to above. Rather than keeping the file as a csv or something else, vaex converts our data into HDF5 format which is an open source file format that supports large, complex, heterogeneous data.
Let’s open the data which is saved in HDF5 format and do some computations.
What I’m going to do is multiplying column 1 by column 3 and save it in a new column called division_col13.
You guys should definitely try it yourself because I may not be able to explain how quick the process really was and remember we had exactly 1 Million records in hand.
If we were using the conventional Pandas DataFrame for the same operation, we would have been waiting eternities to complete the task. Luckily, we have Vaex now. Just nano seconds, that’s what Vaex needs to process a million data points. It was just a simple started and you can try a lot of operations on it.
Now let’s run some filtering operation which selects only those rows having values greater than 70 in col2.
Trust me guys, this is amazingly fast.
Let’s try running a couple more operations.
I tried finding the mean and I got the result in less that 0.2 seconds. Also we can try different builtin operations like minimum, maximum, minmax (minimum and maximum together) and so on.
Above given is an aggregate query ran over the same dataset and it took only 44 milliseconds to complete.
I understand that It’s not something to be written and presented in this way because I’m unable to show you how fast it is. So you people should definitely try this with your own datasets and see how well it’s leveraging.
Thank you for your time and patience.