Part 1 — SQLite DB creation

Original article was published by Zvika Segal on Artificial Intelligence on Medium


Part 1 — SQLite DB creation

In this post we will begin the preprocessing step of the medical claims data analysis pipeline.

As described in this project overview, the CMS free medical claims database is a unique database in the medical domain, as it’s the only freely available dataset which contains complete medical data for a significant sample size for a significant period of time, both in and outside of hospitals. This dataset can offer us deep insights to a patient’s health status and utilization of health-related resources. It’s important to keep in mind that this dataset is that it is synthetic — and as we will see in future posts, we have to be cautious in interpreting our results.

As most medical datasets, this data is very RAW, and needs A LOT of cleaning and preprocessing before it can be used for modeling and prediction, and we will go through this process in the following posts. In this first post of the series, we will face the first challenge of this dataset — multiple, huge csv files with 10M+ rows, structured in a way that will not fit our future feature extraction needs.

There are many ways to face this problem. In this project, we will conduct a multi step approach. First, we will create SQLite database files from the raw CSV. As the different feature classes (we have 5 different feature classes — diagnosis, procedures, costs, hospitalization and medication, see overview) will be extracted in a different manner, we will make different database files for each feature class for increased efficiency and usability in later stages of our feature creation pipeline, and allow future queries from the database to be as efficient as possible -we are going to query our database many time and for many samples in our features extraction pipeline, so it’s important to do this in an efficient way.

As this is a huge project, the entire code is presented in this github. Only a brief summary of the code will be presented here.

The DbRunner class will manage the SQLite database creation process(it inherits some usable methods from the BaseDB class). As each of the 5 main table types (inpatient, outpatient, beneficiary, carrier, medications, see overview) has a slightly different processing pipeline, the main method in the class will be run_dbs,is a simple wrapper method which will call all table-specific methods.

Let’s look at an example of the one of the table specific methods which handles the creation of the outpatient databases from the outpatient csv. The create_inpatient_db method is a wrapper method which iterates over all the feature classes that are relevant for the outpatient table, and create and cache the database in the case it doesn’t exists:

After several more steps of wrapping function, let’s look at the actual database creation method for the diagnosis feature class. The main idea of this method is to take the original csv which contains several diagnoses in each row (all originating from the same encounter), and transform it to a database in which every row represents a unique diagnosis. As we will filter specific diagnoses for feature extraction in the future, this step is crucial. This method iterates over the different diagnosis columns, creates a dataframe from the diagnosis columns and the columns containing necessary information such as date and provider ID, and concats all of the non empty rows to one huge database, which will serve us heavily in the future.

Then we use the convenient sqlite3 API to create the database from the pandas dataframe, and cache it:

our final looks like this:

Our wrapper methods make sure we carry out this procedure for every table, and every feature class. As we have 5 tables and 5 feature classes, all with specific uses and different variable names, this can be quite exhausting — especially as we have to keep track of all database and feature class names for future use. Luckily the entire process is carefully managed in the project, which can be found in this github.

In the next post, we will begin to use the databases we created for feature extraction. The first step in our feature extraction process will allow us to multiprocessor our calculation, as will be described in the following post.