Top Open Source Tools and Libraries for Deep Learning — ICLR 2020 Experience

Original article was published by Patrycja Jenkner on Deep Learning on Medium


Automunge

Tabular data preprocessing platform

Language: Python

Author: Nicholas Teague

Twitter | LinkedIn | GitHub | Website

Description

Automunge is a Python library platform for preparing tabular data for machine learning. Through application, simple feature engineering transformations are applied to normalize, numerically encode, and insert infill. Transformations are “fit” to the properties of a train set and then consistently applied to test data on that basis. Transformations may be performed under automation, assigned from an internal library, or custom defined by user. Infill options include “ML infill”, in which automated machine learning models are trained for each column to predict infill.

In other words, put simply:

  • automunge(.) prepares tabular data for machine learning.
  • postmunge(.) consistently prepares additional data very efficiently.

Automunge is available now for pip install:

pip install Automunge

Once installed, run in notebook to initialize:

from Automunge import Automunger
am = Automunger.AutoMunge()

Where for automated train set processing with default parameters run:

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train)

And for subsequent consistent processing of test data, using the postprocess_dict dictionary populated from the corresponding automunge(.) call, run:

test, testID, testlabels, \
labelsencoding_dict, postreports_dict \
= am.postmunge(postprocess_dict, df_test)

User specification of transformations or infill types can be performed in an automunge(.) call by way of the assigncat and assigninfill parameters. For example, for a train set with column headers ‘column1’ and ‘column2’, one could assign min-max scaling (‘mnmx’) with ML infill to column1 and one-hot encoding (‘text’) with mode-infill to column2. Any columns not explicitly specified will defer to automation.

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train, \
assigncat = {'mnmx':['column1'], 'text':['column2']}, \
assigninfill = {'MLinfill':['column1'], 'modeinfill':['column2']})

Resources and links

Website | GitHub | Brief presentation