Original article was published by Rebecca Vickery on Artificial Intelligence on Medium
In 2018 Google released a new tool called BigQuery ML. BigQuery is Google’s cloud data warehouse solution and is designed to give data analysts and scientists fast access to large quantities of data. BigQuery ML is a tool that enables machine learning models to be developed directly from the BigQuery data warehouse using only SQL.
Since it’s release BigQueryML has evolved to include support for most common machine learning tasks including classification, regression and clustering. You can even import your own Tensforflow models for use within the tool.
From my own experience, BigQueryML is an extremely useful tool for speeding up model prototyping and is also viable to use as a production-based system to solve simple problems.
To give a brief introduction to the tool I will use a data set known as the adult’s income data set to illustrate how to build and evaluate a logistic regression classification model in BigQueryML.
The dataset can be found on the UCI Machine Learning Repository and I am downloading as a CSV file using the following Python code.
Here is a script to download the data and export as a CSV file.
If you don’t already have a Google Cloud Platform (GCP) account you can create one here. When you initially sign up you will get $300 free credit which is more than enough to try out the following examples.
Once on GCP navigate to the BigQuery web UI from the drop-down menu. If this is your first time using GCP you will need to create a project and get set up with BigQuery. The Google quickstart guide gives a good overview here.
The CSV file that I downloaded earlier can be directly uploaded into GCP to create a table.
You can inspect the data in the table by clicking on the table name in the sidebar and selecting preview. This is what the adults’ data looks like now it is in BigQuery.
To train a model on this data we simply write a SQL query which selects everything (*) from the table, renames the target variable (income) to label and adds the logic to create a logistic regression model named ‘adults_log_reg’.
For all the model options see the documentation here.
If we click on the model which will now appear in the sidebar alongside your data table you can see an evaluation of the training performance.
Now we can use the model to make predictions using the ML.PREDICT function.