Azure Machine Learning Service — Where is My Data?

Original article was published on Artificial Intelligence on Medium


Datastores and Datasets

Datastores

Datastores is a data management capability and the SDK provided by the Azure Machine Learning Service (AML). It enables us to connect to the various data sources and then those can be used to ingest them into the ML experiment or write outputs from the same experiments. Azure provides various platform services that can be enabled as a data source, e.g., blob store, data lake, SQL database, Databricks, and many others.

The Azure ML workspace has a natural integration with the datastores defined in Azure, such as Blob Storage and File Storage. But, executing the ML model may require data and its dependencies from other external sources. Hence, the AML SDK provides us the way to register these external sources as a Datasource for model experiments. The ability to define a datastore enables us to reuse the data across multiple experiments, regardless of the compute context in which the experiment is running.

Register Datastores

As discussed, Datastoes are of two types — Default and user provisioned, such as Storage Blobs containers or file storage. To get the list of default Datasores of a workspace:

# get the name of defult Datastore associated with the workspace.
default_dsname = ws.get_default_datastore().name
default_ds = ws.get_default_datastore()
print('default Datastore = ', default_dsname)

To register the database using AML SDK —

from azureml.core import Workspace, Datastore
ws = Workspace.from_config()
# Register a new datastore
blob_ds = Datastore.register_azure_blob_container(workspace=ws,
datastore_name='blob_data',
container_name='data_container',
account_name='az_store_acct',
account_key='11223312345cadas6abcde789…')

To set or change the default datastore —

ws.set_default_datastore('blob_data')
View from Azure ML — Datastores

Upload files to Datastores

Upload the local files from the local system to the remote data store. This allows the experiments to directly run using remote data location. The target_path is the path of the files at a remote datastore location. The ‘Reference Path’ is returned once the files are uploaded to the datastore. When you want to use a Datastore in an experiment script, we must pass a data reference to the script

default_ds.upload_files(files=['../input/iris-flower-dataset/IRIS.csv'],target_path='flower_data/',overwrite=True, show_progress=True)flower_data_ref = default_ds.path('flower_data').as_download('ex_flower_data')
print('reference_path = ',flower_data_ref)

Experiment with Data Store

Once we have the reference to the Datastore as mentioned above, we need to pass this reference to an experiment script, as a script parameter from the Estimator. Thereafter, the value of this parameter can be retrieved and then used as a local folder —

####run = Run.get_context()#define the regularization parameter for the logistic regression.
parser = argparse.ArgumentParser()
parser.add_argument('--reg_rate', type=float, dest='reg', default=0.01)
#define the data_folder parameter for referencing the path of the registerd datafolder.
parser.add_argument('--data_folder', type=str, dest='data_folder', help='Data folder reference', default=0.01)
args=parser.parse_args()
r = args.reg
ex_data_folder = args.data_folder
####

Define the estimator as —

####run = Run.get_context()estimator = SKLearn(source_directory=experiment_folder,
entry_script='iris_simple_experiment.py',
compute_target='local',
use_docker=False,
script_params = {'--reg_rate': 0.07,
'--data_folder':flower_data_ref}
# assigned reference path value as defined above.
)
####

The ‘ — data_folder’ accepts the datastore folder reference; path where files are uploaded. The script will load the training data from the data reference passed to it as a parameter, hence we need to set up the script parameters to pass the file reference to run the experiment.