Azure for AI/ML

Original article was published by Allen Manoj on Artificial Intelligence on Medium


Azure ML Service

Bring the power of containerization and automation.

Fun Fact: All of us use Jupyter notebook.
but how did it get its name?
Ju-Julia, Py-Python, and R languages are used in the notebook, Jupyter notebooks are built especially for these languages.

Typical Workflow

Data sources

Azure Blob Storage, Web URL using HTTP, Hadoop using HiveQL, Azure Table Storage, Azure SQL Database, SQL Server on AzureVM, On-premises SQL server database via the Data Manager and OData are the services provided by Microsoft Azure.

Data Formats

  • .csv — Comma-Separated Value with a header
  • .nh.csv — Comma-Separated Value with a no header
  • .tsv — Tab-Separated Values with a header
  • .nh.tsv — Tab-Separated Values with no header
  • .txt — Plain text
  • .svmlight SVMlight
  • .arff — Attribute Relation File Format
  • .zip
  • .RData — R object or workspace

Explore, Create Summaries

Things to keep in mind.

  • Develop an understanding of data.
  • Which features show independent and independent behavior.
  • Do the features contain outliers.
  • Are there features that only add noise, if used for training the model.
  • Are there trends-patterns or biases.
  • Why the attributes have missing values.
  • Which are the values which are rare and why?
  • Can you see any unusual patterns? What might explain them?
  • How are the observations within each cluster similar to each other?
  • How are the observations within separate clusters different from each other?
  • Identify the missing values
  • Find the minimum and maximum value.
  • Correlation plot.
  • Box plot or identify the skewness or scatterplot.

Bar Graph: Distribution of categorical variable
This is useful in plotting discrete values.

Histogram: Distribution of continuous variable
– Negative skew
– Positive skew

Prepare and Clean Data

  • Replace using MICE.
  • Replace using Probabilistic PCA.
  • Custom substitution.
  • Replace with mean, mode, median.
  • Remove entire row, column.

MICE: Multiple Imputation by Chained Equations
Each variable with missing data is modeled conditionally using other variables in the data before filling in missing values.

PCA: Principal Components Analysis
Replaces the missing values by using a linear model that analysis correlations between the column and estimates a low-dimensional approximation of the data, from which the full data is reconstructed.

Preprocessing

Using standard or advanced preprocessing automatically scaled or normalized to help the algorithm go well.

Drop high cardinality or no variance features, Impute missing values,
Generate additional features, Transform & Encode, Word Embeddings, Cluster Distance

Cross-Validation

  • Uses more test data
  • Evaluates the dataset as well as the model.
  • Generalize to new datasets.

Model Deployment

Deploy the model for consumption!

Target development environment supported are:

  • Docker image
  • Azure Container Instance
  • Azure Kubernetes Service
  • Azure IoT Edge
  • Field Programmable Gate Array

For deployment, you would require.

  • An environment file specifier package dependencies.
  • A configuration file requests the required resources for the container.
  • A score script file that tells the Automated ML’s to call the models.