Azure ML inference pipelines with clustering models

Alibek Jakupov
May 16, 2021
9 min read

Updated: Nov 19, 2021

Clustering is a form of machine learning in which related objects are grouped together based on their characteristics. It is an example of unsupervised machine learning, in which you train a model to group objects based solely on their characteristics, or attributes. The model cannot be trained using any previously defined cluster value (or label).

In Azure Machine Learning an inference pipeline uses the trained model to assign new input data to the pre-defined labels. This generally forms the template for a web service that you can publish for the other services/applications to consume.

Most of you are familiar with both, but did you know that we can combine them? I didn't know that until now, thus I find it important to share this small tip with you. If someone finds it useful, then my work was not completed in vain. Up we go!

For this post, I've followed the tutorial on Microsoft Learn, so you can refer to their materials to reproduce exactly the same steps as in this blog post.

Create an Azure Machine Learning workspace

This step is quite straightforward, simply sign in to the Azure portal and create a resource by searching for 'Machine Learning', and provide the following informations:

Subscription: Your Azure subscription
Resource group: Create or select a resource group
Workspace name: Enter a unique name for your workspace
Region: Select the geographical region closest to you
Storage account: Note the default new storage account that will be created for your workspace
Key vault: Note the default new key vault that will be created for your workspace
Application insights: Note the default new application insights resource that will be created for your workspace
Container registry: None (one will be created automatically the first time you deploy a model to a container)

Once the resource is created simply launch the studio (either from the portal or by going to the studio).

Create compute resource

Again, nothing special with this stuff, but there is an important thing to know. In Azure Machine Learning the price is mainly impacted by compute resources associated with Azure Machine Learning Services. They vary by configuration and family, and should be chosen according to the usage context. For instance, huge language models like BERT will require GPU instances due to their volume.

If you train your model locally, and use Azure Machine Learning for model deployment via SDL, then your Azure ML instance is only used for model deployment. More precisely, the steps performed by Azure Machine Learning in this case are:

Build a Container Image for the trained model
Deploy the model to "dev" using Azure Container Instances (ACI)
Deploy the model to production using Azure Kubernetes Service (AKS)

All these steps do not involve compute instance creation, so no computational cost is considered, and only three resources will incur additional charges

Azure Container Registry Basic account
Azure Block Blob Storage (general purpose v1)
Key Vault

A managed compute resource is created and managed by Azure Machine Learning. This compute is optimized for machine learning workloads. Azure Machine Learning compute clusters and compute instances are the only managed computes.

You can create Azure Machine Learning compute instances or compute clusters from:

Azure Machine Learning studio.
The Python SDK and CLI:
- Compute instance.
- Compute cluster.
The R SDK (preview).
An Azure Resource Manager template. For an example template, see Create an Azure Machine Learning compute cluster.
A machine learning extension for the Azure CLI.

When created, these compute resources are automatically part of your workspace, unlike other kinds of compute targets.

Note : When a compute cluster is idle, it autoscales to 0 nodes, so you don't pay when it's not in use. A compute instance is always on and doesn't autoscale. You should stop the compute instance when you aren't using it to avoid extra cost..

For more details on pricing for computational instances, please refer to the official documentation.

Create a Dataset

Data for model training and other operations is normally encapsulated in an entity called a dataset in Azure Machine Learning. In this tutorial, you'll work with a dataset that contains observations of three different penguin species.

Create a dataset from web files, using the following settings:

Basic Info:
- Web URL: https://aka.ms/penguin-data
- Name: penguin-data
- Dataset type: Tabular
- Description: Penguin data
Settings and preview:
- File format: Delimited
- Delimiter: Comma
- Encoding: UTF-8
- Column headers: Use headers from the first file
- Skip rows: None
Schema:
- Include all columns other than Path
- Review the automatically detected types
Confirm details:
- Do not profile the dataset after creation

For various observations of penguins, this data reflects measurements of the culmen (bill) length and depth, flipper length, and body mass. Adelie, Gentoo, and Chinstrap penguins are among the species described in the dataset.

Create a Pipeline

To use Azure Machine Learning Designer, you must first build a pipeline and then add the dataset you want to use.

In Azure Machine Learning studio for your workspace, view the Designer page and create a new pipeline.
In the Settings pane, change the default pipeline name (Pipeline-Created-on-date) to Train Penguin Clustering (if the Settings pane is not visible, click the ⚙ icon next to the pipeline name at the top).
Note that you need to specify a compute target on which to run the pipeline. In the Settings pane, click Select compute target and select the compute cluster you created previously.
In the pane on the left side of the designer, expand the Datasets section, and drag the penguin-data dataset you created in the previous exercise onto the canvas.
Right-click (Ctrl+click on a Mac) the penguin-data dataset on the canvas, and on the Visualize menu, select Dataset output.
Review the schema of the data, noting that you can see the distributions of the various columns as histograms. Then select the CulmenLength column.

Apply Transformations

We'll only use the measurements to cluster the penguin observations, so the species column will be removed. We'll also need to delete any rows with missing values and normalize the numeric measurement values so they're all on the same scale (by using Select columns and Clean missing data modules respectively).

Finally, normalize your data by selecting the Normalize Data module, and in its Settings pane on the right, set the Transformation method to MinMax and select Edit column. Then in the Select columns window, select With rules and include All columns.

These steps are only performed to prepare your data for training. We are now ready to launch the training.

Create and run a training pipeline

Open the Train Penguin Clustering pipeline, if it's not already open.
In the pane on the left, in the Data Transformations section, drag a Split Data module onto the canvas under the Normalize Data module. Then connect the left output of the Normalize Data module to the input of the Split Data module.
Select the Split Data module, and configure its settings as follows:
1. Splitting mode: Split Rows
2. Fraction of rows in the first output dataset: 0.7
3. Random seed: 123
4. Stratified split: False
Expand the Model Training section in the pane on the left, and drag a Train Clustering Model module to the canvas, under the Split Data module. Then connect the Result dataset1 (left) output of the Split Data module to the Dataset (right) input of the Train Clustering Model module.
The clustering model should assign clusters to the data items by using all of the features you selected from the original dataset. Select the Train Clustering Model module and in its settings pane, on the Parameters tab, select Edit Columns and use the With rules option to include all column
The model we're training will use the features to group the data into clusters, so we need to train the model using a clustering algorithm. Expand the Machine Learning Algorithms section, and under Clustering, drag a K-Means Clustering module to the canvas, to the left of the penguin-data dataset and above the Train Clustering Model module. Then connect its output to the Untrained model (left) input of the Train Clustering Model module.
The K-Means algorithm groups items into the number of clusters you specify - a value referred to as K. Select the K-Means Clustering module and in its settings pane, on the Parameters tab, set the Number of centroids parameter to 3.
After using 70% of the data to train the clustering model, you can use the remaining 30% to test it by using the model to assign the data to clusters. Expand the Model Scoring & Evaluation section and drag an Assign Data to Clusters module to the canvas, below the Train Clustering Model module. Then connect the Trained model (left) output of the Train Clustering Model module to the Trained model (left) input of the Assign Data to Clusters module; and connect the Results dataset2 (right) output of the Split Data module to the Dataset (right) input of the Assign Data to Clusters module.

After all these steps your training pipeline should like like this.

If it seems similar to what you have in your experiment, then you are ready to run the training pipeline.

Training in our case simply means running the model, iteration by iteration, until its convergence, i.e. until the inter class inertia is greater than intra class inertia, and all the items are assigned to their respective centroids. As their is no a priori knowledge, the convergence may obtained by multiple ways, this is what makes such kind of approach extremely interesting.

The fact that there are no previously established true values for the cluster assignments makes evaluating a clustering model difficult. We need metrics to help us calculate the separation since an effective clustering model achieves a reasonable degree of separation between the items in each cluster.

In the pane on the left, in the Model Scoring & Evaluation section, drag an Evaluate Model module to the canvas, under the Assign Data to Clusters module, and connect the output of the Assign Data to Clusters module to the Scored dataset (left) input of the Evaluate Model module.
Select Submit, and run the pipeline using the existing mslearn-penguin-training experiment.
Wait for the experiment run to finish.
When the experiment run has finished, select the Evaluate Model module and in the settings pane, on the Outputs + Logs tab, under Data outputs in the Evaluation results section, use the Visualize icon to view the performance metrics. These metrics can help data scientists assess how well the model separates the clusters. They include a row of metrics for each cluster, and a summary row for a combined evaluation. The metrics in each row are:
1. Average Distance to Other Center: This indicates how close, on average, each point in the cluster is to the centroids of all other clusters.
2. Average Distance to Cluster Center: This indicates how close, on average, each point in the cluster is to the centroid of the cluster.
3. Number of Points: The number of points assigned to the cluster.
4. Maximal Distance to Cluster Center: The maximum of the distances between each point and the centroid of that point’s cluster. If this number is high, the cluster may be widely dispersed. This statistic in combination with the Average Distance to Cluster Center helps you determine the cluster’s spread.
Close the Evaluate Model result visualization window.

Create an inference pipeline

You may build an inference pipeline that uses the clustering model to assign new data observations to clusters after you've created and run a pipeline to train the clustering model. This will serve as the foundation for a predictive service that you will make available to applications.

In Azure Machine Learning Studio, open the Train Penguin Clustering pipeline you created previously.
In the Create inference pipeline drop-down list, click Real-time inference pipeline. After a few seconds, a new version of your pipeline named Train Penguin Clustering-real time inference will be opened. If the pipeline does not include Web Service Input and Web Service Output modules, go back to the Designer page and then re-open the Train Penguin Clustering-real time inference pipeline.
Rename the new pipeline to Predict Penguin Clusters, and then review the new pipeline. It contains a web service input for new data to be submitted, and a web service output to return results. The transformations and clustering model in your training pipeline are encapsulated in this pipeline based on the statistics from your training data, and will be used to transform and score the new data.
1. Replace the penguin-data dataset with an Enter Data Manually module that does not include the Species column.
2. Remove the Select Columns in Dataset module, which is now redundant.
3. Connect the Web Service Input and Enter Data Manually modules (which represent inputs for data to be clustered) to the first Apply Transformation module.
4. Remove the Evaluate Model module.
The inference pipeline assumes that new data will match the schema of the original training data, so the penguin-data dataset from the training pipeline is included. However, this input data includes a column for the penguin species, which the model does not use. Delete both the penguin-data dataset and the Select Columns in Dataset modules, and replace them with an Enter Data Manually module from the Data Input and Output section. Then modify the settings of the Enter Data Manually module to use the following CSV input, which contains feature values for three new penguin observations (including headers)
Connect the outputs from both the Web Service Input and Enter Data Manually modules to the Dataset (right) input of the first Apply Transformation module.
Delete the Evaluate Model module.
Submit the pipeline as a new experiment named mslearn-penguin-inference on your compute cluster. This may take a while!
When the pipeline has finished, visualize the Results dataset output of the Assign Data to Clusters module to see the predicted cluster assignments and metrics for the three penguin observations in the input data

Your final inference pipeline should look similar to the following

N.B. when entering your data manually, be sure the input looks similar to the following

CulmenLength,CulmenDepth,FlipperLength,BodyMass
39.1,18.7,181,3750
49.1,14.8,220,5150
46.6,17.8,193,3800

This is it. You are now ready to deploy a predictive service and start consuming your API from your applications.

Hope this was useful. Please refer to the Microsoft Learn for more insights, tips and tricks.