SMOTE on Azure ML Studio
Updated: May 17
How often we have a fully normalized dataset with almost equal number of instances per class? In a real-world cases with data collected from user's web site it is almost impossible to guarantee an equal distribution among the classes. Thus it may be useful to normalize your dataset.
One of the most popular solutions is to remove some instances from the majority class. However, what if your dataset is already small and removing instances from the majority class would destroy your model.
In this case the oversampling may be the only right solution. The most common method is known as SMOTE : Synthetic Minority Oversampling.
Quote from Wikipedia :
To illustrate how this technique works consider some training data which has s samples, and f features in the feature space of the data. Note that these features, for simplicity, are continuous. As an example, consider a dataset of birds for classification. The feature space for the minority class for which we want to oversample could be beak length, wingspan, and weight (all continuous). To then oversample, take a sample from the dataset, and consider its k nearest neighbors (in feature space). To create a synthetic data point, take the vector between one of those k neighbors, and the current data point. Multiply this vector by a random number x which lies between 0, and 1. Add this to the current data point to create the new, synthetic data point.
And fortunately there is a pre-built module in Azure Machine Learning Studio.
You can follow all the steps from the official documentation to give it a try.
Here we provide a small summary of the steps to implement.
Add the SMOTE module to your experiment.
Connect the dataset you want to boost.
Ensure that the column containing the label, or target class, is marked as such.
In the SMOTE percentage option, type a whole number that indicates the target percentage of minority cases in the output dataset.
Use the Number of nearest neighbors option to determine the size of the feature space that the SMOTE algorithm uses when in building new cases.
Type a value in the Random seed textbox if you want to ensure the same results over runs of the same experiment, with the same data.
Run the experiment.
Nevertheless, as you may have noticed the module only accepts two classes. But what if you wanted to have a multiclass normalizer?
There is an excellent example of using SMOTE with mutliclass dataset on the Azure ML Gallery.
The logic is quite straightforward : you split your dataset into two subsets with two classes in each (majority + one of the classes). Then you remove from all the subsets the majority class instances (except the one) and finally bind them all in a one single dataset.
But what if you have a large dataset with more the 10 classes?
Well, at least you have Python and R script modules.
I didn't succeed to to install imblearn package on Azure ML studio because of scikit-learn packages mismatch.
Fortunately, there is a DMwR package pre-installed on the Azure ML studio.
And if you run :
library(DMwR) # Map 1-based optional input ports to variables dataset1 <- maml.mapInputPort(1) # class: data.frame clean_ds <- subset(dataset1, select = -c(text_column)) clean_ds$label_column <- as.factor(clean_ds$label_column) new_ds <- SMOTE(label_column ~ ., clean_ds, perc.over=7000, perc.under=0) maml.mapOutputPort("new_ds");
Works like a charm.