If you are reading this article then you have probably already faced this issue while running your oversampling experiment. Here I share my personal experience of solving this issue. Nevertheless, even if these techniques worked out in my experiment, it does not necessarily imply that it will work out for all the users. However, I find it useful to share this experience and hopefully someone may find it useful too.
Let's start with a brief introduction.
SMOTE is an abbreviation for Synthetic Minority Oversampling Technique that allows implementing oversampling over your imbalanced data. In other words, if you have a huge difference in distribution of instances in your classes (for instance 10k elements for class A against 1k elements for class B) you may want to generate some synthetic instances that will not simply duplicate instances but permute features in the instances to generate new samples.
In python there is an excellent package called imblearn that besides implementing SMOTE implements some other oversampling algorithms.
As a result, the majority class does not take over the other classes during the training process. Consequently, all classes are represented by the decision function.
The code is quite simple, you need to initialize the SMOTE class and then run the oversample.
>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> from imblearn.over_sampling import SMOTE # doctest: +NORMALIZE_WHITESPACE
>>> X, y = make_classification(n_classes=2, class_sep=2,
... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
>>> print('Original dataset shape %s' % Counter(y))
Original dataset shape Counter({1: 900, 0: 100})
>>> sm = SMOTE(random_state=42)
>>> X_res, y_res = sm.fit_resample(X, y)
>>> print('Resampled dataset shape %s' % Counter(y_res))
Resampled dataset shape Counter({0: 900, 1: 900})
However when I ran my experiment, I have fallen on the following error:
Memory Error
Certainly, I started to look for this error and found out this.
Even if this is not exactly the same error I have found out that the error was caused by the dimensionality (5 000 features) and the huge number of classes (157 all in all).
So here are the steps I followed:
First, initialize the SMOTE
# initilize oversamplers
smote = SMOTE(random_state=42)
Then create a function for binary oversampling:
def normalize_binary(x, y, label):
# initial columns list
x_columns = x.columns
y_column = 'label_column'
# get class statistics
counter = Counter(y)
# get size of minority class
label_count = counter[label]
# for extremely small classes& apply random oversammpling
res_x, res_y = randomSampler.fit_resample(x, y)
result_frame = pd.DataFrame(data=res_x, columns=x_columns)
result_frame[y_column] = res_y
# leave only minority class
result_frame = result_frame[result_frame.label_column == label]
print(Counter(result_frame['label_column']))
return result_frame
And here's the tip. I splitted the dataset into a majority class and one class from the list. I then created a binary sampler, filtered out the majority class and appended the newly generated dataframe to the base frame.
# initial dataframe to which we will append synthetic classes
# leave only "majority class"
initial_output = feature_set[feature_set.label_column=='majoirty class']
# filter out text column
initial_output = initial_output.loc[:, initial_output.columns != 'text_column']
# get list of all classes
all_labels = set(feature_set['label_column'])
all_y = []
final_featureset = pd.DataFrame()
for label in all_labels:
if label != 'majority class':
temp_classes = [label, 'majority class']
temp = feature_set[feature_set.label_column.isin(temp_classes)]
x = temp.loc[:, temp.columns != 'text_column']
x = x.loc[:, x.columns != 'label_column']
y = temp.iloc[:, 0]
# get a dataframe with the current class and synthetic samples
temp_dataframe = normalize_binary(x, y, label)
# add to the initial output
initial_output = initial_output.append(temp_dataframe, sort=True
Hope you find this useful.
Comentarios